// Define the thread layouts (static)
TiledCopy copyA = make_tiled_copy(Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<uint128_t>, TA>{},
Layout<Shape<_16,_8>>{}, // Thr layout 32x4 m-major
Layout<Shape< _8,_1>>{});// Val layout 8x1 m-major
TiledCopy copyB = make_tiled_copy(Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<uint128_t>, TB>{},
Layout<Shape<_16,_8>>{}, // Thr layout 32x4 n-major
Layout<Shape< _8,_1>>{});// Val layout 8x1 n-major
The comment for the
gemm_nt's tiled copyA/B says the thread layout used is32x4, but the used thread layout is16x8.