matmul_shmem_tc_async_opt_port_0.cu

The objective of this project is to develop and implement innovative concepts related to matrix multiplication that may be different from the commonly used optimizations, and therefore could potentially enable to exceed the state-of-the-art performance in some cases of matrix multiplication in combination with other implementations.

matmul_shmem_tc_async_opt_port_0.cu

The design of the gemm_shmem_tc_async_opt_port kernel enabled the investigation of the effect of different levels of workloads, mapped to registers by a compiler, in the context of matrix multiplication.

The evaluation of the matmul_shmem kernel (matmul_shmem.cu) showed that optimizing the number of accumulators per thread could improve performance without using Tensor Cores. An approach to optimizing the number of accumulators at the level of warps using Tensor Cores was then implemented in the gemm_shmem_tc_async_opt_port kernel. This kernel was portable and tunable.

matmul_shmem_tc_async_opt_port_1.cu

The synchronization scheme, provided in the gemm_shmem_tc_async_opt_port kernel, was designed to i) decouple the consumer warps from each other, including at the level of accumulators, and ii) shift the start of the execution by each consumer warp according to the order of the load instructions for the A and transposed B segments of the K dimension. The order of the load instructions was from top to bottom. The earlier load instructions should result in an earlier start of matrix multiply and accumulate. The later load instructions should result in a later start of matrix multiply and accumulate. This shift should be preserved across the pipeline stages, and may potentially provide better utilization of Tensor Cores in settings where the number of stages is small and the size of tiles is large.

matmul_shmem_tc_async_opt_port_2.cu

An optimization is designed with respect to the shift synchronization scheme implemented in matmul_shmem_tc_async_opt_port_1.cu. Instead of using separate barriers for every consumer warp or every row and column of a warp tile, separate barriers were used for the first few upper left consumer warps or the first few upper left rows and columns of a warp tile, and the residual consumer warps or rows and columns of a warp tile were controlled with single barriers. According to this optimization, an early start of matrix multiply and accumulate may be achieved with fewer barriers and a decreased synchronization overhead. Additionally, the ability to handle the rest of the warp tile with arrays of fragments was preserved.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
readme		readme
LICENSE		LICENSE
README.md		README.md
matmul_shmem.cu		matmul_shmem.cu
matmul_shmem_tc.cu		matmul_shmem_tc.cu
matmul_shmem_tc_async.cu		matmul_shmem_tc_async.cu
matmul_shmem_tc_async_opt_port_0.cu		matmul_shmem_tc_async_opt_port_0.cu
matmul_shmem_tc_async_opt_port_1.cu		matmul_shmem_tc_async_opt_port_1.cu
matmul_shmem_tc_async_opt_port_2.cu		matmul_shmem_tc_async_opt_port_2.cu
matmul_test.h		matmul_test.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

matmul_shmem_tc_async_opt_port_0.cu

matmul_shmem_tc_async_opt_port_1.cu

matmul_shmem_tc_async_opt_port_2.cu

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

matmul_shmem_tc_async_opt_port_0.cu

matmul_shmem_tc_async_opt_port_1.cu

matmul_shmem_tc_async_opt_port_2.cu

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages