Skip to content

alfin3/cuda-kernel-opt

Repository files navigation

The objective of this project is to develop and implement innovative concepts related to matrix multiplication that may be different from the commonly used optimizations, and therefore could potentially enable to exceed the state-of-the-art performance in some cases of matrix multiplication in combination with other implementations.

matmul_shmem_tc_async_opt_port_0.cu

The design of the gemm_shmem_tc_async_opt_port kernel enabled the investigation of the effect of different levels of workloads, mapped to registers by a compiler, in the context of matrix multiplication.

The evaluation of the matmul_shmem kernel (matmul_shmem.cu) showed that optimizing the number of accumulators per thread could improve performance without using Tensor Cores. An approach to optimizing the number of accumulators at the level of warps using Tensor Cores was then implemented in the gemm_shmem_tc_async_opt_port kernel. This kernel was portable and tunable.



matmul_shmem_tc_async_opt_port_1.cu

The synchronization scheme, provided in the gemm_shmem_tc_async_opt_port kernel, was designed to i) decouple the consumer warps from each other, including at the level of accumulators, and ii) shift the start of the execution by each consumer warp according to the order of the load instructions for the A and transposed B segments of the K dimension. The order of the load instructions was from top to bottom. The earlier load instructions should result in an earlier start of matrix multiply and accumulate. The later load instructions should result in a later start of matrix multiply and accumulate. This shift should be preserved across the pipeline stages, and may potentially provide better utilization of Tensor Cores in settings where the number of stages is small and the size of tiles is large.



matmul_shmem_tc_async_opt_port_2.cu

An optimization is designed with respect to the shift synchronization scheme implemented in matmul_shmem_tc_async_opt_port_1.cu. Instead of using separate barriers for every consumer warp or every row and column of a warp tile, separate barriers were used for the first few upper left consumer warps or the first few upper left rows and columns of a warp tile, and the residual consumer warps or rows and columns of a warp tile were controlled with single barriers. According to this optimization, an early start of matrix multiply and accumulate may be achieved with fewer barriers and a decreased synchronization overhead. Additionally, the ability to handle the rest of the warp tile with arrays of fragments was preserved.



About

Design and evaluation of potential algorithmic optimizations of matrix multiplication.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors