What is your question?
In example 91, a GEMV with NVFP4 epilogue store is provided (https://github.com/NVIDIA/cutlass/blob/main/examples/91_fp4_gemv/91_fp4_gemv.cu). However, the usage of FP16 for accumulation mainloop seems abnormal. For example, the timing of the following shape, (M, N, K) = (1, 1024, 7168) has around a 5% performance increase on SM100 of FP16 to FP32 for the accumulator.
Is it for consumer RTX SM120, where the performance of FP32 accumulate is not good? Yet, that would be even more confusing, since this uses CUDA core based fma.rn.f16x2, and not the tensor core path (mma.sync.aligned.kind::mxf4nvf4.block_scale…f32.e2m1.e2m1.f32) (which, after further investigation, does not even support FP16 accumulation). Thus, I'm not sure if there is a specific target, or was it just an oversight of the example.
What is your question?
In example 91, a GEMV with NVFP4 epilogue store is provided (https://github.com/NVIDIA/cutlass/blob/main/examples/91_fp4_gemv/91_fp4_gemv.cu). However, the usage of FP16 for accumulation mainloop seems abnormal. For example, the timing of the following shape,
(M, N, K) = (1, 1024, 7168)has around a 5% performance increase on SM100 of FP16 to FP32 for the accumulator.Is it for consumer RTX SM120, where the performance of FP32 accumulate is not good? Yet, that would be even more confusing, since this uses CUDA core based
fma.rn.f16x2, and not the tensor core path (mma.sync.aligned.kind::mxf4nvf4.block_scale…f32.e2m1.e2m1.f32) (which, after further investigation, does not even support FP16 accumulation). Thus, I'm not sure if there is a specific target, or was it just an oversight of the example.