[QST] Why is the accumulation dtype in the CUTLASS FP4 GEMV example in FP16, instead of FP32?

**What is your question?**

In example 91, a GEMV with NVFP4 epilogue store is provided (https://github.com/NVIDIA/cutlass/blob/main/examples/91_fp4_gemv/91_fp4_gemv.cu). However, the usage of FP16 for accumulation mainloop seems abnormal. For example, the timing of the following shape, `(M, N, K) = (1, 1024, 7168)` has around a 5% performance increase on SM100 of FP16 to FP32 for the accumulator.

Is it for consumer RTX SM120, where the performance of FP32 accumulate is not good? Yet, that would be even more confusing, since this uses CUDA core based `fma.rn.f16x2`, and not the tensor core path (`mma.sync.aligned.kind::mxf4nvf4.block_scale…f32.e2m1.e2m1.f32`) (which, after further investigation, does not even support FP16 accumulation). Thus, I'm not sure if there is a specific target, or was it just an oversight of the example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Why is the accumulation dtype in the CUTLASS FP4 GEMV example in FP16, instead of FP32? #3330

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[QST] Why is the accumulation dtype in the CUTLASS FP4 GEMV example in FP16, instead of FP32? #3330

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions