Skip to content

[QST] Why is the accumulation dtype in the CUTLASS FP4 GEMV example in FP16, instead of FP32? #3330

Description

@b8zhong

What is your question?

In example 91, a GEMV with NVFP4 epilogue store is provided (https://github.com/NVIDIA/cutlass/blob/main/examples/91_fp4_gemv/91_fp4_gemv.cu). However, the usage of FP16 for accumulation mainloop seems abnormal. For example, the timing of the following shape, (M, N, K) = (1, 1024, 7168) has around a 5% performance increase on SM100 of FP16 to FP32 for the accumulator.

Is it for consumer RTX SM120, where the performance of FP32 accumulate is not good? Yet, that would be even more confusing, since this uses CUDA core based fma.rn.f16x2, and not the tensor core path (mma.sync.aligned.kind::mxf4nvf4.block_scale…f32.e2m1.e2m1.f32) (which, after further investigation, does not even support FP16 accumulation). Thus, I'm not sure if there is a specific target, or was it just an oversight of the example.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions