[REFACTOR][CUDA] Clarify CUDA codegen flow#19929
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a registry manifest for CUDA codegen intrinsics, grouping registered codegen operations by namespace and exposing them via a new list_registered_codegen Python API. Additionally, it refactors the C++ CodeGenCUDA class by modularizing the large PrintType and VisitExpr_ methods into smaller, type-specific helper functions. The review feedback highlights a potential correctness issue where sub-byte shared memory allocations could be under-allocated due to integer division without rounding up, and suggests optimizing stream insertions in CodeGenCUDA to avoid temporary string allocations.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if (IsSharedSubByteAllocation(op->buffer->dtype, scope)) { | ||
| constant_size = constant_size / (32 / op->buffer->dtype.bits()); | ||
| } |
There was a problem hiding this comment.
Correctness Issue: Potential Buffer Under-allocation for Sub-Byte Types
When allocating shared memory for sub-byte types (such as 4-bit or 1-bit integers), the buffer size is divided by the packing factor 32 / bits. Using integer division (/) without rounding up can result in under-allocation if constant_size is not a multiple of the packing factor. This can lead to out-of-bounds memory accesses or compilation errors.
We should round up the division to ensure sufficient memory is allocated.
| if (IsSharedSubByteAllocation(op->buffer->dtype, scope)) { | |
| constant_size = constant_size / (32 / op->buffer->dtype.bits()); | |
| } | |
| if (IsSharedSubByteAllocation(op->buffer->dtype, scope)) { | |
| const int bits = op->buffer->dtype.bits(); | |
| constant_size = (constant_size + (32 / bits) - 1) / (32 / bits); | |
| } |
| os << local_ptr << "[" + local_offset + " + i] = " << smem_ptr | ||
| << "[(i % 8) / 4 * " + smem_stride + " * 16 + (threadIdx.x % 4) * 4 * " + smem_stride + | ||
| "+ (i % 4) * " + smem_stride + " + threadIdx.x / 4 + (i / 8) * 8];\n"; |
There was a problem hiding this comment.
Efficiency Issue: Unnecessary String Concatenation
Using string concatenation (+) inside stream insertion (os << ...) creates multiple temporary std::string objects on the heap. Since os is a stream, we can chain stream insertion operators (<<) directly to avoid any temporary allocations and improve codegen performance.
| os << local_ptr << "[" + local_offset + " + i] = " << smem_ptr | |
| << "[(i % 8) / 4 * " + smem_stride + " * 16 + (threadIdx.x % 4) * 4 * " + smem_stride + | |
| "+ (i % 4) * " + smem_stride + " + threadIdx.x / 4 + (i / 8) * 8];\n"; | |
| os << local_ptr << "[" << local_offset << " + i] = " << smem_ptr | |
| << "[(i % 8) / 4 * " << smem_stride << " * 16 + (threadIdx.x % 4) * 4 * " << smem_stride | |
| << " + (i % 4) * " << smem_stride << " + threadIdx.x / 4 + (i / 8) * 8];\n"; |
Split CUDA codegen responsibilities into focused helpers and expose registry manifest inspection for CUDA intrinsic coverage tests.
Keep the CUDA intrinsic registry focused on dispatch and remove the inspection-only manifest API from the refactor.
b5bd971 to
6114e37
Compare
Summary
Test plan