Skip to content

[REFACTOR][CUDA] Clarify CUDA codegen flow#19929

Open
Ubospica wants to merge 2 commits into
apache:mainfrom
Ubospica:refactor-cuda-codegen-flow
Open

[REFACTOR][CUDA] Clarify CUDA codegen flow#19929
Ubospica wants to merge 2 commits into
apache:mainfrom
Ubospica:refactor-cuda-codegen-flow

Conversation

@Ubospica

@Ubospica Ubospica commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Split CUDA codegen type, vector lane, barrier, attr, allocation, and CallNode handling into focused helpers.
  • Keep Python CUDA intrinsic codegen dispatch on the existing registry path while preserving canonical op-name aliasing.

Test plan

  • git diff --check

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a registry manifest for CUDA codegen intrinsics, grouping registered codegen operations by namespace and exposing them via a new list_registered_codegen Python API. Additionally, it refactors the C++ CodeGenCUDA class by modularizing the large PrintType and VisitExpr_ methods into smaller, type-specific helper functions. The review feedback highlights a potential correctness issue where sub-byte shared memory allocations could be under-allocated due to integer division without rounding up, and suggests optimizing stream insertions in CodeGenCUDA to avoid temporary string allocations.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +1858 to +1860
if (IsSharedSubByteAllocation(op->buffer->dtype, scope)) {
constant_size = constant_size / (32 / op->buffer->dtype.bits());
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Correctness Issue: Potential Buffer Under-allocation for Sub-Byte Types

When allocating shared memory for sub-byte types (such as 4-bit or 1-bit integers), the buffer size is divided by the packing factor 32 / bits. Using integer division (/) without rounding up can result in under-allocation if constant_size is not a multiple of the packing factor. This can lead to out-of-bounds memory accesses or compilation errors.

We should round up the division to ensure sufficient memory is allocated.

Suggested change
if (IsSharedSubByteAllocation(op->buffer->dtype, scope)) {
constant_size = constant_size / (32 / op->buffer->dtype.bits());
}
if (IsSharedSubByteAllocation(op->buffer->dtype, scope)) {
const int bits = op->buffer->dtype.bits();
constant_size = (constant_size + (32 / bits) - 1) / (32 / bits);
}

Comment on lines +1307 to +1309
os << local_ptr << "[" + local_offset + " + i] = " << smem_ptr
<< "[(i % 8) / 4 * " + smem_stride + " * 16 + (threadIdx.x % 4) * 4 * " + smem_stride +
"+ (i % 4) * " + smem_stride + " + threadIdx.x / 4 + (i / 8) * 8];\n";

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Efficiency Issue: Unnecessary String Concatenation

Using string concatenation (+) inside stream insertion (os << ...) creates multiple temporary std::string objects on the heap. Since os is a stream, we can chain stream insertion operators (<<) directly to avoid any temporary allocations and improve codegen performance.

Suggested change
os << local_ptr << "[" + local_offset + " + i] = " << smem_ptr
<< "[(i % 8) / 4 * " + smem_stride + " * 16 + (threadIdx.x % 4) * 4 * " + smem_stride +
"+ (i % 4) * " + smem_stride + " + threadIdx.x / 4 + (i / 8) * 8];\n";
os << local_ptr << "[" << local_offset << " + i] = " << smem_ptr
<< "[(i % 8) / 4 * " << smem_stride << " * 16 + (threadIdx.x % 4) * 4 * " << smem_stride
<< " + (i % 4) * " << smem_stride << " + threadIdx.x / 4 + (i / 8) * 8];\n";

@yongwww yongwww closed this Jul 2, 2026
@yongwww yongwww reopened this Jul 2, 2026
Ubospica added 2 commits July 3, 2026 15:35
Split CUDA codegen responsibilities into focused helpers and expose registry manifest inspection for CUDA intrinsic coverage tests.
Keep the CUDA intrinsic registry focused on dispatch and remove the inspection-only manifest API from the refactor.
@Ubospica Ubospica force-pushed the refactor-cuda-codegen-flow branch from b5bd971 to 6114e37 Compare July 3, 2026 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants