[REFACTOR][CUDA] Clarify CUDA codegen flow by Ubospica · Pull Request #19929 · apache/tvm

Ubospica · 2026-07-02T17:11:43Z

Summary

Split CUDA codegen type, vector lane, barrier, attr, allocation, and CallNode handling into focused helpers.
Keep Python CUDA intrinsic codegen dispatch on the existing registry path while preserving canonical op-name aliasing.

Test plan

git diff --check

gemini-code-assist

Code Review

This pull request introduces a registry manifest for CUDA codegen intrinsics, grouping registered codegen operations by namespace and exposing them via a new list_registered_codegen Python API. Additionally, it refactors the C++ CodeGenCUDA class by modularizing the large PrintType and VisitExpr_ methods into smaller, type-specific helper functions. The review feedback highlights a potential correctness issue where sub-byte shared memory allocations could be under-allocated due to integer division without rounding up, and suggests optimizing stream insertions in CodeGenCUDA to avoid temporary string allocations.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-07-02T17:15:48Z

+  if (IsSharedSubByteAllocation(op->buffer->dtype, scope)) {
+    constant_size = constant_size / (32 / op->buffer->dtype.bits());
+  }


Correctness Issue: Potential Buffer Under-allocation for Sub-Byte Types

When allocating shared memory for sub-byte types (such as 4-bit or 1-bit integers), the buffer size is divided by the packing factor 32 / bits. Using integer division (/) without rounding up can result in under-allocation if constant_size is not a multiple of the packing factor. This can lead to out-of-bounds memory accesses or compilation errors.

We should round up the division to ensure sufficient memory is allocated.

Suggested change

if (IsSharedSubByteAllocation(op->buffer->dtype, scope)) {

constant_size = constant_size / (32 / op->buffer->dtype.bits());

}

if (IsSharedSubByteAllocation(op->buffer->dtype, scope)) {

const int bits = op->buffer->dtype.bits();

constant_size = (constant_size + (32 / bits) - 1) / (32 / bits);

}

gemini-code-assist · 2026-07-02T17:15:48Z

+    os << local_ptr << "[" + local_offset + " + i] = " << smem_ptr
+       << "[(i % 8) / 4 * " + smem_stride + " * 16 + (threadIdx.x % 4) * 4 * " + smem_stride +
+              "+ (i % 4) * " + smem_stride + " + threadIdx.x / 4 +  (i / 8) * 8];\n";


Efficiency Issue: Unnecessary String Concatenation

Using string concatenation (+) inside stream insertion (os << ...) creates multiple temporary std::string objects on the heap. Since os is a stream, we can chain stream insertion operators (<<) directly to avoid any temporary allocations and improve codegen performance.

Suggested change

os << local_ptr << "[" + local_offset + " + i] = " << smem_ptr

<< "[(i % 8) / 4 * " + smem_stride + " * 16 + (threadIdx.x % 4) * 4 * " + smem_stride +

"+ (i % 4) * " + smem_stride + " + threadIdx.x / 4 + (i / 8) * 8];\n";

os << local_ptr << "[" << local_offset << " + i] = " << smem_ptr

<< "[(i % 8) / 4 * " << smem_stride << " * 16 + (threadIdx.x % 4) * 4 * " << smem_stride

<< " + (i % 4) * " << smem_stride << " + threadIdx.x / 4 + (i / 8) * 8];\n";

Split CUDA codegen responsibilities into focused helpers and expose registry manifest inspection for CUDA intrinsic coverage tests.

Keep the CUDA intrinsic registry focused on dispatch and remove the inspection-only manifest API from the refactor.

gemini-code-assist Bot reviewed Jul 2, 2026

View reviewed changes

yongwww closed this Jul 2, 2026

yongwww reopened this Jul 2, 2026

Ubospica added 2 commits July 3, 2026 15:35

[REFACTOR][CUDA] Clarify CUDA codegen flow

d9195fc

Split CUDA codegen responsibilities into focused helpers and expose registry manifest inspection for CUDA intrinsic coverage tests.

[REFACTOR][CUDA] Drop codegen manifest listing

6114e37

Keep the CUDA intrinsic registry focused on dispatch and remove the inspection-only manifest API from the refactor.

Ubospica force-pushed the refactor-cuda-codegen-flow branch from b5bd971 to 6114e37 Compare July 3, 2026 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[REFACTOR][CUDA] Clarify CUDA codegen flow#19929

[REFACTOR][CUDA] Clarify CUDA codegen flow#19929
Ubospica wants to merge 2 commits into
apache:mainfrom
Ubospica:refactor-cuda-codegen-flow

Ubospica commented Jul 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jul 2, 2026

Uh oh!

gemini-code-assist Bot Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Ubospica commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jul 2, 2026

Choose a reason for hiding this comment

Correctness Issue: Potential Buffer Under-allocation for Sub-Byte Types

Uh oh!

gemini-code-assist Bot Jul 2, 2026

Choose a reason for hiding this comment

Efficiency Issue: Unnecessary String Concatenation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ubospica commented Jul 2, 2026 •

edited

Loading