Add grid-stride loops and ROCm cap to reorder_batched_ad_{lengths,indices,sequence_embeddings}_kernel by q10 · Pull Request #5938 · pytorch/FBGEMM

q10 · 2026-06-21T23:19:20Z

Summary:
Tier-2 fix for HIP grid-overflow in sparse_ops/sparse_reorder_batched_ad.cu. Three saturating kernels — all using direct b_t = blockIdx.x * blockDim.y + threadIdx.y with early-returns.

Changes:

reorder_batched_ad_lengths_kernel: Pattern B grid-stride loop over b_t. The t >= T early-return becomes the implicit loop bound.
reorder_batched_ad_indices_kernel and reorder_batched_ad_indices_kernel_vec: Pattern B grid-stride loop. All per-iteration locals (output_segment_*, input_segment_*, num_elements, dst_ptr, src_ptr) reset naturally. Inner if (num_elements <= 64) / else if ... <= 128 / else > 128 branch dispatch is also per-iteration.
reorder_batched_sequence_embeddings_kernel: Pattern B grid-stride loop. Inner per-row and per-D loops are intra-block and unchanged.

Apply standard #ifdef USE_ROCM min(grid_uncapped, get_max_thread_blocks(stream)) #else grid_uncapped #endif cap to all three launch sites:

reorder_batched_ad_lengths_gpu: grid_size = (B*T+31)/32 (manual ceil-div).
reorder_batched_ad_indices_gpu: cuda_calc_xblock_count(B*T, NUM_WARPS) cap is applied once outside the #if defined __HIP_PLATFORM_AMD__ / #else / #endif block, after NUM_WARPS is determined.
reorder_batched_sequence_embeddings_gpu: (B*T+31)/32 (manual ceil-div).

Stacked on top of D105029511 (Tier-2 Diff 6/7). Plan:
/home/bensonma415/.llms/plans/sparse_ops_rocm_grid_overflow_tier2_fix.plan.md (Diff 7/7) — final diff in the Tier-2 stack.

Differential Revision: D105030655

…_kernel (pytorch#5934) Summary: X-link: facebookresearch/FBGEMM#2852 Tier-2 fix for HIP grid-overflow in `sparse_ops/sparse_index_add.cu`. `index_add_2d_with_unique_indices_kernel` previously used `blockIdx.x` directly to index unique indices. Capping the host-side grid without first adding a grid-stride loop would silently drop work. Changes: - Add `const int num_unique_indices` as a new kernel parameter. - Convert kernel to a grid-stride loop over `u = blockIdx.x; u < num_unique_indices; u += gridDim.x` (Pattern C). All `blockIdx.x` references replaced with `u`. Hoist `start_D` and `has_remainder` outside the loop since they depend only on `blockIdx.y` / `threadIdx.x`. - RESET per-iteration register state at the top of each iteration: `sum[MAX_ELEMENTS_PER_THREAD]` re-zeroed and `sum_remainder = 0`. - Apply standard `#ifdef USE_ROCM min(blocks_x_uncapped, get_max_thread_blocks(stream)) #else blocks_x_uncapped #endif` cap to the x-dim of the launch grid. y dim is bounded by D/stride_D and needs no cap. Stacked on top of D105029028 (Tier-2 Diff 5/7). Plan: `/home/bensonma415/.llms/plans/sparse_ops_rocm_grid_overflow_tier2_fix.plan.md` (Diff 6/7). Reviewed By: henrylhtsang Differential Revision: D105029511

…ices,sequence_embeddings}_kernel Summary: Tier-2 fix for HIP grid-overflow in `sparse_ops/sparse_reorder_batched_ad.cu`. Three saturating kernels — all using direct `b_t = blockIdx.x * blockDim.y + threadIdx.y` with early-returns. Changes: - `reorder_batched_ad_lengths_kernel`: Pattern B grid-stride loop over `b_t`. The `t >= T` early-return becomes the implicit loop bound. - `reorder_batched_ad_indices_kernel` and `reorder_batched_ad_indices_kernel_vec`: Pattern B grid-stride loop. All per-iteration locals (`output_segment_*`, `input_segment_*`, `num_elements`, `dst_ptr`, `src_ptr`) reset naturally. Inner `if (num_elements <= 64) / else if ... <= 128 / else > 128` branch dispatch is also per-iteration. - `reorder_batched_sequence_embeddings_kernel`: Pattern B grid-stride loop. Inner per-row and per-D loops are intra-block and unchanged. Apply standard `#ifdef USE_ROCM min(grid_uncapped, get_max_thread_blocks(stream)) #else grid_uncapped #endif` cap to all three launch sites: - `reorder_batched_ad_lengths_gpu`: `grid_size = (B*T+31)/32` (manual ceil-div). - `reorder_batched_ad_indices_gpu`: `cuda_calc_xblock_count(B*T, NUM_WARPS)` cap is applied once outside the `#if defined __HIP_PLATFORM_AMD__ / #else / #endif` block, after `NUM_WARPS` is determined. - `reorder_batched_sequence_embeddings_gpu`: `(B*T+31)/32` (manual ceil-div). Stacked on top of D105029511 (Tier-2 Diff 6/7). Plan: `/home/bensonma415/.llms/plans/sparse_ops_rocm_grid_overflow_tier2_fix.plan.md` (Diff 7/7) — final diff in the Tier-2 stack. Differential Revision: D105030655

meta-codesync · 2026-06-21T23:19:30Z

@q10 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D105030655.

q10 added 2 commits June 21, 2026 16:19

pytorch-bot Bot added ciflow/rocm module: rocm labels Jun 21, 2026

meta-cla Bot added the cla signed label Jun 21, 2026

meta-codesync Bot added the meta-exported label Jun 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add grid-stride loops and ROCm cap to reorder_batched_ad_{lengths,indices,sequence_embeddings}_kernel#5938

Add grid-stride loops and ROCm cap to reorder_batched_ad_{lengths,indices,sequence_embeddings}_kernel#5938
q10 wants to merge 2 commits into
pytorch:mainfrom
q10:export-D105030655

q10 commented Jun 21, 2026

Uh oh!

meta-codesync Bot commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

q10 commented Jun 21, 2026

Uh oh!

meta-codesync Bot commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant