[minAsyncMoE] Fix active swiglu int32 indexing overflow by IvanKobzarev · Pull Request #3769 · pytorch/torchtitan

IvanKobzarev · 2026-06-24T16:10:07Z

The 64-GPU BS16 MAST repro was DSV3GB64_FLASHB256_BS16_CLEAN_PERFONLY_ACTSWIGTO_ACTSWIGTO_-x613qn1b. It ran DeepSeek V3 16B with DP64, EP64, TP1, local batch size 16, and seq_len 4096. That gives each rank 16 * 4096 = 65,536 local tokens. With EP64 and one local expert per rank in this config, MinimalAsyncEP's receive capacity is 64 * 65,536 = 4,194,304 rows.

The row-copy kernels address row-major hidden buffers as row_id * row_stride + col. At the MAST shape, a 2048-column hidden buffer reaches 4,194,304 * 2,048 = 8,589,934,592 element offsets; the ActiveSwiGLU stress from the source branch also hit 1,408-column offsets, or 5,905,580,032 elements. Both are well past the signed int32 limit of 2,147,483,647.

Triton builds the row vector from tl.arange, which is int32 unless promoted. In the no-src-rows path, src_row inherits that int32 row vector, so row_id * stride can wrap before pointer arithmetic is formed. The loaded source and destination rows are promoted too so all row-derived address math uses int64. The routing metadata itself was valid; the MAST failures showed valid ranges up to receive_capacity, then surfaced as CUDA illegal memory accesses or cudaGraphLaunch crashes because the address calculation wrapped at replay/runtime scale.

I did not add a synthetic overflow unit test here. The source-inspection test was removed, and a behavioral pre/post test needs the same 4,194,304-row, large-stride CUDA address geometry or an equivalent multi-GB pointer setup. A smaller unit test would not exercise this overflow.

Test Plan:

/home/ivankobzarev/local/b/pytorch-env/bin/python -m py_compile torchtitan/distributed/minimal_async_ep/kernels.py

/home/ivankobzarev/local/b/pytorch-env/bin/python -m unittest tests.unit_tests.test_minimal_async_ep_kernels

git diff --check

Authored with assistance from OpenAI Codex.

Commit 69d8af0 fixed the same address-generation bug on a side branch that also carried the ActiveSwiGLU op. This checkout does not have that op, so this backport keeps the minimal part that applies to main: promote MinimalAsyncEP row-copy row ids before stride arithmetic. The 64-GPU BS16 MAST repro was DSV3GB64_FLASHB256_BS16_CLEAN_PERFONLY_ACTSWIGTO_ACTSWIGTO_-x613qn1b. It ran DeepSeek V3 16B with DP64, EP64, TP1, local batch size 16, and seq_len 4096. That gives each rank 16 * 4096 = 65,536 local tokens. With EP64 and one local expert per rank in this config, MinimalAsyncEP's receive capacity is 64 * 65,536 = 4,194,304 rows. The row-copy kernels address row-major hidden buffers as row_id * row_stride + col. At the MAST shape, a 2048-column hidden buffer reaches 4,194,304 * 2,048 = 8,589,934,592 element offsets; the ActiveSwiGLU stress from the source branch also hit 1,408-column offsets, or 5,905,580,032 elements. Both are well past the signed int32 limit of 2,147,483,647. Triton builds the row vector from tl.arange, which is int32 unless promoted. In the no-src-rows path, src_row inherits that int32 row vector, so row_id * stride can wrap before pointer arithmetic is formed. The loaded source and destination rows are promoted too so all row-derived address math uses int64. The routing metadata itself was valid; the MAST failures showed valid ranges up to receive_capacity, then surfaced as CUDA illegal memory accesses or cudaGraphLaunch crashes because the address calculation wrapped at replay/runtime scale. The regression test constructs the smallest practical CUDA address shape for this bug with uint8 storage. It places the source tensor base at a 2**31 byte offset inside a larger allocation and launches the production row-copy kernel with row * stride == 2**31. Before the fix, the int32 row product sign-wraps to -2**31 and copies the sentinel at the start of the allocation. After the fix, the int64 product copies the sentinel at the true high address. The test has a CUDA free-memory guard because this faithful overflow shape needs just over 4 GiB of addressable storage. Test Plan: ```bash CUDA_LAUNCH_BLOCKING=1 /home/ivankobzarev/local/b/pytorch-env/bin/python -m unittest tests.unit_tests.test_minimal_async_ep_kernels.TestMinimalAsyncEPKernels.test_copy_rows_uses_int64_for_source_stride_arithmetic ``` The same command failed before restoring the row promotion, with `AssertionError: 17 != 93`. ```bash /home/ivankobzarev/local/b/pytorch-env/bin/python -m py_compile torchtitan/distributed/minimal_async_ep/kernels.py tests/unit_tests/test_minimal_async_ep_kernels.py ``` ```bash CUDA_LAUNCH_BLOCKING=1 /home/ivankobzarev/local/b/pytorch-env/bin/python -m unittest tests.unit_tests.test_minimal_async_ep_kernels ``` ```bash git diff --check ``` ```bash uv tool run --from pre-commit --with pyrefly==0.45.1 --with-requirements requirements.txt --with-requirements requirements-dev.txt pre-commit run --all-files --show-diff-on-failure ``` Authored with assistance from OpenAI Codex.

tianyu-l

unblock

pytorch-bot Bot added the ciflow/8gpu label Jun 24, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 24, 2026

IvanKobzarev marked this pull request as ready for review June 24, 2026 16:10

IvanKobzarev requested review from fegin, tianyu-l, wconstab and wwwjn as code owners June 24, 2026 16:10

IvanKobzarev requested review from aditvenk and sanketpurandare June 24, 2026 16:10

IvanKobzarev changed the title ~~[minAsyncMoE] Fix high-capacity row-copy addressing~~ [minAsyncMoE] Fix active swiglu int32 indexing overflow Jun 24, 2026

IvanKobzarev force-pushed the fix-mamoe-int32-overflow branch from 43ec06b to 5adde9a Compare June 24, 2026 16:15

aditvenk approved these changes Jun 24, 2026

View reviewed changes

sanketpurandare approved these changes Jun 24, 2026

View reviewed changes

tianyu-l approved these changes Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[minAsyncMoE] Fix active swiglu int32 indexing overflow#3769

[minAsyncMoE] Fix active swiglu int32 indexing overflow#3769
IvanKobzarev wants to merge 1 commit into
pytorch:mainfrom
IvanKobzarev:fix-mamoe-int32-overflow

IvanKobzarev commented Jun 24, 2026

Uh oh!

tianyu-l left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

IvanKobzarev commented Jun 24, 2026

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants