[minAsyncMoE] Fix active swiglu int32 indexing overflow#3769
Open
IvanKobzarev wants to merge 1 commit into
Open
[minAsyncMoE] Fix active swiglu int32 indexing overflow#3769IvanKobzarev wants to merge 1 commit into
IvanKobzarev wants to merge 1 commit into
Conversation
Commit 69d8af0 fixed the same address-generation bug on a side branch that also carried the ActiveSwiGLU op. This checkout does not have that op, so this backport keeps the minimal part that applies to main: promote MinimalAsyncEP row-copy row ids before stride arithmetic. The 64-GPU BS16 MAST repro was DSV3GB64_FLASHB256_BS16_CLEAN_PERFONLY_ACTSWIGTO_ACTSWIGTO_-x613qn1b. It ran DeepSeek V3 16B with DP64, EP64, TP1, local batch size 16, and seq_len 4096. That gives each rank 16 * 4096 = 65,536 local tokens. With EP64 and one local expert per rank in this config, MinimalAsyncEP's receive capacity is 64 * 65,536 = 4,194,304 rows. The row-copy kernels address row-major hidden buffers as row_id * row_stride + col. At the MAST shape, a 2048-column hidden buffer reaches 4,194,304 * 2,048 = 8,589,934,592 element offsets; the ActiveSwiGLU stress from the source branch also hit 1,408-column offsets, or 5,905,580,032 elements. Both are well past the signed int32 limit of 2,147,483,647. Triton builds the row vector from tl.arange, which is int32 unless promoted. In the no-src-rows path, src_row inherits that int32 row vector, so row_id * stride can wrap before pointer arithmetic is formed. The loaded source and destination rows are promoted too so all row-derived address math uses int64. The routing metadata itself was valid; the MAST failures showed valid ranges up to receive_capacity, then surfaced as CUDA illegal memory accesses or cudaGraphLaunch crashes because the address calculation wrapped at replay/runtime scale. The regression test constructs the smallest practical CUDA address shape for this bug with uint8 storage. It places the source tensor base at a 2**31 byte offset inside a larger allocation and launches the production row-copy kernel with row * stride == 2**31. Before the fix, the int32 row product sign-wraps to -2**31 and copies the sentinel at the start of the allocation. After the fix, the int64 product copies the sentinel at the true high address. The test has a CUDA free-memory guard because this faithful overflow shape needs just over 4 GiB of addressable storage. Test Plan: ```bash CUDA_LAUNCH_BLOCKING=1 /home/ivankobzarev/local/b/pytorch-env/bin/python -m unittest tests.unit_tests.test_minimal_async_ep_kernels.TestMinimalAsyncEPKernels.test_copy_rows_uses_int64_for_source_stride_arithmetic ``` The same command failed before restoring the row promotion, with `AssertionError: 17 != 93`. ```bash /home/ivankobzarev/local/b/pytorch-env/bin/python -m py_compile torchtitan/distributed/minimal_async_ep/kernels.py tests/unit_tests/test_minimal_async_ep_kernels.py ``` ```bash CUDA_LAUNCH_BLOCKING=1 /home/ivankobzarev/local/b/pytorch-env/bin/python -m unittest tests.unit_tests.test_minimal_async_ep_kernels ``` ```bash git diff --check ``` ```bash uv tool run --from pre-commit --with pyrefly==0.45.1 --with-requirements requirements.txt --with-requirements requirements-dev.txt pre-commit run --all-files --show-diff-on-failure ``` Authored with assistance from OpenAI Codex.
43ec06b to
5adde9a
Compare
aditvenk
approved these changes
Jun 24, 2026
sanketpurandare
approved these changes
Jun 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The 64-GPU BS16 MAST repro was DSV3GB64_FLASHB256_BS16_CLEAN_PERFONLY_ACTSWIGTO_ACTSWIGTO_-x613qn1b. It ran DeepSeek V3 16B with DP64, EP64, TP1, local batch size 16, and seq_len 4096. That gives each rank 16 * 4096 = 65,536 local tokens. With EP64 and one local expert per rank in this config, MinimalAsyncEP's receive capacity is 64 * 65,536 = 4,194,304 rows.
The row-copy kernels address row-major hidden buffers as row_id * row_stride + col. At the MAST shape, a 2048-column hidden buffer reaches 4,194,304 * 2,048 = 8,589,934,592 element offsets; the ActiveSwiGLU stress from the source branch also hit 1,408-column offsets, or 5,905,580,032 elements. Both are well past the signed int32 limit of 2,147,483,647.
Triton builds the row vector from tl.arange, which is int32 unless promoted. In the no-src-rows path, src_row inherits that int32 row vector, so row_id * stride can wrap before pointer arithmetic is formed. The loaded source and destination rows are promoted too so all row-derived address math uses int64. The routing metadata itself was valid; the MAST failures showed valid ranges up to receive_capacity, then surfaced as CUDA illegal memory accesses or cudaGraphLaunch crashes because the address calculation wrapped at replay/runtime scale.
I did not add a synthetic overflow unit test here. The source-inspection test was removed, and a behavioral pre/post test needs the same 4,194,304-row, large-stride CUDA address geometry or an equivalent multi-GB pointer setup. A smaller unit test would not exercise this overflow.
Test Plan:
Authored with assistance from OpenAI Codex.