Skip to content

[minAsyncMoE] Fix active swiglu int32 indexing overflow#3769

Open
IvanKobzarev wants to merge 1 commit into
pytorch:mainfrom
IvanKobzarev:fix-mamoe-int32-overflow
Open

[minAsyncMoE] Fix active swiglu int32 indexing overflow#3769
IvanKobzarev wants to merge 1 commit into
pytorch:mainfrom
IvanKobzarev:fix-mamoe-int32-overflow

Conversation

@IvanKobzarev

Copy link
Copy Markdown
Contributor

The 64-GPU BS16 MAST repro was DSV3GB64_FLASHB256_BS16_CLEAN_PERFONLY_ACTSWIGTO_ACTSWIGTO_-x613qn1b. It ran DeepSeek V3 16B with DP64, EP64, TP1, local batch size 16, and seq_len 4096. That gives each rank 16 * 4096 = 65,536 local tokens. With EP64 and one local expert per rank in this config, MinimalAsyncEP's receive capacity is 64 * 65,536 = 4,194,304 rows.

The row-copy kernels address row-major hidden buffers as row_id * row_stride + col. At the MAST shape, a 2048-column hidden buffer reaches 4,194,304 * 2,048 = 8,589,934,592 element offsets; the ActiveSwiGLU stress from the source branch also hit 1,408-column offsets, or 5,905,580,032 elements. Both are well past the signed int32 limit of 2,147,483,647.

Triton builds the row vector from tl.arange, which is int32 unless promoted. In the no-src-rows path, src_row inherits that int32 row vector, so row_id * stride can wrap before pointer arithmetic is formed. The loaded source and destination rows are promoted too so all row-derived address math uses int64. The routing metadata itself was valid; the MAST failures showed valid ranges up to receive_capacity, then surfaced as CUDA illegal memory accesses or cudaGraphLaunch crashes because the address calculation wrapped at replay/runtime scale.

I did not add a synthetic overflow unit test here. The source-inspection test was removed, and a behavioral pre/post test needs the same 4,194,304-row, large-stride CUDA address geometry or an equivalent multi-GB pointer setup. A smaller unit test would not exercise this overflow.

Test Plan:

/home/ivankobzarev/local/b/pytorch-env/bin/python -m py_compile torchtitan/distributed/minimal_async_ep/kernels.py
/home/ivankobzarev/local/b/pytorch-env/bin/python -m unittest tests.unit_tests.test_minimal_async_ep_kernels
git diff --check

Authored with assistance from OpenAI Codex.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 24, 2026
@IvanKobzarev IvanKobzarev marked this pull request as ready for review June 24, 2026 16:10
@IvanKobzarev IvanKobzarev changed the title [minAsyncMoE] Fix high-capacity row-copy addressing [minAsyncMoE] Fix active swiglu int32 indexing overflow Jun 24, 2026
Commit 69d8af0 fixed the same address-generation bug on a side branch that also carried the ActiveSwiGLU op. This checkout does not have that op, so this backport keeps the minimal part that applies to main: promote MinimalAsyncEP row-copy row ids before stride arithmetic.

The 64-GPU BS16 MAST repro was DSV3GB64_FLASHB256_BS16_CLEAN_PERFONLY_ACTSWIGTO_ACTSWIGTO_-x613qn1b. It ran DeepSeek V3 16B with DP64, EP64, TP1, local batch size 16, and seq_len 4096. That gives each rank 16 * 4096 = 65,536 local tokens. With EP64 and one local expert per rank in this config, MinimalAsyncEP's receive capacity is 64 * 65,536 = 4,194,304 rows.

The row-copy kernels address row-major hidden buffers as row_id * row_stride + col. At the MAST shape, a 2048-column hidden buffer reaches 4,194,304 * 2,048 = 8,589,934,592 element offsets; the ActiveSwiGLU stress from the source branch also hit 1,408-column offsets, or 5,905,580,032 elements. Both are well past the signed int32 limit of 2,147,483,647.

Triton builds the row vector from tl.arange, which is int32 unless promoted. In the no-src-rows path, src_row inherits that int32 row vector, so row_id * stride can wrap before pointer arithmetic is formed. The loaded source and destination rows are promoted too so all row-derived address math uses int64. The routing metadata itself was valid; the MAST failures showed valid ranges up to receive_capacity, then surfaced as CUDA illegal memory accesses or cudaGraphLaunch crashes because the address calculation wrapped at replay/runtime scale.

The regression test constructs the smallest practical CUDA address shape for this bug with uint8 storage. It places the source tensor base at a 2**31 byte offset inside a larger allocation and launches the production row-copy kernel with row * stride == 2**31. Before the fix, the int32 row product sign-wraps to -2**31 and copies the sentinel at the start of the allocation. After the fix, the int64 product copies the sentinel at the true high address. The test has a CUDA free-memory guard because this faithful overflow shape needs just over 4 GiB of addressable storage.

Test Plan:
```bash
CUDA_LAUNCH_BLOCKING=1 /home/ivankobzarev/local/b/pytorch-env/bin/python -m unittest tests.unit_tests.test_minimal_async_ep_kernels.TestMinimalAsyncEPKernels.test_copy_rows_uses_int64_for_source_stride_arithmetic
```

The same command failed before restoring the row promotion, with `AssertionError: 17 != 93`.

```bash
/home/ivankobzarev/local/b/pytorch-env/bin/python -m py_compile torchtitan/distributed/minimal_async_ep/kernels.py tests/unit_tests/test_minimal_async_ep_kernels.py
```

```bash
CUDA_LAUNCH_BLOCKING=1 /home/ivankobzarev/local/b/pytorch-env/bin/python -m unittest tests.unit_tests.test_minimal_async_ep_kernels
```

```bash
git diff --check
```

```bash
uv tool run --from pre-commit --with pyrefly==0.45.1 --with-requirements requirements.txt --with-requirements requirements-dev.txt pre-commit run --all-files --show-diff-on-failure
```

Authored with assistance from OpenAI Codex.
@IvanKobzarev IvanKobzarev force-pushed the fix-mamoe-int32-overflow branch from 43ec06b to 5adde9a Compare June 24, 2026 16:15

@tianyu-l tianyu-l left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unblock

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants