Fix x86 CPU CI OOM by capping Inductor compile threads by gchalump · Pull Request #5930 · pytorch/FBGEMM

gchalump · 2026-06-18T17:03:11Z

Summary:
The FBGEMM_GPU-CPU CI (fbgemm_gpu_ci_cpu.yml) test job dies on every x86
linux.4xlarge matrix cell with "The self-hosted runner lost communication
with the server", while every arm linux.arm64.m7g.4xlarge cell passes the
identical tests.

Root cause is memory, not a code bug:

x86 linux.4xlarge is a c5.4xlarge: 16 vCPU / 32 GB RAM.
arm linux.arm64.m7g.4xlarge is an m7g.4xlarge: 16 vCPU / 64 GB RAM.
The torch.compile-heavy TBE tests (e.g. test_backward_adagrad_*, compile=True)
trigger TorchInductor, which by default spawns one C++ compile worker per
vCPU (16). Peak memory blows past 32 GB on x86 and the kernel OOM-killer
takes down the runner agent, so the job log cuts off mid-test with no error
and the server later reports "lost communication". The 64 GB arm box
absorbs the same spike and passes.

Job log evidence: the log terminates mid-hypothesis-example during
test_backward_adagrad_fp32_pmNONE_cpu (~16 min into PyTest, before the 20 min
step timeout), with zero error/OOM/traceback markers -- the signature of an
abrupt host death.

Fix: set TORCHINDUCTOR_COMPILE_THREADS=1 on the "Test with PyTest" step to
serialize Inductor's compile workers. This keeps peak memory within the 32 GB
budget without changing test coverage, the matrix, or the runner instance type
(so no CI cost increase). If build time on the compile tests becomes a concern,
the value can be tuned up (e.g. 2-4) as long as it stays under the OOM
threshold.

Differential Revision: D109038088

Summary: The FBGEMM_GPU-CPU CI (fbgemm_gpu_ci_cpu.yml) test job dies on every x86 `linux.4xlarge` matrix cell with "The self-hosted runner lost communication with the server", while every arm `linux.arm64.m7g.4xlarge` cell passes the identical tests. Root cause is memory, not a code bug: - x86 linux.4xlarge is a c5.4xlarge: 16 vCPU / 32 GB RAM. - arm linux.arm64.m7g.4xlarge is an m7g.4xlarge: 16 vCPU / 64 GB RAM. - The torch.compile-heavy TBE tests (e.g. test_backward_adagrad_*, compile=True) trigger TorchInductor, which by default spawns one C++ compile worker per vCPU (16). Peak memory blows past 32 GB on x86 and the kernel OOM-killer takes down the runner agent, so the job log cuts off mid-test with no error and the server later reports "lost communication". The 64 GB arm box absorbs the same spike and passes. Job log evidence: the log terminates mid-hypothesis-example during test_backward_adagrad_fp32_pmNONE_cpu (~16 min into PyTest, before the 20 min step timeout), with zero error/OOM/traceback markers -- the signature of an abrupt host death. Fix: set TORCHINDUCTOR_COMPILE_THREADS=1 on the "Test with PyTest" step to serialize Inductor's compile workers. This keeps peak memory within the 32 GB budget without changing test coverage, the matrix, or the runner instance type (so no CI cost increase). If build time on the compile tests becomes a concern, the value can be tuned up (e.g. 2-4) as long as it stays under the OOM threshold. Differential Revision: D109038088

meta-codesync · 2026-06-18T17:03:20Z

@gchalump has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109038088.

meta-cla Bot added the cla signed label Jun 18, 2026

meta-codesync Bot added the meta-exported label Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix x86 CPU CI OOM by capping Inductor compile threads#5930

Fix x86 CPU CI OOM by capping Inductor compile threads#5930
gchalump wants to merge 1 commit into
pytorch:mainfrom
gchalump:export-D109038088

gchalump commented Jun 18, 2026

Uh oh!

meta-codesync Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

gchalump commented Jun 18, 2026

Uh oh!

meta-codesync Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant