Skip to content

Fix x86 CPU CI OOM by capping Inductor compile threads#5930

Open
gchalump wants to merge 1 commit into
pytorch:mainfrom
gchalump:export-D109038088
Open

Fix x86 CPU CI OOM by capping Inductor compile threads#5930
gchalump wants to merge 1 commit into
pytorch:mainfrom
gchalump:export-D109038088

Conversation

@gchalump

Copy link
Copy Markdown
Contributor

Summary:
The FBGEMM_GPU-CPU CI (fbgemm_gpu_ci_cpu.yml) test job dies on every x86
linux.4xlarge matrix cell with "The self-hosted runner lost communication
with the server", while every arm linux.arm64.m7g.4xlarge cell passes the
identical tests.

Root cause is memory, not a code bug:

  • x86 linux.4xlarge is a c5.4xlarge: 16 vCPU / 32 GB RAM.
  • arm linux.arm64.m7g.4xlarge is an m7g.4xlarge: 16 vCPU / 64 GB RAM.
  • The torch.compile-heavy TBE tests (e.g. test_backward_adagrad_*, compile=True)
    trigger TorchInductor, which by default spawns one C++ compile worker per
    vCPU (16). Peak memory blows past 32 GB on x86 and the kernel OOM-killer
    takes down the runner agent, so the job log cuts off mid-test with no error
    and the server later reports "lost communication". The 64 GB arm box
    absorbs the same spike and passes.

Job log evidence: the log terminates mid-hypothesis-example during
test_backward_adagrad_fp32_pmNONE_cpu (~16 min into PyTest, before the 20 min
step timeout), with zero error/OOM/traceback markers -- the signature of an
abrupt host death.

Fix: set TORCHINDUCTOR_COMPILE_THREADS=1 on the "Test with PyTest" step to
serialize Inductor's compile workers. This keeps peak memory within the 32 GB
budget without changing test coverage, the matrix, or the runner instance type
(so no CI cost increase). If build time on the compile tests becomes a concern,
the value can be tuned up (e.g. 2-4) as long as it stays under the OOM
threshold.

Differential Revision: D109038088

Summary:
The FBGEMM_GPU-CPU CI (fbgemm_gpu_ci_cpu.yml) test job dies on every x86
`linux.4xlarge` matrix cell with "The self-hosted runner lost communication
with the server", while every arm `linux.arm64.m7g.4xlarge` cell passes the
identical tests.

Root cause is memory, not a code bug:
- x86 linux.4xlarge is a c5.4xlarge: 16 vCPU / 32 GB RAM.
- arm linux.arm64.m7g.4xlarge is an m7g.4xlarge: 16 vCPU / 64 GB RAM.
- The torch.compile-heavy TBE tests (e.g. test_backward_adagrad_*, compile=True)
  trigger TorchInductor, which by default spawns one C++ compile worker per
  vCPU (16). Peak memory blows past 32 GB on x86 and the kernel OOM-killer
  takes down the runner agent, so the job log cuts off mid-test with no error
  and the server later reports "lost communication". The 64 GB arm box
  absorbs the same spike and passes.

Job log evidence: the log terminates mid-hypothesis-example during
test_backward_adagrad_fp32_pmNONE_cpu (~16 min into PyTest, before the 20 min
step timeout), with zero error/OOM/traceback markers -- the signature of an
abrupt host death.

Fix: set TORCHINDUCTOR_COMPILE_THREADS=1 on the "Test with PyTest" step to
serialize Inductor's compile workers. This keeps peak memory within the 32 GB
budget without changing test coverage, the matrix, or the runner instance type
(so no CI cost increase). If build time on the compile tests becomes a concern,
the value can be tuned up (e.g. 2-4) as long as it stays under the OOM
threshold.

Differential Revision: D109038088
@meta-cla meta-cla Bot added the cla signed label Jun 18, 2026
@meta-codesync

meta-codesync Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

@gchalump has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109038088.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant