Fix x86 CPU CI OOM by capping Inductor compile threads#5930
Open
gchalump wants to merge 1 commit into
Open
Conversation
Summary: The FBGEMM_GPU-CPU CI (fbgemm_gpu_ci_cpu.yml) test job dies on every x86 `linux.4xlarge` matrix cell with "The self-hosted runner lost communication with the server", while every arm `linux.arm64.m7g.4xlarge` cell passes the identical tests. Root cause is memory, not a code bug: - x86 linux.4xlarge is a c5.4xlarge: 16 vCPU / 32 GB RAM. - arm linux.arm64.m7g.4xlarge is an m7g.4xlarge: 16 vCPU / 64 GB RAM. - The torch.compile-heavy TBE tests (e.g. test_backward_adagrad_*, compile=True) trigger TorchInductor, which by default spawns one C++ compile worker per vCPU (16). Peak memory blows past 32 GB on x86 and the kernel OOM-killer takes down the runner agent, so the job log cuts off mid-test with no error and the server later reports "lost communication". The 64 GB arm box absorbs the same spike and passes. Job log evidence: the log terminates mid-hypothesis-example during test_backward_adagrad_fp32_pmNONE_cpu (~16 min into PyTest, before the 20 min step timeout), with zero error/OOM/traceback markers -- the signature of an abrupt host death. Fix: set TORCHINDUCTOR_COMPILE_THREADS=1 on the "Test with PyTest" step to serialize Inductor's compile workers. This keeps peak memory within the 32 GB budget without changing test coverage, the matrix, or the runner instance type (so no CI cost increase). If build time on the compile tests becomes a concern, the value can be tuned up (e.g. 2-4) as long as it stays under the OOM threshold. Differential Revision: D109038088
Contributor
|
@gchalump has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109038088. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
The FBGEMM_GPU-CPU CI (fbgemm_gpu_ci_cpu.yml) test job dies on every x86
linux.4xlargematrix cell with "The self-hosted runner lost communicationwith the server", while every arm
linux.arm64.m7g.4xlargecell passes theidentical tests.
Root cause is memory, not a code bug:
trigger TorchInductor, which by default spawns one C++ compile worker per
vCPU (16). Peak memory blows past 32 GB on x86 and the kernel OOM-killer
takes down the runner agent, so the job log cuts off mid-test with no error
and the server later reports "lost communication". The 64 GB arm box
absorbs the same spike and passes.
Job log evidence: the log terminates mid-hypothesis-example during
test_backward_adagrad_fp32_pmNONE_cpu (~16 min into PyTest, before the 20 min
step timeout), with zero error/OOM/traceback markers -- the signature of an
abrupt host death.
Fix: set TORCHINDUCTOR_COMPILE_THREADS=1 on the "Test with PyTest" step to
serialize Inductor's compile workers. This keeps peak memory within the 32 GB
budget without changing test coverage, the matrix, or the runner instance type
(so no CI cost increase). If build time on the compile tests becomes a concern,
the value can be tuned up (e.g. 2-4) as long as it stays under the OOM
threshold.
Differential Revision: D109038088