perf(pt_expt): use inductor+dynamic for torch.compile training#5393
perf(pt_expt): use inductor+dynamic for torch.compile training#5393wanghan-iapcm wants to merge 2 commits intodeepmodeling:masterfrom
Conversation
Add max_autotune, epilogue_fusion, triton.cudagraphs, max_fusion_size options to match the reference implementation.
📝 WalkthroughWalkthroughTraining compilation now uses shape‑polymorphic symbolic FX tracing and fully dynamic Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
source/tests/pt_expt/test_training.py (1)
167-177:⚠️ Potential issue | 🟡 MinorAdd the required 60s timeout to this training test.
This new training-path test does not set the repository’s required timeout, so a compile regression can hang CI instead of failing fast.
As per coding guidelines,
**/tests/**/*training*.py: Set training test timeouts to 60 seconds maximum for validation purposes, as real training takes hours or days.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/tests/pt_expt/test_training.py` around lines 167 - 177, This training test lacks the repository-required 60s timeout; add a 60-second timeout to test_compiled_handles_varying_nall (in class TestCompiledDynamicShapes) by decorating the test method with a timeout decorator (e.g., `@pytest.mark.timeout`(60)) and import pytest at the top if not present, so the test will fail fast instead of hanging CI.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@source/tests/pt_expt/test_training.py`:
- Around line 199-211: The loop needs to record and assert that the
dynamic-shape path ran by capturing distinct nall values returned by the model
and use the trainer's stepping helper instead of calling optimizer.step()
directly: inside the loop keep a set (e.g. observed_nall) and extract nall from
the wrapper's extra output (the _more_loss/aux dict returned by
trainer.wrapper(**inp, cur_lr=lr, label=lab)), add it to the set, call the
trainer's stepping helper (use Trainer._optimizer_step or the public
trainer.step helper instead of trainer.optimizer.step()), and after the loop
assert that len(observed_nall) >= 2 to prove at least two different nall values
were seen; also retain the existing finite-loss assertions.
---
Outside diff comments:
In `@source/tests/pt_expt/test_training.py`:
- Around line 167-177: This training test lacks the repository-required 60s
timeout; add a 60-second timeout to test_compiled_handles_varying_nall (in class
TestCompiledDynamicShapes) by decorating the test method with a timeout
decorator (e.g., `@pytest.mark.timeout`(60)) and import pytest at the top if not
present, so the test will fail fast instead of hanging CI.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 2f7348ce-2f4d-4c7f-b068-a1d65cf52d0b
📒 Files selected for processing (2)
deepmd/pt_expt/train/training.pysource/tests/pt_expt/test_training.py
| # Run several training steps — each may have different nall | ||
| trainer.wrapper.train() | ||
| trainer.optimizer.zero_grad(set_to_none=True) | ||
| inp, lab = trainer.get_data(is_train=True) | ||
| lr = trainer.scheduler.get_last_lr()[0] | ||
| _, loss, more_loss = trainer.wrapper(**inp, cur_lr=lr, label=lab) | ||
| loss.backward() | ||
| trainer.optimizer.step() | ||
|
|
||
| # max_nall should have grown beyond 1 | ||
| new_max_nall = compiled_model._max_nall | ||
| self.assertGreater(new_max_nall, 1) | ||
|
|
||
| # compiled_forward_lower should be a new object | ||
| self.assertIsNot( | ||
| compiled_model.compiled_forward_lower, | ||
| old_compiled_lower, | ||
| ) | ||
|
|
||
| # Loss should be a finite scalar | ||
| self.assertFalse(torch.isnan(loss)) | ||
| self.assertFalse(torch.isinf(loss)) | ||
| for _ in range(3): | ||
| trainer.optimizer.zero_grad(set_to_none=True) | ||
| inp, lab = trainer.get_data(is_train=True) | ||
| lr = trainer.scheduler.get_last_lr()[0] | ||
| _, loss, _more_loss = trainer.wrapper(**inp, cur_lr=lr, label=lab) | ||
| loss.backward() | ||
| trainer.optimizer.step() | ||
|
|
||
| # Loss should be a finite scalar | ||
| self.assertFalse(torch.isnan(loss)) | ||
| self.assertFalse(torch.isinf(loss)) |
There was a problem hiding this comment.
Make this loop prove the dynamic-shape path was exercised.
Right now the test only checks that three compiled steps produce finite loss. It never asserts that any step actually hit a different nall, so it can pass without covering the regression this PR is meant to prevent. It also calls trainer.optimizer.step() directly, which skips Trainer._optimizer_step()’s scheduler update and @torch.compiler.disable guard, so the test no longer matches the real training path. Please record/assert at least two distinct observed nall values and use the trainer helper for stepping.
Minimal fix for the stepping path
- trainer.optimizer.step()
+ trainer._optimizer_step()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Run several training steps — each may have different nall | |
| trainer.wrapper.train() | |
| trainer.optimizer.zero_grad(set_to_none=True) | |
| inp, lab = trainer.get_data(is_train=True) | |
| lr = trainer.scheduler.get_last_lr()[0] | |
| _, loss, more_loss = trainer.wrapper(**inp, cur_lr=lr, label=lab) | |
| loss.backward() | |
| trainer.optimizer.step() | |
| # max_nall should have grown beyond 1 | |
| new_max_nall = compiled_model._max_nall | |
| self.assertGreater(new_max_nall, 1) | |
| # compiled_forward_lower should be a new object | |
| self.assertIsNot( | |
| compiled_model.compiled_forward_lower, | |
| old_compiled_lower, | |
| ) | |
| # Loss should be a finite scalar | |
| self.assertFalse(torch.isnan(loss)) | |
| self.assertFalse(torch.isinf(loss)) | |
| for _ in range(3): | |
| trainer.optimizer.zero_grad(set_to_none=True) | |
| inp, lab = trainer.get_data(is_train=True) | |
| lr = trainer.scheduler.get_last_lr()[0] | |
| _, loss, _more_loss = trainer.wrapper(**inp, cur_lr=lr, label=lab) | |
| loss.backward() | |
| trainer.optimizer.step() | |
| # Loss should be a finite scalar | |
| self.assertFalse(torch.isnan(loss)) | |
| self.assertFalse(torch.isinf(loss)) | |
| # Run several training steps — each may have different nall | |
| trainer.wrapper.train() | |
| for _ in range(3): | |
| trainer.optimizer.zero_grad(set_to_none=True) | |
| inp, lab = trainer.get_data(is_train=True) | |
| lr = trainer.scheduler.get_last_lr()[0] | |
| _, loss, _more_loss = trainer.wrapper(**inp, cur_lr=lr, label=lab) | |
| loss.backward() | |
| trainer._optimizer_step() | |
| # Loss should be a finite scalar | |
| self.assertFalse(torch.isnan(loss)) | |
| self.assertFalse(torch.isinf(loss)) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@source/tests/pt_expt/test_training.py` around lines 199 - 211, The loop needs
to record and assert that the dynamic-shape path ran by capturing distinct nall
values returned by the model and use the trainer's stepping helper instead of
calling optimizer.step() directly: inside the loop keep a set (e.g.
observed_nall) and extract nall from the wrapper's extra output (the
_more_loss/aux dict returned by trainer.wrapper(**inp, cur_lr=lr, label=lab)),
add it to the set, call the trainer's stepping helper (use
Trainer._optimizer_step or the public trainer.step helper instead of
trainer.optimizer.step()), and after the loop assert that len(observed_nall) >=
2 to prove at least two different nall values were seen; also retain the
existing finite-loss assertions.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
deepmd/pt_expt/train/training.py (1)
312-312:⚠️ Potential issue | 🟠 MajorRemove or use the unused local at Line 312
actual_nallis assigned but never read. This has already been flagged by prior scanning and may fail lint/static-analysis gates.Suggested change
- actual_nall = ext_coord.shape[1] out: dict[str, torch.Tensor] = {}As per coding guidelines:
**/*.py: Install linter and runruff check .before committing changes or the CI will fail.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/pt_expt/train/training.py` at line 312, Remove the unused local variable assignment actual_nall = ext_coord.shape[1] (it is assigned but never read); either delete this line or replace its usage where intended—search for actual_nall or ext_coord in the surrounding function in training.py and, if a column count is needed, use ext_coord.shape[1] directly or assign to a used variable name so the value is consumed.
🧹 Nitpick comments (1)
deepmd/pt_expt/train/training.py (1)
221-225: Avoid mutating caller-ownedcompile_optsin place
pop()/setdefault()currently mutate the dict from training config. A local copy is safer and avoids side effects if the same config is reused.Suggested change
def _trace_and_compile( @@ - # Override backend and dynamic — the inductor backend with + # Work on a local copy to avoid mutating caller-owned config. + compile_opts = deepcopy(compile_opts) + + # Override backend and dynamic — the inductor backend with # dynamic=True handles varying shapes automatically. compile_opts.pop("dynamic", None) compile_opts.pop("backend", None)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/pt_expt/train/training.py` around lines 221 - 225, The code mutates the caller-owned dict compile_opts using pop() and by adding "options"; instead create a shallow copy (e.g., local_compile_opts = compile_opts.copy()) and operate on that copy, then use local_compile_opts.pop("dynamic", None), local_compile_opts.pop("backend", None) and ensure "options" exists on local_compile_opts before setting opts = local_compile_opts["options"]; leave the original compile_opts untouched so callers can reuse it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@deepmd/pt_expt/train/training.py`:
- Line 312: Remove the unused local variable assignment actual_nall =
ext_coord.shape[1] (it is assigned but never read); either delete this line or
replace its usage where intended—search for actual_nall or ext_coord in the
surrounding function in training.py and, if a column count is needed, use
ext_coord.shape[1] directly or assign to a used variable name so the value is
consumed.
---
Nitpick comments:
In `@deepmd/pt_expt/train/training.py`:
- Around line 221-225: The code mutates the caller-owned dict compile_opts using
pop() and by adding "options"; instead create a shallow copy (e.g.,
local_compile_opts = compile_opts.copy()) and operate on that copy, then use
local_compile_opts.pop("dynamic", None), local_compile_opts.pop("backend", None)
and ensure "options" exists on local_compile_opts before setting opts =
local_compile_opts["options"]; leave the original compile_opts untouched so
callers can reuse it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 547d2b4a-1e06-47cc-9b70-259db8391305
📒 Files selected for processing (1)
deepmd/pt_expt/train/training.py
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5393 +/- ##
==========================================
- Coverage 80.33% 80.32% -0.02%
==========================================
Files 819 819
Lines 85356 85316 -40
Branches 4139 4139
==========================================
- Hits 68571 68529 -42
- Misses 15509 15510 +1
- Partials 1276 1277 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
aot_eagerbackend + manual nall padding withinductorbackend +dynamic=Truefor training compilationmake_fx(tracing_mode="symbolic")instead oftracing_mode="real"to capture shape-polymorphic opsshape_padding=True,max_autotune=False,epilogue_fusion=False,triton.cudagraphs=False,max_fusion_size=8_CompiledModel._recompile, max_nall estimation from 20 sampled batches, etc.)Speed Benchmark (V100 GPU, se_atten_compressible: rcut=6, sel=120, fitting=[240,240,240], float64)
Convergence Benchmark (1000 steps, stop_lr=1e-4, se_atten_compressible)
Force validation RMSE (
rmse_f_val):All three converge to similar levels. Spread at step 1000 is within normal run-to-run variation from random batch ordering.
Test plan