perf(pt2): optimize .pt2 C++ inference path#5407
perf(pt2): optimize .pt2 C++ inference path#5407wanghan-iapcm wants to merge 45 commits intodeepmodeling:masterfrom
Conversation
The reshape-first approach changed the error from a custom ValueError to a generic reshape error, breaking test_self_exception assertions.
- Remove unused _min_nbor_dist variables in main.py - Remove unused _MULTITASK_JSON/_MULTITASK_SHAREFIT_JSON globals - Fix duplicate unittest import (use unittest.mock.patch directly) - Bind test socket to 127.0.0.1 instead of all interfaces - Remove redundant nframes assignment in _compile_model
Under DDP self.wrapper is DistributedDataParallel which has no .model attribute. Use .module to access the underlying ModelWrapper.
Cover the _compile_model DDP unwrap fix with single-task and multi-task tests that enable_compile=True under 2-rank gloo DDP.
Replace aot_eager+padding+manual recompile with symbolic make_fx + torch.compile(backend="inductor", dynamic=True). The compiled graph natively handles varying nframes/nloc/nall so the per-batch padding and runtime _recompile pass can be removed. Use a trace-time nframes of 7 (prime) and reshape with -1 in dpmodel (general_fitting, env_mat) to prevent PyTorch's symbolic tracer from unifying the batch dim with numb_fparam / numb_aparam / ntypes / dim_case_embd. Add TestCompiledVaryingNframesWithParams covering collisions with fparam/aparam, and TestCompileCaseEmbdVaryingNframes covering dim_case_embd > 0 with runtime nframes matching the embed dim.
…modeling#5393 Add silut/custom_silu support to _torch_activation using native torch ops (sigmoid, tanh, where) so the custom silu stays traceable by make_fx / torch.export. Cross-backend consistency tests cover multiple thresholds across the silu/tanh branches, and a pt_expt unit file exercises default/custom threshold, gradient flow, make_fx, and torch.export. Also port DescrptBlockRepformers accessor tests (get_rcut_smth, get_env_protection). The underlying accessor methods already exist on this branch; these tests guard against regressions.
Extend the compiled-vs-uncompiled assertions in TestCompiledConsistency (single-task) and _check_compile_correctness (multi-task) to also cover ``atom_energy`` and the reduced ``virial``. Atomic virial is intentionally not exercised because training never sets ``do_atomic_virial=True``.
…deling#5393 Adds the remaining tests from PR deepmodeling#5393 that were not yet on this branch: ``test_training_loop_compiled_silu`` (silu activation under torch.compile) and ``TestCompiledVaryingNatoms`` (compiled training across systems with different atom counts). Also drops a stray unused ``threshold`` variable in ``test_silut_below_threshold_is_silu`` to match the upstream PR.
Replace the two finite-loss smoke tests with a single test that builds both trainers, syncs weights, and per-step asserts identical predictions, loss, and per-parameter gradients (second-order through the force loss). Also add a silu full-model consistency test and write virial.npy in the small synthetic system so the virial passthrough is exercised on every step. Factor the prediction/grad comparison loops into shared helpers.
Parametrize TestCompiledVaryingNatoms over se_e2_a, DPA2 and DPA3 with strict atol=rtol=1e-10 on float64 (machine epsilon). DPA1 (se_atten) is intentionally omitted: its compiled path is intermittently incorrect (~20% of compiles produce grad diffs up to 0.67 at the first embedding layer), and including it would have required masking the bug with a loose tolerance.
Enable use_three_body=True on the DPA2 varying-natoms test so the compiled three-body neighbor path is also covered. three_body_rcut=3.0 matches the repformer rcut and is large enough to find neighbors in the 6-atom small system (~2.75Å nearest-neighbor distance).
…ckward decomp Revert (-1, nloc*nnei, ...) reshapes back to (nf, -1, ...) in env_mat.py and general_fitting.py. The -1-for-nf pattern breaks zero-atom systems: numpy cannot infer -1 when other dims multiply to zero (0/0), and torch.export shape assertions hit Mod(0,0). Using nf is safe because _TRACE_NFRAMES=7 already prevents symbolic-tracer specialisation during training compile. Add silu_backward decomposition table to make_fx in training.py so inductor can compile second-order gradients through silu without requiring a fused higher-order derivative kernel.
Replace assert with if/raise ValueError for user-facing config validation (data_stat_protect, finetune branch/head checks). Wrap train() in try/finally for destroy_process_group cleanup. Add parents=True, exist_ok=True to stat_file mkdir. Add strict=True to zip() calls. Fix minor test issues.
Match the dpmodel try/except pattern so shape mismatches produce a clear error instead of a raw RuntimeError from torch.view.
DPA1's se_atten attention path produces incorrect force gradients under inductor compile. Add get_numb_attn_layer() API and isinstance guard to reject attn_layer>0 at compile time with a clear error message. DPA1 with attn_layer=0 compiles correctly and is now tested. The guard also covers se_atten_v2 which inherits from DPA1.
DPA1 with se_atten attention layers compiles correctly under inductor+dynamic — the _check_varying_natoms test passes at 1e-10 tolerance for attn_layer=2. The guard added in 1e694a3 was based on an inconclusive diagnostic; replace the rejection test with a compile-correctness test for DPA1 with attention.
The compile guard that used this method was removed in 6d39ddf.
Use ``pt/model/water`` instead of the ``pt/water`` symlink in test_fitting_stat.py to avoid FileNotFoundError on CI. Remove unused ``get_numb_attn_layer()`` from DescrptDPA1.
…faults After Graph.erase_node() stale C-level prev/next pointers may remain on neighbouring Node objects. Dynamo re-tracing can dereference them and segfault. Rebuild into a fresh graph to eliminate stale pointers.
- max_fusion_size 8 → 64 to avoid scheduler timeouts on large descriptors - add triton.mix_order_reduction=False for PyTorch <=2.11 bugs (pytorch/pytorch#174379, #178080, #179494) - hardcode defaults, let user compile_options override per-key
DDPOptimizer splits the inner compiled graph at bucket boundaries, producing subgraph outputs with symbolic integers that crash AOT Autograd (pytorch/pytorch#134182).
max_fusion_size=64 causes DPA3 force divergence and triton.mix_order_reduction=False causes DPA1-attention divergence (both on CPU, float64, 1e-10 tolerance). Revert to max_fusion_size=8 and remove mix_order_reduction (kept as comment for GPU users). Also revert test_fitting_stat.py path back to pt/water symlink which was confirmed working at c2efbf1.
Remove os.environ["DEVICE"] = "cpu" from all worker functions and replace hardcoded backend="gloo" with auto-detected _DDP_BACKEND (nccl on CUDA, gloo on CPU). env.py now evaluates DEVICE naturally based on hardware availability.
…links Move torch._dynamo.config.optimize_ddp = False from module level into _compile_model() so it only applies when compile is active. Resolve symlinks in test_fitting_stat.py data paths for reliable CI access.
TestFparam.tearDown deleted the committed fparam.npy, breaking other tests (e.g. test_fitting_stat) running later in the same CI shard. Also revert the unnecessary .resolve() workaround in test_fitting_stat.
NCCL rejects two ranks sharing the same GPU device, causing all DDP tests to fail on single-GPU CI runners. Skip the entire module when the backend is NCCL and device_count < 2.
Three optimizations for .pt2 (AOTInductor) C++ inference: 1. Replace buildTypeSortedNlist with createNlistTensor: avoids expensive CPU-side distance computation and type-based sorting. The .pt2 model's compiled graph contains format_nlist which handles distance sorting on-device. The new createNlistTensor pads/truncates to the expected nnei (= sum(sel)) stored in model metadata, since torch.export specializes nnei as a static dimension. 2. Export with do_atomic_virial=False by default: avoids 3 extra torch.autograd.grad backward passes in atomic_virial_corr(). The reduced virial (energy_derv_c_redu) is still correct without the correction (it sums to zero). Add --atomic-virial flag to dp convert-backend for users who need per-atom virial (~2.5x cost). C++ raises error if atomic virial requested but model lacks it. 3. Cache mapping tensor: only rebuild when ago==0 (nlist updated), avoiding redundant CPU-to-device copies every step. Also adds nnei and do_atomic_virial to .pt2 metadata for C++ to read.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #5407 +/- ##
==========================================
+ Coverage 80.46% 82.35% +1.88%
==========================================
Files 820 820
Lines 86005 86215 +210
Branches 4139 4184 +45
==========================================
+ Hits 69207 71003 +1796
+ Misses 15521 13936 -1585
+ Partials 1277 1276 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Instead of truncating the neighbor list to sum(sel) in C++ (which drops neighbors without distance sorting), make the nnei dimension dynamic in torch.export. The .pt2 model's compiled format_nlist now sorts by distance on-device and truncates to sum(sel), matching the .pth behavior. C++ createNlistTensor pads to at least nnei+1 columns so the format_nlist sort branch is always active.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
source/api_cc/src/DeepSpinPTExpt.cc (2)
322-326: Also cachefirstneigh_tensorwhenago != 0.
nlist_data.jlistis only mutated inside theif (ago == 0)block, so for subsequent MD steps with the same neighbor list this rebuilds the exact same[1, nloc, nnei+1]int32 tensor, materializes it on CPU, and copies it to the device every step. Same rationale as themapping_tensorcaching just above — worth extending to avoid the per-step H2D copy.♻️ Proposed caching (mirrors mapping_tensor)
if (ago == 0) { nlist_data.copy_from_nlist(lmp_list, nall - nghost); nlist_data.shuffle_exclude_empty(fwd_map); nlist_data.padding(); + firstneigh_tensor = createNlistTensor(nlist_data.jlist, nnei + 1) + .to(torch::kInt64) + .to(device); // Rebuild mapping tensor only when nlist is updated (ago == 0). ... } - // Build raw nlist tensor ... - at::Tensor firstneigh_tensor = createNlistTensor(nlist_data.jlist, nnei + 1) - .to(torch::kInt64) - .to(device);(Declare
firstneigh_tensoras a member alongsidemapping_tensor.)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/api_cc/src/DeepSpinPTExpt.cc` around lines 322 - 326, The code currently rebuilds and H2D-copies firstneigh_tensor each step even when ago != 0; change firstneigh_tensor to be a persistent member (like mapping_tensor), only recreate and assign it inside the block where ago == 0 (using createNlistTensor(nlist_data.jlist, nnei + 1).to(torch::kInt64).to(device)), and on subsequent steps reuse the cached member instead of recreating it; ensure the member is cleared/updated when nlist_data.jlist actually changes and that usages of firstneigh_tensor in functions downstream reference the member variable.
296-321: Optional: collapse the twomapping_tensorbranches.The only difference between the
lmp_list.mappingand identity branches is howmapping[ii]is populated; thefrom_blob → clone → to(device) → assign to mapping_tensortail is identical. Factoring out the tail would reduce duplication and the risk of future drift between the two paths.♻️ Proposed
- if (lmp_list.mapping) { - std::vector<std::int64_t> mapping(nall_real); - for (int ii = 0; ii < nall_real; ii++) { - mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii]]]; - } - mapping_tensor = - torch::from_blob(mapping.data(), {1, nall_real}, int_option) - .clone() - .to(device); - } else { - std::vector<std::int64_t> mapping(nall_real); - for (int ii = 0; ii < nall_real; ii++) { - mapping[ii] = ii; - } - mapping_tensor = - torch::from_blob(mapping.data(), {1, nall_real}, int_option) - .clone() - .to(device); - } + std::vector<std::int64_t> mapping(nall_real); + if (lmp_list.mapping) { + for (int ii = 0; ii < nall_real; ii++) { + mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii]]]; + } + } else { + for (int ii = 0; ii < nall_real; ii++) { + mapping[ii] = ii; + } + } + mapping_tensor = + torch::from_blob(mapping.data(), {1, nall_real}, int_option) + .clone() + .to(device);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/api_cc/src/DeepSpinPTExpt.cc` around lines 296 - 321, Collapse duplication by building the std::vector<std::int64_t> mapping once and then doing the torch::from_blob(...).clone().to(device) tail only once; specifically, in the ago==0 block create mapping sized nall_real, fill it using either the lmp_list.mapping path (mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii]]]) or the identity path (mapping[ii] = ii) based on lmp_list.mapping, then call torch::from_blob(mapping.data(), {1, nall_real}, int_option).clone().to(device) to assign mapping_tensor; keep references to nall_real, bkw_map, fwd_map, int_option and device unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@source/api_cc/src/DeepSpinPTExpt.cc`:
- Around line 412-418: The atomic-virial capability check (do_atomic_virial)
must be performed before invoking run_model to avoid wasted inference: in the
compute methods that handle atomic output (the branch gated by the atomic flag
in DeepSpinPTExpt::compute and the standalone path around the run_model call),
move the existing do_atomic_virial check to the top of the atomic-capable branch
immediately after you detect atomic==true (i.e., before building input tensors
and calling run_model), and remove the later redundant check that currently runs
after run_model returns; you may also consider performing this validation in
init() if convenient, but the minimal fix is to hoist the check above run_model
in the compute paths referencing run_model and do_atomic_virial.
- Around line 135-151: The loader currently falls back do_atomic_virial and
computes nnei but does not detect older .pt2 graphs that lack the new
format_nlist padding, which can cause silent incorrect inference; update
DeepSpinPTExpt.cc to read a metadata schema/version or explicitly check for a
"format_nlist" flag in metadata during load (before creating/passing raw padded
neighbor lists) and reject the archive with a clear error if the graph was
exported pre-format_nlist (or if metadata version < required_version),
referencing the existing symbols/do_atomic_virial, nnei, and format_nlist to
locate the logic that computes nnei and constructs neighbor lists, so
incompatible .pt2 files are detected and loading aborted rather than proceeding
silently.
---
Nitpick comments:
In `@source/api_cc/src/DeepSpinPTExpt.cc`:
- Around line 322-326: The code currently rebuilds and H2D-copies
firstneigh_tensor each step even when ago != 0; change firstneigh_tensor to be a
persistent member (like mapping_tensor), only recreate and assign it inside the
block where ago == 0 (using createNlistTensor(nlist_data.jlist, nnei +
1).to(torch::kInt64).to(device)), and on subsequent steps reuse the cached
member instead of recreating it; ensure the member is cleared/updated when
nlist_data.jlist actually changes and that usages of firstneigh_tensor in
functions downstream reference the member variable.
- Around line 296-321: Collapse duplication by building the
std::vector<std::int64_t> mapping once and then doing the
torch::from_blob(...).clone().to(device) tail only once; specifically, in the
ago==0 block create mapping sized nall_real, fill it using either the
lmp_list.mapping path (mapping[ii] = fwd_map[lmp_list.mapping[bkw_map[ii]]]) or
the identity path (mapping[ii] = ii) based on lmp_list.mapping, then call
torch::from_blob(mapping.data(), {1, nall_real}, int_option).clone().to(device)
to assign mapping_tensor; keep references to nall_real, bkw_map, fwd_map,
int_option and device unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 4c50c5df-2c04-49b8-844c-05064e64a26b
📒 Files selected for processing (7)
deepmd/pt_expt/utils/serialization.pysource/api_cc/src/DeepPotPTExpt.ccsource/api_cc/src/DeepSpinPTExpt.ccsource/api_cc/src/commonPTExpt.hsource/tests/pt_expt/export_helpers.pysource/tests/pt_expt/model/test_ener_model.pysource/tests/pt_expt/model/test_export_pipeline.py
🚧 Files skipped from review as they are similar to previous changes (1)
- source/api_cc/src/DeepPotPTExpt.cc
The .pt2 model's compiled format_nlist expects n_nnei > sum(sel) to trigger the sort branch. Pad the nlist with one extra -1 column before calling the AOTInductor runner.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
deepmd/pt_expt/infer/deep_eval.py (1)
762-774: Deduplicate the nlist padding block.The same 13-line padding snippet is repeated verbatim in
_eval_model(Line 762-774) and_eval_model_spin(Line 914-926). Consider extracting a small helper (e.g._pad_nlist_for_pt2(nlist_t)) to keep the two call sites in sync and document the "n_nnei > nnei triggers format_nlist sort branch" invariant in one place.♻️ Proposed refactor
+ `@staticmethod` + def _pad_nlist_for_pt2(nlist_t: torch.Tensor) -> torch.Tensor: + """Pad nlist with one extra -1 column so n_nnei > nnei, ensuring + format_nlist's compiled sort branch executes in the .pt2 model.""" + pad = -torch.ones( + (*nlist_t.shape[:2], 1), dtype=nlist_t.dtype, device=nlist_t.device + ) + return torch.cat([nlist_t, pad], dim=-1)Then at both call sites:
- nlist_t = torch.cat( - [ - nlist_t, - -torch.ones( - (*nlist_t.shape[:2], 1), - dtype=nlist_t.dtype, - device=nlist_t.device, - ), - ], - dim=-1, - ) + nlist_t = self._pad_nlist_for_pt2(nlist_t)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/pt_expt/infer/deep_eval.py` around lines 762 - 774, Duplicate 13-line padding of nlist_t in _eval_model and _eval_model_spin should be extracted to a single helper to avoid drift: create a small function (e.g. _pad_nlist_for_pt2(nlist_t)) that performs the torch.cat(...) padding of an extra -1 column and document the invariant "n_nnei > nnei triggers format_nlist sort branch" in that helper's docstring; then replace the inlined padding blocks in both _eval_model and _eval_model_spin with calls to _pad_nlist_for_pt2(nlist_t) so both call sites share the same implementation and comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@deepmd/pt_expt/infer/deep_eval.py`:
- Around line 762-774: Duplicate 13-line padding of nlist_t in _eval_model and
_eval_model_spin should be extracted to a single helper to avoid drift: create a
small function (e.g. _pad_nlist_for_pt2(nlist_t)) that performs the
torch.cat(...) padding of an extra -1 column and document the invariant "n_nnei
> nnei triggers format_nlist sort branch" in that helper's docstring; then
replace the inlined padding blocks in both _eval_model and _eval_model_spin with
calls to _pad_nlist_for_pt2(nlist_t) so both call sites share the same
implementation and comment.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 836fbd50-361a-4482-88ee-8625bd33bb92
📒 Files selected for processing (1)
deepmd/pt_expt/infer/deep_eval.py
…orted --atomic-virial Move the do_atomic_virial check before run_model in both DeepPotPTExpt and DeepSpinPTExpt so we fail fast without wasting GPU computation. Also reject --atomic-virial flag in convert_backend when the output backend doesn't support it.
Move nlist padding (+1 column of -1s) inside the `fn` closure in both `make_model.forward_common_lower_exportable` and `SpinModel.forward_common_lower_exportable`, making it part of the traced graph. This fixes proxy tensor shape mismatches from make_fx and removes the need for external padding in deep_eval.py. Also apply `_strip_shape_assertions` unconditionally (not just spin models) to remove spurious torch.export guards like Ne(nnei, sum(sel)). Export tests that verify atomic virial now pass `do_atomic_virial=True` to `deserialize_to_file` so the exported model includes the correction.
The test_models.py tests compare per-atom virial against reference values that include the atomic virial correction. Since convert_backend now defaults to atomic_virial=False for performance, test model generation must explicitly request atomic_virial=True.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deepmd/pt_expt/utils/serialization.py`:
- Around line 188-210: The dynamic nnei lower bound is incorrectly forced to 2;
update the min for the nnei dimension from max(2, model_nnei) to max(1,
model_nnei) inside _build_dynamic_shapes (the code that constructs nnei_dim for
torch.export) so single-neighbor models are allowed; also update the
corresponding metadata that records the nnei minimum (the place that currently
sets the metadata nnei bound to 2) and any call sites or uses that assume the
hard-coded 2 (including where _build_dynamic_shapes is invoked) so they use the
new max(1, model_nnei) logic consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 0be31bce-b291-4412-90b0-dd19b2cabb28
📒 Files selected for processing (6)
deepmd/pt_expt/model/make_model.pydeepmd/pt_expt/model/spin_model.pydeepmd/pt_expt/utils/serialization.pysource/tests/infer/case.pysource/tests/pt_expt/export_helpers.pysource/tests/pt_expt/infer/test_deep_eval.py
✅ Files skipped from review due to trivial changes (1)
- source/tests/pt_expt/export_helpers.py
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 711a1f4ce3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Move createNlistTensor() inside the `if (ago == 0)` block so the nlist tensor is only rebuilt when LAMMPS updates the neighbor list, matching the existing caching pattern for mapping_tensor. Previously, the nlist was rebuilt every step: flatten jagged list, allocate std::vector<int>, copy into tensor, cast int32->int64, transfer to device. This was the dominant CPU+H2D cost outside the model itself.
Adds test_oversized_nlist to both .pte and .pt2 test classes. The test shuffles neighbor columns so real neighbors appear past sum(sel), forcing format_nlist to sort by distance rather than just truncating. This catches the bug fixed in b7509db where missing distance sort would produce wrong results with oversized neighbor lists.
Adds test_oversized_nlist to both .pte and .pt2 test classes. The test verifies two things: 1. The exported model correctly handles an oversized, shuffled nlist (sorts by distance, keeps closest sum(sel) neighbors). 2. Naive truncation of the shuffled nlist to sum(sel) columns (simulating the old C++ createNlistTensor bug) produces wrong energy, proving that close neighbors beyond sum(sel) are lost without distance sorting.
The fn-internal +1 pad already guarantees n_nnei > sum(sel) at trace time, so _format_nlist's distance-sort branch is always traced. The monkey-patch (mutating the model, try/finally cleanup) was redundant. Two mechanisms remain and suffice: - Inner pad (+1 column) inside the traced fn: ensures sort branch is in the compiled graph and fires at runtime. - C++ createNlistTensor pads to nnei+1: satisfies torch.export Dim guard at the model input level.
The fn-internal +1 pad inside the traced graph already ensures format_nlist's distance-sort branch fires at runtime. The C++ side no longer needs to pad nlist to nnei+1. Simplify createNlistTensor to just flatten the jagged nlist without a min_nnei parameter. Also add test_oversized_nlist to TestSpinEnerModelExportable to verify the spin model export path handles oversized nlist correctly.
The test binary (runUnitTests_cc) was missing TORCH_LIBRARIES in its link line, so __has_include for the AOTInductor header always failed and BUILD_PT_EXPT/BUILD_PT_EXPT_SPIN evaluated to 0. All PtExpt C++ tests were silently skipping in CI. Also fix DeepSpin test guards: change #ifndef BUILD_PYTORCH to #if !defined(BUILD_PYTORCH) || !BUILD_PT_EXPT_SPIN, and add #include "DeepSpinPTExpt.h" so the macro is defined.
There was a problem hiding this comment.
Pull request overview
Optimizes the PyTorch-exportable (.pt2/.pte) inference/export pipeline by moving neighbor-list formatting into the compiled graph, making atomic-virial export optional, and updating C++/tests to match the new metadata and dynamic-shape behavior.
Changes:
- Replace CPU-side type-sorted/truncated nlist construction with a raw nlist tensor path and rely on on-device
format_nlistin the compiled graph. - Add
--atomic-virial/atomic_virialplumbing and persistdo_atomic_virial+nneiin export metadata; add C++ “fail fast” when atomic virial is requested but not exported. - Extend pt_expt tests to cover dynamic nnei and oversized/shuffled neighbor lists; adjust test model generators to export with atomic virial where required.
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| source/tests/pt_expt/test_training.py | Adds additional descriptor config for compile-correctness coverage. |
| source/tests/pt_expt/model/test_spin_ener_model.py | Adds helper + new oversized-nlist export correctness test for spin model. |
| source/tests/pt_expt/model/test_export_pipeline.py | Passes model nnei into dynamic-shape builder for export tests. |
| source/tests/pt_expt/model/test_ener_model.py | Updates dynamic-shape builder call to include model nnei. |
| source/tests/pt_expt/infer/test_deep_eval.py | Exports test artifacts with atomic virial enabled; adds oversized-nlist tests. |
| source/tests/pt_expt/export_helpers.py | Pads sample nlist, strips shape assertions, and passes model nnei for export round-trip tests. |
| source/tests/infer/gen_spin_model_devi.py | Exports .pt2 test models with atomic virial enabled via convert-backend. |
| source/tests/infer/gen_spin.py | Exports spin .pt2 with atomic virial enabled. |
| source/tests/infer/gen_sea.py | Exports .pt2 with atomic virial enabled. |
| source/tests/infer/gen_model_devi.py | Exports model-devi .pt2 with atomic virial enabled. |
| source/tests/infer/gen_fparam_aparam.py | Exports fparam/aparam .pt2 artifacts with atomic virial enabled. |
| source/tests/infer/gen_dpa3.py | Exports DPA3 .pt2 with atomic virial enabled. |
| source/tests/infer/gen_dpa2.py | Exports DPA2 .pt2 with atomic virial enabled. |
| source/tests/infer/gen_dpa1.py | Exports DPA1 .pt2 with atomic virial enabled. |
| source/tests/infer/case.py | Ensures .pte/.pt2 test exports include atomic virial when needed. |
| source/api_cc/tests/test_deepspin_model_devi_ptexpt.cc | Includes PTExpt spin header and tightens skip conditions for spin builds. |
| source/api_cc/tests/test_deeppot_dpa_ptexpt_spin.cc | Includes PTExpt spin header and tightens skip conditions for spin builds. |
| source/api_cc/tests/CMakeLists.txt | Links Torch libraries for C++ unit tests when PyTorch is enabled. |
| source/api_cc/src/commonPTExpt.h | Replaces CPU-side sorted/truncated nlist builder with raw tensor flattener. |
| source/api_cc/src/DeepSpinPTExpt.cc | Uses raw nlist tensor path, caches mapping/nlist tensors, reads new metadata, and adds atomic-virial guard. |
| source/api_cc/src/DeepPotPTExpt.cc | Uses raw nlist tensor path, caches mapping/nlist tensors, reads new metadata, and adds atomic-virial guard. |
| source/api_cc/include/DeepSpinPTExpt.h | Stores do_atomic_virial/nnei metadata and caches mapping/nlist tensors. |
| source/api_cc/include/DeepPotPTExpt.h | Stores do_atomic_virial/nnei metadata and caches mapping/nlist tensors. |
| deepmd/pt_expt/utils/serialization.py | Adds nnei metadata, dynamic nnei export dims, atomic-virial export option, and shape-assert stripping. |
| deepmd/pt_expt/train/training.py | Minor formatting cleanup. |
| deepmd/pt_expt/model/spin_model.py | Adds internal +1 nlist padding during tracing to force sort branch capture. |
| deepmd/pt_expt/model/make_model.py | Adds internal +1 nlist padding during tracing to force sort branch capture. |
| deepmd/main.py | Adds dp convert-backend --atomic-virial CLI flag. |
| deepmd/entrypoints/convert_backend.py | Plumbs atomic_virial into pt_expt deserialization (as do_atomic_virial). |
Comments suppressed due to low confidence (1)
deepmd/pt_expt/utils/serialization.py:46
- _strip_shape_assertions deletes every
aten._assert_scalarnode unconditionally. That’s broader than the specific “spurious” guards described in the docstring and can also remove useful/intentional runtime shape checks produced by torch.export, making invalid inputs fail later in harder-to-debug ways. Consider narrowing the removal to known-bad guard patterns (or making this opt-in) rather than erasing all_assert_scalarnodes.
def _strip_shape_assertions(graph_module: torch.nn.Module) -> None:
"""Remove shape-guard assertion nodes from an exported graph.
``torch.export`` inserts ``aten._assert_scalar`` nodes for symbolic shape
relationships discovered during tracing. These guards can be spurious:
* **Spin models**: atom-doubling logic creates slice patterns that depend
on ``(nall - nloc)``, producing guards like ``Ne(nall, nloc)``.
* **All models**: the nlist padding inside ``forward_common_lower_exportable``
and the subsequent sort/truncate in ``_format_nlist`` can produce guards
like ``Ne(nnei, sum(sel))``. These are spurious because the compiled
graph handles any ``nnei >= sum(sel)`` correctly.
The assertion messages use opaque symbolic variable names (e.g.
``Ne(s22, s96)``) rather than human-readable names, so filtering by
message content is not reliable. Since
``prefer_deferred_runtime_asserts_over_guards=True`` converts all shape
guards into these deferred assertions, removing all of them is safe.
"""
graph = graph_module.graph
for node in list(graph.nodes):
if (
node.op == "call_function"
and node.target is torch.ops.aten._assert_scalar.default
):
graph.erase_node(node)
graph.eliminate_dead_code()
graph_module.recompile()
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| cell = (cell + cell.T) + 5.0 * torch.eye(3, device="cpu") | ||
| coord = torch.rand([natoms, 3], dtype=dtype, device="cpu", generator=generator) | ||
| coord = torch.matmul(coord, cell) | ||
| atype = torch.tensor([0, 0, 1, 0, 1, 1], dtype=torch.int64) |
There was a problem hiding this comment.
_build_extended_inputs takes a natoms parameter, but atype is hard-coded to length 6. If natoms is ever changed from the default, this will produce inconsistent shapes (coord/spin vs atype). Consider either generating atype based on natoms or removing the parameter to avoid accidental misuse.
| atype = torch.tensor([0, 0, 1, 0, 1, 1], dtype=torch.int64) | |
| base_atype = torch.tensor([0, 0, 1, 0, 1, 1], dtype=torch.int64) | |
| atype = base_atype.repeat((natoms + len(base_atype) - 1) // len(base_atype))[ | |
| :natoms | |
| ] |
There was a problem hiding this comment.
natoms is never changed from the default 6 in this test. The parameter exists for consistency with the test helper pattern, not for arbitrary values. Adding dynamic atype generation would be over-engineering for a fixed test fixture.
- Move createNlistTensor to shared commonPT.h, replacing duplicates in DeepPotPT.cc, DeepSpinPT.cc, and commonPTExpt.h. The unified version handles jagged rows with a min_nnei parameter for .pt2 models. - Fix shuffle() in common.cc to skip -1 padding entries (was causing out-of-bounds access on fwd_map[-1]). - Add oversized nlist C++ tests for both DeepPot and DeepSpin .pt2 backends, with _pad_shuffle_nlist test helper.
Summary
buildTypeSortedNlistwithcreateNlistTensor(data, expected_nnei)— avoids distance computation and type sorting every step; model's compiledformat_nlisthandles this on-devicedo_atomic_virial=Falseby default — avoids 3 extratorch.autograd.gradbackward passes; add--atomic-virialflag todp convert-backendmapping_tensoras member variable, only rebuild whenago == 0nneianddo_atomic_virialin .pt2 metadata for C++ to read at initBenchmark (V100-SXM2-16GB, 192-atom water, LAMMPS MD)
Before this PR
.pt2 was 9x slower than .pth due to CPU-side nlist sorting, baked-in atomic virial backward passes, and excessive clones:
After this PR
DPA1 L0 (se_atten nlayer=0)
DPA1 L2 (se_atten nlayer=2)
DPA2 (repinit + repformer)
For models with more compute (DPA1 L2, DPA2), .pt2 is 24-45% faster than .pth. For the smallest model (DPA1 L0), .pt2 has higher per-call overhead that dominates at large atom counts.
Test plan
Summary by CodeRabbit
--atomic-virialcommand-line flag to enable atomic virial correction during model conversion and export operations