Summary
Checklist
[x] I have searched existing issues to make sure this feature has not been requested before.
[x] I have checked the documentation of DeepMD-kit.
Description of the problem
When users compile a custom libdeepmd_op_pt.so (e.g., to match specific CUDA/GCC environments) and place it in a custom directory (not the default deepmd/lib/), the PyTorch backend silently fails to load it during dp --pt convert-backend or dp --pt compress.
The root cause is in deepmd/pt/cxx_op.py:
SHARED_LIB_DIR = Path(deepmd.lib.__path__[0])
module_file = (SHARED_LIB_DIR / (prefix + module_name)).with_suffix(ext).resolve()
if module_file.is_file():
# loads the library
It strictly looks for the .so in the Python package installation path and ignores environment variables like LD_LIBRARY_PATH. If the file is missing, ENABLE_CUSTOMIZED_OP becomes False, and the model is serialized with dummy placeholder functions (e.g., tabulate_fusion_se_a that just raise NotImplementedError).
This results in a broken .pth model that passes conversion without any warnings, but crashes immediately in LAMMPS.
Currently, the only workaround is to manually symlink the .so into deepmd/lib/, which is non-intuitive and breaks upon environment updates.
Detailed Description
Describe the solution
-
Add a critical warning/error during model serialization (convert/compress/freeze)
When a model architecture requires customized OPs (e.g., uses tabulate_fusion_se_a for compressed SeA) but ENABLE_CUSTOMIZED_OP is False, dp --pt convert and dp --pt compress should not proceed silently.
It should raise an explicit error or a prominent warning like:
[ERROR] The current model requires customized PyTorch OPs (e.g., tabulate_fusion_se_a), but libdeepmd_op_pt.so was not loaded. The exported model will fail during inference. Please ensure the custom OP library is installed correctly.
-
Allow overriding SHARED_LIB_DIR via Environment Variable
In deepmd/pt/cxx_op.py, the path resolution logic should fall back to an environment variable (e.g., LD_LIBRARY_PATH) if the hardcoded SHARED_LIB_DIR does not contain the library.
Proposed logic
module_file = (SHARED_LIB_DIR / (prefix + module_name)).with_suffix(ext).resolve()
if not module_file.is_file():
# Check environment variable override
env_dir = os.environ.get("DEEPMD_OP_DIR")
if env_dir:
module_file = (Path(env_dir) / (prefix + module_name)).with_suffix(ext).resolve()
This would allow users to point to their custom-compiled OP libraries without modifying source code or creating symlinks.
Describe alternatives you’ve considered
Manually symlinking the .so to site-packages/deepmd/lib/ (Current workaround, fragile).
Modifying cxx_op.py source code directly (Gets overwritten on updates).
Setting LD_LIBRARY_PATH (Does not work because cxx_op.py uses explicit absolute path resolution via Path.is_file(), not torch.ops.load_library("deepmd_op_pt") directly).
Additional context
DeepMD-kit version: v3.1.0 (and likely affects v2.x PT backend as well)
PyTorch version: Built with _GLIBCXX_USE_CXX11_ABI=0 (conda-forge)
How to reproduce:
Compile libdeepmd_op_pt.so in a custom directory (e.g., ~/deepmd-kit/lib/).
Set export LD_LIBRARY_PATH=~/deepmd-kit/lib/:$LD_LIBRARY_PATH.
Run dp --pt convert-backend in.pb out.pth. (.pb may have ops like se_a)
Observe no warnings. Check ENABLE_CUSTOMIZED_OP -> it is False.
Run dp --pt compress out.pth compress.pth
Observe no warnings. Check ENABLE_CUSTOMIZED_OP -> it is False.
Run LAMMPS using out.pth is OK, BUT Run LAMMPS using compress.pth -> NotImplementedError. (some what like #4530 )
Further Information, Files, and Links
No response
Summary
Checklist
Description of the problem
When users compile a custom
libdeepmd_op_pt.so(e.g., to match specific CUDA/GCC environments) and place it in a custom directory (not the defaultdeepmd/lib/), the PyTorch backend silently fails to load it duringdp --pt convert-backendordp --pt compress.The root cause is in
deepmd/pt/cxx_op.py:It strictly looks for the .so in the Python package installation path and ignores environment variables like
LD_LIBRARY_PATH. If the file is missing,ENABLE_CUSTOMIZED_OPbecomes False, and the model is serialized with dummy placeholder functions (e.g.,tabulate_fusion_se_athat just raiseNotImplementedError).This results in a broken
.pthmodel that passes conversion without any warnings, but crashes immediately in LAMMPS.Currently, the only workaround is to manually symlink the .so into
deepmd/lib/, which is non-intuitive and breaks upon environment updates.Detailed Description
Describe the solution
Add a critical warning/error during model serialization (convert/compress/freeze)
When a model architecture requires customized OPs (e.g., uses
tabulate_fusion_se_afor compressed SeA) butENABLE_CUSTOMIZED_OPis False,dp --pt convertanddp --pt compressshould not proceed silently.It should raise an explicit error or a prominent warning like:
[ERROR] The current model requires customized PyTorch OPs (e.g., tabulate_fusion_se_a), but libdeepmd_op_pt.so was not loaded. The exported model will fail during inference. Please ensure the custom OP library is installed correctly.
Allow overriding
SHARED_LIB_DIRvia Environment VariableIn
deepmd/pt/cxx_op.py, the path resolution logic should fall back to an environment variable (e.g.,LD_LIBRARY_PATH) if the hardcodedSHARED_LIB_DIRdoes not contain the library.Proposed logic
This would allow users to point to their custom-compiled OP libraries without modifying source code or creating symlinks.
Describe alternatives you’ve considered
Manually symlinking the .so to site-packages/deepmd/lib/ (Current workaround, fragile).
Modifying
cxx_op.pysource code directly (Gets overwritten on updates).Setting
LD_LIBRARY_PATH(Does not work becausecxx_op.pyuses explicit absolute path resolution via Path.is_file(), not torch.ops.load_library("deepmd_op_pt") directly).Additional context
DeepMD-kit version: v3.1.0 (and likely affects v2.x PT backend as well)
PyTorch version: Built with _GLIBCXX_USE_CXX11_ABI=0 (conda-forge)
How to reproduce:
Compile libdeepmd_op_pt.so in a custom directory (e.g., ~/deepmd-kit/lib/).
Set
export LD_LIBRARY_PATH=~/deepmd-kit/lib/:$LD_LIBRARY_PATH.Run
dp --pt convert-backend in.pb out.pth. (.pb may have ops likese_a)Observe no warnings. Check
ENABLE_CUSTOMIZED_OP-> it is False.Run
dp --pt compress out.pth compress.pthObserve no warnings. Check
ENABLE_CUSTOMIZED_OP-> it is False.Run LAMMPS using
out.pthis OK, BUT Run LAMMPS usingcompress.pth->NotImplementedError. (some what like #4530 )Further Information, Files, and Links
No response