You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to run a simple simulation (using a model trained with example.yaml from the allegro repository) of aspirin in gas phase using LAMMPS, just to validate a container on LUMI (AMD gpus).
The container has PyTorch 1.13 and ROCm 5.2.3, the develop branch of nequip and the main branch of allegro installed.
It also has LAMMPS stable-12Aug2023-update2 patched with the multicut branch of pair_allegro.
I can train and deploy models without a problem.
However, when I try to use a deployed model to run a LAMMPS simulation I get:
+ srun lmp -k on g 1 -sf kk -pk kokkos newton on neigh full -in in.lammps
terminate called after throwing an instance of 'c10::Error'what(): expected scalar type Double but found Float
Exception raised from data_ptr<double> at aten/src/ATen/core/TensorMethods.cpp:20 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x153decdd690c in /opt/lammps/bin/../libtorch/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x153decd9d008 in /opt/lammps/bin/../libtorch/libc10.so)
frame #2: double* at::TensorBase::data_ptr<double>() const + 0x403 (0x153dd01a9bd3 in /opt/lammps/bin/../libtorch/libtorch_cpu.so)
frame #3: lmp() [0x1df3fe8]
frame #4: lmp() [0x1da7155]
frame #5: lmp() [0x18e2d35]
frame #6: lmp() [0x130c0ef]
frame #7: lmp() [0x130af4e]
frame #8: lmp() [0x12f454d]
frame #9: __libc_start_main + 0xef (0x153d9661924d in /lib64/libc.so.6)
frame #10: lmp() [0x12f444a]
srun: error: nid007955: task 0: Aborted
srun: launch/slurm: _step_signal: Terminating StepId=6343105.0
This looks similar to #12 and #37, but I'm already using the develop branch of nequip, and changing the pair_style to pair_style allegro3232 also does not solve the issue:
+ srun lmp -k on g 1 -sf kk -pk kokkos newton on neigh full -in in.lammps
terminate called after throwing an instance of 'std::out_of_range'what(): Argument passed to at() was not in the map.
srun: error: nid005002: task 0: Aborted
srun: launch/slurm: _step_signal: Terminating StepId=6345098.0
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I'm trying to run a simple simulation (using a model trained with
example.yaml
from theallegro
repository) of aspirin in gas phase using LAMMPS, just to validate a container on LUMI (AMD gpus).The container has PyTorch 1.13 and ROCm 5.2.3, the
develop
branch ofnequip
and themain
branch ofallegro
installed.It also has
LAMMPS stable-12Aug2023-update2
patched with themulticut
branch ofpair_allegro
.I can train and deploy models without a problem.
However, when I try to use a deployed model to run a LAMMPS simulation I get:
This looks similar to #12 and #37, but I'm already using the
develop
branch ofnequip
, and changing thepair_style
topair_style allegro3232
also does not solve the issue:Is this some incompatibility between versions?
Beta Was this translation helpful? Give feedback.
All reactions