You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just encountered an error I've never seen before. I used the --load-model command line argument to resume training from a checkpoint. At first everything seemed to be working correctly, but after completing four epochs it exited with this error.
Traceback (most recent call last):
File "/global/homes/p/peastman/torchmd-net/scripts/train.py", line 164, in <module>
main()
File "/global/homes/p/peastman/torchmd-net/scripts/train.py", line 160, in main
trainer.test(model, data)
File "/global/homes/p/peastman/miniconda3/envs/torchmd/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 936, in test
return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)
File "/global/homes/p/peastman/miniconda3/envs/torchmd/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/global/homes/p/peastman/miniconda3/envs/torchmd/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 983, in _test_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/global/homes/p/peastman/miniconda3/envs/torchmd/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1222, in _run
self._log_hyperparams()
File "/global/homes/p/peastman/miniconda3/envs/torchmd/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1277, in _log_hyperparams
raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: Error while merging hparams: the keys ['load_model'] are present in both the LightningModule's and LightningDataModule's hparams but have different values.
The text was updated successfully, but these errors were encountered:
It looks like training stopped after 4 epochs since the error occurred while calling trainer.test in the training script. After fit we load the best model checkpoint and then evaluate it on the test set. The best checkpoint that was loaded probably was saved in the previous training run? So the value of load_model is probably None, while the current DataModule contains a different value for load_value.
The problem in this specific case is probably something else since stopping training after 4 epochs was probably not intended? This error is definitely not very intuitive though.
A potential fix would be to just pass the test dataloader instead of the full DataModule.
I just encountered an error I've never seen before. I used the
--load-model
command line argument to resume training from a checkpoint. At first everything seemed to be working correctly, but after completing four epochs it exited with this error.The text was updated successfully, but these errors were encountered: