FileNotFoundError looking for ckpt files #1009
Replies: 4 comments 3 replies
-
Hi @codeananda thanks for raising this issue and providing a detailed traceback! The checkpoints are automatically created by the PyTorch Lightning learning rate finder which is automatically activated if you don't provide a learning rate. Lightning essentially checkpoints the model before trying out different learning rates and then re-loads the model from the checkpoint to restore the model weights to what they have been initialised with. This is would be the line from your traceback where that happens.
So a fast workaround could be to manually provide a learning rate (which I understand is not feasible in many cases). I assume your are working on Colab, is that correct? Could you manually inspect whether the referenced file exists at the provided location? I have a slight suspicion that the parallel trained models might overwrite the checkpoints of each other or that the mountpoint has tiny outages which cause the learning rate finder not to find the file. |
Beta Was this translation helpful? Give feedback.
-
Hi @karl-richter thanks for your super speedy response!
I'm working in VS Code but storing/accessing files on Google Drive. Similar to Colab but not identical.
The file did exist. However, see comment below as I don't think it existed when the program called it. I ran it once with a specified learning rate and it worked. Then re-ran and got this error. In general, we've adopted the policy of manually deleting the lightning_logs folder before we run our code as it often causes errors. But obviously it would be great if we didn't have to do that! Display error2022-12-01 10:38:45.192 | ERROR | ForwardPredictor:_predict_parallel:583 - An error has been caught in function '_predict_parallel', process 'LokyProcess-2' (4123), thread 'MainThread' (140568232052544):
Traceback (most recent call last):
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
│ │ └ {'__name__': '__main__', '__doc__': None, '__package__': 'joblib.externals.loky.backend', '__loader__': <_frozen_importlib_ex...
│ └ <code object <module> at 0x7fd892bd6ea0, file "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/jobl...
└ <function _run_code at 0x7fd8970a6940>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
│ └ {'__name__': '__main__', '__doc__': None, '__package__': 'joblib.externals.loky.backend', '__loader__': <_frozen_importlib_ex...
└ <code object <module> at 0x7fd892bd6ea0, file "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/jobl...
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/joblib/externals/loky/backend/popen_loky_posix.py", line 170, in <module>
exitcode = process_obj._bootstrap()
│ └ <function BaseProcess._bootstrap at 0x7fd89699b8b0>
└ <LokyProcess name='LokyProcess-2' parent=4087 started>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
│ └ <function BaseProcess.run at 0x7fd896a06ee0>
└ <LokyProcess name='LokyProcess-2' parent=4087 started>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
│ │ │ │ │ └ {}
│ │ │ │ └ <LokyProcess name='LokyProcess-2' parent=4087 started>
│ │ │ └ (<joblib.externals.loky.process_executor._SafeQueue object at 0x7fd892bf1130>, <joblib.externals.loky.backend.queues.SimpleQu...
│ │ └ <LokyProcess name='LokyProcess-2' parent=4087 started>
│ └ <function _process_worker at 0x7fd896813dc0>
└ <LokyProcess name='LokyProcess-2' parent=4087 started>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 428, in _process_worker
r = call_item()
└ CallItem(2, <joblib._parallel_backends.SafeFunction object at 0x7fd892c01940>, (), {})
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 275, in __call__
return self.fn(*self.args, **self.kwargs)
│ │ │ │ │ └ {}
│ │ │ │ └ CallItem(2, <joblib._parallel_backends.SafeFunction object at 0x7fd892c01940>, (), {})
│ │ │ └ ()
│ │ └ CallItem(2, <joblib._parallel_backends.SafeFunction object at 0x7fd892c01940>, (), {})
│ └ <joblib._parallel_backends.SafeFunction object at 0x7fd892c01940>
└ CallItem(2, <joblib._parallel_backends.SafeFunction object at 0x7fd892c01940>, (), {})
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 620, in __call__
return self.func(*args, **kwargs)
│ │ │ └ {}
│ │ └ ()
│ └ <joblib.parallel.BatchedCalls object at 0x7fd801e04cd0>
└ <joblib._parallel_backends.SafeFunction object at 0x7fd892c01940>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
return [func(*args, **kwargs)
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
return [func(*args, **kwargs)
│ │ └ {}
│ └ (date
│ 2013-11-24 NaN
│ 2013-11-25 NaN
│ 2013-11-26 NaN
│ 2013-11-27 NaN
│ 2013-11-28 NaN...
└ <bound method ForwardPredictor._predict_parallel of <ForwardPredictor.ForwardPredictor object at 0x7fd892b94040>>
> File "/mnt/g/My Drive/1 Projects/1 AltDG - Adam/ForwardPredictor/ForwardPredictor.py", line 583, in _predict_parallel
forecast = self.predict(series)
│ │ └ date
│ │ 2013-11-24 NaN
│ │ 2013-11-25 NaN
│ │ 2013-11-26 NaN
│ │ 2013-11-27 NaN
│ │ 2013-11-28 NaN
│ │ ...
│ └ <function ForwardPredictor.predict at 0x7fd801e068b0>
└ <ForwardPredictor.ForwardPredictor object at 0x7fd892b94040>
File "/mnt/g/My Drive/1 Projects/1 AltDG - Adam/ForwardPredictor/ForwardPredictor.py", line 328, in predict
model.fit(train_df)
│ │ └ ds y
│ │ 0 2017-06-11 13.428571
│ │ 1 2017-06-12 13.428571
│ │ 2 2017-06-13 13.428571
│ │ 3 2017-06-14 ...
│ └ <function NeuralProphet.fit at 0x7fd805fb7f70>
└ <neuralprophet.forecaster.NeuralProphet object at 0x7fd801e11f40>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/neuralprophet/forecaster.py", line 730, in fit
metrics_df = self._train(df, minimal=minimal, continue_training=continue_training)
│ │ │ │ └ False
│ │ │ └ False
│ │ └ ds y ID
│ │ 0 2017-06-11 13.428571 __df__
│ │ 1 2017-06-12 13.428571 __df__
│ │ 2 2017-06-13 13....
│ └ <function NeuralProphet._train at 0x7fd805fbd0d0>
└ <neuralprophet.forecaster.NeuralProphet object at 0x7fd801e11f40>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/neuralprophet/forecaster.py", line 2567, in _train
self.trainer.fit(
│ │ └ <function Trainer.fit at 0x7fd806346040>
│ └ <pytorch_lightning.trainer.trainer.Trainer object at 0x7fd8015918e0>
└ <neuralprophet.forecaster.NeuralProphet object at 0x7fd801e11f40>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
│ └ <function Trainer._call_and_handle_interrupt at 0x7fd8063aaf70>
└ <pytorch_lightning.trainer.trainer.Trainer object at 0x7fd8015918e0>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
│ │ └ {}
│ └ (TimeNet(
│ (metrics_train): MetricCollection(
│ (MAE): MeanAbsoluteError()
│ (RMSE): MeanSquaredError()
│ )
│ (metrics_v...
└ <bound method Trainer._fit_impl of <pytorch_lightning.trainer.trainer.Trainer object at 0x7fd8015918e0>>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
│ │ │ │ └ <property object at 0x7fd8963571d0>
│ │ │ └ <pytorch_lightning.trainer.trainer.Trainer object at 0x7fd8015918e0>
│ │ └ TimeNet(
│ │ (metrics_train): MetricCollection(
│ │ (MAE): MeanAbsoluteError()
│ │ (RMSE): MeanSquaredError()
│ │ )
│ │ (metrics_va...
│ └ <function Trainer._run at 0x7fd8063465e0>
└ <pytorch_lightning.trainer.trainer.Trainer object at 0x7fd8015918e0>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
results = self._run_stage()
│ └ <function Trainer._run_stage at 0x7fd806346820>
└ <pytorch_lightning.trainer.trainer.Trainer object at 0x7fd8015918e0>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
return self._run_train()
│ └ <function Trainer._run_train at 0x7fd806346940>
└ <pytorch_lightning.trainer.trainer.Trainer object at 0x7fd8015918e0>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
self.fit_loop.run()
│ └ <property object at 0x7fd8963ac950>
└ <pytorch_lightning.trainer.trainer.Trainer object at 0x7fd8015918e0>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 201, in run
self.on_advance_end()
│ └ <function FitLoop.on_advance_end at 0x7fd807469790>
└ <pytorch_lightning.loops.fit_loop.FitLoop object at 0x7fd8015414c0>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 299, in on_advance_end
self.trainer._call_callback_hooks("on_train_epoch_end")
│ └ <property object at 0x7fd8064d9ef0>
└ <pytorch_lightning.loops.fit_loop.FitLoop object at 0x7fd8015414c0>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1597, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
│ │ │ │ │ └ {}
│ │ │ │ └ ()
│ │ │ └ <property object at 0x7fd896357b80>
│ │ └ <pytorch_lightning.trainer.trainer.Trainer object at 0x7fd8015918e0>
│ └ <pytorch_lightning.trainer.trainer.Trainer object at 0x7fd8015918e0>
└ <bound method ModelCheckpoint.on_train_epoch_end of <pytorch_lightning.callbacks.model_checkpoint.ModelCheckpoint object at 0...
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 311, in on_train_epoch_end
self._save_topk_checkpoint(trainer, monitor_candidates)
│ │ │ └ {'MAE': tensor(5.9332), 'RMSE': tensor(7.2154), 'Loss': tensor([0.4365]), 'RegLoss': tensor([0.]), 'epoch': tensor(1), 'step'...
│ │ └ <pytorch_lightning.trainer.trainer.Trainer object at 0x7fd8015918e0>
│ └ <function ModelCheckpoint._save_topk_checkpoint at 0x7fd80740d160>
└ <pytorch_lightning.callbacks.model_checkpoint.ModelCheckpoint object at 0x7fd801321070>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 384, in _save_topk_checkpoint
self._save_none_monitor_checkpoint(trainer, monitor_candidates)
│ │ │ └ {'MAE': tensor(5.9332), 'RMSE': tensor(7.2154), 'Loss': tensor([0.4365]), 'RegLoss': tensor([0.]), 'epoch': tensor(1), 'step'...
│ │ └ <pytorch_lightning.trainer.trainer.Trainer object at 0x7fd8015918e0>
│ └ <function ModelCheckpoint._save_none_monitor_checkpoint at 0x7fd80740daf0>
└ <pytorch_lightning.callbacks.model_checkpoint.ModelCheckpoint object at 0x7fd801321070>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 674, in _save_none_monitor_checkpoint
trainer.strategy.remove_checkpoint(previous)
│ │ └ '/mnt/g/My Drive/1 Projects/1 AltDG - Adam/ForwardPredictor/lightning_logs/version_146/checkpoints/epoch=0-step=62.ckpt'
│ └ <property object at 0x7fd89634bf90>
└ <pytorch_lightning.trainer.trainer.Trainer object at 0x7fd8015918e0>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 455, in remove_checkpoint
self.checkpoint_io.remove_checkpoint(filepath)
│ │ └ '/mnt/g/My Drive/1 Projects/1 AltDG - Adam/ForwardPredictor/lightning_logs/version_146/checkpoints/epoch=0-step=62.ckpt'
│ └ <property object at 0x7fd80660fcc0>
└ <pytorch_lightning.strategies.single_device.SingleDeviceStrategy object at 0x7fd8015418b0>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/pytorch_lightning/plugins/io/torch_plugin.py", line 95, in remove_checkpoint
fs.rm(path, recursive=True)
│ │ └ '/mnt/g/My Drive/1 Projects/1 AltDG - Adam/ForwardPredictor/lightning_logs/version_146/checkpoints/epoch=0-step=62.ckpt'
│ └ <function LocalFileSystem.rm at 0x7fd80747d310>
└ <fsspec.implementations.local.LocalFileSystem object at 0x7fd801591400>
File "/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/site-packages/fsspec/implementations/local.py", line 169, in rm
os.remove(p)
│ │ └ '/mnt/g/My Drive/1 Projects/1 AltDG - Adam/ForwardPredictor/lightning_logs/version_146/checkpoints/epoch=0-step=62.ckpt'
│ └ <built-in function remove>
└ <module 'os' from '/home/codeananda/anaconda3/envs/neuralprophet/lib/python3.9/os.py'>
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/g/My Drive/1 Projects/1 AltDG - Adam/ForwardPredictor/lightning_logs/version_146/checkpoints/epoch=0-step=62.ckpt' These are the files in the checkpoints dir (I'm just training for 3 epochs for speed). So, this file doesn't exist. In general though, it seems like if I a) delete the lightning_logs folder and b) run it with a specified learning rate, it works. However, I'm just doing this on 8 series for 3 epochs each. My colleague says when he's running it on 1k+ series for 300+ epochs that he sometimes does still get the ckpt error even when manually specifiying a learning rate. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Adding minimal reproducible example. @ourownstory @karl-richter import pandas as pd
import numpy as np
from joblib import Parallel, delayed
from neuralprophet import NeuralProphet
data_location = "https://raw.githubusercontent.com/ourownstory/neuralprophet-data/main/datasets/"
df = pd.read_csv(data_location + "wp_log_peyton_manning.csv")
dfs = [df.copy() for _ in range(8)]
model = NeuralProphet(epochs=10)
results = Parallel(n_jobs=-1)(delayed(model.fit)(df) for df in dfs) Display outputINFO - (NP.df_utils._infer_frequency) - Major frequency D corresponds to 99.966% of the data.
INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as D
INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training.
INFO - (NP.utils.set_auto_seasonalities) - Disabling daily seasonality. Run NeuralProphet with daily_seasonality=True to override this.
INFO - (NP.df_utils._infer_frequency) - Major frequency D corresponds to 99.966% of the data.
INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as D
INFO - (NP.df_utils._infer_frequency) - Major frequency D corresponds to 99.966% of the data.
INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as D
INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training.
INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training.
INFO - (NP.utils.set_auto_seasonalities) - Disabling daily seasonality. Run NeuralProphet with daily_seasonality=True to override this.
INFO - (NP.utils.set_auto_seasonalities) - Disabling daily seasonality. Run NeuralProphet with daily_seasonality=True to override this.
INFO - (NP.df_utils._infer_frequency) - Major frequency D corresponds to 99.966% of the data.
INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as D
INFO - (NP.df_utils._infer_frequency) - Major frequency D corresponds to 99.966% of the data.
INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as D
INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training.
INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training.
INFO - (NP.utils.set_auto_seasonalities) - Disabling daily seasonality. Run NeuralProphet with daily_seasonality=True to override this.
INFO - (NP.utils.set_auto_seasonalities) - Disabling daily seasonality. Run NeuralProphet with daily_seasonality=True to override this.
INFO - (NP.df_utils._infer_frequency) - Major frequency D corresponds to 99.966% of the data.
INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as D
INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training.
INFO - (NP.utils.set_auto_seasonalities) - Disabling daily seasonality. Run NeuralProphet with daily_seasonality=True to override this.
INFO - (NP.df_utils._infer_frequency) - Major frequency D corresponds to 99.966% of the data.
INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as D
INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training.
INFO - (NP.df_utils._infer_frequency) - Major frequency D corresponds to 99.966% of the data.
INFO - (NP.df_utils._infer_frequency) - Dataframe freq automatically defined as D
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 32
INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training.
INFO - (NP.utils.set_auto_seasonalities) - Disabling daily seasonality. Run NeuralProphet with daily_seasonality=True to override this.
INFO - (NP.utils.set_auto_seasonalities) - Disabling daily seasonality. Run NeuralProphet with daily_seasonality=True to override this.
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 32
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 32
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 32
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 32
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 32
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 32
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 32
WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (93) is too small than the required number for the learning rate finder (237). The results might not be optimal.
WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (93) is too small than the required number for the learning rate finder (237). The results might not be optimal.
WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (93) is too small than the required number for the learning rate finder (237). The results might not be optimal.
WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (93) is too small than the required number for the learning rate finder (237). The results might not be optimal.
WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (93) is too small than the required number for the learning rate finder (237). The results might not be optimal.
WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (93) is too small than the required number for the learning rate finder (237). The results might not be optimal.
WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (93) is too small than the required number for the learning rate finder (237). The results might not be optimal.
WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (93) is too small than the required number for the learning rate finder (237). The results might not be optimal.
Finding best initial lr: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 237/237 [00:02<00:00, 107.18it/s]
Finding best initial lr: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 237/237 [00:02<00:00, 97.37it/s]
Finding best initial lr: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 237/237 [00:02<00:00, 92.92it/s]
Finding best initial lr: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 237/237 [00:02<00:00, 97.13it/s]
Finding best initial lr: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 237/237 [00:02<00:00, 95.01it/s]
Finding best initial lr: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 237/237 [00:02<00:00, 91.83it/s]
Finding best initial lr: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 237/237 [00:02<00:00, 92.11it/s]
Finding best initial lr: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 237/237 [00:02<00:00, 92.44it/s]
Missing logger folder: /mnt/g/My Drive/1 Projects/1 AltDG - Adam/BreakoutDetector/lightning_logs
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 428, in _process_worker
r = call_item()
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 275, in __call__
return self.fn(*self.args, **self.kwargs)
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 620, in __call__
return self.func(*args, **kwargs)
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
return [func(*args, **kwargs)
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
return [func(*args, **kwargs)
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/neuralprophet/forecaster.py", line 795, in fit
metrics_df = self._train(
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/neuralprophet/forecaster.py", line 2648, in _train
lr_finder = self.trainer.tuner.lr_find(
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/pytorch_lightning/tuner/tuning.py", line 199, in lr_find
result = self.trainer.tune(
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1052, in tune
result = self.tuner._tune(
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/pytorch_lightning/tuner/tuning.py", line 70, in _tune
result["lr_find"] = lr_find(self.trainer, model, **lr_find_kwargs)
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/pytorch_lightning/tuner/lr_finder.py", line 269, in lr_find
trainer._checkpoint_connector.restore(ckpt_path)
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 140, in restore
self.restore_model()
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 178, in restore_model
if self._hpc_resume_path is not None:
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 66, in _hpc_resume_path
max_version = self.__max_ckpt_version_in_folder(dir_path_hpc, "hpc_ckpt_")
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 506, in __max_ckpt_version_in_folder
files = [os.path.basename(f["name"]) for f in fs.listdir(dir_path)]
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/fsspec/spec.py", line 1313, in listdir
return self.ls(path, detail=detail, **kwargs)
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/fsspec/implementations/local.py", line 60, in ls
return [self.info(f) for f in it]
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/fsspec/implementations/local.py", line 60, in <listcomp>
return [self.info(f) for f in it]
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/fsspec/implementations/local.py", line 71, in info
out = path.stat(follow_symlinks=False)
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/g/My Drive/1 Projects/1 AltDG - Adam/BreakoutDetector/.lr_find_2d942da8-285c-4fdd-b0bd-fb810083bb2a.ckpt'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/g/My Drive/1 Projects/1 AltDG - Adam/BreakoutDetector/BreakoutDetector.py", line 1471, in <module>
results = Parallel(n_jobs=-1)(delayed(model.fit)(df) for df in dfs)
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/joblib/parallel.py", line 1098, in __call__
self.retrieve()
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/joblib/parallel.py", line 975, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 567, in wrap_future_result
return future.result(timeout=timeout)
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/concurrent/futures/_base.py", line 446, in result
return self.__get_result()
File "/home/codeananda/anaconda3/envs/local/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/g/My Drive/1 Projects/1 AltDG - Adam/BreakoutDetector/.lr_find_2d942da8-285c-4fdd-b0bd-fb810083bb2a.ckpt'
(local) codeananda@King:/mnt/g/My Drive/1 Projects/1 AltDG - Adam/BreakoutDetector$ /home/codeananda/anaconda3/envs/local/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 8 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ' Using:
|
Beta Was this translation helpful? Give feedback.
-
When training multiple neuralprophet models on multiple timeseries in parallel, I often get the error
It's looking for a ckpt file that isn't there, even though I have not specified for NP to store checkpoints.
I can provide more info but need to rush off now and thought someone may know the solution just from this. What else would you like?
Note that this does not happen all the time. Sometimes parallel execution works for all series, sometimes not. The series on which it fails change on each run too
NP v0.5 installed from source
Full traceback below (using betterexceptions)
@ourownstory @karl-richter
Beta Was this translation helpful? Give feedback.
All reactions