Don't overwrite logs when resuming training #153

peastman · 2022-11-18T19:14:26Z

I often want to resume training from a checkpoint with --load-model. When I do that, I don't want to lose all the information in the log and metrics.csv files. The obvious way to do that is to create a new log directory for the continuation and use --log-dir and --redirect to tell it to put all new files in the new directory. But it doesn't work. Instead it ignores those options and uses the same log directory as the original training run, deleting and overwriting the existing logs in the process. To prevent that, you first need to copy your existing log directory to a new location. I've several times lost work by forgetting to do that.

How about making it so that --load-model does not override --log-dir and --redirect? That's just telling it what model to load. It wouldn't prevent you from saving logs to a different directory.

The text was updated successfully, but these errors were encountered:

PhilippThoelke · 2022-11-18T21:09:58Z

I'm not sure why overwriting log_dir doesn't work properly when load_model is set. The arguments are parsed in the order as they are defined in train.py. Since --load-model is the first argument there, the loaded model's hparams should always be overwritten by specifying further arguments via config file or CLI.

peastman · 2022-11-18T21:51:29Z

Is it because of this line in LoadFromCheckpoint?

torchmd-net/torchmdnet/utils.py

Line 181 in df7c906

namespace.__dict__.update(config)

namespace contains the options that were passed in. That line overwrites them with ones from the checkpoint, before they've yet had a chance to be processed.

PhilippThoelke · 2022-11-19T00:13:44Z

I somehow thought that arguments are parsed in the order in which they are defined in the code but a quick test showed that this is clearly not true. So yes, that line is the problem. We should probably only update the namespace with arguments that were not specified by the user.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't overwrite logs when resuming training #153

Don't overwrite logs when resuming training #153

peastman commented Nov 18, 2022

PhilippThoelke commented Nov 18, 2022

peastman commented Nov 18, 2022

PhilippThoelke commented Nov 19, 2022

Don't overwrite logs when resuming training #153

Don't overwrite logs when resuming training #153

Comments

peastman commented Nov 18, 2022

PhilippThoelke commented Nov 18, 2022

peastman commented Nov 18, 2022

PhilippThoelke commented Nov 19, 2022