-
-
Notifications
You must be signed in to change notification settings - Fork 616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to resume learning? #2569
Comments
Hi @kazuma0606 , In few lines of code, you can do the following: ignite/examples/contrib/cifar10/main.py Line 334 in 315b6b9
ignite/examples/contrib/cifar10/main.py Lines 351 to 357 in 315b6b9
HTH |
Thanks for reply. Am I correct in assuming that the code below does not include the epoch and loss information to resume learning? from ignite.handlers import ModelCheckpoint, TerminateOnNan
checkpoint_handler = ModelCheckpoint(dirname="/content/drive/My Drive/Colab Notebooks/CycleGAN_Project/pytorch-CycleGAN-and-pix2pix/datasets/T1W2T2W/cpk", filename_prefix="",require_empty=False)
to_save = {
"generator_A2B": generator_A2B,
"discriminator_B": discriminator_B,
"generator_B2A": generator_B2A,
"discriminator_A": discriminator_A,
"optimizer_G": optimizer_G,
"optimizer_D": optimizer_D,
}
trainer.add_event_handler(Events.ITERATION_COMPLETED(every=500), checkpoint_handler, to_save)
trainer.add_event_handler(Events.ITERATION_COMPLETED, TerminateOnNan()) |
@kazuma0606 Yes, you are correct. In order to save epoch and iteration, we need to save to_save = {
"generator_A2B": generator_A2B,
"discriminator_B": discriminator_B,
"generator_B2A": generator_B2A,
"discriminator_A": discriminator_A,
"optimizer_G": optimizer_G,
"optimizer_D": optimizer_D,
"trainer": trainer
} As for batch loss, there is no need to save it as once restored models they would give similar batch loss values. |
Hi, @vfdev-5 Regards. |
Hi @kazuma0606
I'm not sure about academic PoV but if it is about deterministic training and reproducibility while resuming from a checkpoint there are few things to take into account:
More info: https://pytorch.org/ignite/engine.html#deterministic-training
We have to communicate with the team :
See also : https://github.com/pytorch/ignite#communication
We can try to prioritize this feature. Related already issue open #966 |
Hi, @vfdev-5
Functions run_evaluation() and log_generated_images() are called automatically at the start of training and can capture variables like lambda expressions. |
Hi @kazuma0606
Complete code is the following: @trainer.on(Events.EPOCH_STARTED)
def run_evaluation(engine):
evaluator.run(eval_train_loader)
evaluator.run(eval_test_loader)
def log_generated_images(engine, logger, event_name):
# ...
tb_logger.attach(evaluator,
log_handler=log_generated_images,
event_name=Events.COMPLETED) As you can see trainer has
Yes, I think you use any variables in these functions from your global scope. If you want to pass explicitly an argument you can do something like: another_lambda = lambda : "check another lambda"
@trainer.on(Events.EPOCH_STARTED, lambda : "check lambda")
def run_evaluation(engine, fn):
print(fn(), another_lambda()) |
Hi, @vfdev-5
By the way, is the TerminateOnNan flag a function to suppress over-learning? Sorry for all the questions. |
I don't know if this is relevant, but I had to prepare and learning Dataset on my own. |
Hi @kazuma0606
When loss goes Nan, learning is not possible anymore as weights are Nan as well and we just waste resources. Loss can go Nan in various cases:
I'm not sure to understand your point here, sorry
Yes, less GPU memory usage and faster training on Nvidia GPUs with Turing cores
I do not think that your data is responsible for Nan, try 2 above points before and see if it helps |
❓ Questions/Help/Support
Hi, support teams.
This is my first time asking a question.
I believe the following code will load the checkpoints.
===================================================
checkpoint_path = "/tmp/cycle_gan_checkpoints/checkpoint_26500.pt"
===================================================
If the learning process takes a long time, there will be interruptions along the way.
In such a case, what code can I use to resume learning?
We look forward to hearing from you. Regards.
The text was updated successfully, but these errors were encountered: