-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump completed progress on load with checkpoints saved on_train_batch_end #20379
base: master
Are you sure you want to change the base?
Conversation
⛈️ Required checks status: Has failure 🔴
Groups summary🔴 pytorch_lightning: Tests workflowThese checks are required after the changes to 🟢 pytorch_lightning: Azure GPU
These checks are required after the changes to 🟢 pytorch_lightning: Benchmarks
These checks are required after the changes to 🟢 pytorch_lightning: Docs
These checks are required after the changes to 🔴 mypy
These checks are required after the changes to 🟡 installThese checks are required after the changes to Thank you for your contribution! 💜
|
5e313ee
to
0012dcb
Compare
d303d27
to
c3469be
Compare
What does this PR do?
Fixes #14579
The following code
will produce skewed progress information in the checkpoints, compared to the case where there is no restart.
This is due to the fact that when
ModelCheckpoint
is triggered onon_train_batch_end
, it won't seebatch_progress.total.completed
updated to the latest batch that was processed, because progress is updated right after the hook is called.However, upon restart, there won't be any opportunity to register the actual completion of the batch, causing a skew that is proportional to the number of restarts. This impacts the time at which validation is called, which itself becomes dependent from restarts.
This PR addresses this issue by reconciling progress upon restart.
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist
📚 Documentation preview 📚: https://pytorch-lightning--20379.org.readthedocs.build/en/20379/