-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Register batch complete progress before batch_end hooks #20376
Conversation
⛈️ Required checks status: Has failure 🔴
Groups summary🔴 pytorch_lightning: Tests workflowThese checks are required after the changes to 🟢 pytorch_lightning: Azure GPU
These checks are required after the changes to 🟢 pytorch_lightning: Benchmarks
These checks are required after the changes to 🟢 pytorch_lightning: Docs
These checks are required after the changes to 🟢 mypy
These checks are required after the changes to 🟢 installThese checks are required after the changes to Thank you for your contribution! 💜
|
What does this PR do?
Fixes #14579
The following code
will produce skewed progress information in the checkpoints, compared to the case where there is no restart.
This is due to the fact that when
ModelCheckpoint
is triggered onon_train_batch_end
, it won't seebatch_progress.total.completed
updated to the latest batch that was processed, because progress is updated right after the hook is called.However, upon restart, there won't be any opportunity to register the actual completion of the batch, causing a skew that is proportional to the number of restarts. This impacts the time at which validation is called, which itself becomes dependent from restarts.
This PR addresses this issue by first updating batch progress and then calling batch_end hooks.
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist
📚 Documentation preview 📚: https://pytorch-lightning--20376.org.readthedocs.build/en/20376/