-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix geneformer training instability bug #421
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
jstjohn
requested review from
farhadrgh,
dorotat-nv,
malcolmgreaves,
pstjohn and
skothenhill-nv
as code owners
November 8, 2024 23:08
/pre-commit |
/build-ci |
Partial revert of #408 |
…hn/loss-jump-geneformer
/build-ci |
/build-ci |
/build-ci |
/build-ci |
/build-ci |
jstjohn
changed the title
Undo change to position ids to debug loss curve increase
Fix geneformer training instability bug
Nov 11, 2024
/build-ci |
pstjohn
approved these changes
Nov 12, 2024
malcolmgreaves
approved these changes
Nov 12, 2024
/build-ci |
polinabinder1
approved these changes
Nov 12, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See wandb runs here: https://wandb.ai/clara-discovery/geneformer_bionemo2_timing2
See the results below, we can precisely control whether or not there is a grad norm instability by setting or unsetting the two NVTE env variables. Adding the NVTE env variables to our container is a recent change as well. Based on these results we are unsetting these variables for now. There is not a significant hit to performance by making this change.
Old run where this was not an issue:
Representative new run where we see a spike in grad norm
We can make this spike go away by unsetting
NVTE_FUSED_ATTN
andNVTE_FLASH_ATTN
We can introduce this spike on the old image that didn't have these env variables by setting them
Example longer/larger batch run that fails with these env variables set
We can stabilize this run by unsetting these env variables
It seems to be relatively recent so this PR is going to test some recent changes to see if any of them is causing this.
Find out when this worked:
bionemo2-pr312--136b1889fc390d9dad04f077b32b8fbecf50e25d
NVTE_FUSED_ATTN=1
andNVTE_FLASH_ATTN=0
set in my script **did not work **NVTE_FUSED_ATTN=1
andNVTE_FLASH_ATTN=0
unset
in my script WORKED!!NVTE_FUSED_ATTN=1
andNVTE_FLASH_ATTN=0
Notes on differences between TOT and
pr312--136b1889fc390d9dad04f077b32b8fbecf50e25d
env
doesn't haveNVTE_FUSED*
env settings. Unclear if slurm script adds them properly or not.NVTE_FUSED_ATTN
andNVTE_FLASH_ATTN
are set inbionemo2-pr373--db2fe9cc240b12bfaf045654fc5350a7b985c9de
for example.--export=ALL
is default and passes all env variables. Perhaps this happens then, so the run where I have those env variables added might fail if those are causing the issue.pip freeze
differences pop out that distinguish the branch that passes from the set that fail.NVTE_FUSED_ATTN=1
andNVTE_FLASH_ATTN=0
. I am pretty sure these settings are what cause the training instability in geneformer. Unsetting them works in the old PR and setting them causes that old PR to not work with this explosion of gradients.unset
in my script on those variables so that they are removed from the container env prior to executing the script. If this fixes the TOT training curve I will feel very confident that this is what's going on, and we can focus on purging references to these variables from our docs, other than maybe highlighting how they result in training instability.