Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix geneformer training instability bug #421

Merged
merged 14 commits into from
Nov 12, 2024
Merged

Conversation

jstjohn
Copy link
Collaborator

@jstjohn jstjohn commented Nov 8, 2024

See wandb runs here: https://wandb.ai/clara-discovery/geneformer_bionemo2_timing2

See the results below, we can precisely control whether or not there is a grad norm instability by setting or unsetting the two NVTE env variables. Adding the NVTE env variables to our container is a recent change as well. Based on these results we are unsetting these variables for now. There is not a significant hit to performance by making this change.

Old run where this was not an issue:

Screenshot 2024-11-12 at 9 42 45 AM

Representative new run where we see a spike in grad norm

Screenshot 2024-11-12 at 9 43 25 AM

We can make this spike go away by unsetting NVTE_FUSED_ATTN and NVTE_FLASH_ATTN

Screenshot 2024-11-12 at 9 43 44 AM

We can introduce this spike on the old image that didn't have these env variables by setting them

Screenshot 2024-11-12 at 9 44 16 AM

Example longer/larger batch run that fails with these env variables set

Screenshot 2024-11-12 at 9 45 07 AM

We can stabilize this run by unsetting these env variables

Screenshot 2024-11-12 at 9 45 30 AM

It seems to be relatively recent so this PR is going to test some recent changes to see if any of them is causing this.

  • Check if the arange change is causing this?
  • Check if the grad buffer change (should not be enabled) is causing this
  • bias fusions
  • garbage collection callback

Find out when this worked:

  • PR 409 right before second perf change and dset change
  • PR 410 after first perf change, CLI refactor, and wandb fix
  • PR 404 right before new CLI
  • PR 362 (2 weeks ago) but restarting job before the gradients start to increase
  • PR 362 (2 weeks ago)
  • worked https://wandb.ai/clara-discovery/geneformer_bionemo2/runs/0sSIf3tl?nw=nwusernvjstjohn worked uses bionemo2-pr312--136b1889fc390d9dad04f077b32b8fbecf50e25d
  • bionemo2-pr312--136b1889fc390d9dad04f077b32b8fbecf50e25d but with NVTE_FUSED_ATTN=1 and NVTE_FLASH_ATTN=0 set in my script **did not work **
  • bionemo2-pr312--136b1889fc390d9dad04f077b32b8fbecf50e25d but with NVTE_FUSED_ATTN=1 and NVTE_FLASH_ATTN=0 unset in my script WORKED!!
  • bionemo2-pr419--f2599382e4afaf061c9948628f3f72bb8e233fd6 (most recent PR merged) but manually unsetting NVTE_FUSED_ATTN=1 and NVTE_FLASH_ATTN=0

Notes on differences between TOT and pr312--136b1889fc390d9dad04f077b32b8fbecf50e25d

  • env doesn't have NVTE_FUSED* env settings. Unclear if slurm script adds them properly or not.
    • NVTE_FUSED_ATTN and NVTE_FLASH_ATTN are set in bionemo2-pr373--db2fe9cc240b12bfaf045654fc5350a7b985c9de for example.
    • in slurm --export=ALL is default and passes all env variables. Perhaps this happens then, so the run where I have those env variables added might fail if those are causing the issue.
  • Successful run was bs=32 vs 64. I'm running a test now that has the NVTE* settings in the docker script but not in the image.
  • This was a closed branch, maybe some key changes didn't make it to main.
  • No pip freeze differences pop out that distinguish the branch that passes from the set that fail.
  • NOTE: See the experiments above around NVTE_FUSED_ATTN=1 and NVTE_FLASH_ATTN=0 . I am pretty sure these settings are what cause the training instability in geneformer. Unsetting them works in the old PR and setting them causes that old PR to not work with this explosion of gradients.
  • Currently I'm rerunning tests on a TOT branch but calling unset in my script on those variables so that they are removed from the container env prior to executing the script. If this fixes the TOT training curve I will feel very confident that this is what's going on, and we can focus on purging references to these variables from our docs, other than maybe highlighting how they result in training instability.

@jstjohn
Copy link
Collaborator Author

jstjohn commented Nov 8, 2024

/pre-commit

@jstjohn
Copy link
Collaborator Author

jstjohn commented Nov 8, 2024

/build-ci

@jstjohn
Copy link
Collaborator Author

jstjohn commented Nov 9, 2024

Partial revert of #408

@jstjohn
Copy link
Collaborator Author

jstjohn commented Nov 9, 2024

/build-ci

@jstjohn
Copy link
Collaborator Author

jstjohn commented Nov 9, 2024

/build-ci

@jstjohn
Copy link
Collaborator Author

jstjohn commented Nov 9, 2024

/build-ci

@jstjohn
Copy link
Collaborator Author

jstjohn commented Nov 9, 2024

/build-ci

@jstjohn
Copy link
Collaborator Author

jstjohn commented Nov 11, 2024

/build-ci

@jstjohn jstjohn changed the title Undo change to position ids to debug loss curve increase Fix geneformer training instability bug Nov 11, 2024
@jstjohn
Copy link
Collaborator Author

jstjohn commented Nov 12, 2024

/build-ci

@jstjohn jstjohn enabled auto-merge (squash) November 12, 2024 17:50
@jstjohn
Copy link
Collaborator Author

jstjohn commented Nov 12, 2024

/build-ci

@jstjohn jstjohn merged commit 7192b5b into main Nov 12, 2024
4 checks passed
@jstjohn jstjohn deleted the jstjohn/loss-jump-geneformer branch November 12, 2024 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants