Fix geneformer training instability bug #421

jstjohn · 2024-11-08T23:08:33Z

See wandb runs here: https://wandb.ai/clara-discovery/geneformer_bionemo2_timing2

See the results below, we can precisely control whether or not there is a grad norm instability by setting or unsetting the two NVTE env variables. Adding the NVTE env variables to our container is a recent change as well. Based on these results we are unsetting these variables for now. There is not a significant hit to performance by making this change.

Old run where this was not an issue:

Representative new run where we see a spike in grad norm

We can make this spike go away by unsetting `NVTE_FUSED_ATTN` and `NVTE_FLASH_ATTN`

We can introduce this spike on the old image that didn't have these env variables by setting them

Example longer/larger batch run that fails with these env variables set

We can stabilize this run by unsetting these env variables

It seems to be relatively recent so this PR is going to test some recent changes to see if any of them is causing this.

Check if the arange change is causing this?
Check if the grad buffer change (should not be enabled) is causing this
bias fusions
garbage collection callback

Find out when this worked:

PR 409 right before second perf change and dset change
PR 410 after first perf change, CLI refactor, and wandb fix
PR 404 right before new CLI
PR 362 (2 weeks ago) but restarting job before the gradients start to increase
PR 362 (2 weeks ago)
worked https://wandb.ai/clara-discovery/geneformer_bionemo2/runs/0sSIf3tl?nw=nwusernvjstjohn worked uses bionemo2-pr312--136b1889fc390d9dad04f077b32b8fbecf50e25d
bionemo2-pr312--136b1889fc390d9dad04f077b32b8fbecf50e25d but with NVTE_FUSED_ATTN=1 and NVTE_FLASH_ATTN=0 set in my script **did not work **
bionemo2-pr312--136b1889fc390d9dad04f077b32b8fbecf50e25d but with NVTE_FUSED_ATTN=1 and NVTE_FLASH_ATTN=0 unset in my script WORKED!!
bionemo2-pr419--f2599382e4afaf061c9948628f3f72bb8e233fd6 (most recent PR merged) but manually unsetting NVTE_FUSED_ATTN=1 and NVTE_FLASH_ATTN=0

Notes on differences between TOT and pr312--136b1889fc390d9dad04f077b32b8fbecf50e25d

env doesn't have NVTE_FUSED* env settings. Unclear if slurm script adds them properly or not.
- NVTE_FUSED_ATTN and NVTE_FLASH_ATTN are set in bionemo2-pr373--db2fe9cc240b12bfaf045654fc5350a7b985c9de for example.
- in slurm --export=ALL is default and passes all env variables. Perhaps this happens then, so the run where I have those env variables added might fail if those are causing the issue.
Successful run was bs=32 vs 64. I'm running a test now that has the NVTE* settings in the docker script but not in the image.
This was a closed branch, maybe some key changes didn't make it to main.
No pip freeze differences pop out that distinguish the branch that passes from the set that fail.
NOTE: See the experiments above around NVTE_FUSED_ATTN=1 and NVTE_FLASH_ATTN=0 . I am pretty sure these settings are what cause the training instability in geneformer. Unsetting them works in the old PR and setting them causes that old PR to not work with this explosion of gradients.
Currently I'm rerunning tests on a TOT branch but calling unset in my script on those variables so that they are removed from the container env prior to executing the script. If this fixes the TOT training curve I will feel very confident that this is what's going on, and we can focus on purging references to these variables from our docs, other than maybe highlighting how they result in training instability.

jstjohn · 2024-11-08T23:08:47Z

/pre-commit

jstjohn · 2024-11-08T23:08:51Z

/build-ci

jstjohn · 2024-11-09T00:27:24Z

Partial revert of #408

…hn/loss-jump-geneformer

jstjohn · 2024-11-09T00:28:00Z

/build-ci

jstjohn · 2024-11-09T00:31:20Z

/build-ci

jstjohn · 2024-11-09T01:29:06Z

/build-ci

jstjohn · 2024-11-09T03:02:40Z

/build-ci

jstjohn · 2024-11-11T05:25:28Z

/build-ci

…hn/loss-jump-geneformer

jstjohn · 2024-11-12T17:34:29Z

/build-ci

jstjohn · 2024-11-12T18:26:01Z

/build-ci

Undo change to position ids to debug loss curve increase

a010f13

jstjohn requested review from farhadrgh, dorotat-nv, malcolmgreaves, pstjohn and skothenhill-nv as code owners November 8, 2024 23:08

jstjohn added 2 commits November 9, 2024 00:25

remove unused defer wgrad compute just incase

916aaae

remove unused args

1d9d807

Merge branch 'main' of github.com:NVIDIA/bionemo-framework into jstjo…

2560614

…hn/loss-jump-geneformer

add one more change back

152b38d

Untoggle bias fusions

cc97275

Make the faster option for bias fusions the default

3be5b56

jstjohn added 2 commits November 11, 2024 04:42

reset unrelated problems to main

57dc0df

Unset NVTE

73a2d21

jstjohn requested review from ohadmo and trvachov as code owners November 11, 2024 04:45

Fix format

4ff86f0

jstjohn changed the title ~~Undo change to position ids to debug loss curve increase~~ Fix geneformer training instability bug Nov 11, 2024

jstjohn added 3 commits November 12, 2024 16:45

Merge branch 'main' of github.com:NVIDIA/bionemo-framework into jstjo…

2db37b7

…hn/loss-jump-geneformer

remove mention of NVTE from tests/docs/etc

2f05f9c

reduce atol check for geneformer

dcc75ec

pstjohn approved these changes Nov 12, 2024

View reviewed changes

malcolmgreaves approved these changes Nov 12, 2024

View reviewed changes

jstjohn enabled auto-merge (squash) November 12, 2024 17:50

update another failing ci test for geneformer

49ad7cd

polinabinder1 approved these changes Nov 12, 2024

View reviewed changes

jstjohn merged commit 7192b5b into main Nov 12, 2024
4 checks passed

jstjohn deleted the jstjohn/loss-jump-geneformer branch November 12, 2024 19:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix geneformer training instability bug #421

Fix geneformer training instability bug #421

jstjohn commented Nov 8, 2024 •

edited

Loading

jstjohn commented Nov 8, 2024

jstjohn commented Nov 8, 2024

jstjohn commented Nov 9, 2024

jstjohn commented Nov 9, 2024

jstjohn commented Nov 9, 2024

jstjohn commented Nov 9, 2024

jstjohn commented Nov 9, 2024

jstjohn commented Nov 11, 2024

jstjohn commented Nov 12, 2024

jstjohn commented Nov 12, 2024

Fix geneformer training instability bug #421

Fix geneformer training instability bug #421

Conversation

jstjohn commented Nov 8, 2024 • edited Loading

Old run where this was not an issue:

Representative new run where we see a spike in grad norm

We can make this spike go away by unsetting NVTE_FUSED_ATTN and NVTE_FLASH_ATTN

We can introduce this spike on the old image that didn't have these env variables by setting them

Example longer/larger batch run that fails with these env variables set

We can stabilize this run by unsetting these env variables

jstjohn commented Nov 8, 2024

jstjohn commented Nov 8, 2024

jstjohn commented Nov 9, 2024

jstjohn commented Nov 9, 2024

jstjohn commented Nov 9, 2024

jstjohn commented Nov 9, 2024

jstjohn commented Nov 9, 2024

jstjohn commented Nov 11, 2024

jstjohn commented Nov 12, 2024

jstjohn commented Nov 12, 2024

jstjohn commented Nov 8, 2024 •

edited

Loading

We can make this spike go away by unsetting `NVTE_FUSED_ATTN` and `NVTE_FLASH_ATTN`