Update NeMo/Megatron #302

sichu2023 · 2024-10-10T22:03:43Z

Summary

avoid checkpoint directory name clashes
IOMixin only serializes non-default values
update ESM-2 tokenizer serialization test
turn off ckpt_async_save in stop-and-go test
switch to trainer.should_stop instead of raising exception
identify commit 19766a217 which impact training resumption

Details

Megatron now checks whether a non-empty checkpoint directory exists before overwriting (commit). Our unittests have a small validate_every_n_steps which leads to identical monitored metric, and thus the checkpoint directory name.

NeMo updated IOMixin.io_dump to include yaml_attrs. Also, IOMixin only serializes non-default values due to constant version conflicts in Megatron config objects.

Notes

Apparently there are unittest leakage but only on a few test functions listed below.

Pytest errors

On NeMo/Megatron TOT

FAILED sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency[ValidOutputCallback] - AssertionError: Tensor-likes are not close!
FAILED sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency[ValidLossCallback] - AssertionError: Scalars are not close!

Error from test leakage

FAILED sub-packages/bionemo-esm2/tests/bionemo/esm2/model/finetune/test_finetune.py::test_esm2_finetune_token_classifier[False] - TypeError: unsupported operand type(s) for /: 'PosixPath' and 'int'
FAILED sub-packages/bionemo-esm2/tests/bionemo/esm2/model/finetune/test_finetune.py::test_esm2_finetune_regressor[False] - TypeError: unsupported operand type(s) for /: 'PosixPath' and 'int'
FAILED sub-packages/bionemo-example_model/tests/bionemo/example_model/test_lightning_basic.py::test_train_mnist_litautoencoder_with_megatron_strategy_single_gpu[32] - torch.distributed.checkpoint.api.CheckpointException: CheckpointException ranks:dict_keys([0])
FAILED sub-packages/bionemo-example_model/tests/bionemo/example_model/test_lightning_basic.py::test_train_mnist_litautoencoder_with_megatron_strategy_single_gpu[bf16-mixed] - torch.distributed.checkpoint.api.CheckpointException: CheckpointException ranks:dict_keys([0])
FAILED sub-packages/bionemo-testing/tests/bionemo/testing/data/test_load.py::test_default_pbss_client - botocore.exceptions.ConfigParseError: Unable to parse config file: /home/bionemo/.aws/config

sub-packages/bionemo-esm2/src/bionemo/esm2/model/attention.py

jstjohn

This is related in some ways to #304, maybe sync with @farhadrgh and make sure that you are on a new enough NeMo for his needs as well? I think his stuff was recently merged.

sichu2023 · 2024-10-15T20:48:21Z

Dependent on https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/merge_requests/2219

sichu2023 · 2024-10-30T09:10:29Z

Related to #253

sichu2023 · 2024-10-31T15:27:54Z

Observe warning regarding LRScheduler

------------------------------------------------------------------------------------------------------------------- Captured stderr call --------------------------------------------------------------------------------------------------------------------
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2024-10-31 15:26:44 nemo_logger:123] No version folders would be created under the log folder as 'resume_if_exists' is enabled.
[NeMo W 2024-10-31 15:26:44 nemo_logger:173] "update_logger_directory" is True. Overwriting tensorboard logger "save_dir" to /tmp/pytest-of-bionemo/pytest-13/test_esm2_finetune_token_class0/pretrain/../../tmp/pytest-of-bionemo/pytest-13/test_esm2_finetune_token_class0/pretrain
[NeMo W 2024-10-31 15:26:44 nemo_logger:189] The Trainer already contains a ModelCheckpoint callback. This will be overwritten.
[NeMo W 2024-10-31 15:26:44 resume:215] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/pytest-of-bionemo/pytest-13/test_esm2_finetune_token_class0/pretrain/test_experiment/checkpoints. Training from scratch.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
[NeMo W 2024-10-31 15:26:44 megatron_strategy:310] Could not copy Trainer's 'max_steps' to LR scheduler's 'max_steps'. If you are not using an LR scheduler, this warning can safely be ignored.

sichu2023 · 2024-11-01T16:20:29Z

Related to NVIDIA/TransformerEngine#1130

sichu2023 · 2024-11-04T12:42:57Z

@jstjohn Looking into sub-packages/bionemo-llm/tests/bionemo/llm/utils/test_iomixin_utils.py::TestIOMixin::test_dataclass_out_of_sync, it seems that non-default_factory class attribute inherited "c" is no longer accessible by get_hparams.

(Pdb) v1.a, v1.b, v1.c
(4, 3, 3)
(Pdb) v1.get_hparams()
{'b': 7} 
(Pdb) v1_copy = io.reinit(v1)
(Pdb) v1_copy.get_hparams()
{'b': 7}

farhadrgh · 2024-11-05T19:43:43Z

scripts/protein/esm2/test_esm2_pretrain.py

@@ -62,7 +61,6 @@ def dummy_protein_dataset(tmp_path):
    return db_file


-@pytest.mark.skip("duplicate unittest")
 @pytest.fixture
 def dummy_parquet_train_val_inputs(tmp_path):


now that you're at it you may want to consider importing these fixtures from bionemo.testing

pstjohn · 2024-11-05T19:55:39Z

Dockerfile

@@ -17,7 +17,7 @@ RUN git clone https://github.com/NVIDIA/apex.git && \
  --config-settings "--build-option=--cpp_ext --cuda_ext --fast_layer_norm --distributed_adam --deprecated_fused_adam --group_norm"

 # Transformer Engine pre-1.7.0. 1.7 standardizes the meaning of bits in the attention mask to match
-ARG TE_COMMIT=7d576ed25266a17a7b651f2c12e8498f67e0baea
+ARG TE_COMMIT=c27ee60ec746210bcea4ec33958dbbff06706506


Attempting to split this as a separate PR in #399

sub-packages/bionemo-testing/src/bionemo/testing/harnesses/stop_and_go.py

sichu2023 · 2024-11-07T22:03:44Z

Dependent on #414

This reverts commit 8ee5d4f.

…nit__

… checkpoint_dir name

…date

…s to __init__" This reverts commit d3d8f24.

…plicated checkpoint_dir name" This reverts commit b936fda.

…n stopandgo

sichu2023 · 2024-11-08T12:51:18Z

NeMo quick patch to fix stop and go. Can remove xfail and merge after NeMo's PR.
NVIDIA/NeMo#11029

sichu2023 · 2024-11-08T12:52:09Z

Here is the NeMo long term fix.
Lightning-AI/pytorch-lightning#20379

sichu2023 · 2024-11-08T12:52:59Z

Issue tracking: https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/issues/888

sichu2023 added the NOT_related_to_v24.10 label Oct 10, 2024

sichu2023 commented Oct 10, 2024

View reviewed changes

sub-packages/bionemo-esm2/src/bionemo/esm2/model/attention.py Show resolved Hide resolved

jstjohn approved these changes Oct 11, 2024

View reviewed changes

jstjohn mentioned this pull request Oct 11, 2024

ESM2 Infer partial batches using predict method #304

Open

sichu2023 force-pushed the sichu/update-3rd-party branch from 28505eb to dcd025e Compare October 14, 2024 17:12

sichu2023 mentioned this pull request Oct 14, 2024

Bump 3rdparty/Megatron-LM from 0bda578 to b3375a0 #215

Closed

sichu2023 force-pushed the sichu/update-3rd-party branch from 933103a to 8ee5d4f Compare October 31, 2024 14:47

sichu2023 requested a review from jstjohn November 5, 2024 14:17

sichu2023 marked this pull request as ready for review November 5, 2024 19:02

sichu2023 requested review from malcolmgreaves, skothenhill-nv, farhadrgh, dorotat-nv, pstjohn, ohadmo and trvachov as code owners November 5, 2024 19:02

sichu2023 force-pushed the sichu/update-3rd-party branch from 2dbdb6a to 42138d2 Compare November 5, 2024 19:19

farhadrgh reviewed Nov 5, 2024

View reviewed changes

farhadrgh approved these changes Nov 5, 2024

View reviewed changes

pstjohn reviewed Nov 5, 2024

View reviewed changes

pstjohn force-pushed the sichu/update-3rd-party branch from e32d247 to 1667d6c Compare November 5, 2024 21:39

sichu2023 force-pushed the sichu/update-3rd-party branch from 1667d6c to 48142ab Compare November 5, 2024 22:44

pstjohn reviewed Nov 5, 2024

View reviewed changes

sub-packages/bionemo-testing/src/bionemo/testing/harnesses/stop_and_go.py Outdated Show resolved Hide resolved

pstjohn reviewed Nov 5, 2024

View reviewed changes

sub-packages/bionemo-testing/src/bionemo/testing/harnesses/stop_and_go.py Outdated Show resolved Hide resolved

sichu2023 and others added 26 commits November 8, 2024 12:41

fix rotary_pos_emb get_rotary_seq_len call

5b0eca6

revert te version

c826b86

Revert "reuse TEDotProductAttention __init__"

279aa74

This reverts commit 8ee5d4f.

support cp_comm_type in ESM2TEDotProductAttention __init__

454271e

pump NeMo/Megatron/TE commit hash

184ccde

fix test_tokenizer_serialization

b9615c5

update iomixin test - nemo only captures non-default arguments to __i…

69e2899

…nit__

increase limit_val_batches and val_check_interval to avoid duplicated…

70f9707

… checkpoint_dir name

add checkpoint callback to every mode in stop-and-go-test for nemo up…

c7d0b6b

…date

Revert "update iomixin test - nemo only captures non-default argument…

44189e4

…s to __init__" This reverts commit d3d8f24.

add notes on IOMixin behavior

9a0c976

Revert "increase limit_val_batches and val_check_interval to avoid du…

32136de

…plicated checkpoint_dir name" This reverts commit b936fda.

add step in checkpoint_dir to avoid name clashing

cb8856f

fix test_main_runs for esm2 and geneformer

2b0c87a

update test_iomixin_utils.py

4645768

revert ModelCheckpoint move in stopandgo

b1ef6c6

disable ckpt_async_save

522e849

use trainer.should_stop to interrupt training, remove uneven checks i…

7eb1e8a

…n stopandgo

drop every_n_train_steps in ModelCheckpoint

35c9bb0

update min_lr in esm2 scheduler

c9a421f

bump nemo version

9159a25

mark validation stop and go test xfail

430c228

drop every_n_train_steps

ced60e6

update geneformer output tolerance

f1621e5

ruff

a7e4f39

bump megatron version

e53a5ac

sichu2023 force-pushed the sichu/update-3rd-party branch from 5285f0f to e53a5ac Compare November 8, 2024 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update NeMo/Megatron #302

Update NeMo/Megatron #302

sichu2023 commented Oct 10, 2024 •

edited

Loading

jstjohn left a comment

sichu2023 commented Oct 15, 2024

sichu2023 commented Oct 30, 2024

sichu2023 commented Oct 31, 2024

sichu2023 commented Nov 1, 2024

sichu2023 commented Nov 4, 2024 •

edited

Loading

farhadrgh Nov 5, 2024

pstjohn Nov 5, 2024

sichu2023 commented Nov 7, 2024

sichu2023 commented Nov 8, 2024

sichu2023 commented Nov 8, 2024

sichu2023 commented Nov 8, 2024

Update NeMo/Megatron #302

Are you sure you want to change the base?

Update NeMo/Megatron #302

Conversation

sichu2023 commented Oct 10, 2024 • edited Loading

Summary

Details

Notes

Pytest errors

jstjohn left a comment

Choose a reason for hiding this comment

sichu2023 commented Oct 15, 2024

sichu2023 commented Oct 30, 2024

sichu2023 commented Oct 31, 2024

sichu2023 commented Nov 1, 2024

sichu2023 commented Nov 4, 2024 • edited Loading

farhadrgh Nov 5, 2024

Choose a reason for hiding this comment

pstjohn Nov 5, 2024

Choose a reason for hiding this comment

sichu2023 commented Nov 7, 2024

sichu2023 commented Nov 8, 2024

sichu2023 commented Nov 8, 2024

sichu2023 commented Nov 8, 2024

sichu2023 commented Oct 10, 2024 •

edited

Loading

sichu2023 commented Nov 4, 2024 •

edited

Loading