Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update NeMo/Megatron #302

Open
wants to merge 36 commits into
base: main
Choose a base branch
from
Open

Update NeMo/Megatron #302

wants to merge 36 commits into from

Conversation

sichu2023
Copy link
Collaborator

@sichu2023 sichu2023 commented Oct 10, 2024

Summary

  • avoid checkpoint directory name clashes
  • IOMixin only serializes non-default values
  • update ESM-2 tokenizer serialization test
  • turn off ckpt_async_save in stop-and-go test
  • switch to trainer.should_stop instead of raising exception
  • identify commit 19766a217 which impact training resumption

Details

Megatron now checks whether a non-empty checkpoint directory exists before overwriting (commit). Our unittests have a small validate_every_n_steps which leads to identical monitored metric, and thus the checkpoint directory name.

NeMo updated IOMixin.io_dump to include yaml_attrs. Also, IOMixin only serializes non-default values due to constant version conflicts in Megatron config objects.

Notes

Apparently there are unittest leakage but only on a few test functions listed below.

Pytest errors

On NeMo/Megatron TOT

FAILED sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency[ValidOutputCallback] - AssertionError: Tensor-likes are not close!
FAILED sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py::TestESM2StopAndGo::test_stop_and_go_consistency[ValidLossCallback] - AssertionError: Scalars are not close!

Error from test leakage

FAILED sub-packages/bionemo-esm2/tests/bionemo/esm2/model/finetune/test_finetune.py::test_esm2_finetune_token_classifier[False] - TypeError: unsupported operand type(s) for /: 'PosixPath' and 'int'
FAILED sub-packages/bionemo-esm2/tests/bionemo/esm2/model/finetune/test_finetune.py::test_esm2_finetune_regressor[False] - TypeError: unsupported operand type(s) for /: 'PosixPath' and 'int'
FAILED sub-packages/bionemo-example_model/tests/bionemo/example_model/test_lightning_basic.py::test_train_mnist_litautoencoder_with_megatron_strategy_single_gpu[32] - torch.distributed.checkpoint.api.CheckpointException: CheckpointException ranks:dict_keys([0])
FAILED sub-packages/bionemo-example_model/tests/bionemo/example_model/test_lightning_basic.py::test_train_mnist_litautoencoder_with_megatron_strategy_single_gpu[bf16-mixed] - torch.distributed.checkpoint.api.CheckpointException: CheckpointException ranks:dict_keys([0])
FAILED sub-packages/bionemo-testing/tests/bionemo/testing/data/test_load.py::test_default_pbss_client - botocore.exceptions.ConfigParseError: Unable to parse config file: /home/bionemo/.aws/config

Copy link
Collaborator

@jstjohn jstjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related in some ways to #304, maybe sync with @farhadrgh and make sure that you are on a new enough NeMo for his needs as well? I think his stuff was recently merged.

@sichu2023
Copy link
Collaborator Author

@sichu2023
Copy link
Collaborator Author

Related to #253

@sichu2023
Copy link
Collaborator Author

Observe warning regarding LRScheduler

------------------------------------------------------------------------------------------------------------------- Captured stderr call --------------------------------------------------------------------------------------------------------------------
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2024-10-31 15:26:44 nemo_logger:123] No version folders would be created under the log folder as 'resume_if_exists' is enabled.
[NeMo W 2024-10-31 15:26:44 nemo_logger:173] "update_logger_directory" is True. Overwriting tensorboard logger "save_dir" to /tmp/pytest-of-bionemo/pytest-13/test_esm2_finetune_token_class0/pretrain/../../tmp/pytest-of-bionemo/pytest-13/test_esm2_finetune_token_class0/pretrain
[NeMo W 2024-10-31 15:26:44 nemo_logger:189] The Trainer already contains a ModelCheckpoint callback. This will be overwritten.
[NeMo W 2024-10-31 15:26:44 resume:215] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/pytest-of-bionemo/pytest-13/test_esm2_finetune_token_class0/pretrain/test_experiment/checkpoints. Training from scratch.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
[NeMo W 2024-10-31 15:26:44 megatron_strategy:310] Could not copy Trainer's 'max_steps' to LR scheduler's 'max_steps'. If you are not using an LR scheduler, this warning can safely be ignored.

@sichu2023
Copy link
Collaborator Author

Related to NVIDIA/TransformerEngine#1130

@sichu2023
Copy link
Collaborator Author

sichu2023 commented Nov 4, 2024

@jstjohn Looking into sub-packages/bionemo-llm/tests/bionemo/llm/utils/test_iomixin_utils.py::TestIOMixin::test_dataclass_out_of_sync, it seems that non-default_factory class attribute inherited "c" is no longer accessible by get_hparams.

(Pdb) v1.a, v1.b, v1.c
(4, 3, 3)
(Pdb) v1.get_hparams()
{'b': 7} 
(Pdb) v1_copy = io.reinit(v1)
(Pdb) v1_copy.get_hparams()
{'b': 7}

@@ -62,7 +61,6 @@ def dummy_protein_dataset(tmp_path):
return db_file


@pytest.mark.skip("duplicate unittest")
@pytest.fixture
def dummy_parquet_train_val_inputs(tmp_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that you're at it you may want to consider importing these fixtures from bionemo.testing

@@ -17,7 +17,7 @@ RUN git clone https://github.com/NVIDIA/apex.git && \
--config-settings "--build-option=--cpp_ext --cuda_ext --fast_layer_norm --distributed_adam --deprecated_fused_adam --group_norm"

# Transformer Engine pre-1.7.0. 1.7 standardizes the meaning of bits in the attention mask to match
ARG TE_COMMIT=7d576ed25266a17a7b651f2c12e8498f67e0baea
ARG TE_COMMIT=c27ee60ec746210bcea4ec33958dbbff06706506
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attempting to split this as a separate PR in #399

@sichu2023
Copy link
Collaborator Author

Dependent on #414

sichu2023 and others added 26 commits November 8, 2024 12:41
…plicated checkpoint_dir name"

This reverts commit b936fda.
@sichu2023
Copy link
Collaborator Author

NeMo quick patch to fix stop and go. Can remove xfail and merge after NeMo's PR.
NVIDIA/NeMo#11029

@sichu2023
Copy link
Collaborator Author

Here is the NeMo long term fix.
Lightning-AI/pytorch-lightning#20379

@sichu2023
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants