-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update NeMo/Megatron #302
base: main
Are you sure you want to change the base?
Update NeMo/Megatron #302
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related in some ways to #304, maybe sync with @farhadrgh and make sure that you are on a new enough NeMo for his needs as well? I think his stuff was recently merged.
28505eb
to
dcd025e
Compare
Related to #253 |
933103a
to
8ee5d4f
Compare
Observe warning regarding ------------------------------------------------------------------------------------------------------------------- Captured stderr call --------------------------------------------------------------------------------------------------------------------
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2024-10-31 15:26:44 nemo_logger:123] No version folders would be created under the log folder as 'resume_if_exists' is enabled.
[NeMo W 2024-10-31 15:26:44 nemo_logger:173] "update_logger_directory" is True. Overwriting tensorboard logger "save_dir" to /tmp/pytest-of-bionemo/pytest-13/test_esm2_finetune_token_class0/pretrain/../../tmp/pytest-of-bionemo/pytest-13/test_esm2_finetune_token_class0/pretrain
[NeMo W 2024-10-31 15:26:44 nemo_logger:189] The Trainer already contains a ModelCheckpoint callback. This will be overwritten.
[NeMo W 2024-10-31 15:26:44 resume:215] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/tmp/pytest-of-bionemo/pytest-13/test_esm2_finetune_token_class0/pretrain/test_experiment/checkpoints. Training from scratch.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
[NeMo W 2024-10-31 15:26:44 megatron_strategy:310] Could not copy Trainer's 'max_steps' to LR scheduler's 'max_steps'. If you are not using an LR scheduler, this warning can safely be ignored. |
Related to NVIDIA/TransformerEngine#1130 |
@jstjohn Looking into
|
2dbdb6a
to
42138d2
Compare
@@ -62,7 +61,6 @@ def dummy_protein_dataset(tmp_path): | |||
return db_file | |||
|
|||
|
|||
@pytest.mark.skip("duplicate unittest") | |||
@pytest.fixture | |||
def dummy_parquet_train_val_inputs(tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that you're at it you may want to consider importing these fixtures from bionemo.testing
@@ -17,7 +17,7 @@ RUN git clone https://github.com/NVIDIA/apex.git && \ | |||
--config-settings "--build-option=--cpp_ext --cuda_ext --fast_layer_norm --distributed_adam --deprecated_fused_adam --group_norm" | |||
|
|||
# Transformer Engine pre-1.7.0. 1.7 standardizes the meaning of bits in the attention mask to match | |||
ARG TE_COMMIT=7d576ed25266a17a7b651f2c12e8498f67e0baea | |||
ARG TE_COMMIT=c27ee60ec746210bcea4ec33958dbbff06706506 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attempting to split this as a separate PR in #399
e32d247
to
1667d6c
Compare
1667d6c
to
48142ab
Compare
sub-packages/bionemo-testing/src/bionemo/testing/harnesses/stop_and_go.py
Outdated
Show resolved
Hide resolved
sub-packages/bionemo-testing/src/bionemo/testing/harnesses/stop_and_go.py
Outdated
Show resolved
Hide resolved
Dependent on #414 |
NeMo quick patch to fix stop and go. Can remove xfail and merge after NeMo's PR. |
Here is the NeMo long term fix. |
Issue tracking: https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/issues/888 |
5285f0f
to
e53a5ac
Compare
Summary
IOMixin
only serializes non-default valuesckpt_async_save
in stop-and-go testtrainer.should_stop
instead of raising exception19766a217
which impact training resumptionDetails
Megatron now checks whether a non-empty checkpoint directory exists before overwriting (commit). Our unittests have a small
validate_every_n_steps
which leads to identical monitored metric, and thus the checkpoint directory name.NeMo updated
IOMixin.io_dump
to includeyaml_attrs
. Also,IOMixin
only serializes non-default values due to constant version conflicts in Megatron config objects.Notes
Apparently there are unittest leakage but only on a few test functions listed below.
Pytest errors
On NeMo/Megatron TOT
Error from test leakage