From d6a8fc6d52460cbdeaa6813b1ca61464042502a6 Mon Sep 17 00:00:00 2001 From: "Peter St. John" Date: Wed, 8 Jan 2025 11:47:17 -0700 Subject: [PATCH] Add pre-training page for ESM-2 (#578) --- .../images/esm2/esm2_pretrain_convergence.svg | 1522 +++++++++++++++++ docs/docs/models/ESM-2/SUMMARY.md | 2 + docs/docs/models/{esm2.md => ESM-2/index.md} | 5 +- docs/docs/models/ESM-2/pre-training.md | 155 ++ docs/mkdocs.yml | 2 + .../src/bionemo/core/data/resources/esm2.yaml | 27 + 6 files changed, 1710 insertions(+), 3 deletions(-) create mode 100644 docs/docs/assets/images/esm2/esm2_pretrain_convergence.svg create mode 100644 docs/docs/models/ESM-2/SUMMARY.md rename docs/docs/models/{esm2.md => ESM-2/index.md} (99%) create mode 100644 docs/docs/models/ESM-2/pre-training.md diff --git a/docs/docs/assets/images/esm2/esm2_pretrain_convergence.svg b/docs/docs/assets/images/esm2/esm2_pretrain_convergence.svg new file mode 100644 index 0000000000..3d8db001da --- /dev/null +++ b/docs/docs/assets/images/esm2/esm2_pretrain_convergence.svg @@ -0,0 +1,1522 @@ + + + + + + + + 2025-01-07T14:20:56.875575 + image/svg+xml + + + Matplotlib v3.6.3, https://matplotlib.org/ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/docs/models/ESM-2/SUMMARY.md b/docs/docs/models/ESM-2/SUMMARY.md new file mode 100644 index 0000000000..cffc007da9 --- /dev/null +++ b/docs/docs/models/ESM-2/SUMMARY.md @@ -0,0 +1,2 @@ +- [Model Overview](index.md) +- [Pre-trained Checkpoints](pre-training.md) diff --git a/docs/docs/models/esm2.md b/docs/docs/models/ESM-2/index.md similarity index 99% rename from docs/docs/models/esm2.md rename to docs/docs/models/ESM-2/index.md index 789223a787..0ef474a35b 100644 --- a/docs/docs/models/esm2.md +++ b/docs/docs/models/ESM-2/index.md @@ -13,13 +13,14 @@ dimension of 1280. The 3B model has 36 layers, 40 attention heads, and a hidden These models are ready for commercial use. ### Third-Party Community Consideration + This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case [1]; see link to [Non-NVIDIA Model Card for ESM-2 3B model]( https://huggingface.co/facebook/esm2_t36_3B_UR50D) and [non-NVIDIA Model Card for ESM-2 650M model]( https://huggingface.co/facebook/esm2_t33_650M_UR50D) - ### References + [1] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y. and dos Santos Costa, A., 2023. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), pp.1123-1130. @@ -98,7 +99,6 @@ Dataset](../datasets/uniprot.md). ESM-2 is as provided under the Apache 2.0 license. - ## Competitive Benchmarking ### Accuracy @@ -112,7 +112,6 @@ checkpoints is consistent with their outputs when evaluated with the HuggingFace | 650M | 7.001 | 7.002 | 6.95 :material-information-outline: | | 3B | 6.003 | 6.004 | 6.49 :material-information-outline: | - !!! info "Different Validation Sets" The HuggingFace and converted BioNeMo2 checkpoints were evaluated on a newly curated validation set. Perplexities diff --git a/docs/docs/models/ESM-2/pre-training.md b/docs/docs/models/ESM-2/pre-training.md new file mode 100644 index 0000000000..37701b4a87 --- /dev/null +++ b/docs/docs/models/ESM-2/pre-training.md @@ -0,0 +1,155 @@ +# Pre-training ESM-2 + +Pre-trained checkpoints for ESM-2 are available at the 8M, 650M, and 3B model sizes. These models were trained by the +bionemo-framework team to reproduce the original training results from Lin et al, Science (2023), with more recent +UniProt data and leveraging the bionemo training infrastructure. The full [pre-training data](../../datasets/uniprot.md) +and train/test splits are available. + +## Model Convergence + +Validation perplexity evaluated on the NVIDIA validation set. + +
+ ![ESM-2 Pre-training Convergence](../assets/images/esm2/esm2_pretrain_convergence.svg){ width="350" } +
+ +| Model Size | Perplexity at 500k updates | +| -------------- | ------ | +| 8M | 10.26 | +| 650M | 7.14 | +| 3B | 6.42 | + +## Pre-training recipes + +=== "8M" + + ```python + esm2_8m_ckpt_path = load("esm2/nv_8m:2.0") + ``` + + ### Training Script + + | Training Parameters | Value | + | ----------------------- | ------ | + | # of GPUs | 32 | + | GPU Type | A100 | + | Batch size (per device) | 64 | + + ```bash + train_esm2 \ + --create-tensorboard-logger \ + --resume-if-exists \ + --wandb-project= \ + --save-top-k=10 \ + --train-cluster-path=/data/train_clusters.parquet \ # (1)! + --train-database-path=/data/train.db \ + --valid-cluster-path=/data/valid_clusters.parquet \ + --valid-database-path=/data/validation.db \ + --num-steps=500_000 \ + --metric-to-monitor-for-checkpoints=val_loss \ + --micro-batch-size=64 \ + --num-nodes=4 \ + --num-gpus=8 \ + --val-check-interval=10000 \ + --limit-val-batches=1.0 \ + --result-dir=/results/esm2_pretrain_8m \ + --experiment-name=esm2_pretrain_8m \ + --num-layers=6 \ + --hidden-size=320 \ + --num-attention-heads=20 \ + --ffn-hidden-size=1280; + ``` + + 1. Paths here must be mounted into the `bionemo-framework` docker image. + +=== "650M" + + ```python + esm2_650m_ckpt_path = load("esm2/nv_650m:2.1") + ``` + + ### Training Script + + | Training Parameters | Value | + | ----------------------- | ------ | + | # of GPUs | 64 | + | GPU Type | H100 | + | Batch size (per device) | 32 | + + ```bash + train_esm2 \ + --create-tensorboard-logger \ + --resume-if-exists \ + --wandb-project= \ + --save-top-k=10 \ + --train-cluster-path=/data/train_clusters.parquet \ # (1)! + --train-database-path=/data/train.db \ + --valid-cluster-path=/data/valid_clusters.parquet \ + --valid-database-path=/data/validation.db \ + --num-steps=500_000 \ + --metric-to-monitor-for-checkpoints=val_loss \ + --micro-batch-size=32 \ + --num-nodes=8 \ + --num-gpus=8 \ + --val-check-interval=10000 \ + --limit-val-batches=1.0 \ + --result-dir=/results/esm2_pretrain_650m \ + --experiment-name=esm2_pretrain_650m \ + --min-seq-length=1024 \ + --max-seq-length=1024 \ + --num-layers=33 \ + --hidden-size=1280 \ + --num-attention-heads=20 \ + --ffn-hidden-size=5120; + ``` + + 1. Paths here must be mounted into the `bionemo-framework` docker image. + +=== "3B" + + ```python + esm2_3b_ckpt_path = load("esm2/nv_3b:2.1") + ``` + + ### Training Script + + | Training Parameters | Value | + | ----------------------- | ------ | + | # of GPUs | 128 | + | GPU Type | H100 | + | Batch size (per device) | 16 | + | warmup steps | 20,000 | + + ```bash + train_esm2 \ + --create-tensorboard-logger \ + --resume-if-exists \ + --wandb-project= \ + --save-top-k=10 \ + --train-cluster-path=/data/train_clusters.parquet \ # (2)! + --train-database-path=/data/train.db \ + --valid-cluster-path=/data/valid_clusters.parquet \ + --valid-database-path=/data/validation.db \ + --num-steps=500_000 \ + --warmup-steps=20_000 \ # (1)! + --metric-to-monitor-for-checkpoints=val_loss \ + --micro-batch-size=16 \ + --num-nodes=16 \ + --num-gpus=8 \ + --val-check-interval=2500 \ + --limit-val-batches=1.0 \ + --result-dir=/results/esm2_pretrain_3b \ + --experiment-name=esm2_pretrain_3b \ + --min-seq-length=1024 \ + --max-seq-length=1024 \ + --num-layers=36 \ + --hidden-size=2560 \ + --num-attention-heads=40 \ + --ffn-hidden-size=10240; + ``` + + 1. We had to increase the number of warmup steps 10x over the published training recipe for ESM-2 3B, which was + likely trained with fp16 precision. This gave us an overall similar initial curve, but avoided convergence issues + at around 2,000 steps. + + 2. Paths here must be mounted into the `bionemo-framework` docker image. diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index 4b50d623bb..6adbf103cc 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -91,6 +91,8 @@ markdown_extensions: options: custom_icons: - overrides/.icons + - pymdownx.tabbed: + alternate_style: true - def_list - admonition - footnotes diff --git a/sub-packages/bionemo-core/src/bionemo/core/data/resources/esm2.yaml b/sub-packages/bionemo-core/src/bionemo/core/data/resources/esm2.yaml index bb3b4467e3..d93139d756 100644 --- a/sub-packages/bionemo-core/src/bionemo/core/data/resources/esm2.yaml +++ b/sub-packages/bionemo-core/src/bionemo/core/data/resources/esm2.yaml @@ -7,6 +7,33 @@ description: > A pretrained 650M parameter ESM2 model. See https://ngc.nvidia.com/catalog/models/nvidia:clara:esm2nv650m. +- tag: nv_3b:2.1 + ngc: "nvidia/clara/esm2nv3b:2.1" + ngc_registry: model + pbss: "s3://general-purpose/esm2/checkpoints/3b/esm2_3b_checkpoint.tar.gz" + sha256: a79327a4054bf8d1d7075e1b3c961dbc503da02d72ed15f707d9cbbd49d181b6 # pragma: allowlist secret + owner: Peter St John + description: > + An ESM-2 3B model pre-trained on NVIDIA's train/test data split. + +- tag: nv_650m:2.1 + ngc: "nvidia/clara/esm2nv650m:2.1" + ngc_registry: model + pbss: "s3://general-purpose/esm2/checkpoints/650m/esm2_650m_checkpoint.tar.gz" + sha256: b83e9b5d62f1499b443817c5cd0facd3bdd4013a51a897e05e17228bf650befe # pragma: allowlist secret + owner: Peter St John + description: > + An ESM-2 650M model pre-trained on NVIDIA's train/test data split. + +- tag: nv_8m:2.0 + ngc: "nvidia/clara/esm2nv8m:2.0" + ngc_registry: model + pbss: "s3://general-purpose/esm2/checkpoints/8m/esm2_8m_checkpoint.tar.gz" + sha256: b4ea4d52eea8a25d2c2838617ff678f0da22d384cee195b0c192686816078dcd # pragma: allowlist secret + owner: Peter St John + description: > + An ESM-2 8M model pre-trained on NVIDIA's train/test data split. + - tag: 650m:2.0 ngc: nvidia/clara/esm2nv650m:2.0 ngc_registry: model