Merge branch 'main' into pstjohn/changes-for-azure-test-suite

NVIDIA · Jan 7, 2025 · b1e2e64 · b1e2e64
2 parents d6b0cec + 5eddee1
commit b1e2e64
Show file tree

Hide file tree

Showing 18 changed files with 522 additions and 101 deletions.
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
@@ -1,45 +1,34 @@
-(**NOTE:** _**delete** these instructional lines as you fill-out this PR template_)
+### Description
+<!-- Provide a detailed description of the changes in this PR -->
 
-(**NOTE:** _template is designed to be filled-in and used as the **squashed commit message for the entire PR**. _Italicized text_ is intended to be deleted as you fill in this template. Use the text between the `---`)
+### Type of changes
+<!-- Mark the relevant option with an [x] -->
 
----
+- [ ]  Bug fix (non-breaking change which fixes an issue)
+- [ ]  New feature (non-breaking change which adds functionality)
+- [ ]  Refactor
+- [ ]  Documentation update
+- [ ]  Other (please describe):
 
-_High level summary of changes. Try to keep this as short and informative as possible: less is more._
+### CI Pipeline Configuration
+Configure CI behavior by checking relevant boxes below. This will automatically apply labels.
 
-_Describe your changes. You can be more detailed and descriptive here. If it is a code change, Be sure to answer:_
-  - _What is changing?_
-  - _What is the new or fixed functionality?_
-  - _Why or when would someone want to use these changes?_
-  - _How can someone use these changes?_
----
+- [ ] [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests
+- [ ] [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest
 
-## Summary
-_High level summary of changes. Try to keep this as short and informative as possible: less is more._
+> [!NOTE]
+> By default, the notebooks validation tests are skipped unless explicitly enabled.
 
-## Details
-_Describe your changes. You can be more detailed and descriptive here._
-
-## Usage
-_How does a user interact with the changed code?_
+### Usage
+<!--- How does a user interact with the changed code -->
 ```python
-python -m your.new.module -and -all -options
-```
-
-## Testing
-_How do you prove that your code behaves the way you claim?_
-
-Tests for these changes can be run via:
-```shell
-pytest -v tests/your/new/or/existing/test_functions.py::test_function
+TODO: Add code snippet
 ```
 
+### Pre-submit Checklist
+<!--- Ensure all items are completed before submitting -->
 
-(**NOTE:** _also **delete** this checklist as you fill-out this PR template_)
-
-**Most of the changes** to files with extensions `*.py`, `*.yaml`, `*.yml`, `Dockerfile*` or `requirements.txt` **DO REQUIRE both `pytest-` and `jet-` CI stages**.
-
-- [ ] Did you review the [Before your PR is "Ready for review" section](https://github.com/NVIDIA/bionemo-framework/-/blob/dev/CONTRIBUTING.md?ref_type=heads#before-pr-ready) before asking for review?
-- [ ] Did you make sure your changes have tests? Did you test your changes locally?
-- [ ] Can you add [the `SKIP_CI` label](https://github.com/NVIDIA/bionemo-framework/-/blob/dev/CONTRIBUTING.md?ref_type=heads#skip-ci) to your PR?
-- [ ] Can you add [the `PYTEST_NOT_REQUIRED` label](https://github.com/NVIDIA/bionemo-framework/-/blob/dev/CONTRIBUTING.md?ref_type=heads#skip-pytest) to your PR?
-- [ ] Can you add [the `JET_NOT_REQUIRED` label](https://github.com/NVIDIA/bionemo-framework/-/blob/dev/CONTRIBUTING.md?ref_type=heads#skip-jet) to your PR?
+ - [ ] I have tested these changes locally
+ - [ ] I have updated the documentation accordingly
+ - [ ] I have added/updated tests as needed
+ - [ ] All existing tests pass successfully
diff --git a/.github/workflows/pr-labels.yml b/.github/workflows/pr-labels.yml
@@ -0,0 +1,48 @@
+name: PR Label Management
+
+on:
+  pull_request:
+    types: [opened, edited, synchronize]
+
+jobs:
+  manage-labels:
+    runs-on: ubuntu-latest
+    permissions:
+      pull-requests: write
+
+    steps:
+      - name: Check PR body and manage labels
+        uses: actions/github-script@v6
+        with:
+          script: |
+            const prBody = context.payload.pull_request.body;
+
+            const labelChecks = {
+              'SKIP_CI': /\[x\]\s*\[SKIP_CI\]/i,
+              'INCLUDE_NOTEBOOKS_TESTS': /\[x\]\s*\[INCLUDE_NOTEBOOKS_TESTS\]/i
+            };
+
+            const currentLabels = new Set(
+              context.payload.pull_request.labels.map(label => label.name)
+            );
+
+            for (const [label, pattern] of Object.entries(labelChecks)) {
+              const shouldHaveLabel = pattern.test(prBody);
+              const hasLabel = currentLabels.has(label);
+
+              if (shouldHaveLabel && !hasLabel) {
+                await github.rest.issues.addLabels({
+                  owner: context.repo.owner,
+                  repo: context.repo.repo,
+                  issue_number: context.payload.pull_request.number,
+                  labels: [label]
+                });
+              } else if (!shouldHaveLabel && hasLabel) {
+                await github.rest.issues.removeLabel({
+                  owner: context.repo.owner,
+                  repo: context.repo.repo,
+                  issue_number: context.payload.pull_request.number,
+                  name: label
+                });
+              }
+            }
diff --git a/ci/benchmarks/partial-conv/esm2_pretrain.yaml b/ci/benchmarks/partial-conv/esm2_pretrain.yaml
@@ -0,0 +1,54 @@
+scope: partial-conv
+time_limit: 14400
+script_args:
+  # All arguments referenced in the script string must be specified here.
+  # Arguments not referenced in the script string must have the 'arg' field specified.
+  # See jet/core/configs.py for the specification of the configuration class
+  workspace:
+    value: /workspace/bionemo2
+    key_segment: False
+  data_path:
+    value: /data/20240809_uniref_2024_03/data
+    key_segment: False
+  model:
+    value: esm2
+  variant:
+    value: train
+  config_name:
+    value: 650M
+  precision:
+    value: [bf16-mixed]
+  nodes:
+    value: [4]
+  gpus:
+    value: 8
+  batch_size:
+    value: 16
+  max_steps:
+    value: 26500
+script: |-
+  WANDB_API_KEY=$BIONEMO_WANDB_API_KEY ${variant}_${model} \
+    --train-cluster-path=${data_path}/train_clusters.parquet \
+    --train-database-path=${data_path}/train.db \
+    --valid-cluster-path=${data_path}/valid_clusters.parquet \
+    --valid-database-path=${data_path}/validation.db \
+    --micro-batch-size=${batch_size} \
+    --num-nodes=${nodes} \
+    --num-gpus=${gpus} \
+    --val-check-interval=1000 \
+    --limit-val-batches=1 \
+    --num-steps=${max_steps} \
+    --min-seq-length=1024 \
+    --max-seq-length=1024 \
+    --num-layers=33 \
+    --hidden-size=1280 \
+    --num-attention-heads=20 \
+    --ffn-hidden-size=5120 \
+    --create-tensorboard-logger \
+    --experiment-name=${batch_size}bs_${nodes}node_${gpus}gpu_${max_steps}s_${precision}prec \
+    --result-dir=${tensorboard_dir} \
+    --wandb-project=${wandb_project_name} \
+    --wandb-group=${model}_${variant}_${config_name} \
+    --wandb-job-type=${pipeline_label} \
+    --log-every-n-steps=50 \
+    --disable-checkpointing;
diff --git a/ci/benchmarks/perf/esm2_pretrain.yaml b/ci/benchmarks/perf/esm2_pretrain.yaml
@@ -0,0 +1,65 @@
+scope: perf
+time_limit: 1800
+script_args:
+  # All arguments referenced in the script string must be specified here.
+  # Arguments not referenced in the script string must have the 'arg' field specified.
+  # See jet/core/configs.py for the specification of the configuration class
+  workspace:
+    value: /workspace/bionemo2
+    key_segment: False
+  data_path:
+    value: /data/20240809_uniref_2024_03/data
+    key_segment: False
+  model: esm2
+  variant: train
+  config_name: 650M
+  precision: bf16-mixed
+  max_steps: 200
+  gpus: 8
+  acc_grad: 1
+  products:
+    - nodes: 1
+      batch_size: 16
+      pp: 1
+      tp: 1
+    - nodes: 2
+      batch_size: 16
+      pp: 2
+      tp: 1
+    - nodes: 2
+      batch_size: 16
+      pp: 1
+      tp: 2
+    - nodes: 2
+      batch_size: 16
+      pp: 1
+      tp: 1
+script: |-
+  WANDB_API_KEY=$BIONEMO_WANDB_API_KEY ${variant}_${model} \
+    --train-cluster-path=${data_path}/train_clusters.parquet \
+    --train-database-path=${data_path}/train.db \
+    --valid-cluster-path=${data_path}/valid_clusters.parquet \
+    --valid-database-path=${data_path}/validation.db \
+    --micro-batch-size=${batch_size} \
+    --num-nodes=${nodes} \
+    --num-gpus=${gpus} \
+    --val-check-interval=50 \
+    --limit-val-batches=1 \
+    --num-steps=${max_steps} \
+    --min-seq-length=1024 \
+    --max-seq-length=1024 \
+    --num-layers=33 \
+    --hidden-size=1280 \
+    --num-attention-heads=20 \
+    --ffn-hidden-size=5120 \
+    --create-tensorboard-logger \
+    --experiment-name=${batch_size}bs_${nodes}node_${gpus}gpu_${max_steps}s_${precision}prec \
+    --result-dir=${tensorboard_dir} \
+    --wandb-project=${wandb_project_name} \
+    --wandb-group=${model}_${variant}_${config_name} \
+    --wandb-job-type=${pipeline_label} \
+    --log-every-n-steps=10 \
+    --accumulate-grad-batches=${acc_grad} \
+    --pipeline-model-parallel-size=${pp} \
+    --tensor-model-parallel-size={tp} \
+    --disable-checkpointing;
diff --git a/ci/scripts/run_pytest.sh b/ci/scripts/run_pytest.sh
@@ -15,30 +15,107 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-set -xueo pipefail
-export PYTHONDONTWRITEBYTECODE=1
-# NOTE: if a non-nvidia user wants to run the test suite, just run `export BIONEMO_DATA_SOURCE=ngc` prior to this call.
-export BIONEMO_DATA_SOURCE="${BIONEMO_DATA_SOURCE:-pbss}"
-# flexible GPU memory management, reducing the risk of fragmentation-related CUDA OOM
-export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
-source "$(dirname "$0")/utils.sh"
-
-if ! set_bionemo_home; then
-    exit 1
-fi
 
-python -m coverage erase
+# Enable strict mode with better error handling
+set -ueo pipefail
+
+# Function to display usage information
+usage() {
+    cat << EOF
+Usage: $(basename "$0") [OPTIONS]
+
+Options:
+    --skip-docs    Skip running tests in the docs directory
+    --no-nbval     Skip jupyter notebook validation tests
+
+Note: Documentation tests (docs/) are only run when notebook validation
+      is enabled (--no-nbval not set) and docs are not skipped
+      (--skip-docs not set)
+    -h, --help     Display this help message
+EOF
+    exit "${1:-0}"
+}
+
+# Function to handle cleanup on script exit
+cleanup() {
+    local exit_code=$?
+    [ -n "${coverage_files[*]:-}" ] && rm -f "${coverage_files[@]:-}"
+    exit "$exit_code"
+}
 
+# Set default environment variables
+: "${BIONEMO_DATA_SOURCE:=pbss}"
+: "${PYTHONDONTWRITEBYTECODE:=1}"
+: "${PYTORCH_CUDA_ALLOC_CONF:=expandable_segments:True}"
+
+# Export necessary environment variables
+export BIONEMO_DATA_SOURCE PYTHONDONTWRITEBYTECODE PYTORCH_CUDA_ALLOC_CONF
+
+# Initialize variables
+declare -a coverage_files
+SKIP_DOCS=false
+NO_NBVAL=false
 error=false
-for dir in docs/ ./sub-packages/bionemo-*/; do
+
+# Parse command line arguments
+while (( $# > 0 )); do
+    case "$1" in
+        --skip-docs) SKIP_DOCS=true ;;
+        --no-nbval) NO_NBVAL=true ;;
+        -h|--help) usage ;;
+        *) echo "Unknown option: $1" >&2; usage 1 ;;
+    esac
+    shift
+done
+
+# Set up trap for cleanup
+trap cleanup EXIT
+
+# Source utility functions
+SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
+source "$SCRIPT_DIR/utils.sh" || { echo "Failed to source utils.sh" >&2; exit 1; }
+
+# Set up BioNeMo home directory
+set_bionemo_home || exit 1
+
+# Clear previous coverage data
+python -m coverage erase
+
+# Set up pytest options
+PYTEST_OPTIONS=(
+    -v
+    --durations=0
+    --durations-min=60.0
+)
+[[ "$NO_NBVAL" != true ]] && PYTEST_OPTIONS+=(--nbval-lax)
+
+# Define test directories
+TEST_DIRS=(./sub-packages/bionemo-*/)
+if [[ "$NO_NBVAL" != true && "$SKIP_DOCS" != true ]]; then
+    TEST_DIRS+=(docs/)
+fi
+
+echo "Test directories: ${TEST_DIRS[*]}"
+
+# Run tests with coverage
+for dir in "${TEST_DIRS[@]}"; do
     echo "Running pytest in $dir"
-    python -m coverage run --parallel-mode --source=bionemo \
-    -m pytest -v --nbval-lax --durations=0 --durations-min=60.0 "$dir" || error=true
+    coverage_file=".coverage.${dir//\//_}"
+    coverage_files+=("$coverage_file")
+
+    if ! python -m coverage run \
+        --parallel-mode \
+        --source=bionemo \
+        --data-file="$coverage_file" \
+        -m pytest "${PYTEST_OPTIONS[@]}" "$dir"; then
+        error=true
+    fi
 done
 
+# Combine and report coverage
 python -m coverage combine
 python -m coverage report --show-missing
 
-if [ "$error" = true ]; then
-    exit 1
-fi
+# Exit with appropriate status
+$error && exit 1
+exit 0
diff --git a/docs/docs/docs.todo b/docs/docs/docs.todo
diff --git a/docs/docs/user-guide/contributing/contributing.md b/docs/docs/user-guide/contributing/contributing.md
@@ -64,7 +64,7 @@ repository (unless external constraints prevent it).
 
 ## Pull Request (PR) Guidelines
 
-### Labeling Your PR
+### Labeling Your PR as External Contributor
 
 If you are an external contributor (not an NVIDIA employee), please add the `contribution` label to your PR before submitting. Labels can be accessed in the right sidebar of the GitHub user interface when creating or editing a PR.
 
@@ -135,6 +135,24 @@ If you are an external contributor (not an NVIDIA employee), please add the `con
     redistributed consistent with this project or the open source license(s) involved.
     ```
 
+### CI Pipeline Configuration Controls
+
+CI pipeline behavior can be controlled via checkboxes in PR descriptions to optimize test execution:
+
+Key behaviors:
+- Controls processed automatically on PR submit/update
+- Labels applied based on checkbox status
+- Invalid combinations default to most restrictive option
+
+#### **SKIP_CI**
+ - Skips entire CI pipeline
+ - Use for documentation typos, README updates
+
+#### **INCLUDE_NOTEBOOKS_TESTS**
+ - Enables notebook validation tests
+ - Use when modifying notebooks or notebook-related code
+ - Disabled by default
+
 ### Developer workflows:
 
 You should always carefully test your changes. Run `pytest ...` in your container locally. All tests are done via `pytest`.