Skip to content

Releases: aws/sagemaker-hyperpod-recipes

Release v1.1.0

31 Dec 21:16
66e49e0
Compare
Choose a tag to compare

Release v1.1.0

What's Changed

New recipes

  • Added support for Llama 3.1 70b and Mixtral 22b 128 node pre-training.
  • Added support for Llama 3.3 fine-tuning with SFT and LoRA.
  • Added support for Llama 405b 32k sequence length QLoRA fine-tuning.

All new recipes are listed under "Model Support" section of README.

Release v1.0.1

24 Dec 01:45
5f8b472
Compare
Choose a tag to compare

Release v1.0.1

What's Changed

Bug fixes

  • Upgraded Transformers library in the enroot Slurm code path to support running Llama3.2 recipes with an enroot container

Hyperpod Enhancements

  • Added support for additional Hyperpod instance types including p5e and g6

Release v1.0.0

07 Dec 00:52
5c66df4
Compare
Choose a tag to compare

Release Notes - v1.0.0

We're thrilled to announce the initial release of sagemaker-hyperpod-recipes!

🎉 Features

  • Unified Job Submission: Submit training and fine-tuning workflows to SageMaker HyperPod or SageMaker training jobs using a single entry point
  • Flexible Configuration: Customize your training jobs with three types of configuration files:
    • General Configuration (ex: recipes_collection/config.yaml)
    • Cluster Configuration (ex: recipes_collection/cluster/slurm.yaml)
    • Recipe Configuration (ex: recipes_collection/recipes/training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain.yaml)
  • Pre-defined LLM Recipes: Access a collection of ready-to-use recipes for training Large Language Models
  • Cluster Agnostic: Compatible with SageMaker HyperPod (with Slurm or Amazon EKS orchestrators) and SageMaker training jobs
  • Built on Nvidia NeMo Framework: Leverages the Nvidia NeMo Framework Launcher for efficient job management

🗂️ Repository Structure

  • main.py: Primary entry point for submitting training jobs
  • launcher_scripts/: Collection of commonly used scripts for LLM training
  • recipes_collection/: Pre-defined LLM recipes provided by developers

🔧 Key Components

  1. General Configuration: Common settings like default parameters and environment variables
  2. Cluster Configuration: Cluster-specific settings (e.g., volume, label for Kubernetes; job name for Slurm)
  3. Recipe Configuration: Training job settings including model types, sharding degree, and dataset paths

📚 Documentation

  • Refer to the README.md for detailed usage instructions and examples

🤝 Contributing

We welcome contributions to enhance the capabilities of sagemaker-hyperpod-recipes. Please refer to our contributing guidelines for more information.

Thank you for choosing sagemaker-hyperpod-recipes for your large-scale language model training needs!