Releases: aws/sagemaker-hyperpod-recipes
Releases · aws/sagemaker-hyperpod-recipes
Release v1.1.0
Release v1.1.0
What's Changed
New recipes
- Added support for Llama 3.1 70b and Mixtral 22b 128 node pre-training.
- Added support for Llama 3.3 fine-tuning with SFT and LoRA.
- Added support for Llama 405b 32k sequence length QLoRA fine-tuning.
All new recipes are listed under "Model Support" section of README.
Release v1.0.1
Release v1.0.1
What's Changed
Bug fixes
- Upgraded Transformers library in the enroot Slurm code path to support running Llama3.2 recipes with an enroot container
Hyperpod Enhancements
- Added support for additional Hyperpod instance types including p5e and g6
Release v1.0.0
Release Notes - v1.0.0
We're thrilled to announce the initial release of sagemaker-hyperpod-recipes!
🎉 Features
- Unified Job Submission: Submit training and fine-tuning workflows to SageMaker HyperPod or SageMaker training jobs using a single entry point
- Flexible Configuration: Customize your training jobs with three types of configuration files:
- General Configuration (ex:
recipes_collection/config.yaml
) - Cluster Configuration (ex:
recipes_collection/cluster/slurm.yaml
) - Recipe Configuration (ex:
recipes_collection/recipes/training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain.yaml
)
- General Configuration (ex:
- Pre-defined LLM Recipes: Access a collection of ready-to-use recipes for training Large Language Models
- Cluster Agnostic: Compatible with SageMaker HyperPod (with Slurm or Amazon EKS orchestrators) and SageMaker training jobs
- Built on Nvidia NeMo Framework: Leverages the Nvidia NeMo Framework Launcher for efficient job management
🗂️ Repository Structure
main.py
: Primary entry point for submitting training jobslauncher_scripts/
: Collection of commonly used scripts for LLM trainingrecipes_collection/
: Pre-defined LLM recipes provided by developers
🔧 Key Components
- General Configuration: Common settings like default parameters and environment variables
- Cluster Configuration: Cluster-specific settings (e.g., volume, label for Kubernetes; job name for Slurm)
- Recipe Configuration: Training job settings including model types, sharding degree, and dataset paths
📚 Documentation
- Refer to the
README.md
for detailed usage instructions and examples
🤝 Contributing
We welcome contributions to enhance the capabilities of sagemaker-hyperpod-recipes. Please refer to our contributing guidelines for more information.
Thank you for choosing sagemaker-hyperpod-recipes for your large-scale language model training needs!