- Follow steps in README.md
- Launch script in 2.2 Run this recipe for BART If running on AzureML,
cd huggingface/script
python hf-ort.py --gpu_cluster_name <gpu_cluster_name> --hf_model bart-large --run_config ort
If running locally,
cd huggingface/script
python hf-ort.py --hf_model bart-large --run_config ort --process_count <process_count> --local_run
Run configuration | PyTorch | ORTModule | Gain |
---|---|---|---|
fp16 | 338.04 | 384.61 | 13.8% |
fp16 with deepspeed stage 1 | 417.20 | 496.59 | 19.0% |
These numbers are average of samples/sec from 10 runs on ND40rs_v2
VMs (V100 32G x 8), Cuda 11, with stable release onnxruntime_training-1.8.0%2Bcu111-cp36-cp36m-manylinux2014_x86_64.whl
with batch size of 16. Cuda 10.2 option is also available through --use_cu102
flag. Please check dependency details in Dockerfile. We look at the metrics stable_train_samples_per_second
in the log, which discards first step that includes setup time. Also please note since ORTModule takes some time to do initial setup, smaller --max_steps
value may lead to longer total run time for ORTModule compared to PyTorch. However, if you want to see finetuning to finish faster, adjust --max_steps
to a smaller value. Lastly, we do not recommend running this recipe on [NC
] series VMs which uses old architecture (K80).