Compact Automated Reproducible Assessment of Machine Learning (CARAML) is a benchmark to assess main stream Computer Vision and Natural Language Processing workloads on novel accelerators. It is developed and tested on systems of Jülich Supercomputing Centre (JSC).
CARAML benchmark is automated and made compact with the help of JUBE, a scripting-based framework to easily create benchmark sets, run those sets on different computer systems and evaluate the results. Additionally, the benchmarks are supplemented with power/energy measurement feature using jpwr.
With the usage of JUBE
CARAML provides easy and reproducible way to benchmark different systems and model configurations with minimal effort.
CARAML has been tested on the JURECA-DC EVALUATION PLATFORM, JURECA-DC, JEDI and WEST-AI Nodes. These include the accelerators:
- AMD MI200 node with 4
$\times$ MI250 GPUs (tag: MI250
) - Graphcore IPU-POD4 M2000 with 4
$\times$ GC200 IPUs (tag: GC200
) - NVIDIA Ampere node (SXM) with 4
$\times$ A100 GPUs (tag: A100
) - NVIDIA Hopper node (PCIe) with 4
$\times$ H100 GPUs (tag: H100
) - NVIDIA Hopper node (NVLink) with 4
$\times$ H100 GPUs (tag: WAIH100
) - NVIDIA Grace-Hopper chip with 1
$\times$ GH200 GPU (tag: GH200
) - NVIDIA Grace-Hopper Node with 4
$\times$ GH200 GPUs (tag: JEDI
)
CARAML currently offers two benchmarks written in python
:
-
Computer Vision: ResNet50 benchmark implemented in TensorFlow curated from forked versions of
- tensorflow/benchmarks for NVIDIA and AMD
- graphcore/examples for Graphcore
-
GPT Language Model: LLM-training implemented in PyTorch curated from
- Megatron-LM with commit:
f7727433293427bef04858f67b2889fe9b177d88
and patch applied for NVIDIA - Megatron-LM-ROCm with commit:
21045b59127cd2d5509f1ca27d81fae7b485bd22
and patch applied for AMD - graphcore/examples forked version for Graphcore
- Megatron-LM with commit:
To run the benchmark JUBE
must be installed. Refer to JUBE Installation Documentation. The containers are deployed using Apptainer images and SLURM on the accelerators.
For ResNet50, either download the ImageNet LSVRC 2012
dataset from the source or via kaggle (disk space required: 144 GB) or use tag synthetic
with JUBE
to use synthetic data for benchmark.
For LLM training, a subset (790 samples, 10 MB) of the small version of the OSCAR dataset that is pre-processed using GPT-2 tokenizers is provided in llm_data.
The JUBE
file resnet50_benchmark.xml sets up the environment by
- Pulling TensorFlow containers and
pip
installing additional packages using get_tensorflow_container.sh file - Cloning:
- tf_cnn_benchmarks (forked version) for NVIDIA & AMD
- examples (forked version) for Graphcore
The performance is measured in terms of images/sec
and energy is in units of Wh
.
The JUBE
file llm_benchmark_nvidia_amd.yaml and llm_benchmark_ipu.yaml sets up the environment by
- Pulling PyTorch containers and
pip
installing additional packages using get_pytorch_container.sh file - Cloning:
- Megatron-LM with commit:
f7727433293427bef04858f67b2889fe9b177d88
and applying patch using the setup_llm.sh file for NVIDIA, - Megatron-LM-ROCm with commit:
21045b59127cd2d5509f1ca27d81fae7b485bd22
and applying patch using the setup_llm_amd.sh file for AMD - examples (forked version) for Graphcore
- Megatron-LM with commit:
The performance is measured in terms of tokens/sec
and energy is in units of Wh
.
Clone this repository and cd
into it as
git clone https://github.com/FZJ-JSC/CARAML.git
cd CARAML
Set the required system
and model
parameters and the path to downloaded ImageNet data in resnet50_benchmark.xml
-
To pull the required container use
container
tag as:- NVIDIA A100 and H100 GPUs
jube run resnet50/resnet50_benchmark.xml --tag container H100
- NVIDIA GH200 and JEDI GPUs
jube run resnet50/resnet50_benchmark.xml --tag container GH200
- AMD MI250
jube run resnet50/resnet50_benchmark.xml --tag container MI250
- Graphcore GC200
jube run resnet50/resnet50_benchmark.xml --tag container GC200
-
To run the benchmark with defined configurations do
jube run resnet50/resnet50_benchmark.xml --tag A100
OR with synthetic data
jube run resnet50/resnet50_benchmark.xml --tag A100 synthetic
A100
can be replaced withH100
,WAIH100
,GH200
,JEDI
,MI250
andGC200
for the respective systems. -
After the benchmark has been executed, use
jube continue
to postprocess resultsjube continue resnet50/resnet50_benchmark_run -i last
-
After the postprocessing, to get the result do
jube result resnet50/resnet50_benchmark_run -i last
-
Example result
JobID,System,Version,Queue,Runtime(s),Model,Dataset,Nodes,Devices,Tasks/Node,Threads/Task,GlobalBatchSize,BatchSize/Device,Images/second,EnergyFile 13077565,MI250,2024.01,dc-mi200,54.71,resnet50_v2,ImageNet,1,8,8,4,64,8,2107.00,CARAML/resnet50/resnet50_benchmark_run/000004/000002_combine_energy/work/combined_energy.csv JobID,System,Version,Queue,Runtime(s),Model,Dataset,Nodes,Devices,Tasks/Node,Threads/Task,GlobalBatchSize,BatchSize/Device,Images/second,EnergyFile 13082568,GC200,2024.01,dc-ipu,1.0,resnet50_mlperf_pod4_bs20,ImageNet,1,4,1,12,32,8,3556.18,CARAML/resnet50/resnet50_benchmark_run/000000/000000_execute/work/GC200_power.0.energy.csv JobID,System,Version,Queue,Runtime(s),Model,Dataset,Nodes,Devices,Tasks/Node,Threads/Task,GlobalBatchSize,BatchSize/Device,Images/second,EnergyFile 13080521,H100,2024.01,dc-h100,89.67,resnet50_v2,ImageNet,1,4,4,4,32,8,1994.69,CARAML/resnet50/resnet50_benchmark_run/000000/000001_combine_energy/work/combined_energy.csv
Set the required system
and model
parameters in llm_benchmark_nvidia_amd.yaml
for NVIDIA and AMD devices and in llm_benchmark_ipu.yaml for Graphcore
-
To pull the required container and build packages use
container
tag as:- NVIDIA A100 and H100 GPUs
jube run llm_training/llm_benchmark_nvidia_amd.yaml --tag container H100
- NVIDIA GH200 and JEDI GPUs
jube run llm_training/llm_benchmark_nvidia_amd.yaml --tag container GH200
- AMD MI250
jube run llm_training/llm_benchmark_nvidia_amd.yaml --tag container MI250
- Graphcore GC200
jube run llm_training/llm_benchmark_ipu.yaml --tag container
-
To run the benchmark with defined configurations for
800M
GPT model with OSCAR data do:jube run llm_training/llm_benchmark_nvidia_amd.yaml --tag 800M A100
A100
can be replaced withH100
,WAIH100
,GH200
,JEDI
andMI250
for the respective systems and800M
can be replaced with13B
and175B
for systems with more node resources likeJEDI
,H100
,A100
andMI250
. -
To run the benchmark with defined configurations for
117M
GPT model on Graphcore with synthetic data dojube run llm_training/llm_benchmark_ipu.yaml --tag 117M synthetic
If tag
synthetic
is not given, the benchmark will use OSCAR data -
After the benchmark has been executed, use
jube continue
to postprocess resultsjube continue llm_training/llm_benchmark_nvidia_amd_run -i last
OR
jube continue llm_training/llm_benchmark_ipu_run -i last
-
After the postprocessing, to get the result do
jube result llm_training/llm_benchmark_nvidia_amd_run -i last
OR
jube result llm_training/llm_benchmark_ipu_run -i last
-
Example result
JobID,System,Version,Queue,JobTime,Runtime(min),Model,ModelSize,Dataset,Nodes,Devices,GlobalBatchSize,PipelineParallel,TensorParallel,DataParallel,Iterations,Time/iteration(s),Tokens/second,Avg_TFLOPs/GPU,EnergyFile
13077019,MI250,2024.01,dc-mi200,00:15:00,10,GPT,800M,OSCAR,1,8,32,1,1,8,750,0.74,88620.76,69.35,CARAML/llm_training/llm_benchmark_nvidia_amd_run/000006/000002_combine_energy/work/combined_energy.csv
JobID,System,Version,Queue,JobTime,Model,ModelSize,Dataset,Nodes,Devices,DataParallel,IPU/replica,GlobalBatchSize,Time/iteration(s),Tokens/second,EnergyFile
13011841,GC200,2024.01,dc-ipu,00:40:00,GPT,117M,Synthetic,1,4,1,4,2048,11.17,183.37,CARAML/llm_training/llm_benchmark_ipu_run/000003/000002_combine_energy/work/combined_energy.csv
JobID,System,Version,Queue,JobTime,Runtime(min),Model,ModelSize,Dataset,Nodes,Devices,GlobalBatchSize,PipelineParallel,TensorParallel,DataParallel,Iterations,Time/iteration(s),Tokens/second,Avg_TFLOPs/GPU,EnergyFile
3914,JEDI,2024.01,all,00:34:00,30,GPT,800M,OSCAR,1,4,2048,1,1,4,25,26.52,158152.80,321.65,CARAML/llm_training/llm_benchmark_nvidia_amd_run/000025/000002_combine_energy/work/combined_energy.csv
In order to use PyTorch torch run
API on JSC systems fixed_torch_run.py fix is required. The fix solves the issue defined here.
Additionally the hostname
is appended with an i
for allowing communication over InfiniBand as described here.