CARAML

Compact Automated Reproducible Assessment of Machine Learning (CARAML) is a benchmark to assess main stream Computer Vision and Natural Language Processing workloads on novel accelerators. It is developed and tested on systems of Jülich Supercomputing Centre (JSC).

CARAML benchmark is automated and made compact with the help of JUBE, a scripting-based framework to easily create benchmark sets, run those sets on different computer systems and evaluate the results. Additionally, the benchmarks are supplemented with power/energy measurement feature using jpwr.

With the usage of JUBE CARAML provides easy and reproducible way to benchmark different systems and model configurations with minimal effort.

Tested Accelerators:

CARAML has been tested on the JURECA-DC EVALUATION PLATFORM, JURECA-DC, JEDI and WEST-AI Nodes. These include the accelerators:

AMD MI200 node with 4 $\times$ MI250 GPUs (tag: MI250)
Graphcore IPU-POD4 M2000 with 4 $\times$ GC200 IPUs (tag: GC200)
NVIDIA Ampere node (SXM) with 4 $\times$ A100 GPUs (tag: A100)
NVIDIA Hopper node (PCIe) with 4 $\times$ H100 GPUs (tag: H100)
NVIDIA Hopper node (NVLink) with 4 $\times$ H100 GPUs (tag: WAIH100)
NVIDIA Grace-Hopper chip with 1 $\times$ GH200 GPU (tag: GH200)
NVIDIA Grace-Hopper Node with 4 $\times$ GH200 GPUs (tag: JEDI)

Benchmark

CARAML currently offers two benchmarks written in python:

Computer Vision: ResNet50 benchmark implemented in TensorFlow curated from forked versions of
- tensorflow/benchmarks for NVIDIA and AMD
- graphcore/examples for Graphcore
GPT Language Model: LLM-training implemented in PyTorch curated from
- Megatron-LM with commit: f7727433293427bef04858f67b2889fe9b177d88 and patch applied for NVIDIA
- Megatron-LM-ROCm with commit: 21045b59127cd2d5509f1ca27d81fae7b485bd22 and patch applied for AMD
- graphcore/examples forked version for Graphcore

Requirements

To run the benchmark JUBE must be installed. Refer to JUBE Installation Documentation. The containers are deployed using Apptainer images and SLURM on the accelerators.

Dataset

For ResNet50, either download the ImageNet LSVRC 2012 dataset from the source or via kaggle (disk space required: 144 GB) or use tag synthetic with JUBE to use synthetic data for benchmark.

For LLM training, a subset (790 samples, 10 MB) of the small version of the OSCAR dataset that is pre-processed using GPT-2 tokenizers is provided in llm_data.

Implementation

ResNet50

The JUBE file resnet50_benchmark.xml sets up the environment by

Pulling TensorFlow containers and pip installing additional packages using get_tensorflow_container.sh file
Cloning:
- tf_cnn_benchmarks (forked version) for NVIDIA & AMD
- examples (forked version) for Graphcore

The performance is measured in terms of images/sec and energy is in units of Wh.

LLM-Training

The JUBE file llm_benchmark_nvidia_amd.yaml and llm_benchmark_ipu.yaml sets up the environment by

Pulling PyTorch containers and pip installing additional packages using get_pytorch_container.sh file
Cloning:
- Megatron-LM with commit: f7727433293427bef04858f67b2889fe9b177d88 and applying patch using the setup_llm.sh file for NVIDIA,
- Megatron-LM-ROCm with commit: 21045b59127cd2d5509f1ca27d81fae7b485bd22 and applying patch using the setup_llm_amd.sh file for AMD
- examples (forked version) for Graphcore

The performance is measured in terms of tokens/sec and energy is in units of Wh.

Execution

Clone this repository and cd into it as

git clone https://github.com/FZJ-JSC/CARAML.git
cd CARAML

ResNet50

Set the required system and model parameters and the path to downloaded ImageNet data in resnet50_benchmark.xml

To pull the required container use container tag as:

NVIDIA A100 and H100 GPUs

jube run  resnet50/resnet50_benchmark.xml --tag container H100

NVIDIA GH200 and JEDI GPUs

jube run resnet50/resnet50_benchmark.xml --tag container GH200

AMD MI250

jube run resnet50/resnet50_benchmark.xml --tag container MI250

Graphcore GC200

jube run resnet50/resnet50_benchmark.xml --tag container GC200

To run the benchmark with defined configurations do
```
jube run resnet50/resnet50_benchmark.xml --tag A100
```
OR with synthetic data
```
jube run resnet50/resnet50_benchmark.xml --tag A100 synthetic
```
A100 can be replaced with H100, WAIH100, GH200, JEDI, MI250 and GC200 for the respective systems.
After the benchmark has been executed, use jube continue to postprocess results
```
jube continue resnet50/resnet50_benchmark_run -i last
```

After the postprocessing, to get the result do

 jube result resnet50/resnet50_benchmark_run -i last

Example result

JobID,System,Version,Queue,Runtime(s),Model,Dataset,Nodes,Devices,Tasks/Node,Threads/Task,GlobalBatchSize,BatchSize/Device,Images/second,EnergyFile
13077565,MI250,2024.01,dc-mi200,54.71,resnet50_v2,ImageNet,1,8,8,4,64,8,2107.00,CARAML/resnet50/resnet50_benchmark_run/000004/000002_combine_energy/work/combined_energy.csv

JobID,System,Version,Queue,Runtime(s),Model,Dataset,Nodes,Devices,Tasks/Node,Threads/Task,GlobalBatchSize,BatchSize/Device,Images/second,EnergyFile
13082568,GC200,2024.01,dc-ipu,1.0,resnet50_mlperf_pod4_bs20,ImageNet,1,4,1,12,32,8,3556.18,CARAML/resnet50/resnet50_benchmark_run/000000/000000_execute/work/GC200_power.0.energy.csv

JobID,System,Version,Queue,Runtime(s),Model,Dataset,Nodes,Devices,Tasks/Node,Threads/Task,GlobalBatchSize,BatchSize/Device,Images/second,EnergyFile
13080521,H100,2024.01,dc-h100,89.67,resnet50_v2,ImageNet,1,4,4,4,32,8,1994.69,CARAML/resnet50/resnet50_benchmark_run/000000/000001_combine_energy/work/combined_energy.csv

LLM-Training

Set the required system and model parameters in llm_benchmark_nvidia_amd.yaml for NVIDIA and AMD devices and in llm_benchmark_ipu.yaml for Graphcore

To pull the required container and build packages use container tag as:

NVIDIA A100 and H100 GPUs

jube run llm_training/llm_benchmark_nvidia_amd.yaml --tag container H100

NVIDIA GH200 and JEDI GPUs

jube run llm_training/llm_benchmark_nvidia_amd.yaml --tag container GH200

AMD MI250

jube run llm_training/llm_benchmark_nvidia_amd.yaml --tag container MI250

Graphcore GC200

jube run llm_training/llm_benchmark_ipu.yaml --tag container

To run the benchmark with defined configurations for 800M GPT model with OSCAR data do:
```
jube run llm_training/llm_benchmark_nvidia_amd.yaml --tag 800M A100
```
A100 can be replaced with H100, WAIH100, GH200, JEDI and MI250 for the respective systems and 800M can be replaced with 13B and 175B for systems with more node resources like JEDI, H100, A100 and MI250.
To run the benchmark with defined configurations for 117M GPT model on Graphcore with synthetic data do
```
jube run llm_training/llm_benchmark_ipu.yaml --tag 117M synthetic
```
If tag synthetic is not given, the benchmark will use OSCAR data

After the benchmark has been executed, use jube continue to postprocess results

jube continue llm_training/llm_benchmark_nvidia_amd_run -i last

OR

 jube continue llm_training/llm_benchmark_ipu_run -i last

After the postprocessing, to get the result do

jube result llm_training/llm_benchmark_nvidia_amd_run -i last

OR

jube result llm_training/llm_benchmark_ipu_run -i last

Example result

JobID,System,Version,Queue,JobTime,Runtime(min),Model,ModelSize,Dataset,Nodes,Devices,GlobalBatchSize,PipelineParallel,TensorParallel,DataParallel,Iterations,Time/iteration(s),Tokens/second,Avg_TFLOPs/GPU,EnergyFile
13077019,MI250,2024.01,dc-mi200,00:15:00,10,GPT,800M,OSCAR,1,8,32,1,1,8,750,0.74,88620.76,69.35,CARAML/llm_training/llm_benchmark_nvidia_amd_run/000006/000002_combine_energy/work/combined_energy.csv

JobID,System,Version,Queue,JobTime,Model,ModelSize,Dataset,Nodes,Devices,DataParallel,IPU/replica,GlobalBatchSize,Time/iteration(s),Tokens/second,EnergyFile
13011841,GC200,2024.01,dc-ipu,00:40:00,GPT,117M,Synthetic,1,4,1,4,2048,11.17,183.37,CARAML/llm_training/llm_benchmark_ipu_run/000003/000002_combine_energy/work/combined_energy.csv

JobID,System,Version,Queue,JobTime,Runtime(min),Model,ModelSize,Dataset,Nodes,Devices,GlobalBatchSize,PipelineParallel,TensorParallel,DataParallel,Iterations,Time/iteration(s),Tokens/second,Avg_TFLOPs/GPU,EnergyFile
3914,JEDI,2024.01,all,00:34:00,30,GPT,800M,OSCAR,1,4,2048,1,1,4,25,26.52,158152.80,321.65,CARAML/llm_training/llm_benchmark_nvidia_amd_run/000025/000002_combine_energy/work/combined_energy.csv

JSC Specific Fixes

In order to use PyTorch torch run API on JSC systems fixed_torch_run.py fix is required. The fix solves the issue defined here.

Additionally the hostname is appended with an i for allowing communication over InfiniBand as described here.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
llm_training		llm_training
resnet50		resnet50
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CARAML

Tested Accelerators:

Benchmark

Requirements

Dataset

Implementation

ResNet50

LLM-Training

Execution

ResNet50

LLM-Training

JSC Specific Fixes

About

Releases

Packages

Contributors 2

Languages

License

FZJ-JSC/CARAML

Folders and files

Latest commit

History

Repository files navigation

CARAML

Tested Accelerators:

Benchmark

Requirements

Dataset

Implementation

ResNet50

LLM-Training

Execution

ResNet50

LLM-Training

JSC Specific Fixes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages