HCCL demo is a program that demonstrates HCCL usage and supports communication via Gaudi
based scale-out or Host NIC scale-out.
This README provides HCCL demo setup and usage as well as example run commands. In
addition, it provides further setup steps required when using Host NIC Scale out.
Host NIC Scale out is achieved using OFI. Host NIC Scale-Out Setup
section details the steps required to download, install and build OFI. It also provides
the required environment variables to run Host NIC scale-out with Gaudi Direct.
The following lists the supported collective operations:
- All_reduce
- All_gather
- All2All
- Reduce_scatter
- Broadcast
- Reduce
Send/Recv is the supported point to point communication. It illustrates exchanging data between pairs of Gaudis in same box or an outer box, via Gaudi based scale-out or Host NIC scale-out
- C++ project which includes all tests and a makefile
- Python wrapper which builds and runs the tests on multiple processes according to the number of devices
Copyright (c) 2022 Habana Labs, Ltd.
SPDX-License-Identifier: Apache-2.0
The Python wrapper builds and cleans the project (for cleaning please use "-clean").
Alternatively, the project can be built using the following command:
make
For building the project with MPI:
MPI=1 make
By default, the demo is built with affinity configuration.
When switching between MPI and non MPI modes, please remember to run with "-clean".
libfabric should be downloaded and installed in order to use it.
Please follow the instructions below:
-
Define required version to be installed:
export REQUIRED_VERSION=1.20.0
-
Download libfabric tarball from https://github.com/ofiwg/libfabric/releases:
wget https://github.com/ofiwg/libfabric/releases/download/v$REQUIRED_VERSION/libfabric-$REQUIRED_VERSION.tar.bz2 -P /tmp/libfabric`
-
Store temporary download directory in stack:
pushd /tmp/libfabric
-
Open the file:
tar -xf libfabric-$REQUIRED_VERSION.tar.bz2
-
Define libfabric root location:
export LIBFABRIC_ROOT=<libFabric library location>
-
Create folder for libfabric:
mkdir -p ${LIBFABRIC_ROOT}
-
Change permissions for libfabric folder:
chmod 777 ${LIBFABRIC_ROOT}
-
Change directory to libfabric folder created after opening tar file:
cd libfabric-$REQUIRED_VERSION/
-
Configure libfabric:
./configure --prefix=$LIBFABRIC_ROOT --with-synapseai=/usr
-
Build and install libfabric:
make -j 32 && make install
-
Remove temporary download directory from stack:
popd
-
Delete temporary download directory:
rm -rf /tmp/libfabric
-
Include LIBFABRIC_ROOT in LD_LIBRARY_PATH:
export LD_LIBRARY_PATH=$LIBFABRIC_ROOT/lib:$LD_LIBRARY_PATH
Installation can be verified by running:
fi_info --version
.
For more information please see: https://github.com/ofiwg/libfabric
To use libfabric library, HCCL OFI wrapper should be built.
Please follow the instructions below:
- Clone wrapper from https://github.com/HabanaAI/hccl_ofi_wrapper:
git clone https://github.com/HabanaAI/hccl_ofi_wrapper.git
- Define LIBFABRIC_ROOT:
export LIBFABRIC_ROOT=/tmp/libfabric-1.20.0
- Change directory to hccl_ofi_wrapper:
cd hccl_ofi_wrapper
- Build wrapper:
make
- Copy wrapper to /usr/lib/habanalabs/:
cp libhccl_ofi_wrapper.so /usr/lib/habanalabs/libhccl_ofi_wrapper.so
- Run ldconfig utility:
ldconfig
- Include libhccl_ofi_wrapper.so location in LD_LIBRARY_PATH:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/habanalabs/
Gaudi direct (GDR) enables direct fabric access to Gaudi memory. This mode is supported with Verbs or EFA provider if the following conditions are met:
- OFI version 1.16.0 (or higher) for EFA and 1.20.0 (or higher) for Verbs
- Kernel version 5.12 (or higher)
- The following environment variables are set: FI_EFA_USE_DEVICE_RDMA=1 (For AWS EFA) RDMAV_FORK_SAFE=1 MLX5_SCATTER_TO_CQE=0 (For MLX Verbs)
- PCIe ACS (Access Control) should be disabled
--nranks - int, Number of ranks participating in the demo
--ranks_per_node - int, Number of ranks participating in the demo for current node
--node_id - int, ID of the running host. Each host should have unique id between 0-num_nodes
--test - str, Which hccl test to run (for example: broadcast/all_reduce) (default: broadcast)
--size - str, Data size in units of G,M,K,B or no unit (default: 33554432 Bytes)
--data_type - str, Type of data, float or bfloat16 (default: float)
--loop - int, Number of iterations (must be positive, default: 10)
--ranks_list - str, Comma separated list of pairs of ranks for send_recv ranks test only, e.g. 0,8,1,8 (optional, default is to perform regular send_recv test with all ranks)
--test_root - int, Index of root rank for broadcast and reduce tests
--csv_path - str, Path to a file for results output
--size_range - pair of str, Test will run from MIN to MAX, units of G,M,K,B or no unit. Default is Bytes, e.g. --size_range 32B 1M
--size_range_inc - int, Test will run on all multiplies by 2^size_range_inc from MIN to MAX (default: 1)
-mpi - Use MPI for managing execution
-clean - Clear old executable and compile a new one
-list - Display a list of available tests
-help - Display detailed help for HCCL demo in a form of docstring
-ignore_mpi_errors - Ignore generic MPI errors
-no_color - Disable the usage of colors in console output
HCCL_COMM_ID - IP of node_id=0 host and an available port, in the format <IP:PORT>
Set the below when using any operating system that has Linux kernel version between 5.9.x and 5.16.x. Currently, this is applicable to Ubuntu20 and Amazon Linux AMIs:
echo 0 > /proc/sys/kernel/numa_balancing
Run the execution command
HCCL_COMM_ID=<IP:PORT> ./run_hccl_demo.py [options]
Results are printed to the display
Results per rank can also be printed to output file by using --csv_path <path_to_file>
Note: The following examples are applicable for Gaudi based and Host NIC scale-out.
Configuration: One server with 8 ranks, 32 MB size, all_reduce collective, 1000 iterations
HCCL_COMM_ID=127.0.0.1:5555 python3 run_hccl_demo.py --nranks 8 --node_id 0 --size 32m --test all_reduce --loop 1000 --ranks_per_node 8
Output example:
###############################################################################
[BENCHMARK] hcclAllReduce(src!=dst, count=8388608, dtype=float, iterations=1000)
[BENCHMARK] NW Bandwidth : <Test results> MB/s
[BENCHMARK] Algo Bandwidth : <Test results> MB/s
###############################################################################
Different options for running one server with 8 ranks and size of 32 MB:
HCCL_COMM_ID=127.0.0.1:5555 python3 run_hccl_demo.py --nranks 8 --node_id 0 --size 32m --test all_reduce
HCCL_COMM_ID=127.0.0.1:5555 python3 run_hccl_demo.py --nranks 8 --node_id 0 --size 32M --test all_reduce
HCCL_COMM_ID=127.0.0.1:5555 python3 run_hccl_demo.py --nranks 8 --node_id 0 --size 33554432 --test all_reduce
HCCL_COMM_ID=127.0.0.1:5555 python3 run_hccl_demo.py --nranks 8 --node_id 0 --size 33554432b --test all_reduce
HCCL_COMM_ID=127.0.0.1:5555 python3 run_hccl_demo.py --nranks 8 --node_id 0 --size 33554432B --test all_reduce
Configuration: 16 ranks, 32 MB size, all_reduce collective, 1000 iterations
First server command:
HCCL_COMM_ID=10.111.12.234:5555 python3 run_hccl_demo.py --test all_reduce --nranks 16 --loop 1000 --node_id 0 --size 32m --ranks_per_node 8
Second server command:
HCCL_COMM_ID=10.111.12.234:5555 python3 run_hccl_demo.py --test all_reduce --nranks 16 --loop 1000 --node_id 1 --size 32m --ranks_per_node 8
First server output:
###############################################################################
[BENCHMARK] hcclAllReduce(src!=dst, count=8388608, dtype=float, iterations=1000)
[BENCHMARK] NW Bandwidth : <Test results> MB/s
[BENCHMARK] Algo Bandwidth : <Test results> MB/s
###############################################################################
Configuration: One server with 8 ranks, size range 32B to 1 MB, all_reduce collective, 1 iteration
HCCL_COMM_ID=127.0.0.1:5555 python3 run_hccl_demo.py --nranks 8 --node_id 0 --size_range 32b 1m --test all_reduce --loop 1 --ranks_per_node 8
Output example:
################################################
[SUMMARY REPORT]
(src!=dst, collective=all_reduce, iterations=1)
size count type redop time algo_bw nw_bw
(B) (elements) (us) (GB/s) (GB/s)
32 8 float sum <time> <bandwidth> <bandwidth>
64 16 float sum <time> <bandwidth> <bandwidth>
128 32 float sum <time> <bandwidth> <bandwidth>
256 64 float sum <time> <bandwidth> <bandwidth>
512 128 float sum <time> <bandwidth> <bandwidth>
1024 256 float sum <time> <bandwidth> <bandwidth>
2048 512 float sum <time> <bandwidth> <bandwidth>
4096 1024 float sum <time> <bandwidth> <bandwidth>
8192 2048 float sum <time> <bandwidth> <bandwidth>
16384 4096 float sum <time> <bandwidth> <bandwidth>
32768 8192 float sum <time> <bandwidth> <bandwidth>
65536 16384 float sum <time> <bandwidth> <bandwidth>
131072 32768 float sum <time> <bandwidth> <bandwidth>
262144 65536 float sum <time> <bandwidth> <bandwidth>
524288 131072 float sum <time> <bandwidth> <bandwidth>
1048576 262144 float sum <time> <bandwidth> <bandwidth>
Note: The following examples are applicable for Gaudi based and Host NIC scale-out.
All available MPI options are supported.
- For MPI different running options please refer to: https://www.open-mpi.org/faq/?category=running#mpirun
Configuration: One server with 8 ranks, 32 MB size, all_reduce collective, 1000 iterations
python3 run_hccl_demo.py --size 32m --test all_reduce --loop 1000 -mpi -np 8
Output example:
###############################################################################
[BENCHMARK] hcclAllReduce(src!=dst, count=8388608, dtype=float, iterations=1000)
[BENCHMARK] NW Bandwidth : <Test results> MB/s
[BENCHMARK] Algo Bandwidth : <Test results> MB/s
###############################################################################
Configuration: 16 ranks, 32 MB size, all_reduce collective, 1000 iterations
First option using MPI hostfile:
python3 run_hccl_demo.py --test all_reduce --loop 1000 --size 32m -mpi --hostfile hostfile.txt
- For MPI --hostfile option, please refer to: https://www.open-mpi.org/faq/?category=running#mpirun-hostfile
Second option using MPI host:
python3 run_hccl_demo.py --test all_reduce --loop 1000 --size 32m -mpi --host 10.111.12.234:8,10.111.12.235:8
- For MPI --host option, please refer to: https://www.open-mpi.org/faq/?category=running#mpirun-host
First server output:
###############################################################################
[BENCHMARK] hcclAllReduce(src!=dst, count=8388608, dtype=float, iterations=1000)
[BENCHMARK] NW Bandwidth : <Test results> MB/s
[BENCHMARK] Algo Bandwidth : <Test results> MB/s
###############################################################################
Running on 1 server:
Configuration: One server with 8 ranks, 32 MB size, all_reduce collective, 1000 iterations, communicator includes only ranks 0 and 1:
HCCL_COMM_ID=127.0.0.1:5555 python3 run_hccl_demo.py --nranks 8 --node_id 0 --size 32m --test all_reduce --loop 1 --ranks_per_node 8 --custom_comm 0,1
Running on 2 servers with MPI (16 Gaudi devices):
* Note: When defining custom communicator, for each rank in the communicator we should have at least one more rank included that is a peer to the first one.
* In the following examaples we used MPI hostfile, using MPI host is good as well.
Configuration: 16 ranks, 32 MB size, all_reduce collective, 1000 iterations, communicator includes only ranks 0 and 8:
python3 run_hccl_demo.py --test all_reduce --loop 1000 --size 32m --custom_comm 0,8 -mpi --hostfile hostfile.txt
Running on 2 servers without MPI (16 Gaudi devices):
Configuration: 16 ranks, 32 MB size, all_reduce collective, 1000 iterations, communicator includes only ranks 0,1,8,9:
First node:
HCCL_COMM_ID=10.111.12.234:5555 python3 run_hccl_demo.py --test all_reduce --nranks 16 --loop 1000 --node_id 0 --custom_comm 0,1,8,9
Second node:
HCCL_COMM_ID=10.111.12.234:5555 python3 run_hccl_demo.py --test all_reduce --nranks 16 --loop 1000 --node_id 1 --custom_comm 0,1,8,9