HCCL demo is a program that demonstrates HCCL usage and supports communication via Gaudi
based scale-out or Host NIC scale-out.
This README provides HCCL demo setup and usage as well as example run commands. In
addition, it provides further setup steps required when using Host NIC Scale out.
Host NIC Scale out is achieved using OFI. Host NIC Scale-Out Setup
section details the steps required to download, install and build OFI. It also provides
the required environment variables to run Host NIC scale-out with Gaudi Direct.
The following lists the supported collective operations:
- All_reduce
- All_gather
- All2All
- Reduce_scatter
- Broadcast
- Reduce
Send/Recv is the supported point to point communication. It illustrates exchanging data between pairs of Gaudis in same box or an outer box, via Gaudi based scale-out or Host NIC scale-out
- C++ project which includes all tests and a makefile
- Python wrapper which builds and runs the tests on multiple processes according to the number of devices
Copyright (c) 2022 Habana Labs, Ltd.
SPDX-License-Identifier: Apache-2.0
The Python wrapper builds and cleans the project (for cleaning please use "-clean").
Alternatively, the project can be built using the following command:
make
For building the project with MPI:
MPI=1 make
By default, the demo is built with affinity configuration.
When switching between MPI and non MPI modes, please remember to run with "-clean".
libfabric should be downloaded and installed in order to use it.
Please follow the instructions below:
-
Define required version to be installed:
export REQUIRED_VERSION=1.20.0
-
Download libfabric tarball from https://github.com/ofiwg/libfabric/releases:
wget https://github.com/ofiwg/libfabric/releases/download/v$REQUIRED_VERSION/libfabric-$REQUIRED_VERSION.tar.bz2 -P /tmp/libfabric
-
Store temporary download directory in stack:
pushd /tmp/libfabric
-
Open the file:
tar -xf libfabric-$REQUIRED_VERSION.tar.bz2
-
Define libfabric root location:
export LIBFABRIC_ROOT=<libFabric library location>
-
Create folder for libfabric:
mkdir -p ${LIBFABRIC_ROOT}
-
Change permissions for libfabric folder:
chmod 777 ${LIBFABRIC_ROOT}
-
Change directory to libfabric folder created after opening tar file:
cd libfabric-$REQUIRED_VERSION/
-
Configure libfabric:
./configure --prefix=$LIBFABRIC_ROOT --with-synapseai=/usr
-
Build and install libfabric:
make -j 32 && make install
-
Remove temporary download directory from stack:
popd
-
Delete temporary download directory:
rm -rf /tmp/libfabric
-
Include LIBFABRIC_ROOT in LD_LIBRARY_PATH:
export LD_LIBRARY_PATH=$LIBFABRIC_ROOT/lib:$LD_LIBRARY_PATH
Installation can be verified by running:
fi_info --version
.
For more information please see: https://github.com/ofiwg/libfabric
To use libfabric library, HCCL OFI wrapper should be built.
Please follow the instructions below:
- Clone wrapper from https://github.com/HabanaAI/hccl_ofi_wrapper:
git clone https://github.com/HabanaAI/hccl_ofi_wrapper.git
- Define LIBFABRIC_ROOT:
export LIBFABRIC_ROOT=/tmp/libfabric-1.20.0
- Change directory to hccl_ofi_wrapper:
cd hccl_ofi_wrapper
- Build wrapper:
make
- Copy wrapper to /usr/lib/habanalabs/:
cp libhccl_ofi_wrapper.so /usr/lib/habanalabs/libhccl_ofi_wrapper.so
- Run ldconfig utility:
ldconfig
- Include libhccl_ofi_wrapper.so location in LD_LIBRARY_PATH:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/habanalabs/
Gaudi direct (GDR) enables direct fabric access to Gaudi memory. This mode is supported with Verbs or EFA provider if the following conditions are met:
- OFI version 1.16.0 (or higher) for EFA and 1.20.0 (or higher) for Verbs
- Kernel version 5.12 (or higher)
- The following environment variables are set: FI_EFA_USE_DEVICE_RDMA=1 (For AWS EFA) RDMAV_FORK_SAFE=1 MLX5_SCATTER_TO_CQE=0 (For MLX Verbs)
- PCIe ACS (Access Control) should be disabled
-h, --help Show this help message and exit.
--clean, -clean Clean previous artifacts including logs, recipe and csv results.
-list, --list_tests Display a list of available tests.
--doc Display detailed help for HCCL demo in a form of docstring.
--nranks NRANKS Number of ranks in the communicator.
--ranks_per_node RANKS_PER_NODE
Number of ranks per node (default read from h/w or set by MPI)
--scaleup_group_size The scaleup group size per node (default is ranks_per_node)
--node_id NODE_ID Box index. Value in the range of (0, NUM_BOXES).
--mpi, -mpi Use MPI for managing execution.
--test TEST Specify test (use '-l' option for test list).
--size N Data size in units of G,M,K,B or no unit. Default is Bytes.
--size_range MIN MAX Test will run from MIN to MAX, units of G,M,K,B or no unit. Default is Bytes. E.g. --size_range 32B 1M.
--size_range_inc M Test will run on all multiplies by 2^size_range_inc from MIN to MAX.
--loop LOOP Number of loop iterations.
--test_root TEST_ROOT
Index of root rank for broadcast and reduce tests (optional).
--ranks_list RANKS_LIST List of pairs of ranks for send_recv ranks scaleout. E.g. 0,8,1,8 (optional).
--data_type DATA_TYPE Data type, float or bfloat16. Default is float.
--custom_comm CUSTOM_COMM
List of HCCL process that will open a communicator.
--no_correctness Skip correctness validation.
--reduction_op <sum|min|max> (default=sum)
--result_csv CSV_FILE Path to a file for results output (optional).
--ignore_mpi_errors, -ignore_mpi_errors
Ignore generic MPI errors.
--no_color, -no_color
Disable colored output in terminal.
--data_csv CSV_PATH, -data_csv CSV_PATH
Creates 2 csv file for each rank, one for data input and second for data output.
CSV_PATH is the path to a file for results output
HCCL_COMM_ID - IP of node_id=0 host and an available port, in the format <IP:PORT>
Set the below when using any operating system that has Linux kernel version between 5.9.x and 5.16.x. Currently, this is applicable to Ubuntu20 and Amazon Linux AMIs:
echo 0 > /proc/sys/kernel/numa_balancing
Run the execution command
HCCL_COMM_ID=<IP:PORT> ./run_hccl_demo.py [options]
Results are printed to the display
Results per rank can also be printed to output file by using --result_csv <path_to_file>
Note: The following examples are applicable for Gaudi based and Host NIC scale-out.
Configuration: One server with 8 ranks, 32 MB size, all_reduce collective, 1000 iterations
HCCL_COMM_ID=127.0.0.1:5555 python3 run_hccl_demo.py --nranks 8 --node_id 0 --size 32m --test all_reduce --loop 1000 --ranks_per_node 8
Output example:
###############################################################################
[BENCHMARK] hcclAllReduce(src!=dst, count=8388608, dtype=float, iterations=1000)
[BENCHMARK] NW Bandwidth : <Test results> GB/s
[BENCHMARK] Algo Bandwidth : <Test results> GB/s
###############################################################################
Different options for running one server with 8 ranks and size of 32 MB:
HCCL_COMM_ID=127.0.0.1:5555 python3 run_hccl_demo.py --nranks 8 --node_id 0 --size 32m --test all_reduce
HCCL_COMM_ID=127.0.0.1:5555 python3 run_hccl_demo.py --nranks 8 --node_id 0 --size 32M --test all_reduce
HCCL_COMM_ID=127.0.0.1:5555 python3 run_hccl_demo.py --nranks 8 --node_id 0 --size 33554432 --test all_reduce
HCCL_COMM_ID=127.0.0.1:5555 python3 run_hccl_demo.py --nranks 8 --node_id 0 --size 33554432b --test all_reduce
HCCL_COMM_ID=127.0.0.1:5555 python3 run_hccl_demo.py --nranks 8 --node_id 0 --size 33554432B --test all_reduce
Configuration: 16 ranks, 32 MB size, all_reduce collective, 1000 iterations
First server command:
HCCL_COMM_ID=10.111.12.234:5555 python3 run_hccl_demo.py --test all_reduce --nranks 16 --loop 1000 --node_id 0 --size 32m --ranks_per_node 8
Second server command:
HCCL_COMM_ID=10.111.12.234:5555 python3 run_hccl_demo.py --test all_reduce --nranks 16 --loop 1000 --node_id 1 --size 32m --ranks_per_node 8
First server output:
###############################################################################
[BENCHMARK] hcclAllReduce(src!=dst, count=8388608, dtype=float, iterations=1000)
[BENCHMARK] NW Bandwidth : <Test results> GB/s
[BENCHMARK] Algo Bandwidth : <Test results> GB/s
###############################################################################
Configuration: One server with 8 ranks, size range 32B to 1 MB, all_reduce collective, 1 iteration
HCCL_COMM_ID=127.0.0.1:5555 python3 run_hccl_demo.py --nranks 8 --node_id 0 --size_range 32b 1m --test all_reduce --loop 1 --ranks_per_node 8
Output example:
################################################
[SUMMARY REPORT]
(src!=dst, collective=all_reduce, iterations=1)
size count type redop time algo_bw nw_bw
(B) (elements) (ms) (GB/s) (GB/s)
32 8 float sum <time> <bandwidth> <bandwidth>
64 16 float sum <time> <bandwidth> <bandwidth>
128 32 float sum <time> <bandwidth> <bandwidth>
256 64 float sum <time> <bandwidth> <bandwidth>
512 128 float sum <time> <bandwidth> <bandwidth>
1024 256 float sum <time> <bandwidth> <bandwidth>
2048 512 float sum <time> <bandwidth> <bandwidth>
4096 1024 float sum <time> <bandwidth> <bandwidth>
8192 2048 float sum <time> <bandwidth> <bandwidth>
16384 4096 float sum <time> <bandwidth> <bandwidth>
32768 8192 float sum <time> <bandwidth> <bandwidth>
65536 16384 float sum <time> <bandwidth> <bandwidth>
131072 32768 float sum <time> <bandwidth> <bandwidth>
262144 65536 float sum <time> <bandwidth> <bandwidth>
524288 131072 float sum <time> <bandwidth> <bandwidth>
1048576 262144 float sum <time> <bandwidth> <bandwidth>
Note: The following examples are applicable for Gaudi based and Host NIC scale-out.
All available MPI options are supported.
- For MPI different running options please refer to: https://www.open-mpi.org/faq/?category=running#mpirun
Configuration: One server with 8 ranks, 32 MB size, all_reduce collective, 1000 iterations
python3 run_hccl_demo.py --size 32m --test all_reduce --loop 1000 -mpi -np 8
Output example:
###############################################################################
[BENCHMARK] hcclAllReduce(src!=dst, count=8388608, dtype=float, iterations=1000)
[BENCHMARK] NW Bandwidth : <Test results> GB/s
[BENCHMARK] Algo Bandwidth : <Test results> GB/s
###############################################################################
Configuration: 16 ranks, 32 MB size, all_reduce collective, 1000 iterations
First option using MPI hostfile:
python3 run_hccl_demo.py --test all_reduce --loop 1000 --size 32m -mpi --hostfile hostfile.txt
- For MPI --hostfile option, please refer to: https://www.open-mpi.org/faq/?category=running#mpirun-hostfile
Second option using MPI host:
python3 run_hccl_demo.py --test all_reduce --loop 1000 --size 32m -mpi --host 10.111.12.234:8,10.111.12.235:8
- For MPI --host option, please refer to: https://www.open-mpi.org/faq/?category=running#mpirun-host
First server output:
###############################################################################
[BENCHMARK] hcclAllReduce(src!=dst, count=8388608, dtype=float, iterations=1000)
[BENCHMARK] NW Bandwidth : <Test results> GB/s
[BENCHMARK] Algo Bandwidth : <Test results> GB/s
###############################################################################
Running on 1 server:
Configuration: One server with 8 ranks, 32 MB size, all_reduce collective, 1000 iterations, communicator includes only ranks 0 and 1:
HCCL_COMM_ID=127.0.0.1:5555 python3 run_hccl_demo.py --nranks 8 --node_id 0 --size 32m --test all_reduce --loop 1 --ranks_per_node 8 --custom_comm 0,1
Running on 2 servers with MPI (16 Gaudi devices):
* Note: When defining custom communicator, for each rank in the communicator we should have at least one more rank included that is a peer to the first one.
* In the following examples we used MPI hostfile, using MPI host is good as well.
Configuration: 16 ranks, 32 MB size, all_reduce collective, 1000 iterations, communicator includes only ranks 0 and 8:
python3 run_hccl_demo.py --test all_reduce --loop 1000 --size 32m --custom_comm 0,8 -mpi --hostfile hostfile.txt
Running on 2 servers without MPI (16 Gaudi devices):
Configuration: 16 ranks, 32 MB size, all_reduce collective, 1000 iterations, communicator includes only ranks 0,1,8,9:
First node:
HCCL_COMM_ID=10.111.12.234:5555 python3 run_hccl_demo.py --test all_reduce --nranks 16 --loop 1000 --node_id 0 --custom_comm 0,1,8,9
Second node:
HCCL_COMM_ID=10.111.12.234:5555 python3 run_hccl_demo.py --test all_reduce --nranks 16 --loop 1000 --node_id 1 --custom_comm 0,1,8,9