XCCL - Mellanox Collective Communication Library

The XCCL library framework is a continuous extension of the advanced research and development on extreme-scale collective communications used by Mellanox for HPC, AI/ML application domains. The XCCL library implements "Teams API" concepts which is flexible and feature-rich for current and emerging programming models and runtimes.

XCCL Goals

Provides collective operations for HPC and AI/ML programming models
Enables hierarchial collectives (dynamic and static hiearchies)
Enables direct use of hardware collectives by programming model
Supports a variety of resource allocation models
Supports relaxed ordering model
Supports a variery of synchronization models
Supports repetitive collective operations (init once and invoke multiple times)
Support point-to-point operations in the context of group
Supports global memory management
Support multiple vendors' open and proprietary plugins

The library consists of two layers:

XCCL - teams collective communication layer, is the lower layer and implements a subset of the Teams API under consideration by the UCF Collectives WG for the following:
- A UCX Team
- A SHARP Team
- A shared memory Team (proprietary)
- A VNC - Hardware multicast Team (proprietary)
XCCL - is the upper layer and implements a light-weight, highly scalable framework for expressing hierarchical collectives in terms of the Team abstraction.

Quick Start Guide

The SHARP teams require Mellanox's SHARP software library, and the hardware multicast team requires Mellanox's VMC software library.

Build and install XCCL and XCCL libraries:

HPCX can be downloaded from https://www.mellanox.com/products/hpc-x-toolkit

# Line below is needed for all "HPCX_*" variables used in examples below
% module load /path/to/hpcx/dir/modulefiles/hpcx-stack
% export XCCL_DIR=$PWD/xccl

% git clone https://github.com/openucx/xccl.git $XCCL_DIR
% cd $XCCL_DIR
% ./autogen.sh
% ./configure --prefix=$PWD/install --with-vmc=$HPCX_VMC_DIR \ 
  --with-ucx=$HPCX_UCX_DIR --with-sharp=$HPCX_SHARP_DIR
% make -j install

Build and install Open MPI :

OpenMPI is taken from PR open-mpi/ompi#7409

% export OMPI_XCCL_DIR=$PWD/ompi-xccl
% git clone https://github.com/open-mpi/ompi
% cd $OMPI_XCCL_DIR
% git fetch origin pull/7409/head
% git submodule update --init --recursive
% ./autogen.pl
% ./configure --prefix=$OMPI_XCCL_DIR/install \
  --with-platform=contrib/platform/mellanox/optimized \
  --with-xccl=$XCCL_DIR/install
% make -j install

Run :

Example shows how to run osu_allreduce benchmark (https://mvapich.cse.ohio-state.edu/benchmarks/) with XCCL support

% export LD_LIBRARY_PATH="$XCCL_DIR/install/lib:$XCCL_DIR/install/lib/xccl"
% export LD_LIBRARY_PATH="$OMPI_XCCL_DIR/install/lib:$LD_LIBRARY_PATH"
% export nnodes=2 ppn=28

% mpirun -np $((nnodes*ppn)) --map-by ppr:$ppn:node --bind-to core ./osu_allreduce -f

Performance

One-level Allreduce: SHARP team

Helios Cluster: EDR 16 nodes, 1 process-per-node

OSU Allreduce

msglen	HCOLL (SHARP)	XCCL
4	2.81	2.21
8	2.75	2.04
16	2.74	2.21
32	2.78	2.09
64	2.73	2.18
128	2.88	2.15
256	3.57	2.59
512	3.77	2.86

Two-level Allreduce: shared memory socket team, shared memory NUMA team

Single node POWER9 168 threads

OSU Allreduce

msglen	hcoll	xccl
4	4.6	5.6
8	4.53	5.6
16	4.65	5.72
32	4.66	5.86
64	4.84	6.47
128	5.47	7.26
256	6.13	8.51
512	7.41	11.23
1024	9.18	15.93
2048	12.5	25.18

Three-level Broadcast: UCX team, UCX team, Hardware Multicast team (VMC) :

Hercules test bed: HDR100 110 nodes, 32 processes-per-node

OSU Bcast

msglen	hcoll	xccl
1	5.33	4.53
2	4.62	4.48
4	4.51	4.33
8	4.73	4.45
16	4.36	4.24
32	4.44	4.80
64	4.48	5.30
128	4.64	6.30
256	5.49	6.69
512	5.88	7.39
1024	6.45	8.46
2048	7.94	9.90
4096	11.30	13.42
8192	17.09	19.49
16384	30.41	30.95
32768	38.37	41.54

Publications

This framework is a continuous extension of the advanced research and development on extreme-scale collective communications published in the following scientific papers. The shared memory team is code ported from the HCOLL shared memory BCOL component. The XCCL layer is a "distillate" of the HCOLL framework. The HCOLL framework began its life as the Cheetah framework:

Cheetah: A Framework for Scalable Hierarchical Collective Operations
Date: May 2011
Publication description: IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGRID)
ConnectX-2 CORE-Direct Enabled Asynchronous Broadcast Collective Communications
Date: May 2011
Publication: First Workshop on Communication Architecture for Scalable Systems (CASS) held in conjunction with the International Parallel and Distributed Processing Symposium (IPDPS)
Design and Implementation of Broadcast Algorithms for Extreme-Scale Systems
Date: Sept 2011
Publication: IEEE Cluster 2011
Analyzing the Effect of Multicore Architectures and On-host Communication Characteristics on Collective Communications
Date: Sept 2011
Publication: Workshop on Scheduling and Resource Management for Parallel and Distributed Systems held in conjunction with the International Conference on Parallel Processing (ICPP)
Assessing the Performance and Scalability of a Novel K-Nomial Allgather on CORE-Direct Systems
Date: Aug 2012
Publication: 18th International Conference, Euro-Par 2012
Exploring the All-to-All Collective Optimization Space with ConnectX CORE-Direct
Date: Sept 2012
Publication: 41st International Conference on Parallel Processing, ICPP 2012
Optimizing Blocking and Nonblocking Reduction Operations for Multicore Systems: Hierarchical Design and Implementation
Date: Sept 2013
Publication: IEEE Cluster 2013

Name		Name	Last commit message	Last commit date
Latest commit History 188 Commits
.github/workflows		.github/workflows
external		external
m4		m4
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Makefile.am		Makefile.am
README.md		README.md
autogen.sh		autogen.sh
configure.ac		configure.ac
cudalt.py		cudalt.py
ompi_coll_xccl.patch		ompi_coll_xccl.patch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

XCCL - Mellanox Collective Communication Library

XCCL Goals

The library consists of two layers:

Quick Start Guide

Build and install XCCL and XCCL libraries:

Build and install Open MPI :

Run :

Performance

One-level Allreduce: SHARP team

Two-level Allreduce: shared memory socket team, shared memory NUMA team

Three-level Broadcast: UCX team, UCX team, Hardware Multicast team (VMC) :

Publications

About

Releases

Packages

Contributors 8

Languages

License

openucx/xccl

Folders and files

Latest commit

History

Repository files navigation

XCCL - Mellanox Collective Communication Library

XCCL Goals

The library consists of two layers:

Quick Start Guide

Build and install XCCL and XCCL libraries:

Build and install Open MPI :

Run :

Performance

One-level Allreduce: SHARP team

Two-level Allreduce: shared memory socket team, shared memory NUMA team

Three-level Broadcast: UCX team, UCX team, Hardware Multicast team (VMC) :

Publications

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages