Uccl
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)
Install / Use
/learn @uccl-project/UcclREADME
About
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., IBGDA), with two key focuses:
- Flexibility for high performance in fast-evolving ML workloads
- Portability for connecting heterogeneous GPUs in ML workloads
An UCCL overview can be found in this slide deck with the following components:
-
UCCL-collective serves as a drop-in replacement for NCCL/RCCL (e.g., requiring no changes to application code), and significantly outperforms them in both latency and throughput across various settings.
<details> <summary>UCCL-collective performance comparison</summary>- On six HGX servers (across two racks) with 8x400G CX-7 RoCE NICs and 8xH100 GPUs, UCCL-collective outperforms NCCL by up to 2.5x for AllReduce: <p align="left"> <img src="./docs/images/allreduce_6_hgx.png" alt="" width="600"> </p>
- On two AWS
g4dn.8xlargeinstances with 1x50G ENA NICs and 1xT4 GPUs within the same cluster placement group, UCCL-collective outperforms NCCL by up to 3.7x for AllReduce: <p align="left"> <img src="./docs/images/allreduce_2_g4dn.png" alt="" width="600"> </p>
- UCCL-collective aims to:
- rearchitect the CCL layer (while keeping NCCL APIs) to unleash the full potential of network hardware
- rearchitect the network transport layer to be fast and extensible
- support heterogeneous GPU and networking vendors such as Nvidia, AMD, and Broadcom
- become an open and collaborative platform for GPU communication research
- UCCL-collective has built a fast and extensible transport layer in software, which has created many benefits.
- For example, existing network transports under NCCL (i.e., kernel TCP and RDMA) leverage one or few network paths to stream huge data volumes, thus prone to congestion happening in datacenter networks.
- Instead, UCCL-collective employs packet spraying in software to leverage abundant network paths to avoid "single-path-of-congestion".
- More benefits include: 1) packet spraying with 256 paths, 2) advanced congestion control such as latency-based and receiver-driven ones, 3) efficient loss recovery by selective repeat, and 4) widely usable in public clouds with legacy NICs and Ethernet. Feel free to check out our full technical report.
-
UCCL-P2P provides both NIXL-style initiator-target tranfer APIs and NCCL-style collective APIs, with the same or better performance than both. UCCL-P2P is purposely designed for the next-gen 800Gbps NICs with efficient multi-threaded transfer engines.
<details> <summary>UCCL-P2P performance comparison</summary>- Message transfer bandwidth over RDMA on AMD MI300X + Broadcom Thor-2: <p align="left"> <img src="./docs/images/p2p-mi300x-thor2.png" alt="" width="600"> </p>
-
UCCL-EP allows running DeepEP atop of heterogeneous hardware platforms, including AMD and Nvidia GPUs, and any RDMA NICs such as AWS EFA NICs and Broadcom NICs, while achieving IBGDA-level performance.
<details> <summary>UCCL-EP performance comparison</summary>- EP32 dispatch and combine on AWS p5en (8x H200 + 16x 200Gb/s EFA): <p align="left"> <img src="./docs/images/ep32_dispatch_p5en.png" alt="" width="300" style="display:inline-block; vertical-align:middle; margin-right:10px;"> <img src="./docs/images/ep32_combine_p5en.png" alt="" width="300" style="display:inline-block; vertical-align:middle;"> </p>
UCCL has been adopted as part of the AMD TheRock ecosystem.
Road Map
More UCCL features are under development in this repo, currently including:
- ✅ More efficient KV cache transfer engine (e.g., better Mooncake)
- ✅ Supporting AMD GPUs
- 🚧 Supporting RDMA (NVIDIA, Broadcom), AWS EFA, GCP TCPX, TCP
- ✅ Efficient and portable expert-parallel communication
- ✅ Supporting all NIC vendors, including Nvidia, AWS EFA, and Broadcom
- ✅ Supporting AMD GPUs
- 🚧 Better flow control to avoid congestion
- ☐ Supporting other AI accelerators, such as TPUs and Trainium.
- 🚧 Re-architecting NCCL to unleash network hardware performance
- 🚧 Scalable and efficient CPU proxy
- ☐ Fast async collectives with compute-communication ordering guarantee
- ☐ Device kernels in vendor-agnostic Triton language
- ☐ Dynamic membership with GPU servers joining and exiting
Quick Start
The easiest way to use UCCL is to first build based on your platform. The build script will automatically detect the py_version of your current environment. If you need to compile UCCL for a specific python version, please specify the py_version, such as 3.10.
git clone https://github.com/uccl-project/uccl.git --recursive && cd uccl
# Eg, bash build.sh cu12 ep --install
bash build.sh [cu12|cu13|rocm|therock] [all|ccl_rdma|ccl_efa|p2p|ep] \
[py_version] [rocm_index_url] --install
Note:
- By default,
build.sh cu12targets CUDA 12.8 andbuild.sh rocmtargets ROCm 7.1, but you can also specifycu13|rocm6to target CUDA 13.0 or ROCm 6.4.- UCCL uses nanobind for C++/Python bindings. On Python 3.12+, wheels are tagged
cp312-abi3(stable ABI, one wheel for all 3.12+ interpreters); on older Pythons, wheels are CPython-version-specific.- When building for ROCm with python packaging through TheRock, please specify your ROCm index url; the default is
https://rocm.prereleases.amd.com/whl/gfx94X-dcgpuand it may not be what you want. When installing UCCL wheels for TheRock, please provide pip with the index url and add the optional extra[rocm]to the wheel, e.g.,pip install --extra-index-url https://rocm.prereleases.amd.com/whl/gfx94X-dcgpu wheelhouse-therock/uccl-0.0.1.post4-py3-none-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl[rocm].
Then, when running your PyTorch applications, set the environment variable accordingly:
# NCCL over IB/RoCE on x86 or GH200 ARM hosts
NCCL_NET_PLUGIN=`python -c "import uccl; print(uccl.nccl_plugin_path())"`
# RCCL over IB/RoCE on x86 hosts
NCCL_NET_PLUGIN=`python -c "import uccl; print(uccl.rccl_plugin_path())"`
# NCCL over AWS EFA NICs (p4d and p4de only)
LD_PRELOAD=`python -c "import uccl; print(uccl.efa_nccl_path())"`
NCCL_NET_PLUGIN=`python -c "import uccl; print(uccl.efa_plugin_path())"`
Now, you can just run your PyTorch applications and enjoy UCCL performance benefits!
Dev Guide
<details><summary>Click me</summary>First clone the UCCL repo and init submodules:
git clone https://github.com/uccl-project/uccl.git --recursive
export UCCL_HOME=$(pwd)/uccl
To build UCCL for development, you need to install some common dependencies:
# Note if you are using docker+wheel build, there is no need to install the following dependencies.
sudo apt update
sudo apt install linux-tools-$(uname -r) clang llvm cmake m4 build-essential \
net-tools libgtest-dev libgflags-dev \
libelf-dev libpcap-dev libc6-dev-i386 libpci-dev \
libopenmpi-dev libibverbs-dev clang-format -y
# Install and activate Miniconda (you can choose any recent versions)
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash ./Miniconda3-latest-Linux-x86_64.sh -b
source ~/miniconda3/bin/activate
source ~/.bashrc # or .zshrc and others
conda init
# Install python ssh lib and more
pip install paramiko intervaltree pybind11 nanobind
# Upgrade conda glic to modern ones
conda install -c conda-forge "libstdcxx-ng>=12" "libgcc-ng>=12"
For quick installation with docker, you can directly dive into:
-
UCCL-Collective RDMA: Collectives for Nvidia/AMD GPUs + IB/RoCE RDMA NICs (currently support Nvidia and Broadcom NICs) -
UCCL-Collective EFA: Collectives for AWS EFA NIC (currently support p4d.24xlarge)On p5/p5e/p5en/p6, the offical aws-ofi-nccl NCCL plugin with proper env variables already makes NCCL perform excellent
-
UCCL-Collective AFXDP: Collectives for Non-RDMA NICs (currently support AWS ENA NICs and IBM VirtIO NICs) -
UCCL-P2P: P2P for RDMA NICs and GPU IPCs (currently support Nvidia/AMD GPUs and Nvidia/Broadcom NICs) -
UCCL-EP: EP for MoE training and inference with DeepEP-compatible APIs (currently support Nvidia/AMD GPUs and Nvidia/Broadcom/EFA NICs)
Citation
The code in this repository is mostly described in the papers below. Please consider citing this work if you find the repository helpful.
@article{uccl_transp
