Prime Collective Communications Library (PCCL)

The Prime Collective Communications Library (PCCL) implements efficient and fault-tolerant collective communications operations such as reductions over IP and provides shared state synchronization mechanisms to keep peers in sync and allow for the dynamic joining and leaving of peers at any point during training along with automatic bandwidth-aware topology optimization. PCCL implements a novel TCP-based network protocol "Collective Communications over IP" (CCoIP).

Example

The Following is a simplified example of an application using PCCL in C++. Please refer to the documentation for more details and fault tolerance considerations.

C++ Example

#include <pccl.h>


int main() {
    PCCL_CHECK(pcclInit());
    pcclCommCreateParams_t params {
        .master_address = { 
            .inet = {.protocol = inetIPv4, .ipv4 = { 127, 0, 0, 1 }},
            .port = 48148
        },
        .peer_group = 0, .p2p_connection_pool_size = 16
    };
    pcclComm_t* comm{};
    PCCL_CHECK(pcclCreateCommunicator(&params, &comm));
    PCCL_CHECK(pcclConnect(comm));
    
    // declare shared state
    // ...
    pcclSharedState_t sstate{.revision = 0, .count = 1, .infos = infos};

    // training loop
    while (true) {
        if (local_iter > 0) {
            while (pcclUpdateTopology(comm) == pcclUpdateTopologyFailed) {
            }
        }
        int world_size{};
        PCCL_CHECK(pcclGetAttribute(comm, PCCL_ATTRIBUTE_GLOBAL_WORLD_SIZE, &world_size));
        
        // ...
        // Synchronize shared state ( no-op if advanced correctly )
        while (pcclSynchronizeSharedState(comm, &shared_state, &sync_info) == pcclSharedStateSyncFailed) {
        }

        // Do some work here, e.g. training step
        float local_data[4];
        for (int k = 0; k < 4; k++) {
            local_data[k] = float(local_iter * 10 + (k + 1));
        }
        float result_data[4] = {};

        // perform collective operation
        while (world_size > 1 && pcclAllReduce(local_data, result_data, &desc, comm, &reduce_info) != pcclSuccess) {
            PCCL_CHECK(pcclGetAttribute(comm, PCCL_ATTRIBUTE_GLOBAL_WORLD_SIZE, &world_size)); // re-obtain world size
        }
        
        // ...
        
        // advance shared state revision
        sstate.revision++;
        if (sstate.revision >= MAX_STEPS) {
            break;
        }
        local_iter++;
    }
    PCCL_CHECK(pcclDestroyCommunicator(comm));
    return 0;
}

Python Example

The following is a simplified example of an application using PCCL using the Python bindings.

from pccl import Communicator, Attribute, TensorInfo, SharedState, PCCLError

communicator: Communicator = Communicator("127.0.0.1:48148", peer_group=0, p2p_connection_pool_size=16)
communicator.connect(n_attempts=15)

shared_state_dict = {
    "myWeights": torch.zeros(8, dtype=torch.float32).to(device),
}

# declare shared state
entries = [TensorInfo.from_torch(tensor, name, allow_content_inequality=False) for name, tensor in
           shared_state_dict.items()]
shared_state: SharedState = SharedState(entries)

local_iter = 0
while True:
    if local_iter > 1:
        communicator.update_topology()
    world_size = communicator.get_attribute(Attribute.GLOBAL_WORLD_SIZE)

    # synchronize shared state
    sync_info = communicator.sync_shared_state(shared_state)
    
    # Do some work here, e.g. training step
    local_data = torch.rand(4).to(device)
    
    # perform collective operation
    while world_size > 1:
        try:
            communicator.all_reduce(local_data, result_data, desc)
            break
        except PCCLError as e:
            print(f"AllReduce failed => {e}. Retrying...")
            world_size = communicator.get_attribute(Attribute.GLOBAL_WORLD_SIZE)  # re-obtain world size
    
    local_iter += 1

To install pccl from PyPI, run the following command:

pip install pypccl

Prerequisites

Git
CMake (3.22.1 or higher)
C++ compiler with C++20 support (MSVC 17+, gcc 12+ or clang 12+)
Python 3.12+ (if bindings are used)
NVIDIA CUDA Computing Toolkit v12+ (if building with CUDA support)

Supported Operating Systems

Windows
macOS
Linux

Supported architectures

PCCL aims to be compatible with all architectures. While specialized kernels exist to optimize crucial operations like CRC32 hashing and quantization, fallback to a generic implementation should always be possible. Feel free to create issues for architecture-induced compilation failures.

For optimal performance, we recommend compiling with an OpenMP supported compiler. While PCCL will compile without it, it may result in significant performance degradation. NOTE: The default clang distribution of macOS does not support OpenMP! We recommend compiling with clang as distributed by the llvm homebrew package.

Explicitly supported are:

x86_64
aarch64 (incl. Apple Silicon)

Building

Installing prerequisites

In this section we propose a method of installing the required prerequisites for building PCCL on Windows, macOS and Linux.

Windows

With the winget package manager installed & up-to-date from the Microsoft Store, you can install the prerequisites as follows:

winget install Microsoft.VisualStudio.2022.Community --silent --override "--wait --quiet --add ProductLang En-us --add Microsoft.VisualStudio.Workload.NativeDesktop --includeRecommended"
winget install Git.Git --source winget
winget install Kitware.CMake --source winget
winget install Python.Python.3.12 --source winget # (if you want to use the Python bindings)

After installing these packages, make sure to refresh your PATH by restarting your explorer.exe in the Task Manager and opening a new Terminal launched by said explorer.

Installing CUDA on Windows

Go to https://developer.nvidia.com/cuda-downloads and download & click through the CUDA Toolkit installer.

CAUTION: It is crucial to do this installation process after installing Visual Studio because the installer will install the "Visual Studio Integration" only if VisualStudio is installed at the point of running the installer. If you just freshly installed Visual Studio and still have an old CUDA Toolkit installation lying around, you will have to uninstall and reinstall CUDA afterwards. Make sure "Visual Studio Integration" is status "Installed" in the summary of the installer. Without the Visual Studio Integration of the CUDA Toolkit, the cmake generation phase will fail in a specific way that is documented in the build section below.

macOS

xcode-select --install # if not already installed

# install Homebrew package manager if not already installed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

brew install git # if not already installed by xcode command line tools
brew install cmake
brew install python@3.12 # (if you want to use the Python bindings)
brew install llvm libomp # when OpenMP support is desired (strongly recommended!)

We recommend using python distributions from Homebrew to avoid conflicts with the system python and additionally because Homebrew python is built to allow attachment of debuggers, such as lldb and gdb to debug both python and native code end to end.

After installing the llvm package, make sure the clang compiler from said package is first in your PATH.

First, check the install location of the llvm package:

brew --prefix llvm

Depending on your system this path might look like /opt/homebrew/opt/llvm.

Then, in bin directory of the llvm package, you will find the clang compiler.

ls /opt/homebrew/opt/llvm/bin | grep clang

We will then export variables for the current shell to use the clang compiler from the llvm package:

export CC=/opt/homebrew/opt/llvm/bin/clang
export CXX=/opt/homebrew/opt/llvm/bin/clang++

NOTE: These variables will only be set for the current shell session. To make these changes permanent, add these lines to your shell profile (e.g. ~/.bash_profile or ~/.zshrc).

Ubuntu

sudo apt update
sudo apt install -y build-essential
sudo apt install -y git
sudo apt install -y cmake

# (if you want to use the Python bindings)
sudo apt install -y python3.12 python3.12-venv python3-pip

Installing CUDA (if not already installed)

The NVIDIA CUDA Computing Toolkit can be installed using any prevalent method as long as nvcc ends up in the system PATH of the shell that performs the cmake build.

Install using nvidia provided apt repository

This is the recommended way to install the CUDA toolkit on Ubuntu. If you do not have a good reason to deviate from this (such as custom drivers, as the p2p geohot driver), you should likely stick to this method.

Go to https://developer.nvidia.com/cuda-downloads and follow the instructions provided.

Install using .run file (not recommended on Ubuntu)

It is also possible to install CUDA via the NVIDIA provided .run file. This is not the recommended way to install cuda, but there may be good reasons why not to use the system packages, such as using a custom-built nvidia driver, such as the p2p geohot driver. Any nvidia driver related package such as userspace libraries and the CUDA toolkit may bring kernel module dependencies which may be undesirable. When installing via this

Pccl

Install / Use

README