Cccl
CUDA Core Compute Libraries
Install / Use
/learn @NVIDIA/CcclREADME
|Contributor Guide|Dev Containers|Discord|Godbolt|GitHub Project|Documentation| |-|-|-|-|-|-|
CUDA Core Compute Libraries (CCCL)
Welcome to the CUDA Core Compute Libraries (CCCL) where our mission is to make CUDA more delightful.
This repository unifies three essential CUDA C++ libraries into a single, convenient repository:
The goal of CCCL is to provide CUDA C++ developers with building blocks that make it easier to write safe and efficient code. Bringing these libraries together streamlines your development process and broadens your ability to leverage the power of CUDA C++. For more information about the decision to unify these projects, see the announcement here.
Overview
The concept for the CUDA Core Compute Libraries (CCCL) grew organically out of the Thrust, CUB, and libcudacxx projects that were developed independently over the years with a similar goal: to provide high-quality, high-performance, and easy-to-use C++ abstractions for CUDA developers. Naturally, there was a lot of overlap among the three projects, and it became clear the community would be better served by unifying them into a single repository.
-
Thrust is the C++ parallel algorithms library which inspired the introduction of parallel algorithms to the C++ Standard Library. Thrust's high-level interface greatly enhances programmer productivity while enabling performance portability between GPUs and multicore CPUs via configurable backends that allow using multiple parallel programming frameworks (such as CUDA, TBB, and OpenMP).
-
CUB is a lower-level, CUDA-specific library designed for speed-of-light parallel algorithms across all GPU architectures. In addition to device-wide algorithms, it provides cooperative algorithms like block-wide reduction and warp-wide scan, providing CUDA kernel developers with building blocks to create speed-of-light, custom kernels.
-
libcudacxx is the CUDA C++ Standard Library. It provides an implementation of the C++ Standard Library that works in both host and device code. Additionally, it provides abstractions for CUDA-specific hardware features like synchronization primitives, cache control, atomics, and more.
The main goal of CCCL is to fill a similar role that the Standard C++ Library fills for Standard C++: provide general-purpose, speed-of-light tools to CUDA C++ developers, allowing them to focus on solving the problems that matter. Unifying these projects is the first step towards realizing that goal.
Example
This is a simple example demonstrating the use of CCCL functionality from Thrust, CUB, and libcudacxx.
It shows how to use Thrust/CUB/libcudacxx to implement a simple parallel reduction kernel.
Each thread block computes the sum of a subset of the array using cub::BlockReduce.
The sum of each block is then reduced to a single value using an atomic add via cuda::atomic_ref from libcudacxx.
It then shows how the same reduction can be done using Thrust's reduce algorithm and compares the results.
#include <thrust/execution_policy.h>
#include <thrust/device_vector.h>
#include <cub/block/block_reduce.cuh>
#include <cuda/atomic>
#include <cuda/cmath>
#include <cuda/std/span>
#include <cstdio>
template <int block_size>
__global__ void reduce(cuda::std::span<int const> data, cuda::std::span<int> result) {
using BlockReduce = cub::BlockReduce<int, block_size>;
__shared__ typename BlockReduce::TempStorage temp_storage;
int const index = threadIdx.x + blockIdx.x * blockDim.x;
int sum = 0;
if (index < data.size()) {
sum += data[index];
}
sum = BlockReduce(temp_storage).Sum(sum);
if (threadIdx.x == 0) {
cuda::atomic_ref<int, cuda::thread_scope_device> atomic_result(result.front());
atomic_result.fetch_add(sum, cuda::memory_order_relaxed);
}
}
int main() {
// Allocate and initialize input data
int const N = 1000;
thrust::device_vector<int> data(N);
thrust::fill(data.begin(), data.end(), 1);
// Allocate output data
thrust::device_vector<int> kernel_result(1);
// Compute the sum reduction of `data` using a custom kernel
constexpr int block_size = 256;
int const num_blocks = cuda::ceil_div(N, block_size);
reduce<block_size><<<num_blocks, block_size>>>(cuda::std::span<int const>(thrust::raw_pointer_cast(data.data()), data.size()),
cuda::std::span<int>(thrust::raw_pointer_cast(kernel_result.data()), 1));
auto const err = cudaDeviceSynchronize();
if (err != cudaSuccess) {
std::cout << "Error: " << cudaGetErrorString(err) << std::endl;
return -1;
}
int const custom_result = kernel_result[0];
// Compute the same sum reduction using Thrust
int const thrust_result = thrust::reduce(thrust::device, data.begin(), data.end(), 0);
// Ensure the two solutions are identical
std::printf("Custom kernel sum: %d\n", custom_result);
std::printf("Thrust reduce sum: %d\n", thrust_result);
assert(kernel_result[0] == thrust_result);
return 0;
}
Getting Started
Users
Everything in CCCL is header-only. Therefore, users need only concern themselves with how they get the header files and how they incorporate them into their build system.
CUDA Toolkit
The easiest way to get started using CCCL is via the CUDA Toolkit which includes the CCCL headers.
When you compile with nvcc, it automatically adds CCCL headers to your include path so you can simply #include any CCCL header in your code with no additional configuration required.
If compiling with another compiler, you will need to update your build system's include search path to point to the CCCL headers in your CTK install (e.g., /usr/local/cuda/include).
#include <thrust/device_vector.h>
#include <cub/cub.cuh>
#include <cuda/std/atomic>
GitHub
Users who want to stay on the cutting edge of CCCL development are encouraged to use CCCL from GitHub. Using a newer version of CCCL with an older version of the CUDA Toolkit is supported, but not the other way around. For complete information on compatibility between CCCL and the CUDA Toolkit, see our platform support.
Everything in CCCL is header-only, so cloning and including it in a simple project is as easy as the following:
git clone https://github.com/NVIDIA/cccl.git
nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub main.cu -o main
Note Use
-Iand not-isystemto avoid collisions with the CCCL headers implicitly included bynvccfrom the CUDA Toolkit. All CCCL headers use#pragma system_headerto ensure warnings will still be silenced as if using-isystem, see https://github.com/NVIDIA/cccl/issues/527 for more information.
Installation
The default CMake options generate only installation rules, so the familiar
cmake . && make install workflow just works:
git clone https://github.com/NVIDIA/cccl.git
cd cccl
cmake . -DCMAKE_INSTALL_PREFIX=/usr/local
make install
A convenience script is also provided:
ci/install_cccl.sh /usr/local
Advanced installation using presets
CMake presets are also available with options for including experimental libraries:
cmake --preset install -DCMAKE_INSTALL_PREFIX=/usr/local
cmake --build --preset install --target install
Use the install-unstable preset to include experimental libraries, or
install-unstable-only to install only experimental libraries.
Conda
CCCL also provides conda packages of each release via the conda-forge channel:
conda config --add channels conda-forge
conda install cccl
This will install the latest CCCL to the conda environment's $CONDA_PREFIX/include/ and $CONDA_PREFIX/lib/cmake/ directories.
It is discoverable by CMake via find_package(CCCL) and can be used by any compilers in the conda environment.
For more information, see this introduction to conda-forge.
If you want to use the same CCCL version that shipped with a particular CUDA Toolkit, e.g. CUDA 12.4, you can install CCCL with:
conda config --add channels conda-forge
conda install cuda-cccl cuda-version=12.4
The cuda-cccl metapackage installs the cccl version that shipped with the CUDA Toolkit corresponding to cuda-version.
If you wish to update to the latest cccl after installing cuda-cccl, uninstall cuda-cccl before updating cccl:
conda uninstall cuda-cccl
conda install -c conda-forge cccl
Note There are also conda packages with names like
cuda-cccl_linux-64. Those packages contain the CCCL versions shipped as part of the CUDA Toolkit, but are designed for internal use by the CUDA Toolkit. Installccclorcuda-ccclinstead, for compatibility with conda compilers. For more information, see the cccl conda-forge recipe.
CMake Integration
CCCL uses CMake for all build and installation infrastructure, including tests as well as targets to link against in other CMake projects. Therefore, CMake is the recommended way to integrate CCCL
Related Skills
node-connect
337.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
337.7kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.3kCommit, push, and open a PR
