COSMA
Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm
Install / Use
/learn @eth-cscs/COSMAREADME
Table of Contents
- Overview
- COSMA Literature
- Features
- Building COSMA
- COSMA Dependencies
- Using COSMA
- COSMA on Multi-GPU Systems
- COSMA in production
- Examples - Miniapps
- Tunable Parameters
- Performance Profiling
- Authors
- Questions?
- Acknowledgements
Overview
COSMA is a parallel, high-performance, GPU-accelerated, matrix-matrix multiplication algorithm that is communication-optimal for all combinations of matrix dimensions, number of processors and memory sizes, without the need for any parameter tuning. The key idea behind COSMA is to first derive a tight optimal sequential schedule and only then parallelize it, preserving I/O optimality between processes. This stands in contrast with the 2D and 3D algorithms, which fix process domain decomposition upfront and then map it to the matrix dimensions, which may result in asymptotically more communication. The final design of COSMA facilitates the overlap of computation and communication, ensuring speedups and applicability of modern mechanisms such as RDMA. COSMA allows to not utilize some processors in order to optimize the processor grid, which reduces the communication volume even further and increases the computation volume per processor.
COSMA got the Best Student Paper Award at the prestigious Supercomputing 2019 conference in Denver, US.
COSMA alleviates the issues of current state-of-the-art algorithms, which can be summarized as follows:
2D (SUMMA): Requires manual tuning and not communication-optimal in the presence of extra memory.2.5D: Optimal form=n, but inefficient form << norn << mand for some numbers of processesp.Recursive (CARMA): Asymptotically communication-optimal for allm, n, k, p, but splitting always the largest dimension might lead up to√3increase in communication volume.COSMA (this work): Strictly communication-optimal (not just asymptotically) for allm, n, k, pand memory sizes that yields the speedups by factor of up to 8.3x over the second-fastest algorithm.
In addition to being communication-optimal, this implementation is higly-optimized to reduce the memory footprint in the following sense:
Buffer Reuse: all the buffers are pre-allocated and carefully reused during execution, including the buffers necessary for the communication, which reduces the total memory usage.Reduced Local Data Movement: the assignment of data blocks to processes is fully adapted to communication pattern, which minimizes the need of local data reshuffling that arise after each communication step.
The library supports both one-sided and two-sided MPI communication backends. It uses dgemm for the local computations, but also has a support for the GPU acceleration through our Tiled-MM library using cublas or rocBLAS.
COSMA Literature
The paper and other materials on COSMA are available under the following link:
- ACM Digital Library (Best Student Paper Award at SC19): https://dl.acm.org/doi/10.1145/3295500.3356181
- Arxiv: https://arxiv.org/abs/1908.09606
- YouTube Presentation: https://www.youtube.com/watch?v=5wiZWw5ltR0
- Press Release: https://www.cscs.ch/science/computer-science-hpc/2019/new-matrix-multiplication-algorithm-pushes-the-performance-to-the-limits/
Features
- [NEW] Multi-GPU Systems Support: COSMA is now able to take advantage of fast GPU-to-GPU interconnects either through the use of NCCL/RCCL libraries or by using the GPU-aware MPI. Both, NVIDIA and AMD GPUs are supported.
- ScaLAPACK API Support: it is enough to link to COSMA, without changing the code and all
p?gemmcalls will use ScaLAPACK wrappers provided by COSMA. - C/Fortran Interface: written in
C++, but providesCandFortraninterfaces. - Custom Types: fully templatized types.
- GPU acceleration: supports both NVIDIA and AMD GPUs.
- Supported BLAS (CPU) backends: MKL, LibSci, NETLIB, BLIS, ATLAS.
- Custom Data Layout Support: natively uses its own blocked data layout of matrices, but supports arbitrary grid-like data layout of matrices.
- Tranposition/Conjugation Support: matrices
AandBcan be transposed and/or conjugated. - Communication and Computation Overlap: supports overlapping of communication and computation.
- Spack Installation: can be built and installed with
Spacksince v14.1 - Julia Package: see https://github.com/haampie/COSMA.jl/ on how to use COSMA in the Julia language.
Building COSMA
See Installation Instructions.
COSMA Dependencies
COSMA is a CMake project and requires a recent CMake(>=3.17).
External dependencies:
MPI 3: (required)BLAS: when the problem becomes local, COSMA uses provided?gemmbackend, which can be one of the following:MKL(default)OPENBLASBLISATLASCRAY_LIBSCI:Cray-libsciorCray-libsci_acc(GPU-accelerated)CUDA:cublasis used for NVIDIA GPUsROCM:rocBLASis used for AMD GPUsCUSTOM: user-provided BLAS API
Some dependencies are bundled as submodules and need not be installed explicitly:
TiledMM- cublasXt GEMM replacement, that is also ported to AMD GPUs.COSTA- distributed matrix reshuffle and transpose algorithm.semiprof- profiling utlilitygtest_mpi- MPI utlility wrapper over GoogleTest (unit testing library)
Using COSMA
To allow easy integration, COSMA can be used in the following ways:
-
without changing your code: if your code already uses the
ScaLAPACK API, then you can just link to COSMA, before linking to any other library providingpxgemmand allpxgemmcalls will be using COSMA, without the need to change your code at all. To get a feeling of the performance you can expect to get, please have a look at the pdgemm miniapp. To see how you can link your code to COSMApxgemm, have a look at the 30 seconds tutorial on how to do this. In this way, we integrated COSMA into CP2K quantum chemistry simulator, which you can read more about in the production example. -
adapting your code: if your code is not using ScaLAPACK, then there are two interfaces that can be used:
- custom layout: if you matrices are distributed in a custom way, then it is eanough to pass the descriptors of your data layout to
multiply_using_layoutfunction, which will then adapt COSMA to your own layout. - native COSMA layout: to get the maximum performance, the native COSMA matrix layout should be used. To get an idea of the performance you can expect to get, please have a look at the matrix multiplication miniapp.
- custom layout: if you matrices are distributed in a custom way, then it is eanough to pass the descriptors of your data layout to
The documentation for the latter option will soon be published here.
Using COSMA in 30 seconds
For easy integration, it is enough to build COSMA with ScaLAPACK API and then link your code to COSMA before linking to any other library providing ScaLAPACK pxgemm. This way, all pxgemm calls will be using COSMA pxgemm wrappers. To achieve this, please follow these steps:
- Build COSMA with ScaLAPACK API:
###############
# get COSMA
###############
git clone --recursive https://github.com/eth-cscs/COSMA cosma && cd cosma
##############################
# build and install COSMA
##############################
mkdir build && cd build
# set up the compiler, e.g. with:
export CC=`which cc`
export CXX=`which CC`
# choose BLAS and SCALAPACK versions you want to use
# COSMA_BLAS can be: MKL, OpenBLAS, CRAY_LIBSCI, CUDA, ROCM, CUSTOM
# COSMA_SCALAPACK can be MKL, CRAY_LIBSCI, CUSTOM
cmake -DCOSMA_BLAS=CUDA -DCOSMA_SCALAPACK=MKL -DCMAKE_INSTALL_PREFIX=<installation dir>/cosma ..
make -j 8
make install
!! Note the --recursive flag !!
-
Link your code to COSMA:
-
CPU-only version of COSMA:
- link your code to:
-L<installation dir>/cosma/lib64 -lcosma_pxgemm -lcosma -lcosta_scalapack
- then link to the BLAS and ScaLAPACK you built COSMA with (see
COSMA_BLASandCOSMA_SCALAPACKflags in cmake):
-L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -lgomp -lpthread -lm
-
using GPU-accelerated version of COSMA:
- link your code to:
-L<installation dir>/cosma/lib64 -lcosma_pxgemm -lcosma -lcosta_scalapack -lTiled-MM
- link to the GPU backend you built COSMA with (see
COSMA_BLASflag in cmake):
-lcublas -lcudart -lrt
- then link to the ScaLAPACK you built COSMA with (see
COSMA_SCALAPACKflag in cmake):
-L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -lgomp -lpthread -lm
-
-
Include headers:
-I<installation dir>/cosma/include
COSMA on Multi-GPU Systems
COSMA is able to take advantage of fast GPU-to-GPU interconnects on multi-gpu s
Related Skills
node-connect
353.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
353.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
353.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
