SkillAgentSearch skills...

COSMA

Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm

Install / Use

/learn @eth-cscs/COSMA

README

pipeline status

<p align="center"><img src="./docs/cosma-logo.svg" width="70%"></p>

Table of Contents

Overview

COSMA is a parallel, high-performance, GPU-accelerated, matrix-matrix multiplication algorithm that is communication-optimal for all combinations of matrix dimensions, number of processors and memory sizes, without the need for any parameter tuning. The key idea behind COSMA is to first derive a tight optimal sequential schedule and only then parallelize it, preserving I/O optimality between processes. This stands in contrast with the 2D and 3D algorithms, which fix process domain decomposition upfront and then map it to the matrix dimensions, which may result in asymptotically more communication. The final design of COSMA facilitates the overlap of computation and communication, ensuring speedups and applicability of modern mechanisms such as RDMA. COSMA allows to not utilize some processors in order to optimize the processor grid, which reduces the communication volume even further and increases the computation volume per processor.

COSMA got the Best Student Paper Award at the prestigious Supercomputing 2019 conference in Denver, US.

COSMA alleviates the issues of current state-of-the-art algorithms, which can be summarized as follows:

  • 2D (SUMMA): Requires manual tuning and not communication-optimal in the presence of extra memory.
  • 2.5D: Optimal for m=n, but inefficient for m << n or n << m and for some numbers of processes p.
  • Recursive (CARMA): Asymptotically communication-optimal for all m, n, k, p, but splitting always the largest dimension might lead up to √3 increase in communication volume.
  • COSMA (this work): Strictly communication-optimal (not just asymptotically) for all m, n, k, p and memory sizes that yields the speedups by factor of up to 8.3x over the second-fastest algorithm.

In addition to being communication-optimal, this implementation is higly-optimized to reduce the memory footprint in the following sense:

  • Buffer Reuse: all the buffers are pre-allocated and carefully reused during execution, including the buffers necessary for the communication, which reduces the total memory usage.
  • Reduced Local Data Movement: the assignment of data blocks to processes is fully adapted to communication pattern, which minimizes the need of local data reshuffling that arise after each communication step.

The library supports both one-sided and two-sided MPI communication backends. It uses dgemm for the local computations, but also has a support for the GPU acceleration through our Tiled-MM library using cublas or rocBLAS.

COSMA Literature

The paper and other materials on COSMA are available under the following link:

  • ACM Digital Library (Best Student Paper Award at SC19): https://dl.acm.org/doi/10.1145/3295500.3356181
  • Arxiv: https://arxiv.org/abs/1908.09606
  • YouTube Presentation: https://www.youtube.com/watch?v=5wiZWw5ltR0
  • Press Release: https://www.cscs.ch/science/computer-science-hpc/2019/new-matrix-multiplication-algorithm-pushes-the-performance-to-the-limits/

Features

  • [NEW] Multi-GPU Systems Support: COSMA is now able to take advantage of fast GPU-to-GPU interconnects either through the use of NCCL/RCCL libraries or by using the GPU-aware MPI. Both, NVIDIA and AMD GPUs are supported.
  • ScaLAPACK API Support: it is enough to link to COSMA, without changing the code and all p?gemm calls will use ScaLAPACK wrappers provided by COSMA.
  • C/Fortran Interface: written in C++, but provides C and Fortran interfaces.
  • Custom Types: fully templatized types.
  • GPU acceleration: supports both NVIDIA and AMD GPUs.
  • Supported BLAS (CPU) backends: MKL, LibSci, NETLIB, BLIS, ATLAS.
  • Custom Data Layout Support: natively uses its own blocked data layout of matrices, but supports arbitrary grid-like data layout of matrices.
  • Tranposition/Conjugation Support: matrices A and B can be transposed and/or conjugated.
  • Communication and Computation Overlap: supports overlapping of communication and computation.
  • Spack Installation: can be built and installed with Spack since v14.1
  • Julia Package: see https://github.com/haampie/COSMA.jl/ on how to use COSMA in the Julia language.

Building COSMA

See Installation Instructions.

COSMA Dependencies

COSMA is a CMake project and requires a recent CMake(>=3.17).

External dependencies:

  • MPI 3: (required)
  • BLAS: when the problem becomes local, COSMA uses provided ?gemm backend, which can be one of the following:
    • MKL (default)
    • OPENBLAS
    • BLIS
    • ATLAS
    • CRAY_LIBSCI: Cray-libsci or Cray-libsci_acc (GPU-accelerated)
    • CUDA: cublas is used for NVIDIA GPUs
    • ROCM: rocBLAS is used for AMD GPUs
    • CUSTOM: user-provided BLAS API

Some dependencies are bundled as submodules and need not be installed explicitly:

  • TiledMM - cublasXt GEMM replacement, that is also ported to AMD GPUs.
  • COSTA - distributed matrix reshuffle and transpose algorithm.
  • semiprof - profiling utlility
  • gtest_mpi - MPI utlility wrapper over GoogleTest (unit testing library)

Using COSMA

To allow easy integration, COSMA can be used in the following ways:

  • without changing your code: if your code already uses the ScaLAPACK API, then you can just link to COSMA, before linking to any other library providing pxgemm and all pxgemm calls will be using COSMA, without the need to change your code at all. To get a feeling of the performance you can expect to get, please have a look at the pdgemm miniapp. To see how you can link your code to COSMA pxgemm, have a look at the 30 seconds tutorial on how to do this. In this way, we integrated COSMA into CP2K quantum chemistry simulator, which you can read more about in the production example.

  • adapting your code: if your code is not using ScaLAPACK, then there are two interfaces that can be used:

    • custom layout: if you matrices are distributed in a custom way, then it is eanough to pass the descriptors of your data layout to multiply_using_layout function, which will then adapt COSMA to your own layout.
    • native COSMA layout: to get the maximum performance, the native COSMA matrix layout should be used. To get an idea of the performance you can expect to get, please have a look at the matrix multiplication miniapp.

The documentation for the latter option will soon be published here.

Using COSMA in 30 seconds

For easy integration, it is enough to build COSMA with ScaLAPACK API and then link your code to COSMA before linking to any other library providing ScaLAPACK pxgemm. This way, all pxgemm calls will be using COSMA pxgemm wrappers. To achieve this, please follow these steps:

  1. Build COSMA with ScaLAPACK API:
###############
# get COSMA
###############
git clone --recursive https://github.com/eth-cscs/COSMA cosma && cd cosma

##############################
# build and install COSMA
##############################
mkdir build && cd build

# set up the compiler, e.g. with:
export CC=`which cc`
export CXX=`which CC`

# choose BLAS and SCALAPACK versions you want to use
# COSMA_BLAS can be: MKL, OpenBLAS, CRAY_LIBSCI, CUDA, ROCM, CUSTOM
# COSMA_SCALAPACK can be MKL, CRAY_LIBSCI, CUSTOM
cmake -DCOSMA_BLAS=CUDA -DCOSMA_SCALAPACK=MKL -DCMAKE_INSTALL_PREFIX=<installation dir>/cosma ..
make -j 8
make install

!! Note the --recursive flag !!

  1. Link your code to COSMA:

    • CPU-only version of COSMA:

      • link your code to:

      -L<installation dir>/cosma/lib64 -lcosma_pxgemm -lcosma -lcosta_scalapack

      • then link to the BLAS and ScaLAPACK you built COSMA with (see COSMA_BLAS and COSMA_SCALAPACK flags in cmake):

      -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -lgomp -lpthread -lm

    • using GPU-accelerated version of COSMA:

      • link your code to:

      -L<installation dir>/cosma/lib64 -lcosma_pxgemm -lcosma -lcosta_scalapack -lTiled-MM

      • link to the GPU backend you built COSMA with (see COSMA_BLAS flag in cmake):

      -lcublas -lcudart -lrt

      • then link to the ScaLAPACK you built COSMA with (see COSMA_SCALAPACK flag in cmake):

      -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -lgomp -lpthread -lm

  2. Include headers:

-I<installation dir>/cosma/include

COSMA on Multi-GPU Systems

COSMA is able to take advantage of fast GPU-to-GPU interconnects on multi-gpu s

Related Skills

View on GitHub
GitHub Stars213
CategoryDevelopment
Updated15d ago
Forks33

Languages

C++

Security Score

100/100

Audited on Mar 25, 2026

No findings