Overview
Publication
Features
Installing in 30 seconds
Examples
Advanced Features
- Transforming Multiple Layouts
- Achieving Communication-Optimality
Performance Results
COSTA in Production
Miniapps (for testing and benchmarking)
Questions?
Acknowledgements

Overview

COSTA is a communication-optimal, highly-optimised algorithm for data redistribution accross multiple processors, using MPI and OpenMP and offering the possibility to transpose and scale some or all data. It implements scalapack routines for matrix scale & transpose operations (sub(C) = alpha * sub(A)^T + beta * C, provided by pxtran(u)) and data redistribution (sub(C) = sub(A), provided by pxgemr2d) and outperforms other scalapack implementations by orders of magnitude in some cases. Unlike previous redistribution algorithms, COSTA will also propose the relabelling of MPI ranks that minimizes the data reshuffling cost, leaving to users to decide if they want to use it. This way, if the initial and the target data distributions differ up to a rank permutation, COSTA will perform no communication, whereas other algorithms will reshuffle all the data. Thanks to its optimizations, significant speedups will be achieved even if the proposed rank relabelling is not used.

What makes COSTA more general than scalapack routines is that it is not limited only to block-cyclic data distributions, but can deal with completely arbitrary and irregular matrix distributions and can be easily generalized for n-dimensional tensors.

Thanks to its scalapack wrappers, scalapack users do not need to change their code in order to use COSTA: it is enough to link your library to COSTA before linking to scalapack and all pxtran, pxtranu and pxgemr2d routines will automatically be using the COSTA algorithm.

Publication

This work is published in the Proceedings of the International Conference on High Performance Computing (ISC21) and is available under the following links:

published version: https://link.springer.com/chapter/10.1007/978-3-030-78713-4_12
arxiv preprint: https://arxiv.org/abs/2106.06601

It can be cited as:

@InProceedings{costa_algorithm_2021,
    author="Kabi{\'{c}}, Marko and Pintarelli, Simon and Kozhevnikov, Anton and VandeVondele, Joost",
    editor="Chamberlain, Bradford L. and Varbanescu, Ana-Lucia and Ltaief, Hatem and Luszczek, Piotr",
    title="COSTA: Communication-Optimal Shuffle and Transpose Algorithm with Process Relabeling",
    booktitle="High Performance Computing",
    year="2021",
    publisher="Springer International Publishing",
    address="Cham",
    pages="217--236",
    isbn="978-3-030-78713-4"
}

Features

COSTA has the following features:

scale, transpose & reshuffle: apart from redistribution, can also transpose, scale and sum initial and final layouts:

sub(B) = beta * sub(B) + alpha * sub(op(A)) ; op=N, T or C; sub = submatrix

Arbitrary Layouts: COSTA is not limited to block cyclic matrix layouts and can handle complitely irregular and arbitrary matrix distributions.
Multiple Layouts: can transform multiple layouts at once (in the same communication round).
Highly-optimized: it is highly-optimized in distributed and multithreaded settings.
Communication-Optimal: proposes (but does not enforce) the optimal rank relabelling to minimize communication.
SCALAPACK wrappers: provides scalapack wrappers for pxgemr2d and pxtran(u).
Well Documented: a detailed documentation is provided in this README.

Installing in 30 seconds

Please refer to INSTALL.md.

Examples

Block-cyclic (Scalapack) Layout

To represent an arbitrary block-cyclic (scalapack) layout, we can use the following function defined in costa/layout.hpp header:

#include <costa/layout.hpp>
// ...
template <typename T>
grid_layout<T> costa::block_cyclic_layout<double>(
                   const int m, const int n,         // global matrix dimensions
                   const int b_m, const int b_n,     // block dimensions
                   const int i, const int j,         // submatrix start
                                                     // (1-based, scalapack-compatible)
                   const int sub_m, const int sub_n, // submatrix size
                   const int p_m, const int p_n,     // processor grid dimension
                   const char order,                 // rank grid ordering ('R' or 'C')
                   const int rsrc, const int csrc,   // coordinates of ranks oweing 
                                                     // the first row (0-based)
                   T* ptr,                           // local data of matrix A 
                                                     // (not the submatrix)
                   const int lld,                    // local leading dimension
                   const char data_ordering,         // 'R' or 'C' depending on whether 
                                                     // each local block
                                                     // is given in row- or col-major
                                                     // ordering
                   const int rank                    // processor rank
               );

The arguments can be nicely visualized with the following figure, where the red submatrix is represented:

In case we want to represent the full matrix (instead of a submatrix), it suffices to put:

// start of the submatrix is the start of the full matrix
int i = 1; int j = 1 // 1-based due to scalapack-compatibility
// size of the submatrix is that size of the full matrix
int sub_m = m; int sub_n = n

For a complete example of transforming between two block-cyclic matrix layouts, please refer to examples/example0.cpp.

Custom (Arbitrary) Layout

To represent an arbitrary grid-like layout, we can use the following function defined in costa/layout.hpp header:

#include <costa/layout.hpp>
// ...
template <typename T>
grid_layout<T> costa::custom_layout(
                   int rowblocks,           // number of global blocks (N_rb)
                   int colblocks,           // number of global blocks (N_cb)
                   int* rowsplit,           // [rowsplit[i], rowsplit[i+1]) is range of rows of block i
                   int* colsplit,           // [colsplit[i], colsplit[i+1]) is range of columns of block i
                   int* owners,             // owners[i][j] is the rank owning block (i,j). 
                                            // Owners are given in row-major order as assumed by C++.
                   int nlocalblocks,        // number of blocks owned by the current rank (N_lb)
                   block_t* localblocks,    // an array of block descriptions for the current rank
                   const char data_ordering // 'R' or 'C' depending on whether each
                                            // local block is given in row- or col-major
                                            // order
               );

where block_t is a simple struct defined in the same header:

// each local block is assumed to be stored in col-major order
struct costa::block_t {
    void *data; // a pointer to the start of the local matrix
    int ld;     // leading dimension or distance between two consecutive local columns
    int row;    // the global block row index
    int col;    // the global block colum index
};

The arguments can be nicely visualized with the following figure:

For a complete example of transforming between a block-cyclic and a custom matrix layout, please refer to examples/example1.cpp.

Initializing Layouts

Once the layouts are created as previously described, we can initialize them by providing a simple lambda function that maps global element coordinates (i,j) to the value to which the element should be initialized:

// some user-defined layout
grid_layout<double> layout; 

// function f(i, j) := value of element (i, j) in the global matrix
// an arbitrary function
auto f = [](int i, int j) -> double {
    return i + j; 
};

// initialize it
layout.initialize(f);

In exactly the same way, we can check if the elements in the layout are equal to the values provided by the lambda function:

// check if the values in the final layout correspond to function f
// that was used for the initialization of the initial layout
bool ok = layout.validate(f, 1e-12); // the second argument is the tolerance

Transforming Matrix Layouts

Once the layouts are created as previously described, we can transform between two layouts in the following way (defined in header <costa/grid2grid/transform.hpp>):

Redistribute with optional scaling and/or transpose. Performs B = beta * B + alpha * op(A), where op can be transpose, conjugate or none (i.e. identity). ``

COSTA

Install / Use

README

Table of Contents