COSTA
Distributed Communication-Optimal Shuffle and Transpose Algorithm
Install / Use
/learn @eth-cscs/COSTAREADME
Table of Contents
- Overview
- Publication
- Features
- Installing in 30 seconds
- Examples
- Advanced Features
- Performance Results
- COSTA in Production
- Miniapps (for testing and benchmarking)
- Questions?
- Acknowledgements
Overview
COSTA is a communication-optimal, highly-optimised algorithm for data redistribution accross multiple processors, using MPI and OpenMP and offering the possibility to transpose and scale some or all data. It implements scalapack routines for matrix scale & transpose operations (sub(C) = alpha * sub(A)^T + beta * C, provided by pxtran(u)) and data redistribution (sub(C) = sub(A), provided by pxgemr2d) and outperforms other scalapack implementations by orders of magnitude in some cases. Unlike previous redistribution algorithms, COSTA will also propose the relabelling of MPI ranks that minimizes the data reshuffling cost, leaving to users to decide if they want to use it. This way, if the initial and the target data distributions differ up to a rank permutation, COSTA will perform no communication, whereas other algorithms will reshuffle all the data. Thanks to its optimizations, significant speedups will be achieved even if the proposed rank relabelling is not used.
What makes COSTA more general than scalapack routines is that it is not limited only to block-cyclic data distributions, but can deal with completely arbitrary and irregular matrix distributions and can be easily generalized for n-dimensional tensors.
Thanks to its scalapack wrappers, scalapack users do not need to change their code in order to use COSTA: it is enough to link your library to COSTA before linking to scalapack and all pxtran, pxtranu and pxgemr2d routines will automatically be using the COSTA algorithm.
Publication
This work is published in the Proceedings of the International Conference on High Performance Computing (ISC21) and is available under the following links:
- published version: https://link.springer.com/chapter/10.1007/978-3-030-78713-4_12
- arxiv preprint: https://arxiv.org/abs/2106.06601
It can be cited as:
@InProceedings{costa_algorithm_2021,
author="Kabi{\'{c}}, Marko and Pintarelli, Simon and Kozhevnikov, Anton and VandeVondele, Joost",
editor="Chamberlain, Bradford L. and Varbanescu, Ana-Lucia and Ltaief, Hatem and Luszczek, Piotr",
title="COSTA: Communication-Optimal Shuffle and Transpose Algorithm with Process Relabeling",
booktitle="High Performance Computing",
year="2021",
publisher="Springer International Publishing",
address="Cham",
pages="217--236",
isbn="978-3-030-78713-4"
}
Features
COSTA has the following features:
- scale, transpose & reshuffle: apart from redistribution, can also transpose, scale and sum initial and final layouts:
sub(B) = beta * sub(B) + alpha * sub(op(A)) ; op=N, T or C; sub = submatrix
- Arbitrary Layouts: COSTA is not limited to block cyclic matrix layouts and can handle complitely irregular and arbitrary matrix distributions.
- Multiple Layouts: can transform multiple layouts at once (in the same communication round).
- Highly-optimized: it is highly-optimized in distributed and multithreaded settings.
- Communication-Optimal: proposes (but does not enforce) the optimal rank relabelling to minimize communication.
- SCALAPACK wrappers: provides scalapack wrappers for
pxgemr2dandpxtran(u). - Well Documented: a detailed documentation is provided in this README.
Installing in 30 seconds
Please refer to INSTALL.md.
Examples
Block-cyclic (Scalapack) Layout
To represent an arbitrary block-cyclic (scalapack) layout, we can use the following function defined in costa/layout.hpp header:
#include <costa/layout.hpp>
// ...
template <typename T>
grid_layout<T> costa::block_cyclic_layout<double>(
const int m, const int n, // global matrix dimensions
const int b_m, const int b_n, // block dimensions
const int i, const int j, // submatrix start
// (1-based, scalapack-compatible)
const int sub_m, const int sub_n, // submatrix size
const int p_m, const int p_n, // processor grid dimension
const char order, // rank grid ordering ('R' or 'C')
const int rsrc, const int csrc, // coordinates of ranks oweing
// the first row (0-based)
T* ptr, // local data of matrix A
// (not the submatrix)
const int lld, // local leading dimension
const char data_ordering, // 'R' or 'C' depending on whether
// each local block
// is given in row- or col-major
// ordering
const int rank // processor rank
);
The arguments can be nicely visualized with the following figure, where the red submatrix is represented:
<p align="center"><img src="./docs/block-cyclic.svg" width="100%"></p>In case we want to represent the full matrix (instead of a submatrix), it suffices to put:
// start of the submatrix is the start of the full matrix
int i = 1; int j = 1 // 1-based due to scalapack-compatibility
// size of the submatrix is that size of the full matrix
int sub_m = m; int sub_n = n
For a complete example of transforming between two block-cyclic matrix layouts, please refer to examples/example0.cpp.
Custom (Arbitrary) Layout
To represent an arbitrary grid-like layout, we can use the following function defined in costa/layout.hpp header:
#include <costa/layout.hpp>
// ...
template <typename T>
grid_layout<T> costa::custom_layout(
int rowblocks, // number of global blocks (N_rb)
int colblocks, // number of global blocks (N_cb)
int* rowsplit, // [rowsplit[i], rowsplit[i+1]) is range of rows of block i
int* colsplit, // [colsplit[i], colsplit[i+1]) is range of columns of block i
int* owners, // owners[i][j] is the rank owning block (i,j).
// Owners are given in row-major order as assumed by C++.
int nlocalblocks, // number of blocks owned by the current rank (N_lb)
block_t* localblocks, // an array of block descriptions for the current rank
const char data_ordering // 'R' or 'C' depending on whether each
// local block is given in row- or col-major
// order
);
where block_t is a simple struct defined in the same header:
// each local block is assumed to be stored in col-major order
struct costa::block_t {
void *data; // a pointer to the start of the local matrix
int ld; // leading dimension or distance between two consecutive local columns
int row; // the global block row index
int col; // the global block colum index
};
The arguments can be nicely visualized with the following figure:
<p align="center"><img src="./docs/custom-layout.svg" width="90%"></p>For a complete example of transforming between a block-cyclic and a custom matrix layout, please refer to examples/example1.cpp.
Initializing Layouts
Once the layouts are created as previously described, we can initialize them by providing a simple lambda function that maps global element coordinates (i,j) to the value to which the element should be initialized:
// some user-defined layout
grid_layout<double> layout;
// function f(i, j) := value of element (i, j) in the global matrix
// an arbitrary function
auto f = [](int i, int j) -> double {
return i + j;
};
// initialize it
layout.initialize(f);
In exactly the same way, we can check if the elements in the layout are equal to the values provided by the lambda function:
// check if the values in the final layout correspond to function f
// that was used for the initialization of the initial layout
bool ok = layout.validate(f, 1e-12); // the second argument is the tolerance
Transforming Matrix Layouts
Once the layouts are created as previously described, we can transform between two layouts in the following way (defined in header <costa/grid2grid/transform.hpp>):
- Redistribute with optional scaling and/or transpose. Performs
B = beta * B + alpha * op(A), whereopcan be transpose, conjugate or none (i.e. identity). ``
