The Task-Aware MPI or TAMPI library extends the functionality of standard MPI libraries by providing new mechanisms for improving the interoperability between parallel task-based programming models, such as OpenMP and OmpSs-2, and MPI communications. This library allows the safe and efficient execution of MPI operations from concurrent tasks and guarantees the transparent management and progress of these communications.

The TAMPI library is not an MPI implementation. Instead, TAMPI is an independent library that works over any standard MPI library that supports the MPI_THREAD_MULTIPLE level. At building time, the TAMPI library is linked to the specified MPI library.

By following the MPI Standard, programmers must pay close attention to avoid deadlocks that may occur in hybrid applications (e.g., MPI+OpenMP) where MPI calls take place inside tasks. This is given by the out-of-order execution of tasks that consequently alter the execution order of the enclosed MPI calls. The TAMPI library ensures a deadlock-free execution of such hybrid applications by implementing a cooperation mechanism between the MPI library and the parallel task-based runtime system.

TAMPI provides two main mechanisms: the blocking mode and the non-blocking mode. The blocking mode targets the efficient and safe execution of blocking MPI operations (e.g., MPI_Recv) from inside tasks, while the non-blocking mode focuses on the efficient execution of non-blocking or immediate MPI operations (e.g., MPI_Irecv), also from inside tasks.

On the one hand, TAMPI is currently compatible with two task-based programming model implementations: a derivative version of the LLVM/OpenMP and OmpSs-2. However, the derivative OpenMP does not support the full set of features provided by TAMPI. OpenMP programs can only make use of the non-blocking mode of TAMPI, whereas OmpSs-2 programs can leverage both blocking and non-blocking modes.

On the other hand, TAMPI is compatible with mainstream MPI implementations that support the MPI_THREAD_MULTIPLE threading level, which is the minimum requirement to provide its task-aware features. The following sections describe in detail the blocking (OmpSs-2) and non-blocking (OpenMP & OmpSs-2) modes of TAMPI.

Important Notice

The current library implements significant optimizations using a delegation technique. All the communications are delegated to a polling task, so the MPI interface is mostly accessed by this task. By delegating communications, we avoid threading contention at the MPI layer and obtain the MPI performance of single-threaded scenarios. Furthermore, another polling task handles the post-processing of tickets and calls the tasking runtime system. We call TAMPI-OPT to this optimized library.

We have dropped some features and others are not supported yet in TAMPI-OPT. The following changes apply:

TAMPI-OPT does not support request-based MPI operations. The following functions are no longer implemented: MPI_Wait, MPI_Waitall, TAMPI_Iwait, TAMPI_Iwaitall. Please use use blocking operations (e.g., MPI_Recv, MPI_Send) or non-blocking TAMPI operations (e.g., TAMPI_Isend, TAMPI_Irecv). These latter do not provide a request and are the recommended for performance. Check these variants in src/include/TAMPI_Wrappers.h.
All point-to-point and collective operations are supported, except MPI_Sendrecv and MPI_Sendrecv_replace.
Fortran applications are not supported.
The documentation in the following sections may be outdated.

Blocking Mode (OmpSs-2)

The blocking mode of TAMPI targets the safe and efficient execution of blocking MPI operations (e.g., MPI_Recv) from inside tasks. This mode virtualizes the execution resources (e.g., hardware threads) of the underlying system when tasks call blocking MPI functions. When a task calls a blocking operation, and it cannot complete immediately, the task is paused, and the underlying execution resource is leveraged to execute other ready tasks while the MPI operation does not complete. This means that the execution resource is not blocked inside MPI. Once the operation completes, the paused task is resumed, so that it will eventually continue its execution and it will return from the blocking MPI call.

All this is done transparently to the user, meaning that all blocking MPI functions maintain the blocking semantics described in the MPI Standard. Also, this TAMPI mode ensures that the user application can make progress although multiple communication tasks are executing blocking MPI operations.

This virtualization prevents applications from blocking all execution resources inside MPI (waiting for the completion of some operations), which could result in a deadlock due to the lack of progress. Thus, programmers are allowed to instantiate multiple communication tasks (that call blocking MPI functions) without the need of serializing them with dependencies, which would be necessary if this TAMPI mode was not enabled. In this way, communication tasks can run in parallel and their execution can be re-ordered by the task scheduler.

This mode provides support for the following set of blocking MPI operations:

Blocking primitives: MPI_Recv, MPI_Send, MPI_Bsend, MPI_Rsend and MPI_Ssend.
Blocking collectives: MPI_Gather, MPI_Scatter, MPI_Barrier, MPI_Bcast, MPI_Scatterv, etc.
Waiters of a complete set of requests: MPI_Wait and MPI_Waitall.
- MPI_Waitany and MPI_Waitsome are not supported yet; the standard behavior is applied.

As stated previously, this mode is only supported by OmpSs-2.

Using the blocking mode

This library provides a header named TAMPI.h (or TAMPIf.h in Fortran). Apart from other declarations and definitions, this header defines a new MPI level of thread support named MPI_TASK_MULTIPLE, which is monotonically greater than the standard MPI_THREAD_MULTIPLE. To activate this mode from an application, users must:

Include the TAMPI.h header in C or TAMPIf.h in Fortran.
Initialize MPI with MPI_Init_thread and requesting the new MPI_TASK_MULTIPLE threading level.

The blocking TAMPI mode is considered activated once MPI_Init_thread successfully returns, and the provided threading level is MPI_TASK_MULTIPLE. A valid and safe usage of TAMPI's blocking mode is shown in the following OmpSs-2 + MPI example:

#include <mpi.h>
#include <TAMPI.h>

int main(int argc, char **argv)
{
  int provided;
  MPI_Init_thread(&argc, &argv, MPI_TASK_MULTIPLE, &provided);
  if (provided != MPI_TASK_MULTIPLE) {
    fprintf(stderr, "Error: MPI_TASK_MULTIPLE not supported!");
    return 1;
  }
  
  int *data = (int *) malloc(N * sizeof(int));
  // ...
  
  if (rank == 0) {
    for (int n = 0; n < N; ++n) {
      #pragma oss task in(data[n]) // T1
      {
        MPI_Ssend(&data[n], 1, MPI_INT, 1, n, MPI_COMM_WORLD);
        // Data buffer could already be reused
      }
    }
  } else if (rank == 1) {
    for (int n = 0; n < N; ++n) {
      #pragma oss task out(data[n]) // T2
      {
        MPI_Status status;
        MPI_Recv(&data[n], 1, MPI_INT, 0, n, MPI_COMM_WORLD, &status);
        check_status(&status);
        fprintf(stdout, "data[%d] = %d\n", n, data[n]);
      }
    }
  }
  #pragma oss taskwait
  
  //...
}

In this example, the first MPI rank sends the data buffer of integers to the second rank in messages of one single integer, while the second rank receives the integers and prints them. On the one hand, the sender rank creates a task (T1) for sending each single-integer MPI message with an input dependency on the corresponding position of the data buffer (MPI_Ssend reads from data). Note that these tasks can run in parallel.

On the other hand, the receiver rank creates a task (T2) for receiving each integer. Similarly, each task declares an output dependency on the corresponding position of the data buffer (MPI_Recv writes on data). After receiving the integer, each task checks the status of the MPI operation, and finally, it prints the received integer. These tasks can also run in parallel.

This program would be incorrect when using the standard MPI_THREAD_MULTIPLE since it could result in a deadlock situation depending on the task scheduling policy. Because of the out-of-order execution of tasks in each rank, all available execution resources could end up blocked inside MPI, hanging the application due to the lack of progress. However, with TAMPI's MPI_TASK_MULTIPLE threading level, execution resources are prevented from blocking inside blocking MPI calls and they can execute other ready tasks while communications are taking place, guaranteeing application progress.

See the articles listed in the References section for more information.

Non-Blocking Mode (OpenMP & OmpSs-2)

The non-blocking mode of TAMPI focuses on the execution of non-blocking or immediate MPI operations from inside tasks. As the blocking TAMPI mode, the objective of this one is to allow the safe and efficient execution of multiple communication tasks in parallel, but avoiding the pause of these tasks.

The idea is to allow tasks to bind their completion to the finalization of one or more MPI requests. Thus, the completion of a task is delayed until (1) it finishes the execution of its body code and (2) all MPI requests that it bound during its execution complete. Notice that the completion of a task usually implies the release of its dependencies, the freeing of its data structures, etc.

For that reason, TAMPI defines two asynchronous and non-blocking functions named TAMPI_Iwait and TAMPI_Iwaitall, which have the same parameters as their standard synchronous counterparts MPI_Wait and MPI_Waitall, respectively. They bind the completion of the calling task to the finalization of th

Tampi

Install / Use

README

Important Notice

Blocking Mode (OmpSs-2)

Using the blocking mode

Non-Blocking Mode (OpenMP & OmpSs-2)