MyIntrinsics++ (MIPP)

Purpose

MIPP is a portable and Open-source wrapper (MIT license) for vector intrinsic functions (SIMD) written in C++11. It works for SSE, AVX, AVX-512, ARM NEON and SVE (work in progress) instructions. MIPP wrapper supports simple/double precision floating-point numbers and also signed/unsigned integer arithmetic (64-bit, 32-bit, 16-bit and 8-bit).

With the MIPP wrapper you do not need to write a specific intrinsic code anymore. Just use provided functions and the wrapper will automatically generates the right intrisic calls for your specific architecture.

If you are interested by ARM SVE development status, please follow this link.

Short Documentation

Supported Compilers

At this time, MIPP has been tested on the following compilers:

Intel: icpc >= 16,
GNU: g++ >= 4.8,
Clang: clang++ >= 3.6,
Microsoft: msvc >= 14.

On msvc 14.10 (Microsoft Visual Studio 2017), the performances are reduced compared to the other compilers, the compiler is not able to fully inline all the MIPP methods. This has been fixed on msvc 14.21 (Microsoft Visual Studio 2019) and now you can expect high performances.

Install and Configure your Code

You don't have to install MIPP because it is a simple C++ header file. The headers are located in the include folder (note that this location has changed since commit 6795891, before they were located in the src folder).

Just include the header into your source files when the wrapper is needed.

#include "mipp.h"

mipp.h use a C++ namespace: mipp, if you do not want to prefix all the MIPP calls by mipp:: you can do that:

#include "mipp.h"
using namespace mipp;

Before trying to compile, think to tell the compiler what kind of vector instructions you want to use. For instance, if you are using GNU compiler (g++) you simply have to add the -march=native option for SSE and AVX CPUs compatible. For ARMv7 CPUs with NEON instructions you have to add the -mfpu=neon option (since most of current NEONv1 instructions are not IEEE-754 compliant). However, this is no more the case on ARMv8 processors, so the -march=native option will work too. MIPP also uses some nice features provided by the C++11 and so we have to add the -std=c++11 flag to compile the code. You are now ready to run your code with the MIPP wrapper.

In the case where MIPP is installed on the system it can be integrated into a cmake projet in a standard way. Example

# install MIPP
cd MIPP/
export MIPP_ROOT=$PWD/build/install
cmake -B build -DCMAKE_INSTALL_PREFIX=$MIPP_ROOT
cmake --build build -j5
cmake --install build

In your CMakeLists.txt:

# find the installation of MIPP on the system
find_package(MIPP REQUIRED)

# define your executable
add_executable(gemm gemm.cpp)

# link your executable to MIPP
target_link_libraries(gemm PRIVATE MIPP::mipp)

cd your_project/
# if MIPP is installed in a system standard path: MIPP will be found automatically with cmake
cmake -B build
# if MIPP is installed in a non-standard path: use CMAKE_PREFIX_PATH
cmake -B build -DCMAKE_PREFIX_PATH=$MIPP_ROOT

Generate Sources & Compile the Static Library

MIPP is mainly a header only library. However, some macro operations require to compile a small library. This is particularly true for the compress operation that relies on generated LUTs stored in the static library.

To generate the source files containing these LUTs you need to install Python3 with the Jinja2 package:

sudo apt install python3 python3-pip
pip3 install --user -r codegen/requirements.txt

Then you can call the generator as follow:

python3 codegen/gen_compress.py

And, finally you can compile the MIPP static library:

cmake -B build -DMIPP_STATIC_LIB=ON
cmake --build build -j4

Note that the compilation of the static library is optional. You can choose to do not compile the static library then only some macro operations will be missing.

Sequential Mode

By default, MIPP tries to recognize the instruction set from the preprocessor definitions. If MIPP can't match the instruction set (for instance when MIPP does not support the targeted instruction set), MIPP falls back on standard sequential instructions. In this mode, the vectorization is not guarantee anymore but the compiler can still perform auto-vectorization.

It is possible to force MIPP to use the sequential mode with the following compiler definition: -DMIPP_NO_INTRINSICS. Sometime it can be useful for debugging or to bench a code.

If you want to check the MIPP mode configuration, you can print the following global variable: mipp::InstructionFullType (std::string).

Vector Register Declaration

Just use the mipp::Reg<T> type.

mipp::Reg<T> r1, r2, r3; // we have declared 3 vector registers

But we do not know the number of elements per register here. This number of elements can be obtained by calling the mipp::N<T>() function (T is a template parameter, it can be double, float, int64_t, uint64_t, int32_t, uint32_t, int16_t, uint16_t, int8_t or uint8_t type).

for (int i = 0; i < n; i += mipp::N<float>()) {
	// ...
}

The register size directly depends on the precision of the data we are working on.

Register `load` and `store` Instructions

Loading memory from a vector into a register:

int n = mipp::N<float>() * 10;
std::vector<float> myVector(n);
int i = 0;
mipp::Reg<float> r1;
r1.load(&myVector[i*mipp::N<float>()]);

The last two lines can be shorten as follow where the load call becomes implicit:

mipp::Reg<float> r1 = &myVector[i*mipp::N<float>()];

Store can be done with the store(...) method:

int n = mipp::N<float>() * 10;
std::vector<float> myVector(n);
int i = 0;
mipp::Reg<float> r1 = &myVector[i*mipp::N<float>()];

// do something with r1

r1.store(&myVector[(i+1)*mipp::N<float>()]);

By default the loads and stores work on unaligned memory. It is possible to control this behavior with the -DMIPP_ALIGNED_LOADS definition: when specified, the loads and stores work on aligned memory by default. In the aligned memory mode, it is still possible to perform unaligned memory operations with the mipp::loadu and mipp::storeu functions. However, it is not possible to perform aligned loads and stores in the unaligned memory mode.

To allocate aligned data you can use the MIPP aligned memory allocator wrapped into the mipp::vector class. mipp::vector is fully retro-compatible with the standard std::vector class and it can be use everywhere you can use std::vector.

mipp::vector<float> myVector(n);

Register Initialization

You can initialize a vector register from a scalar value:

mipp::Reg<float> r1; // r1 = | unknown | unknown | unknown | unknown |
r1 = 1.0;            // r1 = |    +1.0 |    +1.0 |    +1.0 |    +1.0 |

Or from an initializer list (std::initializer_list):

mipp::Reg<float> r1;       // r1 = | unknown | unknown | unknown | unknown |
r1 = {1.0, 2.0, 3.0, 4.0}; // r1 = |    +1.0 |    +2.0 |    +3.0 |    +4.0 |

Computational Instructions

Add two vector registers:

mipp::Reg<float> r1, r2, r3;

r1 = 1.0;     // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |
r2 = 2.0;     // r2 = | +2.0 | +2.0 | +2.0 | +2.0 |

r3 = r1 + r2; // r3 = | +3.0 | +3.0 | +3.0 | +3.0 |

Subtract two vector registers:

mipp::Reg<float> r1, r2, r3;

r1 = 1.0;     // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |
r2 = 2.0;     // r2 = | +2.0 | +2.0 | +2.0 | +2.0 |

r3 = r1 - r2; // r3 = | -1.0 | -1.0 | -1.0 | -1.0 |

Multiply two vector registers:

mipp::Reg<float> r1, r2, r3;

r1 = 1.0;     // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |
r2 = 2.0;     // r2 = | +2.0 | +2.0 | +2.0 | +2.0 |

r3 = r1 * r2; // r3 = | +2.0 | +2.0 | +2.0 | +2.0 |

Divide two vector registers:

mipp::Reg<float> r1, r2, r3;

r1 = 1.0;     // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |
r2 = 2.0;     // r2 = | +2.0 | +2.0 | +2.0 | +2.0 |

r3 = r1 / r2; // r3 = | +0.5 | +0.5 | +0.5 | +0.5 |

Fused multiply and add of three vector registers:

mipp::Reg<float> r1, r2, r3, r4;

r1 = 2.0;                     // r1 = | +2.0 | +2.0 | +2.0 | +2.0 |
r2 = 3.0;                     // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |
r3 = 1.0;                     // r3 = | +1.0 | +1.0 | +1.0 | +1.0 |

// r4 = (r1 * r2) + r3
r4 = mipp::fmadd(r1, r2, r3); // r4 = | +7.0 | +7.0 | +7.0 | +7.0 |

Fused negative multiply and add of three vector registers:

mipp::Reg<float> r1, r2, r3, r4;

r1 = 2.0;                      // r1 = | +2.0 | +2.0 | +2.0 | +2.0 |
r2 = 3.0;                      // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |
r3 = 1.0;                      // r3 = | +1.0 | +1.0 | +1.0 | +1.0 |

// r4 = -(r1 * r2) + r3
r4 = mipp::fnmadd(r1, r2, r3); // r4 = | -5.0 | -5.0 | -5.0 | -5.0 |

Square root of a vector register:

mipp::Reg<float> r1, r2;

r1 = 9.0;             // r1 = | +9.0 | +9.0 | +9.0 | +9.0 |

r2 = mipp::sqrt(r1);  // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |

Reciprocal square root of a vector register (be careful: this intrinsic exists only for simple precision floating-point numbers):

mipp::Reg<float> r1, r2;

r1 = 9.0;             // r1 = | +9.0 | +9.0 | +9.0 | +9.0 |

r2 = mipp::rsqrt(r1); // r2 = | +0.3 | +0.3 | +0.3 | +0.3 |

Selections

Select the minimum between two vector registers:

mipp::Reg<float> r1, r2, r3;

r1 = 2.0;               // r1 = | +2.0 | +2.0 | +2.0 | +2.0 |
r2 = 3.0;               // r2 = | +3.0 | +3.0 | +3.0

MIPP

Install / Use

README

MyIntrinsics++ (MIPP)

Purpose

Short Documentation

Supported Compilers

Install and Configure your Code

Generate Sources & Compile the Static Library

Sequential Mode

Vector Register Declaration

Register `load` and `store` Instructions

Register Initialization

Computational Instructions

Selections

MIPP

Install / Use

README

MyIntrinsics++ (MIPP)

Purpose

Short Documentation

Supported Compilers

Install and Configure your Code

Generate Sources & Compile the Static Library

Sequential Mode

Vector Register Declaration

Register load and store Instructions

Register Initialization

Computational Instructions

Selections

Register `load` and `store` Instructions