TransPimLib: A Library for Efficient Transcendental Functions on Processing-in-Memory Systems

Processing-in-memory (PIM) promises to alleviate the data movement bottleneck in modern computing systems. However, current real-world PIM systems have the inherent disadvantage that their hardware is more constrained than in conventional processors (CPU, GPU), due to the difficulty and cost of building processing elements near or inside the memory. As a result, general-purpose PIM architectures support fairly limited instruction sets and struggle to execute complex operations such as transcendental functions and other hard-to-calculate operations (e.g., square root). These operations are particularly important for some modern workloads, e.g., activation functions in machine learning applications.

To provide support for transcendental (and other hard-to-calculate) functions in general-purpose PIM systems, TransPimLib is a library that provides CORDIC-based and LUT-based methods for trigonometric functions, hyperbolic functions, exponentiation, logarithm, square root, etc. The first implementation of TransPimLib is for the UPMEM PIM architecture.

Citation

Please cite the following papers if you find this repository useful.

ISPASS2023 paper version:

Maurus Item, Juan Gómez-Luna, Yuxin Guo, Geraldo F. Oliveira, Mohammad Sadrosadati, and Onur Mutlu, "TransPimLib: Efficient Transcendental Functions for Processing-in-Memory Systems". 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2023.

Bibtex entry for citation:

@inproceedings{item2023transpimlibispass,
  title={{TransPimLib: Efficient Transcendental Functions for Processing-in-Memory Systems}}, 
  author={Maurus Item and Juan Gómez-Luna and Yuxin Guo and Geraldo F. Oliveira and Mohammad Sadrosadati and Onur Mutlu},
  year={2023},
  booktitle = {ISPASS}
}

arXiv paper version:

Maurus Item, Juan Gómez-Luna, Yuxin Guo, Geraldo F. Oliveira, Mohammad Sadrosadati, and Onur Mutlu, "TransPimLib: A Library for Efficient Transcendental Functions on Processing-in-Memory Systems". arXiv:2304.01951 [cs.MS], 2023.

Bibtex entries for citation:

@misc{item2023transpimlib,
  title={{TransPimLib: A Library for Efficient Transcendental Functions on Processing-in-Memory Systems}}, 
  author={Maurus Item and Juan Gómez-Luna and Yuxin Guo and Geraldo F. Oliveira and Mohammad Sadrosadati and Onur Mutlu},
  year={2023},
  howpublished={arXiv:2304.01951 [cs.MS]}
}

Installation

Prerequisites

Using TransPimLib requires installing the UPMEM SDK. This implementation of the library designed to run on a server with real UPMEM modules, but they are also able to be run by the functional simulator in the UPMEM SDK.

Getting Started

Clone the repository:

git clone https://github.com/CMU-SAFARI/transpimlib.git
cd transpimlib

Repository Structure

We point out next the repository structure and some important folders and files. The repository also includes run_*.sh scripts to run the experiments in the paper.

.
+-- LICENSE
+-- README.md
+-- run_strong_full.py
+-- run_strong_rank.py
+-- run_weak.py
+-- benchmarks/
|   +-- blackscholes/
|	|	+-- parsec/
|   +-- sigmoid/
|   +-- softmax/
|   +-- makefile
|	+-- polynomial.c
|	+-- run_benchmarks.sh
+-- dpu/
+-- host/
+-- microbenchmarks/
|   +-- dpu/
|   +-- host/
|   +-- makefile
|	+-- run_extension_performance_sin.sh
|	+-- run_extension_performance.sh
|	+-- run_method_performance_sin.sh
|	+-- run_method_performance.sh
|	+-- run_method_setup_sin.sh
|	+-- run_method_setup.sh
+-- validation/
|   +-- dpu/
|   +-- host/
|   +-- makefile

Usage

Here is a minimal example on how you can use TransPimLib in your code. On the host side

#include "lut_ldexpf_host.c" // <<--- Add this

#define DPU_BINARY "dpu"

int main(void) {
    
    // Get a DPU and load our kernel on it
    DPU_ASSERT(dpu_alloc(1, NULL, &set));
    DPU_ASSERT(dpu_load(set, DPU_BINARY, NULL));
    
    // Get some number and move it over to the DPU
    int number = 0.5;
    DPU_ASSERT(dpu_broadcast_to(set, "number", 0, &number, sizeof(int), DPU_XFER_DEFAULT));
    
    // Fill the tables on the DPU
    broadcast_tables(set); // <<--- Add this
    
    // Run the DPU Kernel
    DPU_ASSERT(dpu_launch(set, DPU_SYNCHRONOUS));
    
    // Retrieve the result
    DPU_FOREACH(set, dpu) {
        DPU_ASSERT(dpu_copy_from(dpu, "number", 0, &number, sizeof(int)));
        printf("Result of sin() on dpu: %f", number);
       
    }
    
    return 0;
}

and on the host side

#include "lut_ldexpf_interpolate.c" // <<--- Add this

__host uint32_t number;

int main(){
    number = sinf(number); // <<--- Now you can use transcendntal functions!
    return 0;
}

TransPimLib's Methods

TransPimLib contains different implementation methods with different memory requirements, host setup time, accuracy, and performance.

Recomended Methods

cordic.c Has the smallest table sizes
cordic_lut.c Has medium table sizes (size independent of precision, but bigger tables = less cycles)
lut_ldexpf_interpolate.c Has large table sizes but runs significantly faster
lut_direct_ldexpf.c same as lut_ldexpf_interpolate.c but with non-linear spacing that might be beneficial for machine learning activation functions

Prototype Methods

lut_multi_interpolate.csame as lut_ldexpf_interpolate.c but slower
lut_direct.c same as lut_direct_ldexpf.c but without any table values for small numbers

Non-interpolated Methods

lut_ldexpf.c Can be run with lut_ldexpf_interpolate.c (same on host side) and is faster but less accurate
lut_multi.c Can be run with lut_multi_interpolate.c (same on host side) and is faster but less accurate

Not all methods have all functions, because some functions rely on some parts of the implementation. Functions in brackets () have a limited range

| Method | sinf | cosf | tanf | sinhf | coshf | tanhf | expf | logf | sqrtf | gelu | |----------------------------|------|------|------|-------|-------|-------|------|------|-------|------| | cordic.c | x | x | x | (x) | (x) | (x) | x | x | x | | | cordic_lut.c | x | x | x | (x) | (x) | (x) | x | | | | | lut_ldexpf_interpolate.c | x | x | x | | | | x | x | x | | | lut_direct_ldexpf.c | | | | | | x | | | | x | | lut_multi_interpolate.c | x | x | x | | | | x | x | x | | | lut_direct.c | x | x | x | | | | x | x | x | | | lut_ldexpf.c | x | x | x | | | | x | x | x | | | lut_multi.c | | | | | | x | | | | x |

Check the paper for explanations and use cases.

Customization

Precision

The implementations default to the best possible precision where there is no diminishing performance returns yet. The precision can be changed by defining #define PRECISION xyz before the include. All methods now use the defined precision. Alternatively, if extra precision is only needed for one or two functions, the precision can be changed with the define in each implementation section. This change needs to be done on both the host and the dpu codes! E.g., in lut_ldexpf_interpolate.c change PRECISION to some value:

...
#define SIN_COS_TAN_PRECISION 16 // This needs to match on CPU and DPU side!
...

and the same change of value needs to be applied in lut_ldexpf_host.c.

Approximate table sizes for LUT-based implementations (i.e., all methods except cordic.c):

| Precision | Table Size | Memory Requirement <br/>(with 6 Tables, <br/>as in lut_ldexpf_interpolate.c) | |-----------|------------|-----------------------------------------------------------------------| | 6 | 256 bytes | 1.5 KiB | | 8 | 1 KiB | 6 KiB | | 10 | 4 KiB | 24 KiB (recomended size) | | 12 | 16 KiB | 96 KiB | | 14 | 64 KiB | 384 KiB | | 16 | 256 KiB | 1.5 MiB | | 18 | 1 MiB | 6 MiB | | 20 | 4 MiB | 24 MiB | | 22 | 16 MiB | 96 MiB |

MRAM / WRAM

We suggest to save LUT tables in MRAM, as the performance gain from storing them in WRAM is pretty small. To change this, there is a define per table on the dpu side. E.g., in lut_ldexpf_interpolate.c set

...
#define SIN_COS_TAN_STORE_IN_WRAM 1
...

Adding a New Function

New functions can be integrated into TransPimLib with the following three major steps:

On the host side, add new code to calculate the lookup tables, and then transfer them over.

In all methods, there is a function called fill_table or similar, which creates a lookup table. This function just needs to be given the right inputs. Then, the generated table needs to be transferred to the PIM side.

All methods are split into sections that do everything needed for one particular function. For example, in lut_ldexpf_host.c there is a sec

Transpimlib

Install / Use

README