Mpl
The MaPLe compiler: efficient and scalable parallel functional programming
Install / Use
/learn @MPLLang/MplREADME
MaPLe (MPL)
MaPLe is a functional language for provably efficient and safe multicore parallelism.
Features:
- Support for the full Standard ML programming language, extended with task-parallel and data-parallel primitives.
- Native performance on both x86 and Arm architectures.
- Whole-program compilation based on MLton, with aggressive optimizations to achieve performance competitive with languages such as C/C++.
- Efficient memory representations, including:
- Untagged and unboxed native integers and floating-point numbers.
- Flattened tuples and records.
- Native arrays with contiguous unboxed elements.
- Simple and fast foreign-function calls into C, based on MLtonFFI.
- Support for both regular and irregular fine-grained parallelism, with provably efficient automatic parallelism management [7] to control the overheads of task creation.
- Provably efficient parallel garbage collection based on hierarchical memory management and disentanglement [1,2,3,4,5,6].
- Support for large core counts and large memory sizes. MPL scales to hundreds of cores, and can efficiently handle heap sizes of as much as 1TB or more.
MPL is being actively developed. If you are interested in contributing to the project, PRs are welcome!
If you are you interested in using MPL, consider checking
out the tutorial.
You might also be interested in exploring
mpllib
(a library for MPL) and the
Parallel ML benchmark suite.
References
[<a name="wfra24">7</a>] Automatic Parallelism Management. Sam Westrick, Matthew Fluet, Mike Rainey, and Umut A. Acar. POPL 2024.
[<a name="awa23">6</a>] Efficient Parallel Functional Programming with Effects. Jatin Arora, Sam Westrick, and Umut A. Acar. PLDI 2023.
[<a name="waa22">5</a>] Entanglement Detection with Near-Zero Cost. Sam Westrick, Jatin Arora, and Umut A. Acar. ICFP 2022.
[<a name="awa21">4</a>] Provably Space-Efficient Parallel Functional Programming. Jatin Arora, Sam Westrick, and Umut A. Acar. POPL 2021.
[<a name="wyfa20">3</a>] Disentanglement in Nested-Parallel Programs. Sam Westrick, Rohan Yadav, Matthew Fluet, and Umut A. Acar. POPL 2020.
[<a name="gwraf18">2</a>] Hierarchical Memory Management for Mutable State. Adrien Guatto, Sam Westrick, Ram Raghunathan, Umut Acar, and Matthew Fluet. PPoPP 2018.
[<a name="rmab16">1</a>] Hierarchical Memory Management for Parallel Programs. Ram Raghunathan, Stefan K. Muller, Umut A. Acar, and Guy Blelloch. ICFP 2016.
Try It Out
Instructions for installing MPL natively (on Linux or Mac) are further below.
If you want to quickly try out using MPL, you can download the Docker image and run one of the examples.
$ docker pull shwestrick/mpl
$ docker run -it shwestrick/mpl /bin/bash
...# examples/bin/primes @mpl procs 4 --
To write and compile your own code, we recommend
mounting a local directory inside the container. For example, here's how you
can use MPL to compile and run your own main.mlb in the current directory.
(To mount some other directory, replace $(pwd -P) with a different path.)
$ ls
main.mlb
$ docker run -it -v $(pwd -P):/root/mycode shwestrick/mpl /bin/bash
...# cd /root/mycode
...# mpl main.mlb
...# ./main @mpl procs 4 --
Benchmark Suite
The Parallel ML benchmark suite provides many examples of sophisticated parallel algorithms and applications in MPL, as well as cross-language performance comparisons with C++, Go, Java, and multicore OCaml.
Libraries and Projects
We recommend using the smlpkg package manager. MaPLe supports the full SML language, so existing libraries for SML can be used.
Here are a few libraries and projects that make use of MaPLe for parallelism:
github.com/MPLLang/mpllib: grab-bag library with a variety of data structures (sequences, sets, dictionaries, graphs, matrices, meshes, images, etc.) and parallel algorithms (map, reduce, scan, filter, sorting, search, tokenization, graph processing, computational geometry, etc.). Also includes basic utilies (e.g. parsing command-line arguments) and benchmarking infrastructure.github.com/shwestrick/sml-audio: a library for audio processing and I/O support for.wavfiles.github.com/shwestrick/sml-matrix-market: a library for parsing MatrixMarket filesgithub.com/shwestrick/mpl-seam-carve: image resizing by seam carving, demonstrating parallelism in MaPLe.github.com/shwestrick/mpl-wc: a simple clone of the Unixwcutility demonstrating parallelism in MaPLe.github.com/shwestrick/mpl-1brc: an example solution to the 1 Billion Row Challenge, demonstrating some features of MaPLe.github.com/shwestrick/mpl-tinykaboom: a port of a port (Futhark) of the KABOOM (C++) tiny graphics program.github.com/UmutAcarLab/grafeyn: a hybrid Feynman-Schrödinger pure quantum simulator written in MaPLe.
Parallel and Concurrent Extensions
MaPLe extends SML with a number of primitives for parallelism and concurrency.
Take a look at examples/ to see these primitives in action.
The ForkJoin Structure
val par: (unit -> 'a) * (unit -> 'b) -> 'a * 'b
val parfor: int -> (int * int) -> (int -> unit) -> unit
val alloc: int -> 'a array
val parform: (int * int) -> (int -> unit) -> unit
val reducem: ('a * 'a -> 'a) -> 'a -> (int * int) -> (int -> 'a) -> 'a
The par primitive takes two functions to execute in parallel and
returns their results.
The parfor primitive is a "parallel for loop". It takes a grain-size g, a
range (i, j), and a function f, and executes f(k) in parallel for each
i <= k < j. The grain-size g is for manual granularity
control: parfor splits the input range into approximately (j-i)/g subranges,
each of size at most g, and each subrange is processed sequentially. The
grain-size must be at least 1, in which case the loop is "fully parallel".
Note: This function should only be used if a reasonable grain size can be
passed as argument. This is often cumbersome.
In general, we recommend using parform instead (described below).
The alloc primitive takes a length and returns a fresh, uninitialized array
of that size. Warning: To guarantee no errors, the programmer must be
careful to initialize the array before reading from it. alloc is intended to
be used as a low-level primitive in the efficient implementation of
high-performance libraries. It is integrated with the scheduler and memory
management system to perform allocation in parallel and be safe-for-GC.
The parform primitive is a parallel for loop, similar to parfor above,
except with no grain parameter. The parallelism of the loop is automatically
managed. In general, we recommend using parform instead of parfor.
The reducem primitive performs an automatically managed parallel reduction.
It takes a "sum" function c, a "zero" element z, a range (i,j), and
a function f, and computes the "sum" of [f(i), ..., f(j-1)] in parallel
with respect to c. For example:
reduce op+ 0 (0, Array.length a) (fn i => Array.sub (a, i))computes the sum of an arrayareduce Real.max Real.negInf (0, n) (fn i => f (Real.fromInt i / Real.fromInt n))samples a functionf: real -> realatnevenly-spaced locations in the range[0.0, 1.0]to find the maximum value.
The MLton.Parallel Structure
val compareAndSwap: 'a ref -> ('a * 'a) -> 'a
val arrayCompareAndSwap: ('a array * int) -> ('a * 'a) -> 'a
val fetchAndAdd: int ref -> int -> int
val arrayFetchAndAdd: int array * int -> int -> int
compareAndSwap r (x, y) performs an atomic
CAS
which attempts to atomically swap the contents of r from x to y,
returning the original value stored in r before the CAS.
Polymorphic equality is determined
in the same way as MLton.eq, which is a
standard equality check for simple types (char, int, word, etc.) and
a pointer equality check for other types (array, string, tuples, datatypes,
etc.). The semantics are a bit murky.
arrayCompareAndSwap (a, i) (x, y) behaves the same as compareAndSwap but
on arrays instead of references. This performs a CAS at index i of array
a, and does not read or write at any other locations of the array.
fetchAndAdd r d performs an atomic fetch-and-add
which atomically retrieves and the value of the specified memory cell r
adds d to it, returning the original value stored in r before the update.
arrayFetchAndAdd (a, i) d behaves the same as
