Weave, a state-of-the-art multithreading runtime

"Good artists borrow, great artists steal." -- Pablo Picasso

Weave (codenamed "Project Picasso") is a multithreading runtime for the Nim programming language.

It is continuously tested on Linux, MacOS and Windows for the following CPU architectures: x86, x86_64 and ARM64 with the C and C++ backends.

Weave aims to provide a composable, high-performance, ultra-low overhead and fine-grained parallel runtime that frees developers from the common worries of "are my tasks big enough to be parallelized?", "what should be my grain size?", "what if the time they take is completely unknown or different?" or "is parallel-for worth it if it's just a matrix addition? On what CPUs? What if it's exponentiation?".

Thorough benchmarks track Weave performance against industry standard runtimes in C/C++/Cilk language on both Task parallelism and Data parallelism with a variety of workloads:

Compute-bound
Memory-bound
Load Balancing
Runtime-overhead bound (i.e. trillions of tasks in a couple milliseconds)
Nested parallelism

Benchmarks are issued from recursive tree algorithms, finance, linear algebra and High Performance Computing, game simulations. In particular Weave displays as low as 3x to 10x less overhead than Intel TBB and GCC OpenMP on overhead-bound benchmarks.

At implementation level, Weave unique feature is being-based on Message-Passing instead of being based on traditional work-stealing with shared-memory deques.

⚠️ Disclaimer:

Only 1 out of 2 complex synchronization primitives was formally verified to be deadlock-free. They were not submitted to an additional data race detection tool to ensure proper implementation.

Furthermore worker threads are state-machines and were not formally verified either.

Weave does limit synchronization to only simple SPSC and MPSC channels which greatly reduces the potential bug surface.

Installation

Weave can be simply installed with

nimble install weave

or for the devel version

nimble install weave@#master

Weave requires at least Nim v1.2.0

Changelog

The latest changes are available in the changelog.md file.

Demos

A raytracing demo is available, head over to demos/raytracing.

Weave, a state-of-the-art multithreading runtime

API

Task parallelism

Weave provides a simple API based on spawn/sync which works like async/await for IO-based futures.

The traditional parallel recursive Fibonacci would be written like this:

import weave

proc fib(n: int): int =
  # int64 on x86-64
  if n < 2:
    return n

  let x = spawn fib(n-1)
  let y = fib(n-2)

  result = sync(x) + y

proc main() =
  var n = 20

  init(Weave)
  let f = fib(n)
  exit(Weave)

  echo f

main()

Data parallelism

Weave provides nestable parallel for loop.

A nested matrix transposition would be written like this:

import weave

func initialize(buffer: ptr UncheckedArray[float32], len: int) =
  for i in 0 ..< len:
    buffer[i] = i.float32

proc transpose(M, N: int, bufIn, bufOut: ptr UncheckedArray[float32]) =
  ## Transpose a MxN matrix into a NxM matrix with nested for loops

  parallelFor j in 0 ..< N:
    captures: {M, N, bufIn, bufOut}
    parallelFor i in 0 ..< M:
      captures: {j, M, N, bufIn, bufOut}
      bufOut[j*M+i] = bufIn[i*N+j]

proc main() =
  let M = 200
  let N = 2000

  let input = newSeq[float32](M*N)
  # We can't work with seq directly as it's managed by GC, take a ptr to the buffer.
  let bufIn = cast[ptr UncheckedArray[float32]](input[0].unsafeAddr)
  bufIn.initialize(M*N)

  var output = newSeq[float32](N*M)
  let bufOut = cast[ptr UncheckedArray[float32]](output[0].addr)

  init(Weave)
  transpose(M, N, bufIn, bufOut)
  exit(Weave)

main()

Strided loops

You might want to use loops with a non unit-stride, this can be done with the following syntax.

import weave

init(Weave)

# expandMacros:
parallelForStrided i in 0 ..< 100, stride = 30:
  parallelForStrided j in 0 ..< 200, stride = 60:
    captures: {i}
    log("Matrix[%d, %d] (thread %d)\n", i, j, myID())

exit(Weave)

Complete list

We separate the list depending on the threading context

Root thread

The root thread is the thread that started the Weave runtime. It has special privileges.

init(Weave), exit(Weave) to start and stop the runtime. Forgetting this will give you nil pointer exceptions on spawn.
The thread that calls init will become the root thread.
syncRoot(Weave) is a global barrier. The root thread will not continue beyond until all tasks in the runtime are finished.

Weave worker thread

A worker thread is automatically created per (logical) core on the machine. The root thread is also a worker thread. Worker threads are tuned to maximize throughput of computational tasks.

spawn fnCall(args) which spawns a function that may run on another thread and gives you an awaitable Flowvar handle.
newFlowEvent, trigger, spawnOnEvent and spawnOnEvents (experimental) to delay a task until some dependencies are met. This allows expressing precise data dependencies and producer-consumer relationships.
sync(Flowvar) will await a Flowvar and block until you receive a result.
isReady(Flowvar) will check if sync will actually block or return the result immediately.
syncScope is a scope barrier. The thread will not move beyond the scope until all tasks and parallel loops spawned and their descendants are finished. syncScope is composable, it can be called by any thread, it can be nested. It has the syntax of a block statement:
```
syncScope():
  parallelFor i in 0 ..< N:
    captures: {a, b}
    parallelFor j in 0 ..< N:
      captures: {i, a, b}
  spawn foo()
```
In this example, the thread encountering syncScope will create all the tasks for parallel loop i, will spawn foo() and then will be waiting at the end of the scope. A thread blocked at the end of its scope is not idle, it still helps processing all the work existing and that may be created by the current tasks.
parallelFor, parallelForStrided, parallelForStaged, parallelForStagedStrided are described above and in the experimental section.
loadBalance(Weave) gives the runtime the opportunity to distribute work. Insert this within long computation as due to Weave design, it's the busy workers that are also in charge of load balancing. This is done automatically when using parallelFor.
isSpawned(Flowvar) allows you to build speculative algorithm where a thread is spawned only if certain conditions are valid. See the nqueens benchmark for an example.
getThreadId(Weave) returns a unique thread ID. The thread ID is in the range 0 ..< number of threads.

The max number of worker threads can be configured by the environment variable WEAVE_NUM_THREADS and default to your number of logical cores (including HyperThreading). Weave uses Nim's countProcessors() in std/cpuinfo

Foreign thread & Background service (experimental)

Weave can also be run as a background service and process jobs similar to the Executor concept in C++. Jobs will be processed in FIFO order.

Experimental: The distinction between spawn/sync on a Weave thread and submit/waitFor on a foreign thread may be removed in the future.

A background service can be started with either:

thr.runInBackground(Weave)
or thr.runInBackground(Weave, signalShutdown: ptr Atomic[bool])

with thr an uninitialized Thread[void] or Thread[ptr Atomic[bool]]

Then the f

Weave

Install / Use

README