Weave
A state-of-the-art multithreading runtime: message-passing based, fast, scalable, ultra-low overhead
Install / Use
/learn @mratsim/WeaveREADME
Weave, a state-of-the-art multithreading runtime
"Good artists borrow, great artists steal." -- Pablo Picasso
Weave (codenamed "Project Picasso") is a multithreading runtime for the Nim programming language.
It is continuously tested on Linux, MacOS and Windows for the following CPU architectures: x86, x86_64 and ARM64 with the C and C++ backends.
Weave aims to provide a composable, high-performance, ultra-low overhead and fine-grained parallel runtime that frees developers from the common worries of "are my tasks big enough to be parallelized?", "what should be my grain size?", "what if the time they take is completely unknown or different?" or "is parallel-for worth it if it's just a matrix addition? On what CPUs? What if it's exponentiation?".
Thorough benchmarks track Weave performance against industry standard runtimes in C/C++/Cilk language on both Task parallelism and Data parallelism with a variety of workloads:
- Compute-bound
- Memory-bound
- Load Balancing
- Runtime-overhead bound (i.e. trillions of tasks in a couple milliseconds)
- Nested parallelism
Benchmarks are issued from recursive tree algorithms, finance, linear algebra and High Performance Computing, game simulations. In particular Weave displays as low as 3x to 10x less overhead than Intel TBB and GCC OpenMP on overhead-bound benchmarks.
At implementation level, Weave unique feature is being-based on Message-Passing instead of being based on traditional work-stealing with shared-memory deques.
⚠️ Disclaimer:
Only 1 out of 2 complex synchronization primitives was formally verified to be deadlock-free. They were not submitted to an additional data race detection tool to ensure proper implementation.
Furthermore worker threads are state-machines and were not formally verified either.
Weave does limit synchronization to only simple SPSC and MPSC channels which greatly reduces the potential bug surface.
Installation
Weave can be simply installed with
nimble install weave
or for the devel version
nimble install weave@#master
Weave requires at least Nim v1.2.0
Changelog
The latest changes are available in the file.
Demos
A raytracing demo is available, head over to demos/raytracing.

Table of Contents
- Weave, a state-of-the-art multithreading runtime
API
Task parallelism
Weave provides a simple API based on spawn/sync which works like async/await for IO-based futures.
The traditional parallel recursive Fibonacci would be written like this:
import weave
proc fib(n: int): int =
# int64 on x86-64
if n < 2:
return n
let x = spawn fib(n-1)
let y = fib(n-2)
result = sync(x) + y
proc main() =
var n = 20
init(Weave)
let f = fib(n)
exit(Weave)
echo f
main()
Data parallelism
Weave provides nestable parallel for loop.
A nested matrix transposition would be written like this:
import weave
func initialize(buffer: ptr UncheckedArray[float32], len: int) =
for i in 0 ..< len:
buffer[i] = i.float32
proc transpose(M, N: int, bufIn, bufOut: ptr UncheckedArray[float32]) =
## Transpose a MxN matrix into a NxM matrix with nested for loops
parallelFor j in 0 ..< N:
captures: {M, N, bufIn, bufOut}
parallelFor i in 0 ..< M:
captures: {j, M, N, bufIn, bufOut}
bufOut[j*M+i] = bufIn[i*N+j]
proc main() =
let M = 200
let N = 2000
let input = newSeq[float32](M*N)
# We can't work with seq directly as it's managed by GC, take a ptr to the buffer.
let bufIn = cast[ptr UncheckedArray[float32]](input[0].unsafeAddr)
bufIn.initialize(M*N)
var output = newSeq[float32](N*M)
let bufOut = cast[ptr UncheckedArray[float32]](output[0].addr)
init(Weave)
transpose(M, N, bufIn, bufOut)
exit(Weave)
main()
Strided loops
You might want to use loops with a non unit-stride, this can be done with the following syntax.
import weave
init(Weave)
# expandMacros:
parallelForStrided i in 0 ..< 100, stride = 30:
parallelForStrided j in 0 ..< 200, stride = 60:
captures: {i}
log("Matrix[%d, %d] (thread %d)\n", i, j, myID())
exit(Weave)
Complete list
We separate the list depending on the threading context
Root thread
The root thread is the thread that started the Weave runtime. It has special privileges.
init(Weave),exit(Weave)to start and stop the runtime. Forgetting this will give you nil pointer exceptions on spawn.
The thread that callsinitwill become the root thread.syncRoot(Weave)is a global barrier. The root thread will not continue beyond until all tasks in the runtime are finished.
Weave worker thread
A worker thread is automatically created per (logical) core on the machine. The root thread is also a worker thread. Worker threads are tuned to maximize throughput of computational tasks.
-
spawn fnCall(args)which spawns a function that may run on another thread and gives you an awaitableFlowvarhandle. -
newFlowEvent,trigger,spawnOnEventandspawnOnEvents(experimental) to delay a task until some dependencies are met. This allows expressing precise data dependencies and producer-consumer relationships. -
sync(Flowvar)will await a Flowvar and block until you receive a result. -
isReady(Flowvar)will check ifsyncwill actually block or return the result immediately. -
syncScopeis a scope barrier. The thread will not move beyond the scope until all tasks and parallel loops spawned and their descendants are finished.syncScopeis composable, it can be called by any thread, it can be nested. It has the syntax of a block statement:syncScope(): parallelFor i in 0 ..< N: captures: {a, b} parallelFor j in 0 ..< N: captures: {i, a, b} spawn foo()In this example, the thread encountering syncScope will create all the tasks for parallel loop i, will spawn foo() and then will be waiting at the end of the scope. A thread blocked at the end of its scope is not idle, it still helps processing all the work existing and that may be created by the current tasks.
-
parallelFor,parallelForStrided,parallelForStaged,parallelForStagedStridedare described above and in the experimental section. -
loadBalance(Weave)gives the runtime the opportunity to distribute work. Insert this within long computation as due to Weave design, it's the busy workers that are also in charge of load balancing. This is done automatically when usingparallelFor. -
isSpawned(Flowvar)allows you to build speculative algorithm where a thread is spawned only if certain conditions are valid. See thenqueensbenchmark for an example. -
getThreadId(Weave)returns a unique thread ID. The thread ID is in the range 0 ..< number of threads.
The max number of worker threads can be configured by the environment variable WEAVE_NUM_THREADS
and default to your number of logical cores (including HyperThreading).
Weave uses Nim's countProcessors() in std/cpuinfo
Foreign thread & Background service (experimental)
Weave can also be run as a background service and process jobs similar to the Executor concept in C++.
Jobs will be processed in FIFO order.
Experimental: The distinction between spawn/sync on a Weave thread and submit/waitFor on a foreign thread may be removed in the future.
A background service can be started with either:
thr.runInBackground(Weave)- or
thr.runInBackground(Weave, signalShutdown: ptr Atomic[bool])
with thr an uninitialized Thread[void] or Thread[ptr Atomic[bool]]
Then the f
