Paralight
A lightweight parallelism library for indexed structures in Rust
Install / Use
/learn @gendx/ParalightREADME
Paralight: a lightweight parallelism library for indexed structures
This library allows you to distribute computation over indexed sources
(slices, ranges, Vec, etc.) among multiple
threads. It aims to uphold the highest standards of documentation, testing and
safety, see the FAQ below.
It is designed to be as lightweight as possible, following the principles outlined in the blog post Optimization adventures: making a parallel Rust workload 10x faster with (or without) Rayon. Benchmarks on a real-world use case can be seen here.
use paralight::prelude::*;
// Create a thread pool with the given parameters.
let mut thread_pool = ThreadPoolBuilder {
num_threads: ThreadCount::AvailableParallelism,
range_strategy: RangeStrategy::WorkStealing,
cpu_pinning: CpuPinningPolicy::No,
}
.build();
// Compute the sum of a slice.
let input = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
let sum = input
.par_iter()
.with_thread_pool(&mut thread_pool)
.sum::<i32>();
assert_eq!(sum, 5 * 11);
// Add slices together.
let mut output = [0; 10];
let left = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
let right = [11, 12, 13, 14, 15, 16, 17, 18, 19, 20];
(output.par_iter_mut(), left.par_iter(), right.par_iter())
.zip_eq()
.with_thread_pool(&mut thread_pool)
.for_each(|(out, &a, &b)| *out = a + b);
assert_eq!(output, [12, 14, 16, 18, 20, 22, 24, 26, 28, 30]);
Paralight supports various indexed sources out-of-the-box (slices,
ranges, etc.), and can be extended to other types via the
ParallelSource trait, together with the conversion
traits (IntoParallelSource,
IntoParallelRefSource and
IntoParallelRefMutSource).
Thread pool configuration
The ThreadPoolBuilder provides an explicit way
to configure your thread pool, giving you fine-grained control over performance
for your workload. There is no default, which is deliberate because the suitable
parameters depend on your workload.
Number of worker threads
Paralight allows you to specify the number of worker threads to spawn in a
thread pool with the ThreadCount enum:
AvailableParallelismuses the number of threads returned by the standard library'savailable_parallelism()function,Count(_)uses the specified number of threads, which must be non-zero.
For convenience, ThreadCount implements the
TryFrom<usize> trait to create a
Count(_) instance, validating that the given
number of threads is not zero.
Recommendation: It depends. While
AvailableParallelism may be a
good default, it usually returns twice the number of CPU cores (at least on
Intel) to account for
hyper-threading. Whether this
is optimal or not depends on your workload, for example whether it's compute
bound or memory bound, whether a single thread can saturate the resources of one
core or not, etc. Generally, the long list of caveats mentioned in the
documentation of available_parallelism()
applies.
On some workloads, hyper-threading doesn't provide a performance boost over
using only one thread per core, because two hyper-threads would compete on
resources on the core they share (e.g. memory caches). In this case, using half
of what available_parallelism() returns
can reduce contention and perform better.
If your program is not running alone on your machine but is competing with other programs, using too many threads can also be detrimental to the overall performance of your system.
Work-stealing strategy
Paralight offers two strategies in the RangeStrategy
enum to distribute computation among threads:
Fixedsplits the input evenly and hands out a fixed sequential range of items to each worker thread,WorkStealingstarts with the fixed distribution, but lets each worker thread steal items from others once it is done processing its items.
Recommendation: If your pipeline is performing roughly the same amont of
work for each item, you should probably use the
Fixed strategy, to avoid paying the
synchronization cost of work-stealing. This is especially true if the amount of
work per item is small (e.g. some simple arithmetic operations). If the amoung
of work per item is highly variable and/or large, you should probably use the
WorkStealing strategy (e.g. parsing
strings, processing files).
Note: In work-stealing mode, each thread processes an arbitrary subset of
items in arbitrary order, meaning that a reduction operation must be both
commutative and
associative to yield a
deterministic result (in contrast to the standard library's
Iterator trait that processes items in sequential
order). Fortunately, a lot of common operations are commutative and associative,
but be mindful of this.
use paralight::prelude::*;
let mut thread_pool = ThreadPoolBuilder {
num_threads: ThreadCount::AvailableParallelism,
range_strategy: RangeStrategy::WorkStealing,
cpu_pinning: CpuPinningPolicy::No,
}
.build();
let s = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
.par_iter()
.with_thread_pool(&mut thread_pool)
.map(|c: &char| c.to_string())
.reduce(String::new, |mut a: String, b: String| {
a.push_str(&b);
a
});
// ⚠️ There is no guarantee that this check passes. In practice, `s` contains any permutation
// of the input, such as "fgdebachij".
assert_eq!(s, "abcdefghij");
// This makes sure the example panics anyway if the permutation is (by luck) the identity.
panic!("Congratulations, you won the lottery and the assertion passed this time!");
CPU pinning
Paralight allows pinning each worker thread to one CPU, on platforms that
support it. For now, this is implemented for platforms whose
target_os
is among android, dragonfly, freebsd, linux (platforms that support
libc::sched_setaffinity() via the nix crate)
and windows (using
SetThreadAffinityMask()
via the windows-sys crate).
Paralight offers three policies in the
CpuPinningPolicy enum:
Nodoesn't pin worker threads to CPUs,IfSupportedattempts to pin each worker thread to a distinct CPU on supported platforms, but proceeds without pinning if running on an unsupported platform or if the pinning function fails,Alwayspins each worker thread to a distinct CPU, panicking if the platform isn't supported or if the pinning function returns an error.
Recommendation: Whether CPU pinning is useful or detrimental depends on your
workload. If you're processing the same data over and over again (e.g. calling
par_iter() multiple times on the same
data), CPU pinning can help ensure that each subset of the data is always
processed on the same CPU core and stays fresh in the lower-level per-core
caches, speeding up memory accesses. This however depends on the amount of data:
if it's too large, it may not fit in per-core caches anyway.
If your program is not running alone on your machine but is competing with other
programs, CPU pinning may be detrimental, as a worker thread will be blocked
whenever its required core is used by another program, even if another core is
free and other worker threads are done (especially with the
Fixed strategy). This of course depends on
how the scheduler works on your OS.
Using a thread pool
To create parallel pipelines, be mindful that the [`with
Related Skills
node-connect
342.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
342.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.7kCommit, push, and open a PR
