moodycamel::ConcurrentQueue<T>

An industrial-strength lock-free queue for C++.

Note: If all you need is a single-producer, single-consumer queue, I have [one of those too][spsc].

Features

Knock-your-socks-off [blazing fast performance][benchmarks].
Single-header implementation. Just drop it in your project.
Fully thread-safe lock-free queue. Use concurrently from any number of threads.
C++11 implementation -- elements are moved (instead of copied) where possible.
Templated, obviating the need to deal exclusively with pointers -- memory is managed for you.
No artificial limitations on element types or maximum count.
Memory can be allocated once up-front, or dynamically as needed.
Fully portable (no assembly; all is done through standard C++11 primitives).
Supports super-fast bulk operations.
Includes a low-overhead blocking version (BlockingConcurrentQueue).
Exception safe.

Reasons to use

There are not that many full-fledged lock-free queues for C++. Boost has one, but it's limited to objects with trivial assignment operators and trivial destructors, for example. Intel's TBB queue isn't lock-free, and requires trivial constructors too. There're many academic papers that implement lock-free queues in C++, but usable source code is hard to find, and tests even more so.

This queue not only has less limitations than others (for the most part), but [it's also faster][benchmarks]. It's been fairly well-tested, and offers advanced features like bulk enqueueing/dequeueing (which, with my new design, is much faster than one element at a time, approaching and even surpassing the speed of a non-concurrent queue even under heavy contention).

In short, there was a lock-free queue shaped hole in the C++ open-source universe, and I set out to fill it with the fastest, most complete, and well-tested design and implementation I could. The result is moodycamel::ConcurrentQueue :-)

Reasons not to use

The fastest synchronization of all is the kind that never takes place. Fundamentally, concurrent data structures require some synchronization, and that takes time. Every effort was made, of course, to minimize the overhead, but if you can avoid sharing data between threads, do so!

Why use concurrent data structures at all, then? Because they're gosh darn convenient! (And, indeed, sometimes sharing data concurrently is unavoidable.)

My queue is not linearizable (see the next section on high-level design). The foundations of its design assume that producers are independent; if this is not the case, and your producers co-ordinate amongst themselves in some fashion, be aware that the elements won't necessarily come out of the queue in the same order they were put in relative to the ordering formed by that co-ordination (but they will still come out in the order they were put in by any individual producer). If this affects your use case, you may be better off with another implementation; either way, it's an important limitation to be aware of.

My queue is also not NUMA aware, and does a lot of memory re-use internally, meaning it probably doesn't scale particularly well on NUMA architectures; however, I don't know of any other lock-free queue that is NUMA aware (except for [SALSA][salsa], which is very cool, but has no publicly available implementation that I know of).

Finally, the queue is not sequentially consistent; there is a happens-before relationship between when an element is put in the queue and when it comes out, but other things (such as pumping the queue until it's empty) require more thought to get right in all eventualities, because explicit memory ordering may have to be done to get the desired effect. In other words, it can sometimes be difficult to use the queue correctly. This is why it's a good idea to follow the [samples][samples.md] where possible. On the other hand, the upside of this lack of sequential consistency is better performance.

High-level design

Elements are stored internally using contiguous blocks instead of linked lists for better performance. The queue is made up of a collection of sub-queues, one for each producer. When a consumer wants to dequeue an element, it checks all the sub-queues until it finds one that's not empty. All of this is largely transparent to the user of the queue, however -- it mostly just works<sup>TM</sup>.

One particular consequence of this design, however, (which seems to be non-intuitive) is that if two producers enqueue at the same time, there is no defined ordering between the elements when they're later dequeued. Normally this is fine, because even with a fully linearizable queue there'd be a race between the producer threads and so you couldn't rely on the ordering anyway. However, if for some reason you do extra explicit synchronization between the two producer threads yourself, thus defining a total order between enqueue operations, you might expect that the elements would come out in the same total order, which is a guarantee my queue does not offer. At that point, though, there semantically aren't really two separate producers, but rather one that happens to be spread across multiple threads. In this case, you can still establish a total ordering with my queue by creating a single producer token, and using that from both threads to enqueue (taking care to synchronize access to the token, of course, but there was already extra synchronization involved anyway).

I've written a more detailed [overview of the internal design][blog], as well as [the full nitty-gritty details of the design][design], on my blog. Finally, the [source][source] itself is available for perusal for those interested in its implementation.

Basic use

The entire queue's implementation is contained in one header, [concurrentqueue.h][concurrentqueue.h]. Simply download and include that to use the queue. The blocking version is in a separate header, [blockingconcurrentqueue.h][blockingconcurrentqueue.h], that depends on [concurrentqueue.h][concurrentqueue.h] and [lightweightsemaphore.h][lightweightsemaphore.h]. The implementation makes use of certain key C++11 features, so it requires a relatively recent compiler (e.g. VS2012+ or g++ 4.8; note that g++ 4.6 has a known bug with std::atomic and is thus not supported). The algorithm implementations themselves are platform independent.

Use it like you would any other templated queue, with the exception that you can use it from many threads at once :-)

Simple example:

#include "concurrentqueue.h"

moodycamel::ConcurrentQueue<int> q;
q.enqueue(25);

int item;
bool found = q.try_dequeue(item);
assert(found && item == 25);

Description of basic methods:

ConcurrentQueue(size_t initialSizeEstimate) Constructor which optionally accepts an estimate of the number of elements the queue will hold
enqueue(T&& item) Enqueues one item, allocating extra space if necessary
try_enqueue(T&& item) Enqueues one item, but only if enough memory is already allocated
try_dequeue(T& item) Dequeues one item, returning true if an item was found or false if the queue appeared empty

Note that it is up to the user to ensure that the queue object is completely constructed before being used by any other threads (this includes making the memory effects of construction visible, possibly via a memory barrier). Similarly, it's important that all threads have finished using the queue (and the memory effects have fully propagated) before it is destructed.

There's usually two versions of each method, one "explicit" version that takes a user-allocated per-producer or per-consumer token, and one "implicit" version that works without tokens. Using the explicit methods is almost always faster (though not necessarily by a huge factor). Apart from performance, the primary distinction between them is their sub-queue allocation behaviour for enqueue operations: Using the implicit enqueue methods causes an automatically-allocated thread-local producer sub-queue to be allocated. Explicit producers, on the other hand, are tied directly to their tokens' lifetimes (but are recycled internally).

In order to avoid the number of sub-queues growing without bound, implicit producers are marked for reuse once their thread exits. However, this is not supported on all platforms. If using the queue from short-lived threads, it is recommended to use explicit producer tokens instead.

Full API (pseudocode):

# Allocates more memory if necessary
enqueue(item) : bool
enqueue(prod_token, item) : bool
enqueue_bulk(item_first, count) : bool
enqueue_bulk(prod_token, item_first, count) : bool

# Fails if not enough memory to enqueue
try_enqueue(item) : bool
try_enqueue(prod_token, item) : bool
try_enqueue_bulk(item_first, count) : bool
try_enqueue_bulk(prod_token, item_first, count) : bool

# Attempts to dequeue from the queue (never allocates)
try_dequeue(item&) : bool
try_dequeue(cons_token, item&) : bool
try_dequeue_bulk(item_first, max) : size_t
try_dequeue_bulk(cons_token, item_first, max) : size_t

# If you happen to know which producer you want to dequeue from
try_dequeue_from_producer(prod_token, item&) : bool
try_dequeue_bulk_from_producer(prod_token, item_first, max) : size_t

# A not-necessarily-accurate count of the total number of elements
size_approx() : size_t

Blocking version

As mentioned above, a full blocking wrapper of the queue is provided that adds wait_dequeue and wait_dequeue_bulk methods in addition to the regular interface. This wrapper is extremely low-overhead, but slightly less fast than the non-blocking queue (due to the necessary bookkeeping involving a lightweight semaphore).

There are also timed versions that allow a timeout to be specifi

Concurrentqueue

Install / Use

README