Introduction

PyParallel is an experimental, proof-of-concept fork of Python 3.3.5 designed to optimally exploit contemporary hardware: multiple CPU cores, fast SSDs, NUMA architectures, and fast I/O channels (10GbE, Thunderbolt, etc). It presents a solution for removing the limitation of the Python Global Interpreter Lock (GIL) without needing to actually remove it at all.

The code changes required to the interpreter are relatively unobtrusive, all existing semantics such as reference counting and garbage collection remain unchanged, the new mental model required in order to write PyParallel-safe code is very simple (don't persist parallel objects), the single-thread overhead is negligible, and, most desirably, performance scales linearly with cores.

Disclaimer

-- PyParallel is, first and foremost, an experiment. It is not currently suitable for production. It is a product of trial-and-error, intended to shape the discussions surrounding the next generation of Python. We attempt to juggle the difficult task of setting the stage for Python to flourish over the next 25 years, without discarding all the progress we made in the last 25.

PyParallel was created by an existing Python committer with the intention of eventually merging it back into the mainline. It is not a hostile fork. There are many details that still need to be ironed out. It will need to prove itself as an independent project first before it could be considered for inclusion back in the main source tree. We anticipate this being at least 5 years out, and think Python 4.x would be a more realistic target than Python 3.x.

5 years sounds like a long time, however, it will come and go, just like any other. We may as well start the ball rolling now. There's nothing wrong with slow and steady as long as you're heading in the right direction. And it's not like we're getting any less cores per year.

Expectations need to be set reasonably, and we encourage the Python community toward biasing yes versus biasing no, with a view toward the long term benefits of such a project. Early adopters and technical evaluators will need to have thick skin, a pioneering spirit and a hearty sense of adventure. You will definitely hit a __debugbreak() or two if you're doing things right. But hey, you'll be able to melt all your cores during the process, and that's kinda' fun.

We encourage existing committers to play around and experiment, to fork and to send pull requests. One of the benefits of PyParallel at the moment is the freedom to experiment without the constraints that come with normal mainline development, where much more discipline is required per commit. It provides a nice change of pace and helps get the creative juices flowing. There is also a lot of low-hanging fruit, ripe for picking by Computer Science and Software Engineering students that want to get their feet wet with open source.

Catalyst

PyParallel and asyncio share the same origin. They were both products of an innocuous e-mail to python-ideas in September 2012 titled asyncore: included batteries don't fit. The general discussion centered around providing better asynchronous I/O primitives in Python 3.4. PyParallel took the wildly ambitious (and at the time, somewhat ridiculous) path of trying to solve both asynchronous I/O and the parallel problem at the same time. The efforts paid off, as we consider the whole experiment to be a success (at least in terms of its original goals), but it is a much longer term project, alluded to above.

Note: the parallel facilities provided by PyParallel are actually complementary to the single-threaded event loop facilities provided by asyncio. In fact, we envision hybrid solutions emerging that use asyncio to drive the parallel facilities behind the scenes, where the main thread dispatches requests to parallel servers behind the scenes, acting as the coordinator for parallel computation.

Overview

We expose a new parallel module to Python user code which must be used in order to leverage the new parallel execution facilities. Specifically, users implement completion-oriented protocol classes, then register them with PyParallel TCP/IP client or server objects.

import parallel
class Hello:
    def connection_made(self, transport, data):
        return b'Hello, World!\r\n'
    def data_received(self, transport, data):
        return b'You said: ' + data + '\r\n'

server = parallel.server('0.0.0.0', 8080)
parallel.register(transport=server, protocol=Hello)
parallel.run()

The protocol callbacks are automatically executed in parallel. This is achieved by creating parallel contexts for each client socket that connects to a server. The parallel context owns the underlying socket object, and all memory allocations required during callback execution are served from the context's heap, which is a simple block allocator. If callbacks need to send data back to the client, they must return a sendable object: bytes, bytearray or string. (That is, they do not explicitly call read() and write() methods against the transport directly.) If the contents of a file needs to be returned, transport.sendfile() can be used. Byte ranges can also be efficiently returned via transport.ranged_sendfile(). Both of these methods serve file content directly from the kernel's cache via TransmitFile.

GIL Semantics Unchanged

The semantic behavior of the "main thread" (the current thread holding the GIL) is unchanged. Instead, we introduce the notion of a parallel thread (or parallel context), and parallel objects, which are PyObjects allocated from parallel contexts. We provide an efficient way to detect if we're in a parallel thread via the Py_PXCTX() macro, as well as a way to detect if a PyObject was allocated from a parallel thread via Py_ISPX(ob). Using only these two facilities, we are able to intercept all thread-sensitive parts of the interpreter and redirect to our new parallel alternative if necessary. (The GIL entry and exit macros, Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS respectively, also get ignored in parallel contexts.)

The One New Restriction: Don't Persist Parallel Objects

We introduce one new restriction that will affect existing Python code (and C extensions): don't persist parallel objects. More explicitly, don't cache objects that were created during parallel callback processing.

For the CPython interpreter internals (in C), this means avoiding the following: freelist manipulation, first-use static PyObject initialization, and unicode interning. On the Python side, this means avoiding mutation of global state (or, more specifically, avoiding mutation of Python objects that were allocated from the main thread; don't append to a main thread list or assign to a main thread dict from a parallel thread).

Reference Counting and Garbage Collection in Parallel Contexts

We approached the problem of referencing counting and garbage collection within parallel contexts using a rigorous engineering methodology that can be summed up as follows: let's not do it, and see what happens. Nothing happened, so we don't do it.

Instead, we manage object lifetime and memory allocation in parallel contexts by exploiting the temporal and predictable nature of the protocol callbacks, which map closely to TCP/IP states (connection_made(), data_received(), send_complete(), etc).

A snapshot is taken prior to invoking the callback and then rolled back upon completion. Object lifetime is therefore governed by the duration of the callback; all objects allocated during the processing of a HTTP request, for example, including the final bytes object we send as a response, will immediately cease to exist the moment our TCP/IP stack informs us the send completed. (Specifically, we perform the rollback activity upon receipt of the completion notification.)

This is effective for the same reason generational garbage collectors are effective: most objects are short lived. For stateless, idempotent protocols, like HTTP, all objects are short lived. For stateful protocols, scalar objects (ints, floats, strings, bytes) can be assigned to self (a special per-connection instance of the protocol class), which will trigger a copy of the object from an alternate heap (still associated with the parallel context). (This is described in more detail here.) The lifetime of these objects will last as long as the TCP/IP connection persists, or until a previous value is overwritten by a new value, which ever comes first.

Thus, PyParallel requires no changes to existing reference counting and garbage collection semantics or APIs. Py_INCREF(op) and Py_DECREF(op) get ignored in parallel contexts, and GC-specific calls like PyObject_GC_New() simply get re-routed to our custom parallel object allocator in the same fashion as PyObject_New(). This obviates the need for fine-grain, per-object locking, as well as the need for a thread-safe, concurrent garbage collector.

This is significant when you factor in how Python's scoping works at a language level: Python code executing in a parallel thread can freely access any non-local variables created by the "main thread". That is, it has the exact same scoping and variable name resolution rules as any other Python code. This facilitates loading large data structures from the main thread and then freely accessing them from parallel callbacks.

We demonstrate this with our simple Wikipedia "instant search" server, which loads a trie with 27 million entries, each one mapping a title to a 64-

Pyparallel

Install / Use

README