Usched
To experiment with ligh-weight user threads based on stack copying.
Install / Use
/learn @nikitadanilov/UschedREADME
usched: stackswap coroutines
[Blog entry.]
Update: with the new ll.c scheduler, usched beats C++ coroutines.
This repository contains a simple experimental implementation of coroutines alternative to well-known "stackless" and "stackful" methods.
The term "coroutine" gradually grew to mean a mechanism where a computation, which in this context means a chain of nested function calls, can "block" or "yield" so that the top-most caller can proceed and the computation can later be resumed at the blocking point with the chain of intermediate function activation frames preserved.
Prototypical uses of coroutines are lightweight support for potentially blocking operations (user interaction, IO, networking) and generators, which produce multiple values (see same fringe problem).
There are two common coroutine implementation methods:
-
a stackful coroutine runs on a separate stack. When a stackful coroutine blocks, it performs a usual context switch. Historically "coroutines" meant stackful coroutines. Stackful coroutines are basically little more than usual threads, and so they can be kernel (supported by the operating system) or user-space (implemented by a user-space library, also known as green threads), preemptive or cooperative.
-
a stackless coroutine does not use any stack when blocked. In a typical implementation instead of using a normal function activation frame on the stack, the coroutine uses a special activation frame allocated in the heap so that it can outlive its caller. Using heap-allocated frame to store all local variable lends itself naturally to compiler support, but some people are known to implement stackless coroutines manually via a combination of pre-processing, library and tricks much worse than Duff's device.
Stackful and stateless are by no means the only possibilities. One of the earliest languages to feature generators CLU (distribution) ran generators on the caller's stack.
usched is in some sense intermediate between stackful and stackless: its coroutines do not use stack when blocked, nor do they allocate individual activation frames in the heap.
The following is copied with some abbreviations from usched.c.
Overview
usched: A simple dispatcher for cooperative user-space threads.
A typical implementation of user-space threads allocates a separate stack for
each thread when the thread is created and then dispatches threads (as
decided by the scheduler) through some context switching mechanism, for
example, longjmp().
In usched all threads (represented by struct ustack) are executed on the same
"native" stack. When a thread is about to block (usched_block()), a memory
buffer for the stack used by this thread is allocated and the stack is copied
to the buffer. After that the part of the stack used by the blocking thread
is discarded (by longjmp()-ing to the base of the stack) and a new thread is
selected. The stack of the selected thread is restored from its buffer and
the thread is resumed by longjmp()-ing to the usched_block() that blocked it.
The focus of this implementation is simplicity: the total size of usched.[ch]
is less than 120LOC, as measured by SLOCCount.
Advantages:
-
no need to allocate maximal possible stack at thread initialisation: stack buffer is allocated as needed. It is also possible to free the buffer when the thread is resumed (not currently implemented);
-
thread that doesn't block has 0 overhead: it is executed as a native function call (through a function pointer) without any context switching;
-
because the threads are executed on the stack of the same native underlying thread, native synchronisation primitives (mutices, etc.) work, although the threads share underlying TLS. Of course one cannot use native primitives to synchronise between usched threads running on the same native thread.
Disadvantages:
-
stack copying introduces overhead (
memcpy()) in each context switch; -
because stacks are moved around, addresses on a thread stack are only valid while the thread is running. This invalidates certain common programming idioms: other threads and heap cannot store pointers to the stacks, at least to the stacks of the blocked threads. Note that Go language, and probably other run-times, maintains a similar invariant.
Usage
usched is only a dispatcher and not a scheduler: it blocks and resumes threads but
-
it does not keep track of threads (specifically allocation and freeing of struct ustack instances is done elsewhere),
-
it implements no scheduling policies.
These things are left to the user, together with stack buffers allocation and freeing. The user supplies 3 call-backs:
-
usched::s_next(): the scheduling function. This call-backs returns the next thread to execute. This can be either a new (never before executed) thread initialised withustack_init(), or it can be a blocked thread. The user must keep track of blocked and runnable threads, presumably by providing wrappers toustack_init()andustack_block()that would record thread state changes. It is up tousched::s_next()to block and wait for events if there are no runnable threads and all threads are waiting for something; -
usched::s_alloc(): allocates new stack buffer of at least the specified size. The user have full control over stack buffer allocation. It is possible to pre-allocate the buffer when the thread is initialised (reducing the cost ofusched_block()), it is possible to cache buffers, etc.; -
usched::s_free(): frees the previously allocated stack buffer.
rr.h and rr.c provide a simple "round-robin" scheduler implementing all the call-backs. Use it carefully, it was only tested with rmain.c benchmark.
Pictures!
The following diagrams show stack management by usched. The stack grows from right to left.
At the entrance to the dispatcher loop. usched_run(S):
usched_run()
----------------------------------------------+--------------+-------+
| buf | anchor | ... |
----------------------------------------------+--------------+-------+
^
|
sp = S->s_buf
A new (never before executed) thread U is selected by S->s_next(), launch()
calls the thread startup function U->u_f():
U->u_f() launch() usched_run()
-----------------------------+---------+-----+--------------+-------+
| | pad | buf | anchor | ... |
-----------------------------+---------+-----+--------------+-------+
^ ^
| |
sp U->u_bottom
Thread executes as usual on the stack, until it blocks by calling
usched_block():
usched_block() bar() U->u_f() launch() usched_run()
----------+------+-----+-----+---------+-----+--------------+-------+
| here | ... | | | pad | buf | anchor | ... |
----------+------+-----+-----+---------+-----+--------------+-------+
^ ^ ^
| +-- sp = U->u_cont |
| U->u_bottom
U->u_top
The stack from U->u_top to U->u_bottom is copied into the stack buffer
U->u_stack, and control returns to usched_run() by longjmp(S->s_buf):
usched_run()
----------------------------------------------+--------------+-------+
| buf | anchor | ... |
----------------------------------------------+--------------+-------+
^
|
sp = S->s_buf
Next, suppose S->s_next() selects a previously blocked thread V ready to be
resumed. usched_run() calls cont(V).
cont() usched_run()
----------------------------------------+-----+--------------+-------+
| | buf | anchor | ... |
----------------------------------------+-----+--------------+-------+
^
|
sp
cont() copies the stack from the buffer to [V->u_top, V->u_bottom]
range. It's important that this memcpy() operation does not overwrite
cont()'s own stack frame, this is why pad[] array is needed in launch(): it
advances V->u_bottom and gives cont() some space to operate.
usched_block() foo() V->u_f() cont() usched_run()
---------+------+-----+-----+--------+--+-----+--------------+-------+
| here | ... | | | | | buf | anchor | ... |
---------+------+-----+-----+--------+--+-----+--------------+-------+
Related Skills
node-connect
349.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
