PANDEMONIUM
A Linux kernel scheduler built on sched_ext in Rust and C23 that dynamically learns task behaviour.
Install / Use
/learn @wllclngn/PANDEMONIUMREADME
PANDEMONIUM
Built in Rust and C23, PANDEMONIUM is a Linux kernel scheduler built for sched_ext. Utilizing BPF patterns, PANDEMONIUM classifies every task by its behavior--wakeup frequency, context switch rate, runtime, sleep patterns--and adapts scheduling decisions in real time. Single-thread adaptive control loop (zero mutexes), three-tier behavioral dispatch, overflow sojourn rescue, longrun detection with deficit tightening, dual burst detection (CUSUM + wakeup rate), L2 cache affinity placement, sleep-informed batch tuning, CoDel-inspired sojourn rescue, classification-gated DSQ routing, workload regime detection, vtime ceiling, hard starvation rescue, and a persistent process database that learns task classifications across lifetimes.
PANDEMONIUM is included in the sched-ext/scx project alongside scx_rusty, scx_lavd, scx_layered, scx_cosmos, and the rest of the sched_ext family. Thank you to Piotr Gorski and the sched-ext team. PANDEMONIUM is made possible by contributions from the sched_ext, CachyOS, Gentoo, OpenSUSE and Arch communities within the Linux ecosystem.
Performance
Benchmarked on 12 AMD Zen CPUs, kernel 6.18.13-arch1-1, clang 21.1.6, 3 iterations.
Burst P99 (fork/exec storm under CPU saturation)
| Cores | EEVDF | PANDEMONIUM (BPF) | PANDEMONIUM (ADAPTIVE) | scx_bpfland | |-------|----------|-------------------|------------------------|-------------| | 2 | 2,773us | 1,228us | 821us | 3,181us | | 4 | 2,883us | 1,285us | 1,594us | 2,005us | | 8 | 2,886us | 1,092us | 1,239us | 2,006us | | 12 | 2,280us | 1,231us | 1,021us | 2,007us |
Both modes beat EEVDF and scx_bpfland at every core count. Overflow sojourn rescue and dual burst detection (CUSUM + wakeup rate) keep burst response sub-2ms.
P99 Wakeup Latency (interactive probe under CPU saturation)
| Cores | EEVDF | PANDEMONIUM (BPF) | PANDEMONIUM (ADAPTIVE) | scx_bpfland | |-------|----------|-------------------|------------------------|-------------| | 2 | 1,953us | 1,922us | 929us | 3,060us | | 4 | 1,358us | 1,012us | 1,690us | 2,011us | | 8 | 1,711us | 1,041us | 830us | 2,005us | | 12 | 718us| 909us | 1,284us | 2,007us |
ADAPTIVE wins at 2C, BPF beats EEVDF at 4C and 8C. Both modes beat scx_bpfland at every core count.
Longrun P99 (interactive latency with sustained CPU-bound long-runners)
| Cores | EEVDF | PANDEMONIUM (BPF) | PANDEMONIUM (ADAPTIVE) | scx_bpfland | |-------|----------|-------------------|------------------------|-------------| | 2 | 2,295us | 440us | 568us | 2,020us | | 4 | 1,395us | 912us | 1,338us | 2,001us | | 8 | 557us| 1,009us | 1,741us | 2,005us | | 12 | 418us| 993us | 1,957us | 1,999us |
Longrun detection tightens deficit ratio to 1:1 under sustained batch pressure. BPF mode sub-1ms at 2C and 4C.
Mixed Latency P99 (interactive + batch concurrent)
| Cores | EEVDF | PANDEMONIUM (BPF) | PANDEMONIUM (ADAPTIVE) | scx_bpfland | |-------|----------|-------------------|------------------------|-------------| | 2 | 2,759us | 520us | 1,026us | 2,167us | | 4 | 974us| 1,075us | 1,386us | 1,999us | | 8 | 2,183us | 968us | 1,217us | 2,004us | | 12 | 2,100us | 999us | 1,737us | 2,000us |
BPF mode sub-1ms at 2C, 8C, and 12C under mixed interactive+batch workloads.
Throughput (kernel build, vs EEVDF baseline, 3 iterations)
| Cores | PANDEMONIUM (BPF) | PANDEMONIUM (ADAPTIVE) | scx_bpfland | |-------|-------------------|------------------------|-------------| | 2 | +2.6% | +18.8% | +5.2% | | 4 | +3.5% | +2.7% | +0.4% | | 8 | +3.9% | +2.9% | +3.8% | | 12 | +1.7% | +0.6% | +1.9% |
Deadline Jitter (16.6ms frame target, miss ratio)
| Cores | EEVDF | PANDEMONIUM (BPF) | PANDEMONIUM (ADAPTIVE) | scx_bpfland | |-------|---------|-------------------|------------------------|-------------| | 8 | 11.3% | 8.1% | 4.3% | 56.5% | | 12 | 11.9% | 4.4% | 6.8% | 56.3% |
At 8+ cores, PANDEMONIUM ADAPTIVE misses 4.3% of frame deadlines at 8C. BPF mode hits 4.4% at 12C vs EEVDF's 11.9%. scx_bpfland misses 56%.
Key Features
Dispatch Order
- Overflow sojourn rescue (aging overflow DSQ tasks > threshold, deficit-gated)
- Per-CPU DSQ (direct placement from enqueue, zero contention, counts toward deficit)
- Deficit counter (DRR: force batch if budget exhausted + starving; longrun tightens budget)
- Hard starvation rescue (core-count-scaled absolute safety net)
- Node interactive overflow (LAT_CRITICAL + INTERACTIVE, vtime-ordered)
- Batch sojourn rescue (CoDel: rescue if oldest batch > threshold)
- Node batch overflow (normal fallback for batch tasks)
- Cross-node steal (interactive + batch per remote node)
- KEEP_RUNNING if prev still wants CPU and nothing queued
Three-Tier Enqueue
- Idle CPU Fast Path:
select_cpu()places wakeups directly to per-CPU DSQ (depth-gated: 1 slot at <4 CPUs, 2 at 4+), kicks withSCX_KICK_IDLE - Node-Local Placement with L2 Affinity:
enqueue()tries L2 sibling first (INTERACTIVE/BATCH with affinity_mode > 0), then falls back to any idle CPU within the NUMA node, always dispatching to the per-node shared DSQ. LAT_CRITICAL and kernel threads (PF_KTHREAD) skip affinity for fastest-available placement - Wakeup Preemption: All wakeups get node DSQ dispatch with
SCX_KICK_PREEMPT. A task waking from sleep has external input to deliver regardless of behavioral tier. The classifier operates on historical behavior; the wakeup is the real-time latency signal. LAT_CRITICAL also gets preemption on requeue (compositor guarantee). Batch requeues skip to overflow DSQ - NUMA-Scoped Overflow: Per-node overflow DSQ with classification-gated routing. Immature INTERACTIVE tasks (
ewma_age < 2) route to batch DSQ until EWMA classifies them. LAT_CRITICAL tasks are never redirected - Event-Driven Preemption:
tick()checksinteractive_waitingflag and preempts batch tasks abovepreempt_thresh_ns. Duringburst_mode, preempt threshold drops to 0 (immediate preemption). Zero polling -- no BPF timer
Overflow Sojourn Rescue
Per-CPU DSQ dominance under sustained load makes all downstream anti-starvation logic unreachable -- 90%+ of dispatches serve per-CPU DSQ while overflow tasks age indefinitely. Dispatch Step 0 checks both overflow DSQs for tasks aging past overflow_sojourn_rescue_ns (core-count-scaled: 2ms per core, clamped 4-10ms) and serves them before per-CPU DSQ. CAS-based timestamp management prevents races across CPUs.
Longrun Detection
Tracks sustained batch DSQ pressure. When batch DSQ is non-empty for >2 seconds, longrun_mode activates:
- Deficit ratio tightens from
nr_cpu_ids * ratiotonr_cpu_ids * 1, quadrupling batch dispatch share task_slice()usesburst_slice_ns(1ms) instead of regime slice (up to 4ms)- Rust adaptive layer: sleep-informed batch adjustment skipped, affinity forced to WEAK (spread batch across CPUs)
Dual Burst Detection
- CUSUM: Statistical change-point detection (Page, 1954) monitors total enqueue rate. Samples every 64th enqueue, EWMA baseline with 25% slack. Effective for BPF mode (1ms slices) where enqueue rate spikes during fork storms
- Wakeup Rate Counter: Absolute threshold --
nr_cpu_ids * 2wakeups per tick = fork storm. No calibration needed, works immediately on first tick. Effective for adaptive mode (4ms slices) where CUSUM is rate-bounded - Either firing activates
burst_mode: preempt threshold drops to 0, task_slice usesburst_slice_ns - Split DSQ routing always active. Burst handled via slice reduction and preempt override, not DSQ reorganization
Vtime Ceiling
High-vtime daemons sort to the tail of the batch DSQ while fresh burst tasks take the head. Sojourn rescue dispatches from the head, so daemons starve. The ceiling caps batch deadline at vtime_now + 30ms, keeping every task within 6 sojourn cycles of the head. Gated at >=8 cores -- at 2-4 cores the batch DSQ is shallow enough that sojourn rescue reaches every task naturally.
Hard Starvation Rescue
Absolute safety net. Computed as the minimum of two linear functions: min(25ms * nr_cpus, 500ms / max(1, nr_cpus/4)), clamped to 20-500ms. Short at low core counts (fast starvation: 2C = 50ms) and at high core counts (dispatch contention: 128C = 20ms), peaks in the middle (8C = 200ms). Fires before the interactive DSQ and guarantees batch service regardless of interactive pressure.
Batch DSQ Separation
Batch tasks enqueue to dedicated per-node batch overflow DSQs instead of sharing vtime-ordered DSQs with interactive tasks. Separate DSQs give dispatch explicit control: interactive overflow first, then sojourn rescue, then batch fallback.
CoDel Sojourn Rescue
batch_enqueue_ns records when the batch DSQ transitions from empty to non-empty. dispatch() rescues batch tasks waiting longer than sojourn_thresh_ns. The threshold is set by the Rust adaptive layer from observed dispatch rate: target = 4x dispatch interval, EWMA-smoothed (7/8 old + 1/8 new), clamped to core-count-aware floor/ceiling.
Deficit Counter (DRR)
After interactive_budget consecutive interactive dispatches without batch service, forces one batch dispatch. Bud
Related Skills
node-connect
342.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
85.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
342.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
342.5kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
