SkillAgentSearch skills...

PANDEMONIUM

A Linux kernel scheduler built on sched_ext in Rust and C23 that dynamically learns task behaviour.

Install / Use

/learn @wllclngn/PANDEMONIUM
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

PANDEMONIUM

Built in Rust and C23, PANDEMONIUM is a Linux kernel scheduler built for sched_ext. Utilizing BPF patterns, PANDEMONIUM classifies every task by its behavior--wakeup frequency, context switch rate, runtime, sleep patterns--and adapts scheduling decisions in real time. Single-thread adaptive control loop (zero mutexes), three-tier behavioral dispatch, overflow sojourn rescue, longrun detection with deficit tightening, dual burst detection (CUSUM + wakeup rate), L2 cache affinity placement, sleep-informed batch tuning, CoDel-inspired sojourn rescue, classification-gated DSQ routing, workload regime detection, vtime ceiling, hard starvation rescue, and a persistent process database that learns task classifications across lifetimes.

PANDEMONIUM is included in the sched-ext/scx project alongside scx_rusty, scx_lavd, scx_layered, scx_cosmos, and the rest of the sched_ext family. Thank you to Piotr Gorski and the sched-ext team. PANDEMONIUM is made possible by contributions from the sched_ext, CachyOS, Gentoo, OpenSUSE and Arch communities within the Linux ecosystem.

Performance

Benchmarked on 12 AMD Zen CPUs, kernel 6.18.13-arch1-1, clang 21.1.6, 3 iterations.

Burst P99 (fork/exec storm under CPU saturation)

| Cores | EEVDF | PANDEMONIUM (BPF) | PANDEMONIUM (ADAPTIVE) | scx_bpfland | |-------|----------|-------------------|------------------------|-------------| | 2 | 2,773us | 1,228us | 821us | 3,181us | | 4 | 2,883us | 1,285us | 1,594us | 2,005us | | 8 | 2,886us | 1,092us | 1,239us | 2,006us | | 12 | 2,280us | 1,231us | 1,021us | 2,007us |

Both modes beat EEVDF and scx_bpfland at every core count. Overflow sojourn rescue and dual burst detection (CUSUM + wakeup rate) keep burst response sub-2ms.

P99 Wakeup Latency (interactive probe under CPU saturation)

| Cores | EEVDF | PANDEMONIUM (BPF) | PANDEMONIUM (ADAPTIVE) | scx_bpfland | |-------|----------|-------------------|------------------------|-------------| | 2 | 1,953us | 1,922us | 929us | 3,060us | | 4 | 1,358us | 1,012us | 1,690us | 2,011us | | 8 | 1,711us | 1,041us | 830us | 2,005us | | 12 | 718us| 909us | 1,284us | 2,007us |

ADAPTIVE wins at 2C, BPF beats EEVDF at 4C and 8C. Both modes beat scx_bpfland at every core count.

Longrun P99 (interactive latency with sustained CPU-bound long-runners)

| Cores | EEVDF | PANDEMONIUM (BPF) | PANDEMONIUM (ADAPTIVE) | scx_bpfland | |-------|----------|-------------------|------------------------|-------------| | 2 | 2,295us | 440us | 568us | 2,020us | | 4 | 1,395us | 912us | 1,338us | 2,001us | | 8 | 557us| 1,009us | 1,741us | 2,005us | | 12 | 418us| 993us | 1,957us | 1,999us |

Longrun detection tightens deficit ratio to 1:1 under sustained batch pressure. BPF mode sub-1ms at 2C and 4C.

Mixed Latency P99 (interactive + batch concurrent)

| Cores | EEVDF | PANDEMONIUM (BPF) | PANDEMONIUM (ADAPTIVE) | scx_bpfland | |-------|----------|-------------------|------------------------|-------------| | 2 | 2,759us | 520us | 1,026us | 2,167us | | 4 | 974us| 1,075us | 1,386us | 1,999us | | 8 | 2,183us | 968us | 1,217us | 2,004us | | 12 | 2,100us | 999us | 1,737us | 2,000us |

BPF mode sub-1ms at 2C, 8C, and 12C under mixed interactive+batch workloads.

Throughput (kernel build, vs EEVDF baseline, 3 iterations)

| Cores | PANDEMONIUM (BPF) | PANDEMONIUM (ADAPTIVE) | scx_bpfland | |-------|-------------------|------------------------|-------------| | 2 | +2.6% | +18.8% | +5.2% | | 4 | +3.5% | +2.7% | +0.4% | | 8 | +3.9% | +2.9% | +3.8% | | 12 | +1.7% | +0.6% | +1.9% |

Deadline Jitter (16.6ms frame target, miss ratio)

| Cores | EEVDF | PANDEMONIUM (BPF) | PANDEMONIUM (ADAPTIVE) | scx_bpfland | |-------|---------|-------------------|------------------------|-------------| | 8 | 11.3% | 8.1% | 4.3% | 56.5% | | 12 | 11.9% | 4.4% | 6.8% | 56.3% |

At 8+ cores, PANDEMONIUM ADAPTIVE misses 4.3% of frame deadlines at 8C. BPF mode hits 4.4% at 12C vs EEVDF's 11.9%. scx_bpfland misses 56%.

Key Features

Dispatch Order

  1. Overflow sojourn rescue (aging overflow DSQ tasks > threshold, deficit-gated)
  2. Per-CPU DSQ (direct placement from enqueue, zero contention, counts toward deficit)
  3. Deficit counter (DRR: force batch if budget exhausted + starving; longrun tightens budget)
  4. Hard starvation rescue (core-count-scaled absolute safety net)
  5. Node interactive overflow (LAT_CRITICAL + INTERACTIVE, vtime-ordered)
  6. Batch sojourn rescue (CoDel: rescue if oldest batch > threshold)
  7. Node batch overflow (normal fallback for batch tasks)
  8. Cross-node steal (interactive + batch per remote node)
  9. KEEP_RUNNING if prev still wants CPU and nothing queued

Three-Tier Enqueue

  • Idle CPU Fast Path: select_cpu() places wakeups directly to per-CPU DSQ (depth-gated: 1 slot at <4 CPUs, 2 at 4+), kicks with SCX_KICK_IDLE
  • Node-Local Placement with L2 Affinity: enqueue() tries L2 sibling first (INTERACTIVE/BATCH with affinity_mode > 0), then falls back to any idle CPU within the NUMA node, always dispatching to the per-node shared DSQ. LAT_CRITICAL and kernel threads (PF_KTHREAD) skip affinity for fastest-available placement
  • Wakeup Preemption: All wakeups get node DSQ dispatch with SCX_KICK_PREEMPT. A task waking from sleep has external input to deliver regardless of behavioral tier. The classifier operates on historical behavior; the wakeup is the real-time latency signal. LAT_CRITICAL also gets preemption on requeue (compositor guarantee). Batch requeues skip to overflow DSQ
  • NUMA-Scoped Overflow: Per-node overflow DSQ with classification-gated routing. Immature INTERACTIVE tasks (ewma_age < 2) route to batch DSQ until EWMA classifies them. LAT_CRITICAL tasks are never redirected
  • Event-Driven Preemption: tick() checks interactive_waiting flag and preempts batch tasks above preempt_thresh_ns. During burst_mode, preempt threshold drops to 0 (immediate preemption). Zero polling -- no BPF timer

Overflow Sojourn Rescue

Per-CPU DSQ dominance under sustained load makes all downstream anti-starvation logic unreachable -- 90%+ of dispatches serve per-CPU DSQ while overflow tasks age indefinitely. Dispatch Step 0 checks both overflow DSQs for tasks aging past overflow_sojourn_rescue_ns (core-count-scaled: 2ms per core, clamped 4-10ms) and serves them before per-CPU DSQ. CAS-based timestamp management prevents races across CPUs.

Longrun Detection

Tracks sustained batch DSQ pressure. When batch DSQ is non-empty for >2 seconds, longrun_mode activates:

  • Deficit ratio tightens from nr_cpu_ids * ratio to nr_cpu_ids * 1, quadrupling batch dispatch share
  • task_slice() uses burst_slice_ns (1ms) instead of regime slice (up to 4ms)
  • Rust adaptive layer: sleep-informed batch adjustment skipped, affinity forced to WEAK (spread batch across CPUs)

Dual Burst Detection

  • CUSUM: Statistical change-point detection (Page, 1954) monitors total enqueue rate. Samples every 64th enqueue, EWMA baseline with 25% slack. Effective for BPF mode (1ms slices) where enqueue rate spikes during fork storms
  • Wakeup Rate Counter: Absolute threshold -- nr_cpu_ids * 2 wakeups per tick = fork storm. No calibration needed, works immediately on first tick. Effective for adaptive mode (4ms slices) where CUSUM is rate-bounded
  • Either firing activates burst_mode: preempt threshold drops to 0, task_slice uses burst_slice_ns
  • Split DSQ routing always active. Burst handled via slice reduction and preempt override, not DSQ reorganization

Vtime Ceiling

High-vtime daemons sort to the tail of the batch DSQ while fresh burst tasks take the head. Sojourn rescue dispatches from the head, so daemons starve. The ceiling caps batch deadline at vtime_now + 30ms, keeping every task within 6 sojourn cycles of the head. Gated at >=8 cores -- at 2-4 cores the batch DSQ is shallow enough that sojourn rescue reaches every task naturally.

Hard Starvation Rescue

Absolute safety net. Computed as the minimum of two linear functions: min(25ms * nr_cpus, 500ms / max(1, nr_cpus/4)), clamped to 20-500ms. Short at low core counts (fast starvation: 2C = 50ms) and at high core counts (dispatch contention: 128C = 20ms), peaks in the middle (8C = 200ms). Fires before the interactive DSQ and guarantees batch service regardless of interactive pressure.

Batch DSQ Separation

Batch tasks enqueue to dedicated per-node batch overflow DSQs instead of sharing vtime-ordered DSQs with interactive tasks. Separate DSQs give dispatch explicit control: interactive overflow first, then sojourn rescue, then batch fallback.

CoDel Sojourn Rescue

batch_enqueue_ns records when the batch DSQ transitions from empty to non-empty. dispatch() rescues batch tasks waiting longer than sojourn_thresh_ns. The threshold is set by the Rust adaptive layer from observed dispatch rate: target = 4x dispatch interval, EWMA-smoothed (7/8 old + 1/8 new), clamped to core-count-aware floor/ceiling.

Deficit Counter (DRR)

After interactive_budget consecutive interactive dispatches without batch service, forces one batch dispatch. Bud

Related Skills

View on GitHub
GitHub Stars35
CategoryDevelopment
Updated1d ago
Forks1

Languages

Python

Security Score

90/100

Audited on Mar 30, 2026

No findings