Forkrun

NUMA-Aware Contention-Free Dynamically-Auto-Tuning Bash-Native Streaming Parallelization Engine

Generate Convert Improve

Install / Use

/learn @jkool702/Forkrun

About this skill

Quality Score

0/100

README

forkrun — NUMA-Aware Contention-Free Streaming Parallelization

forkrun is a self-tuning, drop-in replacement for GNU Parallel and xargs -P that accelerates shell-based data preparation by 50×–400× on modern CPUs and scales linearly on NUMA architectures.

forkrun achieves:

200,000+ batch dispatches/sec (vs ~500 for GNU Parallel)
~95–99% CPU utilization across all cores (vs ~6% for GNU Parallel)
Near-zero cross-socket memory traffic (NUMA-aware “born-local” design)

forkrun is built for high-frequency, low-latency workloads on deep NUMA hardware — a regime where existing tools leave most cores idle due to IPC overhead and cross-socket data migration.

🚀 Quick Start (Installation & Usage)

forkrun is distributed as a single bash file with an embedded, self-extracting compiled C extension. There are no external dependencies (no Perl, no Python).

Download and source it directly:

source <(curl -sL https://raw.githubusercontent.com/jkool702/forkrun/main/frun.bash)

(Note: Sourcing the script sets up the required C loadable builtins in your shell environment).

Once sourced, frun acts as a drop-in parallelizer:

frun my_bash_func < inputs.txt             # parallelize custom bash functions natively!
cat file_list | frun -k sed 's/old/new/'   # pipe-based input, ordered output
frun -k -s sort < records.tsv              # stdin-passthrough, ordered output
frun -s -I 'gzip -c >{ID}.gz' < raw_logs   # stdin-passthrough, unique output names

Verifiable Builds: The embedded C-extension is compiled and injected transparently via GitHub Actions. You can trace the git blame of the Base64 blob directly to the public CI workflow run that compiled forkrun_ring.c, guaranteeing the binary contains no hidden malicious code.

⚡ Benchmarks (14-core/28-thread i9-7940x, 100M lines)

| Workload | forkrun | GNU Parallel | Speedup | Notes | |-----------------------------------------------|-------------------------|------------------------------|------------|-------| | Default (array + fully-quoted args, no-op) | 24 M lines/s | 58 k lines/s | ~415× | forkrun default mode | | Ordered output (-k, no-op) | 24.5 M lines/s | 57 k lines/s | ~430× | ordering is free in forkrun | | echo (line args) | 22.6 M lines/s | ~55 k lines/s | ~410× | typical shell command | | printf '%s\n' (I/O heavy) | 12.8 M lines/s | ~58 k lines/s | ~220× | formatting + output | | -s stdin passthrough (no-op) | 893 M lines/s | 6.05 M lines/s (--pipe) | ~148× | streaming / splice | | -b 524288 byte batches (no-op) | 1.54 B lines/s | 6.02 M lines/s (--pipe) | ~256× | kernel-limited |

Average CPU utilization across ~400 benchmarks

forkrun: 95% (27.1 / 28 cores) — No centralized dispatcher; all 27.1 cores do actual work.
GNU Parallel: 6% (2.68 / 28 cores) — 1 full core used strictly for dispatching work; 1.68 cores doing actual work.

🧠 How It Works: The Physics of forkrun

Traditional tools like GNU Parallel use heavy regex parsing and IPC dispatch loops that bottleneck multi-socket servers. forkrun operates completely differently. The pipeline has four stages, each designed to preserve physical locality:

Ingest (Born-Local NUMA): Data is splice()'d from stdin into a shared memfd. This is PFS-friendly (avoids Lustre/NFS metadata storms). On multi-socket systems, set_mempolicy(MPOL_BIND) places each chunk's pages on a target NUMA node before any worker touches them. This placement is driven by real-time backpressure from the per-node indexers, making NUMA distribution completely self-load-balancing.
Index: Per-node indexers (pinned to their socket) find record boundaries using AVX2/NEON SIMD scanning at memory bandwidth. They dynamically batch based on runtime conditions, then publish offset markers into a per-node lock-free ring buffer.
Claim (Contention-Free): Workers claim batches via a single atomic_fetch_add — no CAS retry loops, no locks, no contention. Overshoots are handled by depositing remainders into an escrow pipe for idle workers to steal.
Reclaim: A background fallow thread punches holes behind completed work via fallocate(PUNCH_HOLE), bounding memory usage without breaking the offset coordinate system.

Adaptive tuning is fully automatic. A PID-based controller discovers the optimal batch size in O(log L) steps and continuously adjusts based on input rate, consumption rate, and worker starvation — with no user -n or -j configuration required.

🛠 Requirements & Dependencies

forkrun is designed to run anywhere with zero friction:

Required: Bash ≥ 4.0 (Bash 5.1+ highly recommended for array performance), Linux Kernel ≥ 3.17 (for memfd).

🏛️ Legacy Version (v2)

With the release of v3.0.0, forkrun has transitioned to a high-performance C-ring architecture (frun.bash). The older v2, pure-Bash coproc-based version (forkrun.bash) remains available in the legacy/ directory. While v3 (frun.bash) is highly recommended for all modern workloads, v2 (forkrun.bash) remains as an alternate fully-functional high-performance bash stream parallelizer. forkrun v1 is not recommended for use.

🛣 Roadmap

forkrun currently guarantees correctness under the assumption that at least one worker per NUMA node remains alive until its assigned work completes — a safe assumption for local shell operations on healthy compute nodes.

Priorities for the development roadmap include:

Failure isolation and per-batch retries to handle transient worker crashes.
Resume-after-interruption state saving to gracefully handle preempted cluster/Slurm jobs.
Deeper integration with facility workload managers.

(If forkrun is saving your institution compute-hours, please consider sponsoring its development to accelerate these features!)

Related Skills

qqbot-channel

344.4k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

99.9k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

344.4k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

project-overview

FlightPHP Skeleton Project Instructions This document provides guidelines and best practices for structuring and developing a project using the FlightPHP framework. Instructions for AI Coding A

jkool702

View profile

View on GitHub

GitHub Stars329

CategoryContent

Updated44m ago

Forks7

jkool702/forkrun

Languages

Shell

Security Score

100/100

Audited on Apr 1, 2026

No findings