Kbox
Boot a real Linux kernel as an in-process library (LKL) and route intercepted syscalls to it via seccomp
Install / Use
/learn @sysprog21/KboxREADME
kbox
kbox boots a real Linux kernel as an in-process library (LKL) and routes intercepted syscalls to it. Three interception tiers are available: seccomp-unotify (most compatible), SIGSYS trap (lower latency), and binary rewriting (near-native for process-info syscalls). The default auto mode selects the fastest tier that works for a given workload. kbox provides a rootless chroot/proot alternative with kernel-level syscall accuracy, and serves as a high-observability execution substrate for AI agent tool calls.
Why kbox
Running Linux userspace programs in a rootless, unprivileged environment requires intercepting their syscalls and providing a convincing kernel interface. Existing tools fall short:
chrootrequires root privileges (or user namespaces, which are unavailable on many systems including Termux and locked-down shared hosts).prootuses ptrace for syscall interception. ptrace is slow (two context switches per syscall), cannot faithfully emulate all syscalls, breaks under complex multi-threaded workloads, and its path translation is vulnerable to TOCTOU races.- User Mode Linux (UML) runs as a separate supervisor/guest process tree with ptrace-based syscall routing, imposing overhead and complexity that LKL avoids by running in-process.
gVisorimplements a userspace kernel from scratch -- millions of lines reimplementing Linux semantics, inevitably diverging from the real kernel on edge cases.
kbox takes a fundamentally different approach: boot the actual Linux kernel as an in-process library and route intercepted syscalls to it. The kernel that handles your open() is the same kernel that runs on servers in production. No reimplementation, no approximation.
The interception mechanism matters too. kbox offers three tiers, each trading isolation for speed:
- Seccomp-unotify (Tier 3): syscall notifications delivered to a separate supervisor process via
SECCOMP_RET_USER_NOTIF. Strongest isolation, lowest overhead for file I/O. The supervisor dispatches to LKL and injects results back via two ioctl round-trips per syscall. - SIGSYS trap (Tier 1): in-process signal handler intercepts syscalls via
SECCOMP_RET_TRAP. No cross-process round-trip, but the signal frame build/restore and a service-thread hand-off (eventfd + futex) add overhead. Best for metadata operations on aarch64 where the USER_NOTIF round-trip cost is proportionally higher. - Binary rewriting (Tier 2): syscall instructions patched to call a trampoline at load time. On aarch64,
SVC #0is replaced with aBbranch into a per-site trampoline that calls the dispatch function directly on the guest thread, with zero signal overhead, zero context switches, and zero FS base switching. Stat from the LKL inode cache completes in-process without any kernel round-trip. On x86_64, only 8-byte wrapper sites (mov $NR; syscall; ret) are patched; bare 2-bytesyscallinstructions cannot currently be rewritten in-place (the only same-width replacement,call *%rax, would jump to the syscall number in RAX), so unpatched sites fall through to the SIGSYS trap path. Process-info syscalls (getpid, gettid) at wrapper sites return virtualized values inline at native speed.
The default --syscall-mode=auto selects the fastest tier for each command. Non-shell direct binaries use rewrite/trap on both x86_64 and aarch64 (faster open+close and lseek+read via the local fast-path that bypasses the service thread for 50+ LKL-free syscalls). Shell invocations and networking commands use seccomp (fork/exec coherence and SLIRP poll loop require the supervisor). The selection is based on binary analysis: the main executable is scanned for fork/clone wrapper sites, and binaries that can fork fall back to seccomp. A guest-thread local fast-path (kbox_dispatch_try_local_fast_path) handles brk, futex, epoll, poll, mmap, munmap, and other host-kernel operations with zero IPC overhead. An FD-local stat cache avoids repeated LKL inode lookups for fstat on the same file descriptor. (Note: ASAN builds pin AUTO to seccomp; the trap path's guest-stack switch is incompatible with sanitizer memory tracking.)
The result: programs get real VFS, real ext4, real procfs, at near-native syscall speed, without root privileges, containers, VMs, or ptrace.
How it works
Seccomp mode (--syscall-mode=seccomp, shell commands in auto)
┌────────────────┐
│ guest child │ (seccomp BPF: USER_NOTIF)
└──────┬─────────┘
│ syscall notification
┌──────▼──────────┐ ┌──────────────────┐
│ supervisor │────────▶ │ web observatory │
│ (dispatch) │ counters │ (HTTP + SSE) │
└────┬───────┬────┘ events └────────┬─────────┘
LKL path │ │ host path │
┌───────────▼──┐ ┌──▼──────────┐ ▼
│ LKL kernel │ │ host kernel │ ┌──────────────┐
│ (in-proc) │ │ │ │ web browser │
└──────────────┘ └─────────────┘ └──────────────┘
Trap mode (--syscall-mode=trap, direct binaries in auto)
┌─────────────────────────────────────────┐
│ single process │
│ ┌─────────────┐ ┌──────────────────┐ │
│ │ guest code │──▶│ SIGSYS handler │ │
│ │ (loaded ELF)│ │ (dispatch thread)│ │
│ └─────────────┘ └───┬────────┬─────┘ │
│ LKL path │ │ host │
│ ┌─────────────▼──┐ ┌───▼─────┐ │
│ │ LKL kernel │ │ host │ │
│ │ (in-proc) │ │ kernel │ │
│ └────────────────┘ └─────────┘ │
└─────────────────────────────────────────┘
- The supervisor opens a rootfs disk image and registers it as an LKL block device.
- LKL boots a real Linux kernel inside the process (no VM, no separate process tree).
- The filesystem is mounted via LKL, and the supervisor sets the guest's virtual root via LKL's internal chroot.
- The launch path depends on the syscall mode:
- Seccomp: a child process is forked with a BPF filter that delivers syscalls as user notifications. The supervisor receives each notification, dispatches to LKL or the host kernel, and injects results back.
- Trap: the guest binary is loaded into the current process via a userspace ELF loader. A BPF filter traps guest-range syscalls via
SECCOMP_RET_TRAP, delivering SIGSYS. A service thread runs the dispatch; the signal handler captures the request and spins until the result is ready. No cross-process round-trip. - Rewrite: same as trap, but additionally patches syscall instructions to branch directly into dispatch trampolines, eliminating the SIGSYS signal overhead entirely for patched sites. On aarch64,
SVC #0(4 bytes, fixed-width) is replaced with aBbranch to a per-site trampoline past the segment end; veneer pages withLDR+BRindirect stubs bridge sites beyond ±128MB. The trampoline saves registers, calls the C dispatch function on the guest thread, and returns. No signal frame, no service thread, no context switch. On x86_64, only 8-byte wrapper sites (mov $NR, %eax; syscall; ret) can be safely patched (tojmp rel32targeting a wrapper trampoline); bare 2-bytesyscall/sysenterinstructions cannot be rewritten in-place because the replacementcall *%raxwould jump to the syscall number, not a code address. Unpatched x86_64 sites fall through to the SIGSYS trap path. An instruction-boundary-aware length decoder (x86-decode.c) ensures the scanner never matches0F 05bytes that appear inside longer instructions (immediates, displacements). Site-aware classification labels each site as WRAPPER (eligible for inline virtualized getpid=1, gettid=1) or COMPLEX (must use full dispatch). W^X enforcement blocks simultaneousPROT_WRITE|PROT_EXECin guest memory. - Auto (default): selects the fastest tier per command. Non-shell direct binaries whose main executable has no fork/clone wrapper sites use rewrite/trap on both x86_64 and aarch64. On aarch64, rewrite delivers ~7x faster stat (~3us vs 22us seccomp) via in-process LKL inode cache. On x86_64, trap delivers faster lseek+read (~1.4x) and open+close (~1.1x) via the guest-thread local fast-path (50+ CONTINUE syscalls bypass the service thread entirely). Shell invocations and
--netcommands always use seccomp (fork coherence and SLIRP poll loop). If the selected tier fails at install time, auto falls through to the next tier. ASAN builds pin auto to seccomp (guest-stack switch incompatible with sanitizer tracking).
Syscall routing
Every intercepted syscall is dispatched to one of three dispositions:
- LKL forward (~74 handlers): filesystem operations (open, read, write, stat, getdents, mkdir, unlink, rename), metadata (chmod, chown, utimensat), identity (getuid, setuid, getgroups), and networking (socket, connect). In seccomp mode, the supervisor reads arguments from tracee memory via
process_vm_readvand writes results viaprocess_vm_writev. In trap/rewrite mode, guest memory is accessed directly viamemcpy(same address space) withsigsetjmp-based fault recovery that returns-EFAULTfor unmapped pointers. An FD-local stat cache (16 entries, round-robin) avoids repeated LKL inode lookups for fstat. - Host CONTINUE (~50 entries): scheduling (sched_yield, sched_setscheduler), signals (rt_sigaction, kill, tgkill), memory management (mmap, mprotect, brk, munmap, mremap), I/O multiplexing (epoll, poll, select), threading (futex, clone, set_tid_address, rseq), time (nanosleep, clock_gettime), and more. In seccomp mode, the kernel replays the syscall. In trap/rewrite mode, a guest-thread local
Related Skills
node-connect
343.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
90.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
343.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
343.1kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
