Regexr
Regex engine specialized in tokenization and LLM text processing
Install / Use
/learn @ml-rust/RegexrREADME

regexr
A specialized Rust regex engine designed for LLM tokenization and complex pattern matching.
Originally created as the regex backend for splintr, an LLM tokenizer. Passes compliance tests for industry-standard tokenizer patterns (OpenAI's
cl100k_base, Meta's Llama 3).Please report issues on the Issue Tracker.
🎯 When to use regexr
This is a specialized tool, not a general-purpose replacement.
The Rust ecosystem already has the excellent, battle-tested regex crate. For 99% of use cases, you should use that.
Only use regexr if you specifically need:
- Lookarounds: You need
(?=...),(?<=...), or(?!\S)without a C build dependency.- Why not
regex? It intentionally omits lookarounds to guarantee linear time. - Why not
fancy-regex? No JIT compilation.
- Why not
- JIT Compilation: You want native code generation for hot regex patterns.
- Why not
regex/fancy-regex? Neither offers JIT compilation. - Why not
pcre2? Requires installing a system C library (libpcre2).
- Why not
- No System Dependencies: Just
cargo add regexr— no C toolchain or system libraries required. All code (including JIT) compiles withcargo build. - WASM Support: Compiles to
wasm32-unknown-unknownwith--no-default-features. Same API and features across native and WASM targets — JIT/SIMD are disabled automatically, but lookarounds, backreferences, and ReDoS protection all work. No need to swap crates per platform. - Bounded Execution: ReDoS protection that memoizes states (guaranteeing completion) rather than just aborting after a timeout (like
pcre2).
The Problem Solved
Developers building LLM tokenizers (like GPT-4 or Llama 3) currently face a dilemma in Rust:
regexcrate: Fast, safe, but lacks lookarounds and JIT compilation.fancy-regex: Supports lookarounds, but lacks JIT compilation.pcre2: Supports everything including JIT, but requires system C library installation and cross-compilation setup.
regexr bridges this gap. It provides Lookarounds + JIT compilation + Backreferences with no system dependencies — just cargo add and go.
Installation
Add this to your Cargo.toml:
[dependencies]
regexr = "*"
For JIT compilation support:
[dependencies]
regexr = { version = "*", features = ["full"] }
Usage
Basic matching
use regexr::Regex;
let re = Regex::new(r"\w+").unwrap();
assert!(re.is_match("hello"));
// Find first match
if let Some(m) = re.find("hello world") {
println!("Found: {}", m.as_str()); // "hello"
}
// Find all matches
for m in re.find_iter("hello world") {
println!("{}", m.as_str());
}
Capture groups
use regexr::Regex;
let re = Regex::new(r"(\w+)@(\w+)\.(\w+)").unwrap();
let caps = re.captures("user@example.com").unwrap();
println!("{}", &caps[0]); // "user@example.com"
println!("{}", &caps[1]); // "user"
println!("{}", &caps[2]); // "example"
println!("{}", &caps[3]); // "com"
Named captures
use regexr::Regex;
let re = Regex::new(r"(?P<user>\w+)@(?P<domain>\w+\.\w+)").unwrap();
let caps = re.captures("user@example.com").unwrap();
println!("{}", &caps["user"]); // "user"
println!("{}", &caps["domain"]); // "example.com"
JIT compilation
Enable JIT for patterns that will be matched many times:
use regexr::RegexBuilder;
let re = RegexBuilder::new(r"\w+")
.jit(true)
.build()
.unwrap();
assert!(re.is_match("hello"));
Prefix optimization for tokenizers
For patterns with many literal alternatives (e.g., keyword matching in tokenizers):
use regexr::RegexBuilder;
let re = RegexBuilder::new(r"(function|for|while|if|else|return)")
.optimize_prefixes(true)
.build()
.unwrap();
assert!(re.is_match("function"));
Text replacement
use regexr::Regex;
let re = Regex::new(r"\d+").unwrap();
// Replace first match
let result = re.replace("abc 123 def", "NUM");
assert_eq!(result, "abc NUM def");
// Replace all matches
let result = re.replace_all("abc 123 def 456", "NUM");
assert_eq!(result, "abc NUM def NUM");
Feature Flags
simd(default): Enables SIMD-accelerated literal searchjit: Enables JIT compilation (x86-64 and ARM64)full: Enables both JIT and SIMD
Platform Support
| Platform | JIT Support | SIMD Support | | --------------------------- | ----------- | ------------ | | Linux x86-64 | ✓ | ✓ (AVX2) | | Linux ARM64 | ✓ | ✗ | | macOS x86-64 | ✓ | ✓ (AVX2) | | macOS ARM64 (Apple Silicon) | ✓ | ✗ | | Windows x86-64 | ✓ | ✓ (AVX2) | | WASM (wasm32) | ✗ | ✗ | | Other | ✗ | ✗ |
Build without default features for a minimal installation (also works for WASM):
cargo build --no-default-features # Minimal (PikeVM + LazyDFA only)
cargo build --no-default-features --target wasm32-unknown-unknown # WASM target
Build with all optimizations:
cargo build --features "full"
Engine Selection
The library automatically selects the best execution engine based on pattern characteristics:
Non-JIT mode (default):
- ShiftOr: Small patterns (≤64 states) without anchors/word boundaries
- EagerDfa: Patterns with word boundaries or anchors
- LazyDfa: General patterns with on-demand state construction
- BacktrackingVm: Patterns with backreferences
- PikeVm: Patterns with lookaround or non-greedy quantifiers
JIT mode (with jit feature):
- BacktrackingJit: Patterns with backreferences
- TaggedNfa: Patterns with lookaround or non-greedy quantifiers
- JitShiftOr: Small patterns with alternations
- DFA JIT: General patterns, benefits from SIMD prefiltering
See docs/architecture.md for details on the engine selection logic.
Performance
Speedup relative to regex crate (higher is better):

Highlights (speedup vs regex crate):
| Benchmark | regexr | regexr-jit | pcre2-jit |
| ---------------- | -------------- | -------------- | ----------- |
| log_parsing | 0.80-0.84x | 3.91-4.09x | 3.57-3.71x |
| url_extraction | 0.81-0.83x | 1.95-1.99x | 2.10-2.13x |
| unicode_letters | 1.24x | 1.43-1.44x | 1.65-1.72x |
| html_tags | 0.82-0.87x | 1.33-1.43x | 0.80-0.85x |
| word_boundary | 1.19-1.24x | 1.15-1.19x | 0.72-0.74x |
| email_validation | 0.99-1.00x | 1.00-1.11x | 0.94-1.00x |
| alternation | 0.88-1.01x | 0.88-1.01x | 0.12-0.15x |
regexr-jitexcels at log parsing (4x faster thanregex)regexr(non-JIT) matchesregexperformance on most patterns- Both outperform
fancy-regexandpcre2(non-JIT) consistently
Documentation
- Architecture Overview - Engine architecture and selection logic
- Features - Detailed feature documentation
Citation
If you use regexr in your research, please cite:
@software{regexr2025,
author = {Syah, Farhan},
title = {regexr: A Rust Regex Engine with JIT Compilation for LLM Tokenization},
year = {2025},
url = {https://github.com/ml-rust/regexr},
note = {Rust regex engine with lookaround support and JIT compilation}
}
Related Skills
next
A beautifully designed, floating Pomodoro timer that respects your workspace.
product-manager-skills
50PM skill for Claude Code, Codex, Cursor, and Windsurf: diagnose SaaS metrics, critique PRDs, plan roadmaps, run discovery, and coach PM career transitions.
devplan-mcp-server
3MCP server for generating development plans, project roadmaps, and task breakdowns for Claude Code. Turn project ideas into paint-by-numbers implementation plans.
