Regex2dfa
Convert regular expressions to minimized DFAs in AT&T FST format.
Install / Use
/learn @kpdyer/Regex2dfaREADME
regex2dfa
Convert regular expressions to minimized DFAs in AT&T FST format.
Pure Python. Zero dependencies. Results cached in memory.
Installation
pip install regex2dfa
Or install from source:
git clone https://github.com/kpdyer/regex2dfa.git
cd regex2dfa
pip install -e .
Quick Start
from regex2dfa import regex2dfa
dfa = regex2dfa("(a|b)+")
print(dfa)
Output:
0 1 97 97
0 1 98 98
1 1 97 97
1 1 98 98
1
Caching
Results are automatically cached in memory. Repeated calls with the same regex return instantly:
from regex2dfa import regex2dfa, cache_info, clear_cache
# First call computes the DFA
regex2dfa("(a|b)+")
# Second call returns from cache (instant)
regex2dfa("(a|b)+")
# Check cache statistics
print(cache_info())
# CacheInfo(hits=1, misses=1, maxsize=1024, currsize=1)
# Clear cache if needed
clear_cache()
Object-Oriented Interface
from regex2dfa import Regex2DFA
converter = Regex2DFA("(a|b)+")
# Access intermediate representations
print(f"NFA states: {len(converter.nfa.states)}")
print(f"DFA states: {len(converter.dfa.states)}")
print(f"Minimized DFA states: {len(converter.minimized_dfa.states)}")
# Get AT&T format
print(converter.to_att())
Low-Level API
from regex2dfa import parse_regex, build_nfa, nfa_to_dfa, minimize_dfa, format_att
# Step-by-step pipeline
postfix = parse_regex("(a|b)+")
nfa = build_nfa(postfix)
dfa = nfa_to_dfa(nfa)
min_dfa = minimize_dfa(dfa)
output = format_att(min_dfa)
AT&T FST Format
The output uses tab-separated fields:
- Transitions:
src dst input output - Final states:
state_id
Labels are ASCII byte values (97 = 'a', 98 = 'b').
Supported Regex Syntax
| Feature | Syntax | Description |
|---------|--------|-------------|
| Literal | a | Matches character |
| Any char | . | Matches any byte (0-255) |
| Char class | [abc] | Matches any listed char |
| Negated class | [^abc] | Matches any char not listed |
| Range | [a-z] | Matches range |
| Zero or more | * | Kleene star |
| One or more | + | Kleene plus |
| Optional | ? | Zero or one |
| Alternation | \| | Either side |
| Grouping | (...) | Group expressions |
| Anchors | ^ $ | Start/end (stripped, implicit) |
| Hex escape | \x00 | Byte by hex value |
| Any byte | \C | Any byte (0-255) |
| Digits | \d | [0-9] |
| Word chars | \w | [a-zA-Z0-9_] |
| Whitespace | \s | Space, tab, newline, carriage return |
How It Works
- Parser — Tokenizes regex and converts to postfix notation (shunting-yard algorithm)
- Thompson's Construction — Builds NFA from postfix tokens
- Subset Construction — Converts NFA to DFA (powerset method)
- Hopcroft's Algorithm — Minimizes DFA (partition refinement)
Project Structure
regex2dfa/
├── __init__.py # Public API
├── core.py # Main regex2dfa() function with caching
├── parser.py # Regex tokenizer + postfix conversion
├── nfa.py # NFA data structures
├── thompson.py # Thompson's construction (Regex → NFA)
├── dfa.py # DFA data structures
├── subset.py # Subset construction (NFA → DFA)
├── hopcroft.py # DFA minimization
└── formatter.py # AT&T FST output
Tests
pytest tests/ -v
License
MIT
