dregex - Deterministic Regular Expression Engine

dregex is a Java library that implements a regular expression engine using deterministic finite automata (DFA). It supports some Perl-style features yet retains linear matching time. Additionally, it can perform set operations (union, intersection, and difference).

Rationale

History

"Regular expressions" are a mathematically defined concept, invented by Stephen Kleene in 1956. In their most minimalistic (and original) version, these expressions define languages using just literal characters, alternation ("|"), and repetition ("*"). They have well-understood properties; most importantly, they can be matched against text using another well-understood abstract device: a Deterministic Finite Automaton (DFA). DFAs have the important property of running on arbitrary text of length n in O(n) time and using O(1) space. Presented with a regular expression and a candidate text, a DFA decides whether the text matches the expression.

Some years later, regular expressions entered actual usage when Ken Thompson included them as a feature in the editor QED. After that, they became a standard in the UNIX world, appearing in a variety of programs, notably the still-used grep tool.

Around 1994, Henry Spencer wrote a grep implementation that used a Non-Deterministic Finite Automaton (NFA). Using a NFA simplifies the compilation of expressions (building the automata) at the cost of complicating the execution. More importantly, NFA-based engines (specifically, those using recursive backtracking) make it easy to add new features, many of which are impossible to do with a state-machine (DFA) based implementation. That's precisely what started to happen when the library was used to implement regular expressions in Perl 5 in 1994.

After the release of Perl 5, the extended feature set was quickly embraced by users, and Perl-style regular expression implementations became part of the standard libraries of virtually all popular programming languages and also standard infrastructure components, like the Apache and Nginx web servers.

As mentioned, Perl-style expressions don't give any execution time guarantee. On the other hand, there are some features of Perl regular expressions that are impossible to express in a DFA, most notable backreferences (i.e., forcing a match of same text more than once). Other features, while not infeasible, significantly complicate a DFA solution, such as search and capturing groups.

Proposal

Regular expressions were born as a very specific tool and, almost by accident, grew to be one of the most versatile (and sometimes abused) tools in the world of software. However, there is a fundamental trade-off between the two prototypical implementations that is often ignored. Unbounded execution time is undesirable for many (if not all) interactive uses, as problems can happen when either the regular expression or the text are supplied by the user. For examples, see:

For cases when advanced Perl-style features are not needed and predictable performance is desired, using DFA-based matching is a compelling alternative, unfortunately made difficult by the scarcity of proper implementations in most languages. dregex offers an alternative for the Java ecosystem.

Supported regex flavor

Unless specified, the supported regular expression flavor attempts to be compatible with the Java flavor.

Supported features

Literals
Escaped characters
Standard regular language constructs: |, ?, * and +, and parenthesis grouping.
General quantifiers: {n}, {n,m}, {n,}
Dot wildcards
Simple character classes: [abc]
Simple negated character classes: [^abc]
Ranges in character classes: [a-z]
Special character classes, including negated versions: \w, \s, \d, \W, \S, \D
POSIX character classes (using the syntax and definition of java.util.regex.Pattern), e.g., \p{Lower}, \p{Punct}, \P{Lower}
Unicode character classes (using the syntax and definition of java.util.regex.Pattern):
- Blocks, e.g., \p{InGreek}, \p{block=Greek}, \p{blk=Greek}
- Scripts, e.g., \p{IsLatin}, \p{script=Latin}, \p{sc=Latin}
- General categories, e.g., \p{Lu}, \p{IsLu}, \p{general_category=Lu}, \p{gc=Lu}
- Binary properties, e.g., \p{IsAlphabetic}
Java-defined character classes (using the syntax and definition of java.util.regex.Pattern), e.g., \p{javaLowerCase}, \p{javaWhitespace}
Special character classes inside regular character classes: [\d\s], [\D]
Unicode line breaks: \R
Block quotes: \Q...\E
Lookaround (lookahead and lookbehind; both positive and negative) (see note below)

The definition of Unicode character classes (Unicode blocks, scripts, general categories and binary properties) is tied to the library and not the JVM). The version currently supported is Unicode 16.0.0.

Compile flags

With one exception, all compile flags defined by java.util.regex.Pattern are supported, including in embedded form:

LITERAL
COMMENTS (also in embedded form: (?x)). Note: The flag intentionally behaves ignoring exactly the same set of white space characters as the standard Java implementation, that is, only ASCII white space, not Unicode.
DOTALL (also in embedded form: (?s))
UNIX_LINES (also in embedded form: (?d))
UNICODE_CHARACTER_CLASS (also in embedded form: (?U))
CASE_INSENSITIVE (also in embedded form: (?i))
UNICODE_CASE (also in embedded form: (?u))
CANON_EQ

There is a special limitation with respect to embedded flags: they are only supported at the beginning of the pattern.

Not supported

Searching (the engine matches only against the full input string)
Capturing groups
Backreferences
Anchors (ˆ and $), as they are redundant because the expressions only operate over the complete text.
Reluctant (+?, *?, ??, {...}?) and possessive (++, *+, ?+, {...}+) quantifiers , because they are meaningless for a pure-matching engine. By definition, they only affect capturing groups, not whether an expression matches.
Compile flag MULTILINE, because it is meaningless for a pure-matching engine, which works always in multi-line mode.

Note: for the safety, the presence of unsupported features in a regular expression will cause it to fail the compilation (except for unnamed capturing groups, as they have no syntax: they are just a pair of parenthesis).

Set operations

In addition to regular matching, dregex fully supports set operations on regular expressions. These operations work on regular expressions themselves, not on input strings. It possible to do union, intersection and difference:

List<Regex> regexes = Regex.compile(List.of("[a-z]+", "[A-Z]+", "[a-z]+|[A-Z]+"));
Regex lower = regexes.get(0);
Regex upper = regexes.get(1);
Regex both = regexes.get(2);
System.out.println(lower.doIntersect(upper)); // false
System.out.println(both.equiv(lower.union(upper))); // true

List<Regex> regexes = Regex.compile(List.of("[a-z]+|[A-Z]+", "[A-Z]+"));
Regex all = regexes.get(0);
Regex upper = regexes.get(1);
Regex lower = all.diff(upper);
System.out.println(lower.matches("aaa")); // true
System.out.println(lower.matches("Aaa")); // false

The motivating use case was detecting non-intersecting expressions. Once it can be establ

Dregex

Install / Use

README