Lex
C++ library for Lua style pattern matching
Install / Use
/learn @PG1003/LexREADME
lex
A C++ library for Lua style pattern matching.
The standard library of the Lua program language includes a minimalistic pattern matching capability with features you won't find in common regular expressions implementations. This library provides an easy integration of Lua style pattern matching into your C++ application.
The library is tested by a simple test program that includes tests that are ported from the Lua test suite.
There are additional tests to verify parts that are specific for the implementation of this library.
You can also use the test program to toy with the library.
This project includes a makefile to build the test program but doesn't include makefiles or project files to build the library. Integration in your own build environment should be easy since the library consists of only 2 source files (lex.h and lex.cpp) without any external dependencies.
Features
The features listed here are specific for this implementation, not about the capabilities of Lua's pattern matching.
- Uses only the C++ standard library, no external dependencies.
- Full compatability with the match patterns as implemented in Lua 5.4.
- Support for a wide range of string types such as std::string, std::string_view, character arrays and pointers. All of these can be based on the character types as defined by C++17 and C++20.
- Match a string with a pattern.
- Substitute a matching pattern by a replacement pattern or the result of a function that is called for each match.
- Iterate over a string with a pattern.
Requirements
- Minimal a C++17 compliant compiler.
Examples
Match a word at the begin of a string
// 1 Our input string to match.
std::string str = "Hello world!";
// 2 Match a word at the beginning of a string.
auto result = pg::lex::match( str, "^%a+" );
// 3 Validate if there was a match.
assert( result );
// 4 Print the match.
std::cout << "Match: " << result.at( 0 ) << '\n';
Output:
Match: Hello
Match with a capture
// 1 Our input string.
std::wstring str = "Hello PG1003!";
// 2 Match and capture 'PG'.
auto result = pg::lex::match( str, "(%a+)%d+" );
// 3 Validate if there was a match
assert( result );
// 4 Print the capture.
std::cout << "Capture: "<< result.at( 0 ) << '\n';
Output:
Capture: PG
Iterate with a pattern
// 1 Our input string.
auto str = u"foo = 42; bar= 1337; baz = PG =1003 ;";
// 2 Iterate over all key/value pairs.
for( auto match : pg::lex::gmatch( str, "(%a+)%s*=%s*(%d+)%s*;" ) )
{
// 3 The match should have 2 captures.
assert( match.size() == 2 );
// 4 Print the key/value pairs.
std::cout << "Key: " << match.at( 0 ) << ", Value: " << match.at( 1 ) << '\n';
}
Output:
Key: foo, Value: 42
Key: bar, Value: 1337
Key: PG, Value: 1003
Substitution with a replacement pattern
// 1 Our input string.
auto str = "foo =\t42; bar= 1337; pg =1003 ;";
// 2 Match pattern.
auto pat = "(%a+)%s*=%s*(%d+)%s*;";
// 3 Replacement pattern.
auto repl = "%1=%2;";
// 4 Do the global substitution.
auto result = pg::lex::gsub( str, pat, repl );
// 5 Print result.
std::cout << reslult << '\n';
}
Output:
foo=42; bar=1337; pg=1003;
Substitution with a function
// 1 Our input string.
auto str = "one two three four";
// 2 Function that generates the replacement.
std::string function( const pg::lex::match_result & mr )
{
if( mr.at( 0 ) == "one" )
{
return "PG";
}
return "1003";
}
// 3 Do the substitution for the first 2 matches and print the result.
std::cout << pg::lex::gsub( str, "%s*%w+", function, 2 ) << '\n';
Output:
PG1003 three four
Documentation
Lex errors
This library throws exceptions of the pg::lex:lex_error type which is derived from std::runtime_error.
The errors emitted by Lua regarding to pattern matching are thrown too by this library.
You can get the exception description and number by calling the what() and code() member functions.
Match result
A successful match returns a match result that contains one or more captures, the position and size of the matched substring.
The number of captures depends on the pattern; the captures defined in the pattern or one for whole match when the pattern doesn't have captures.
For convenience a match result has an operator bool that returns true when it contains at least one capture.
A capture has a reference to a part of the input string. This means that the input string must be available when reading the result of the capture.
A match result is templated on the character type of the input string.
The following predefined match result types are made available in the pg::lex namespace;
| character type of input string | match result type|
|-----------|------------------|
| char | match_result |
| wchar | wmatch_result |
| char8_t (C++20) | u8match_result |
| char16_t | u16match_result |
| char32_t | u32match_result |
You can iterate over the captures of a match result; it has begin and end member functions that return a random access iterator.
A match result iterator returns a string view when it is dereferenced.
Match results can also be used with range based for-loops.
Match
The pg::lex::match( string, pattern ) function searches for a pattern in a string and returns a match result.
An empty match result is returned when no match was found.
Iteration
To itereate over matches in a string you create a context by calling the pg::lex::gmatch function.
A context is a pg::lex::gmatch_context object with a reference to a input string and a pattern.
You get a pg::lex::gmatch_iterator by calling the pg::lex::begin and pg::lex::end functions with a context as parameter.
A pg::lex::gmatch_iterator behaves like a forward iterator; it can only advance with the ++ operator.
Gmatch iterators return match results when you dereference them.
The pg::lex::begin function creates an iterator and searches for the first match in the input string.
The returned iterator is equal to the iterator returned by pg::lex::end when no match was found.
A pg::lex::gmatch_context is compatible with ranged based for-loops as shown in the Iterate with a pattern example.
However prefer using the pg::lex::gmatch function over instantiating a pg::lex::gmatch_context object for readability.
Substitute
There are two overloaded functions that substitute a matched substring with a replacement.
The pg::lex::gsub( string, pattern, replacement, count = -1 ) overload replaces the match with a replacement pattern.
The second overload pg::lex::gsub( string, pattern, function, count = -1 ) replaces the match with te result of the function that is called for each match.
The function must accept a match result as parameter that is templated on the same character type as the input string.
The count parameter limits number of substitutes with a negative value for an unlimited count.
The string.gsub function in Lua also supports tables as lookup for replacements.
This library does not support this Lua feature.
Replacement pattern
A replacement pattern is a string that contains a repacement text which can include captures of the match result.
References to captures are marked as %d where d is a number between 1 and 9 to refrence the first up to nineth capture.
%0 stands for the whole match.
The whole match will handled as one capture when a pattern didn't specified any captures.
auto a = pg::lex::gsub( "hello world", "(%w+)", "%1 %1" );
assert( a == "hello hello world world" );
auto b = pg::lex::gsub("hello world", "%w+", "%0 %0", 1); // Whole match
auto c = pg::lex::gsub("hello world", "%w+", "%1 %1", 1); // Same since there are no captures
assert( b == "hello hello world" );
assert( b == c );
auto d = pg::lex::gsub("hello world from Lua", "(%w+)%s*(%w+)", "%2 %1");
assert( d == "world hello Lua from" );
Patterns
For convenience this paragraph is copied from the patterns paragraph in the Lua reference manual and adjusted to match the usage of this library.
Patterns are described by regular strings, which are interpreted when matching, iterating and substituting. This section describes the syntax and the meaning (that is, what they match) of these strings.
Character class
A character class is used to represent a set of characters. The following combinations are allowed in describing a character class:
x(where x is not one of the magic characters ^$()%.[]*+-?) represents the character x itself..(a dot) represents all characters.%arepresents all letters.%crepresents all control characters.%drepresents all digits.%grepresents all printable characters except space.%lrepresents all lowercase letters.%prepresents all punctuation characters.%srepresents all space characters.%urepresents all uppercase letters.%wrepresents all alphanumeric characters.%xrepresents all hexadecimal digits.%x(where x is any non-alphanumeric character) represents the character x. This is the standard way to escape the magic characters. Any non-alphanumeric character (including all punctuation characters, even the non-magical) can be preceded by a%to represent itself in a pattern.[set]represents the class which is the union of all characters in set. A range of characters can be specified by separating the end characters of the range, in ascending order, with a-. All classes %x described above can also be used as components in set. All other characters in set repr
