Stringozzi
C++ Parsing/searching/validating expression grammar library (PEG) To write Regex-like expressions in the core
Install / Use
/learn @osamasalem/StringozziREADME
Stringozzi
Serving efficiently, Served with :sparkling_heart:
C++ Parsing Expression grammar library for parsing/searching/validating strings similar to RegEx
What's new
the version 2.0.0 comes with totally revamped and rewritten core to provide the balance between flexibility and performance
Motivation
The idea of this project comes from the need to a library that the user can build his ABNF like messages to parse text based network messages easily and efficiently (i.e.. HTTP/SIP) which has so complex string patterns and structures ...
Philosophy
- To be open source and friendly available for commercial product
- To be efficient specially in parsing without loosing flexibility
- To be cross platform (Windows and Linux for now !!)
- To support wide C++ standards choices as much as possible
Why Stringozzi ?
- Wide cross C++ standard compliance (this version is compliant to C++97 and upward)
- 3x-10x* faster than std::regex VS/GCC
- Lean and mean.. use only feature that you need to ( Validate / Search distinction and disable matching results )
- the grammar is checked by C++ compiler rather than the engine
- new features not in Regex like Support for recursive expressions, character chains detection , conditional parsing and (A)BNF like rules
- Seamless support for UTF and Unicode string with mix and match
- support for long expressions and text ( which is not the case in Regex for some reason)
- Packed with useful built-in expressions (IPv4,URI, ServerName)
Why not ?
- You don't like pasta or Italian cuisine :smile:
- The expressions you use are too short
- The program loads validation expressions from remote source(like DB or text files)
The magic you can do
Expression like that
^([XYZ]{5}X?\s+\<(AB){3}\>)*$
Is equivalent to
Rule z = *(5 * (In("XYZ")) > ~Is('X') > +WhiteSpace() > Enclosed(3 * (Is("AB")), "<", ">")) > End();
...
StringozziA(z).Test("ZXYZZ <ABABAB>")
Or structured ones like these
const Rule Verb = (Is("GET") | Is("POST"));
const Rule URI = Is("http://") > *(Any() & !WhiteSpace());
const Rule RequestLine = Verb > URI > Is("HTTP/2.0") > EndOfLine();
const Rule Header = *(Any() & !WhiteSpaces()) > Is(":") > *(Any() & !WhiteSpaces())
const Rule Request = RequestLine > *(Headers) > EndOfLine() > Content;
...
Getting Started
Cloning the repository
cd stringozzi
git.exe clone "https://github.com/osamasalem/stringozzi.git ."
Building
For Windows
build.bat
For Linux
./build.sh
First Steps
#include <Stringozzi.h>
int main(int argc, char** argv)
{
const Rule r = Is("Student No#:") > (Range(1,3) * Digit());
Actions::Test(r, "Student No#: 434");
}
Rules and Operators
| Rule | Description |
|-|-|
| Any() | Any character |
| Digit() | Any numeric character between '0' and '9'|
| Alphabet() | Any character between 'A' and 'Z'|
| Alphanumeric() | Digit or Alphabet|
| Whitespace() | Any space character |
| Beginning() | Matches the beginning of text |
| End() | Matches the end of text (i.e. '\0')
| a>b | Parsing sequentially using rule a first then rule b |
| !a | Negate parsing rule a, this action does not move parsing pointer |
| a&b | Boolean "And" operation: Rule a and Rule b must be matched, the token must comply with both rules, and apply the most relevant one |
| a|b | Boolean "Or" operation: Rule a or Rule b, the token may comply with one of the rules, short circuit applies here, the first rule matched.. the pointer will move accordingly to the end of that match... so start always with most specific rule to more general ones OR use Greedy OR operator |
| a||b | Rule a or Rule b, the token may comply with either rules, it always checks all rules and takes the most relevant (matching) one in expense of performance |
| a | Zero Or More: succeed if whether does not match or matches multiple instances Take care when using this rule it can match any thing|
| [num]a | exact num of matches |
|[num]+a| One Or More (optional maximum num) (at least rule), Matches a single instance or multiple instances |
| ~a | Optional rule a . it parses the token whenever possible |
| Between(a,b) | a and b are chars, this matches any character in the specified range|
| In(str) | str is string pointer,Belongs to rule: matches the character with a set of characters |
| Out(str) | Any character out of set str |
| Is(tok) | Equal operator,tok is either a char or string pointer .. it matches the token with a single character or a set of consecutive characters |
| rule >> str | put the matched by rule rule string in matches table with the name specified in str|
| Skip(rule) | Skip the characters till it matches the rule, always return true |
| Until(rule) | Skip the characters till it matches the rule, it requires the next token to match rule |
| LookAhead(rule) | it peeks the next token and checks if it matches rule, it does not move parsing pointer |
| LookBack(rule) | it peeks the previous token and checks if it matches rule, it does not move parsing pointer |
| CaseSensitive() | this will set case sensitive mode in parsing process |
| CaseSensitive() | this will set case insensitive mode in parsing process |
| SetVar([varname], [value]) | this will match all the time .. this sets a flag/variable with specified value.. if no value is supplied the default will be 1 |
| DelVar([varname]) | this will match always .. removes/unset flag/variable |
| If(varname,[value]) | checks if the stored named variable varname equals the specified value.. if no value speicified the default value will be 1 |
Flags definition
| Flag | Description |
|-|-|
| SPEG_CASEINSENSITIVE | Specify if matching process is case (in)sensitive |
| SPEG_MATCHNAMED | Match all named returns by Extract or >> operators. clearing this flag will bypass marking matches |
| SPEG_MATCHUNNAMED | Store all successful matches , clearing this flag will bypass marking matches |
| SPEG_IGNORESPACES | Will match all successive tokens whether there are spaces between them or not, Whitespace match pattern will not work here in this mode |
Guides and Use Cases
Use structured Rules
It is possible to use the Rules inside each other like this
const Rule Digit = Between('0','9');
const Rule SmallAlphabet = Between('a','z');
const Rule CapitalAlphabet = Between('A','Z');
const Rule Alphabet = CapitalAlphabet | SmallAlphabet;
const Rule Alphanumeric = Digit | Alphabet;
Basic operations
the basic operation you can do with Stringozzi
- Test: It validate the input string against the set rule
Actions::Test(In("ABC"), "A")
- FastMatch: like
Testbut returns the related matches
MatchesA m;
Actions::FastMatch(In("ABC") >> "Match" , "A", m)
- Search: it searches the string till the rule applies
bool b = Actions::Search(Is('b'), "abc"); // true
char* ptr = Actions::SearchAndGetPtr(Is('b'), "abc"); // "bc"
int idx = Actions::SearchAndGetIndex(Is('b'), "abc"); // 1
- Match: it is the combined operations of Search + FastMatch ... it searches the string till the rule applies then extract matches
MatchesA m;
Actions::Match(In("ABC") >> "Match" , "-----A", m)
- Replace: searches the string till the rule applies and then replace the string match with the specified text
Actions::Replace(Is("ABC"), "1234567ABC890ABC", "X", 0, 1) // "1234567X890ABC";
- Split: searches the string till the rule applies and then replace the string match with the specified text
vector<string> vec;
Actions::Split(Is("<=>"), "1234567<=>ABC", vec, 0, true, 1); // ["1234567","ABC"]
Using Matches.. (Not :fire: ones :wink:)
There are two types of expression matches
- Named : where you set the name of the match in the rule
- Anonymous : Any other tokens matched by expression elements
you can use extract that way and get the resulting Matches
MatchesA m;
StringozziA str(Is('K') >> "MYMATCH"); //a rule to match letter K and store it as "MYMATCH"
str.Match("K", m); // Match against string "K"
//number of total matches entries
m.NumberOfMatches(); // == 2 => "K" and "<UNNAMED>"
// number of MYATCH entries
m.NumberOfMatches("MYMATCH"); // == 1
m.NumberOfMatches("NOTFOUND"); // == 0
m.NumberOfMatches("<UNNAMED>"); // == 1
m.Get("MYMATCH",0); // "K"
m.Get("MYMATCH",1); // <NULL>
Case sensitivity
The default mode for Stringozzi is case sensitive You can specify the case insensitivity in two ways either you can specify SPEG_CASEINSENSITIVE in the operation
Actions::Test(In("ABC"), "a", SPEG_CASEINSENSITIVE);
Or you can use mode change in the rule it self
Actions::Test(CaseInsensitive > In("ABC") > CaseSensitive, "a")
- Or Vs. Greedy Or
Consider this example
Rule r = (Is("V") | Is("Via")) > Is(':') ; // will not work for Via
This will work with V: but not for Via: because it will always match the first letter and in the other hand ia: and : will not match due to the fact that OR operator will be short circuited to the first option and
