Ctpg
Compile Time Parser Generator is a C++ single header library which takes a language description as a C++ code and turns it into a LR1 table parser with a deterministic finite automaton lexical analyzer, all in compile time.
Install / Use
/learn @peter-winter/CtpgREADME
CTPG
C++ Compile Time Parser Generator
C++ single header library which takes a language description as a C++ code and turns it into a LR1 table parser with a deterministic finite automaton lexical analyzer, all in compile time. What's more, the generated parser is actually itself capable of parsing in compile time. All it needs is a C++17 compiler!
Contents
- Installation
- Compiler support
- Usage
- Explanation
- Compile Time Parsing
- LR(1) Parser
- Functors - advanced
- External context
- Various features
- Regular expressions
- Diagnostics
- Resulting binary size
Installation
Option 1.
It is a single header library. You can just copy the include/ctpg/ctpg.hpp header wherever you want.
Option 2.
Use CMake 3.14+ to build the library:
$ git clone https://github.com/peter-winter/ctpg
$ cmake -S ctpg -B ctpg/build -DCMAKE_BUILD_TYPE=Release
$ cmake --build ctpg/build
$ (cd ctpg/build ; ctest) # optional
$ cmake --install ctpg/build [--prefix /usr/local]
CTPG can be used via the standard find_package interface. Just link to ctpg::ctpg!
There is a minimal, self-contained example in examples/language/CMakeLists.txt.
Option 3.
The ctpg-git AUR package is available for Arch-based distribution users.
- Manual install:
git clone https://aur.archlinux.org/ctpg-git.git
cd ctpg-git
makepkg -si
- Using yay:
yay -S ctpg-git
Compiler support
Tested on:
- GCC 10.3+, 11.x, 12.x
- Clang 12.x, 13.x, 14.x
- MSVC 19.30+
Usage
Following code demonstrates a simple parser which takes a comma separated list of integer numbers as argument and prints a sum of them.
readme-example.cpp
#include <ctpg/ctpg.hpp>
#include <iostream>
#include <charconv>
using namespace ctpg;
using namespace ctpg::buffers;
constexpr nterm<int> list("list");
constexpr char number_pattern[] = "[1-9][0-9]*";
constexpr regex_term<number_pattern> number("number");
int to_int(std::string_view sv)
{
int i = 0;
std::from_chars(sv.data(), sv.data() + sv.size(), i);
return i;
}
constexpr parser p(
list,
terms(',', number),
nterms(list),
rules(
list(number) >=
to_int,
list(list, ',', number)
>= [](int sum, char, const auto& n){ return sum + to_int(n); }
)
);
int main(int argc, char* argv[])
{
if (argc < 2)
return -1;
auto res = p.parse(string_buffer(argv[1]), std::cerr);
bool success = res.has_value();
if (success)
std::cout << res.value() << std::endl;
return success ? 0 : -1;
}
Compile and run:
g++ readme-example.cpp -std=c++17 -o example && example "10, 20, 30"
You should see the output : 60. If incorrect text supplied as an argument:
g++ readme-example.cpp -std=c++17 -o example && example "1, 2, 3x"
you should see:
[1:8] PARSE: Unexpected character: x
Explanation
Header
#include <ctpg/ctpg.hpp>
Namespaces
Namespace ctpg is the top namespace. There are couple of feature namespaces like buffers
using namespace ctpg;
using namespace ctpg::buffers;
Terminal symbols
Terminal symbols (short: terms) are symbols used in grammar definition that are atomic blocks. Examples of the terms from a C++ language are: identifier, '+' operator, various keywords etc.
To define a term use the one of char_term, string_term and regex_term classes.
Here is the example of a regex_term with a common integer number regex pattern.
constexpr char number_pattern[] = "[1-9][0-9]*";
constexpr regex_term<number_pattern> number("number");
The constructor argument ("number") indicates a debug name and can be omitted, however it is not advised.
Names are handy to diagnose problems with the grammar. If omitted, the name will be set to the pattern string.
Note: the pattern needs to have a static linkage to be allowed as a template parameter. This is C++17 limitation, and CTPG does not support C++20 features yet.
Other types of terms
char_term is used when we need to match things like a + or , operator.
string_term is used when we need to match a whole string, like a language keyword.
Nonterminal symbols
Nonterminal symbols (short: nonterms) are essentially all non atomic symbols in the grammar. In C++ language these are things like: expression, class definition, function declaration etc.
To define a nonterm use the nterm class.
constexpr nterm<int> list("list");
The constructor argument ("list") is a debug name as well, like in the case of regex_term. The difference is in nterms names are neccessary, because they serve as unique identifiers as well. Therefore it is a requirement that nonterm names are unique.
Template parameter <int> in this case is a value type. More on this concept later.
Parser definition
The parser class together with its template deduction guides allows to define parsers using 4 arguments:
- Grammar root - symbol which is a top level nonterm for a grammar.
- List of all terms
- List of all nonterms
- List of rules
The parser object should be declared as constexpr, which makes all the neccessary calculations of the LR(1) table parser done in compile time.
Let's break down the arguments:
constexpr parser p(
list,
terms(',', number),
nterms(list),
Grammar root.
When the root symbol gets matched (in this case list) the parse is successful.
Term list.
List of terms enclosed in a terms call. In our case there are two: number and a ,.
Note: the
,term is not defined earlier in the code. It is an implicitchar_term. The code implicitly converts the char to thechar_termclass. Thereforechar_terms(as well asstring_terms) are allowed not to be defined in advance. Their debug names are assigned to the them by default to a char (or a string) they represent.
Nonterm list.
List of terms enclosed in a nterms call. In our case, just a single list nonterm is enough.
Rules
List of rules enclosed in a rules call.
Each rule is in the form of:
nonterm(symbols...) >= functor
The nonterm part is what's called a left side of the rule. The symbols are called the right side.
The right side can contain any number of nterm objects as well as terms (regex_terms, char_terms or string_terms).
Terms can be in their implicit form, like , in the example. Implicit string_terms are in form of "strings".
rules(
list(number)
>= to_int
list(list, ',', number)
>= [](int sum, char, const auto& n)
{ return sum + to_int(n); }
)
The first rule list(number) indicates that the list nonterm can be parsed using a single number regex term.
The second rule uses what's know as a left recurrence. In other words, a list can be parsed as a list followed by a , and a number.
Functors
The functors are any callables that can accept the exact number of arguments as there are symbols on the right side and return a value type of the left side. Each nth argument needs to accept a value of a value type of the nth right side symbol.
So in the case of the first to_int functor, it is required to accept a value type of regex_term and return an int.
The second functor is a lambda which accepts 3 arguments: an int for the list, a char for the , and auto for whatever is passed as
a value type for the regex_term.
Note: Functors are called in a way that allows taking advantage of move semantics, so defining it's arguments as a move reference is encouraged.
Value types for terms
Terms unlike nonterms (which have their value types defined as a template parameter to the nterm definition),
have their value types predefined to either a term_value<char> for a char_term, and a term_value<std::string_view>
for both regex_term and string_term.
The term_value class template is a simple wrapper that is implicitly convertible to it's template parameter (either a char or std::string_view).
That's why when providing functors we can simply declare arguments as either a char or a std::string_view.
In our case the to_int functor has a std::string_view argument, which accepts a term_value<std::string_view> just fine.
Of course an auto in case of lambda will always do the trick.
T
