CTPG

C++ Compile Time Parser Generator

C++ single header library which takes a language description as a C++ code and turns it into a LR1 table parser with a deterministic finite automaton lexical analyzer, all in compile time. What's more, the generated parser is actually itself capable of parsing in compile time. All it needs is a C++17 compiler!

Installation
Compiler support
Usage
Explanation
Compile Time Parsing
LR(1) Parser
Functors - advanced
- Functor helpers
- Default functors
External context
Various features
Regular expressions
Diagnostics
Resulting binary size

Installation

Option 1.

It is a single header library. You can just copy the include/ctpg/ctpg.hpp header wherever you want.

Option 2.

Use CMake 3.14+ to build the library:

$ git clone https://github.com/peter-winter/ctpg
$ cmake -S ctpg -B ctpg/build -DCMAKE_BUILD_TYPE=Release
$ cmake --build ctpg/build
$ (cd ctpg/build ; ctest)  # optional
$ cmake --install ctpg/build [--prefix /usr/local]

CTPG can be used via the standard find_package interface. Just link to ctpg::ctpg! There is a minimal, self-contained example in examples/language/CMakeLists.txt.

Option 3.

The ctpg-git AUR package is available for Arch-based distribution users.

Manual install:

git clone https://aur.archlinux.org/ctpg-git.git
cd ctpg-git
makepkg -si

Using yay: yay -S ctpg-git

Compiler support

Tested on:

GCC 10.3+, 11.x, 12.x
Clang 12.x, 13.x, 14.x
MSVC 19.30+

Usage

Following code demonstrates a simple parser which takes a comma separated list of integer numbers as argument and prints a sum of them.

readme-example.cpp

#include <ctpg/ctpg.hpp>
#include <iostream>
#include <charconv>

using namespace ctpg;
using namespace ctpg::buffers;

constexpr nterm<int> list("list");

constexpr char number_pattern[] = "[1-9][0-9]*";
constexpr regex_term<number_pattern> number("number");

int to_int(std::string_view sv)
{
    int i = 0;
    std::from_chars(sv.data(), sv.data() + sv.size(), i);
    return i;
}

constexpr parser p(
    list,
    terms(',', number),
    nterms(list),
    rules(
        list(number) >=
            to_int,
        list(list, ',', number)
            >= [](int sum, char, const auto& n){ return sum + to_int(n); }
    )
);

int main(int argc, char* argv[])
{
    if (argc < 2)
        return -1;
    auto res = p.parse(string_buffer(argv[1]), std::cerr);
    bool success = res.has_value();
    if (success)
        std::cout << res.value() << std::endl;
    return success ? 0 : -1;
}

Compile and run:

g++ readme-example.cpp -std=c++17 -o example && example "10, 20, 30"

You should see the output : 60. If incorrect text supplied as an argument:

g++ readme-example.cpp -std=c++17 -o example && example "1, 2, 3x"

you should see:

[1:8] PARSE: Unexpected character: x

Explanation

Header

#include <ctpg/ctpg.hpp>

Namespaces

Namespace ctpg is the top namespace. There are couple of feature namespaces like buffers

using namespace ctpg;
using namespace ctpg::buffers;

Terminal symbols

Terminal symbols (short: terms) are symbols used in grammar definition that are atomic blocks. Examples of the terms from a C++ language are: identifier, '+' operator, various keywords etc.

To define a term use the one of char_term, string_term and regex_term classes.

Here is the example of a regex_term with a common integer number regex pattern.

constexpr char number_pattern[] = "[1-9][0-9]*";
constexpr regex_term<number_pattern> number("number");

The constructor argument ("number") indicates a debug name and can be omitted, however it is not advised. Names are handy to diagnose problems with the grammar. If omitted, the name will be set to the pattern string.

Note: the pattern needs to have a static linkage to be allowed as a template parameter. This is C++17 limitation, and CTPG does not support C++20 features yet.

Other types of terms

char_term is used when we need to match things like a + or , operator. string_term is used when we need to match a whole string, like a language keyword.

Nonterminal symbols

Nonterminal symbols (short: nonterms) are essentially all non atomic symbols in the grammar. In C++ language these are things like: expression, class definition, function declaration etc.

To define a nonterm use the nterm class.

constexpr nterm<int> list("list");

The constructor argument ("list") is a debug name as well, like in the case of regex_term. The difference is in nterms names are neccessary, because they serve as unique identifiers as well. Therefore it is a requirement that nonterm names are unique.

Template parameter <int> in this case is a value type. More on this concept later.

Parser definition

The parser class together with its template deduction guides allows to define parsers using 4 arguments:

Grammar root - symbol which is a top level nonterm for a grammar.
List of all terms
List of all nonterms
List of rules

The parser object should be declared as constexpr, which makes all the neccessary calculations of the LR(1) table parser done in compile time.

Let's break down the arguments:

constexpr parser p(
    list,
    terms(',', number),         
    nterms(list),

Grammar root.

When the root symbol gets matched (in this case list) the parse is successful.

Term list.

List of terms enclosed in a terms call. In our case there are two: number and a ,.

Note: the , term is not defined earlier in the code. It is an implicit char_term. The code implicitly converts the char to the char_term class. Therefore char_terms (as well as string_terms) are allowed not to be defined in advance. Their debug names are assigned to the them by default to a char (or a string) they represent.

Nonterm list.

List of terms enclosed in a nterms call. In our case, just a single list nonterm is enough.

Rules

List of rules enclosed in a rules call. Each rule is in the form of: nonterm(symbols...) >= functor The nonterm part is what's called a left side of the rule. The symbols are called the right side.

The right side can contain any number of nterm objects as well as terms (regex_terms, char_terms or string_terms). Terms can be in their implicit form, like , in the example. Implicit string_terms are in form of "strings".

    rules(
        list(number)
            >= to_int
        list(list, ',', number)
            >= [](int sum, char, const auto& n)
            { return sum + to_int(n); }
    )

The first rule list(number) indicates that the list nonterm can be parsed using a single number regex term.

The second rule uses what's know as a left recurrence. In other words, a list can be parsed as a list followed by a , and a number.

Functors

The functors are any callables that can accept the exact number of arguments as there are symbols on the right side and return a value type of the left side. Each nth argument needs to accept a value of a value type of the nth right side symbol.

So in the case of the first to_int functor, it is required to accept a value type of regex_term and return an int.

The second functor is a lambda which accepts 3 arguments: an int for the list, a char for the , and auto for whatever is passed as a value type for the regex_term.

Note: Functors are called in a way that allows taking advantage of move semantics, so defining it's arguments as a move reference is encouraged.

Value types for terms

Terms unlike nonterms (which have their value types defined as a template parameter to the nterm definition), have their value types predefined to either a term_value<char> for a char_term, and a term_value<std::string_view> for both regex_term and string_term.

The term_value class template is a simple wrapper that is implicitly convertible to it's template parameter (either a char or std::string_view). That's why when providing functors we can simply declare arguments as either a char or a std::string_view. In our case the to_int functor has a std::string_view argument, which accepts a term_value<std::string_view> just fine. Of course an auto in case of lambda will always do the trick.

Ctpg

Install / Use

README