Lpegrex
Parse programming languages syntax into an AST using PEGs with ease (LPeg Extension).
Install / Use
/learn @edubart/LpegrexREADME
LPegRex
LPegRex is a re-implementation of
LPeg/
LPegLabel
re module with some extensions to make
easy to parse language grammars into an AST (abstract syntax tree)
while maintaining readability.
LPegRex stands for LPeg Regular Expression eXtended.
Goals
The goal of this library is to extend the LPeg re module with some minor additions to make easy parsing a whole programming language grammar to an abstract syntax tree using a single, simple, compact and clear PEG grammar.
For instance is in the goal of the project to parse Lua 5.4 source files with complete syntax into an abstract syntax tree under 100 lines of clear PEG grammar rules while generating an output suitable to be used analyzed by a compiler. This goal was accomplished, see the Lua example section below.
The new extensions should not break any existing re syntax.
This project will be later incorporated in the Nelua programming language compiler. This goal was accomplished, and LPegRex is the new parsing engine for the Nelua compiler.
Additional Features
- New predefined patterns for control characters (
%ca%cb%ct%cn%cv%cf%cr). - New predefined patterns for utf8 (
%utf8%utf8seq%ascii). - New predefined pattern for spaces independent of locale (
%sp). - New syntax for capturing arbitrary values while matching empty strings (e.g.
$true). - New syntax for optional captures (e.g
patt~?). - New syntax for throwing labels errors on failure of expected matches (e.g.
@rule). - New syntax for rules that capture AST Nodes (e.g.
NodeName <== patt). - New syntax for rules that capture tables (e.g.
MyList <-| patt). - New syntax for matching unique tokens with automatic skipping (e.g.
`,`). - New syntax for matching unique keywords with automatic skipping (e.g.
`for`). - Auto generate
KEYWORDrule based on used keywords in the grammar. - Auto generate
TOKENrule based on used tokens in the grammar. - Use supplied
NAME_SUFFIXrule for generating each keyword rule. - Use supplied
SKIPrule for generating each keyword or token rule. - Capture nodes with initial and final positions.
- Support using
-character in rule names. - Pre define some useful auxiliary functions:
tonilSubstitute captures bynil.totrueSubstitute captures bytrue.tofalseSubstitute captures byfalse.toemptytableSubstitute captures by{}.tonumberSubstitute a string capture by its corresponding number.tocharSubstitute a numeric code capture by its corresponding character byte.toutf8charSubstitute a numeric code capture by its corresponding UTF-8 byte sequence.foldleftFold tables to the left (use only with~>).foldrightFold tables to the right (use only with->).rfoldleftFold tables to the left in reverse order (use only with->).rfoldrightFold tables to the right in reverse order (use only with~>)
Quick References
For reference on how to use re and its syntax,
please check its manual first.
Here is a quick reference of the new syntax additions:
| Purpose | Example Syntax | Equivalent Re Syntax |
|-|-|-|
| Rule | name <-- patt | name <- patt |
| Capture node rule | Node <== patt | Node <- {\| {:pos:{}:} {:tag:''->'Node':} patt {:endpos:{}:} \|} |
| Capture tagged node rule | name : Node <== patt | name <- {\| {:pos:{}:} {:tag:''->'Node':} patt {:endpos:{}:} \|} |
| Capture table rule | name <-\| patt | name <- {\| patt \|} |
| Match keyword | `keyword` | 'keyword' !NAME_SUFFIX SKIP |
| Match token | `.` `..` | !('..' SKIP) '.' SKIP '..' SKIP |
| Capture token or keyword | {`,`} | {','} SKIP |
| Optional capture | patt~? | patt / ''->tofalse |
| Match control character | %cn | %nl |
| Arbitrary capture | $'string' | ''->'string' |
| Expected match | @'string' @rule | 'string'^Expected_string rule^Expected_rule |
As you can notice the additional syntax is mostly sugar for common capture patterns that are used when defining programming language grammars.
Folding auxiliary functions
Often we need to reduce a list of captured AST nodes into a single captured AST node (e.g. when reducing a call chain), here we call this operation folding. The following table demonstrates the four ways to fold a list of nodes:
| Purpose | Example Input | Corresponding Output | Syntax |
|-|-|-|-|
| Fold tables to the left | {1}, {2}, {3} | {{{1}, 2}, 3} | patt ~> foldleft |
| Fold tables to the right | {1}, {2}, {3} | {1, {2, {3}}}} | patt -> foldright |
| Fold tables to the left in reverse order | {1}, {2}, {3} | {{{3}, 2}, 1} | patt -> rfoldleft |
| Fold tables to the right in reverse order | {1}, {2}, {3} | {3, {2, {1}} | patt ~> rfoldright |
Where the pattern patt captures a list of tables with a least one capture.
Note that depending on the fold operation you must use its correct arrow (-> or ~>).
Capture auxiliary syntax
Sometimes is useful to match empty strings and capture some arbitrary values, the following tables show auxiliary syntax to help on that:
| Syntax | Captured Lua Value |
|-|-|
| $nil | nil |
| $true | true |
| $false | false |
| $name | defs[name] |
| ${} | {} |
| $16 | 16 |
| $'string' | "string" |
| p~? | p captures if it matches, otherwise false |
Capture auxiliary functions
Sometimes is useful to substitute a list of captures by a lua value, the following tables show auxiliary functions to help on that:
| Purpose | Syntax | Captured Value |
|-|-|-|
| Substitute captures by nil | p -> tonil | nil |
| Substitute captures by false | p -> tofalse | false |
| Substitute captures by true | p -> totrue | true |
| Substitute captures by {} | p -> toemptytable | {} |
| Substitute a capture by a number | p -> tonumber | Corresponding number of the captured |
| Substitute a capture by a character byte | p -> tochar | Corresponding byte of the captured number |
| Substitute a capture by UTF-8 byte sequence | p -> toutf8char | Corresponding UTF-8 bytes of the captured number |
Captured node fields
By default when capturing a node with <== syntax, LPegRex will set the following 3 fields:
tagName of the node (its type)posInitial position of the node matchendposFinal position of the node match (usually includes following SKIP)
The user can customize and change these field names or disable them by
setting it's corresponding name in the defs.__options table when compiling the grammar,
for example:
local mypatt = rex.compile(mygrammar, {__options = {
tag = 'name', -- 'tag' field rename to 'name'
pos = 'init', -- 'pos' field renamed to 'init'
endpos = false, -- don't capture node final position
}})
The fields pos and endpos are useful to generate error messages with precise location
when analyzing the AST and the tag field is used to distinguish the node type.
Captured node action
In case defs.__options.tag is a function, then it's called and the user will be responsible for
setting the tag field and return the node, this flexibility exists in case
specific actions are required to be executed on node creation, for example:
local mypatt = rex.compile(mygrammar, {__options = {
tag = function(tag, node)
print('new node', tag)
node.tag = tag
return node
end
}})
Note that when this function is called the node children may be incomplete in case the node is being folded.
Matching keywords and tokens
When using the back tick syntax (e.g. `something`),
LPegRex will register its contents as a keyword in case it begins with a letter (or _),
or as token in case it contains only punctuation characters (except _).
Both keywords and tokens always match the SKIP rule immediately to
skip spaces, thus the rule SKIP must always be defined when using the back tick syntax.
Tokens matches are always unique in case of common characters, that is,
in case both . and .. tokens are defined, the rule `.` will match
. but not ...
In case a token is found, the rule TOKEN will be automatically generated,
this rule will match any token plus SKIP.
In case a keyword is found,
the rule NAME_SUFFIX also need to be defined, it's used
to differentiate keywords from identifier names.
In most cases the user will need define something like:
NAME_SUFFIX <- [_%w]+
SKIP <- %s+
You may want to edit the SKIP rule to consider comments if you grammar supports them.
Token and keywords will not capture SKIP rule when using the syntax {`keyword`}.
Capturing identifier names
Often we need to create a rule that capture identifier names while ignoring grammar keywords, let call this rule NAME.
To assist doing this the KEYWORD rule is automatically generated based on all defined keywords in
the grammar, the user can then use it to define the NAME rule, in most cases something like:
NAME <-- !KEYWORD {NAME_PREFIX NAME_SUFFIX?} SKIP
NAME_PREFIX <-- [_%a]
NAME_SUFFIX <-- [_%w]+
SKIP <- %s+
Handling syntax errors
Any rule name, keyword, token or string pattern can be preceded by the token @,
marking it as an expected match, in case the match is not fulfilled an error
label will be thrown using the name Expected_name, where name is the
token, keyword or rule name.
Once an error label is found, the user can generate pretty syntax error
messages using the function lpegrex.calcline to gather line information,
for example:
local patt = lpegrex.compile(PEG)
local ast, errlabel, errpos = patt:match(source)
if not ast then
l
Related Skills
node-connect
343.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
92.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
343.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
343.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
