EBNF

[EBNF][] parser and generic parser generator.

Description

This is a [Ruby][] implementation of an [EBNF][] and [BNF][] parser and parser generator.

[PEG][] / [Packrat][] Parser

In the primary mode, it supports a Parsing Expression Grammar ([PEG][]) parser generator. This performs more minimal transformations on the parsed grammar to extract sub-productions, which allows each component of a rule to generate its own parsing event.

The resulting {EBNF::PEG::Rule} objects then parse each associated rule according to the operator semantics and use a [Packrat][] memoizer to reduce extra work when backtracking.

These rules are driven using the {EBNF::PEG::Parser} module which calls invokes the starting rule and ensures that all input is consumed.

LL(1) Parser

In another mode, it parses [EBNF][] grammars to [BNF][], generates [First/Follow][] and Branch tables for [LL(1)][] grammars, which can be used with the stream [Tokenizer][] and [LL(1) Parser][].

As LL(1) grammars operate using alt and seq primitives, allowing for a match on alternative productions or a sequence of productions, generating a parser requires turning the [EBNF][] rules into [BNF][]:

Transform a ::= b? into a ::= _empty | b
Transform a ::= b+ into a ::= b b*
Transform a ::= b* into a ::= _empty | (b a)
Transform a ::= op1 (op2) into two rules:
```
  a     ::= op1 _a_1
  _a_1_ ::= op2
```

Of note in this implementation is that the tokenizer and parser are streaming, so that they can process inputs of arbitrary size.

The exception operator (A - B) is only supported on terminals.

See {EBNF::LL1} and {EBNF::LL1::Parser} for further information.

Usage

Parsing an EBNF Grammar

require 'ebnf'

grammar = EBNF.parse(File.open('./etc/ebnf.ebnf'))

Output rules and terminals as [S-Expressions][S-Expression], [Turtle][], HTML or [BNF][]

puts grammar.to_sxp
puts grammar.to_ttl
puts grammar.to_html
puts grammar.to_s

Transform [EBNF][] to [PEG][] (generates sub-rules for embedded expressions) and the RULES table as Ruby for parsing grammars:

grammar.make_peg
grammar.to_ruby

Transform [EBNF][] to [BNF][] (generates sub-rules using alt or seq from plus, star or opt)

grammar.make_bnf

Generate [First/Follow][] rules for BNF grammars (using "ebnf" as the starting production):

grammar.first_follow(:ebnf)

Generate Terminal, [First/Follow][], Cleanup and Branch tables as Ruby for parsing grammars:

grammar.build_tables
grammar.to_ruby

Generate formatted grammar using HTML (requires [Haml][Haml] gem):

grammar.to_html

Parsing an ISO/IEC 14977 Grammar

The EBNF gem can also parse [ISO/IEC 14977][] Grammars (ISOEBNF) to [S-Expressions][S-Expression].

grammar = EBNF.parse(File.open('./etc/iso-ebnf.isoebnf'), format: :isoebnf)

Parsing an ABNF Grammar

The EBNF gem can also parse [ABNF] Grammars to [S-Expressions][S-Expression].

grammar = EBNF.parse(File.open('./etc/abnf.abnf'), format: :abnf)

Parser Debugging

Inevitably while implementing a parser for some specific grammar, a developer will need greater insight into the operation of the parser. While this can involve sorting through a tremendous amount of data, the parser can be provided a [Logger][] instance which will output messages at varying levels of detail to document the state of the parser at any given point. Most useful is likely the INFO level of debugging, but even more detail is revealed using the DEBUG level. WARN and ERROR statements will typically also be provided as part of an exception if parsing fails, but can be shown in the context of other parsing state with appropriate indentation as part of the logger.

Writing Grammars

The {EBNF::Writer} class can be used to write parsed grammars out, either as formatted text, or HTML. Because grammars are written from the Abstract Syntax Tree, represented as [S-Expressions][S-Expression], this provides a means of transforming between grammar formats (e.g., W3C [EBNF][] to [ABNF][]), although with some potential loss in semantic fidelity (case-insensitive string matching vs. case-sensitive matching).

The formatted HTML results are designed to be appropriate for including in specifications.

Parser Errors

On a parsing failure, an exception is raised with information that may be useful in determining the source of the error.

EBNF Grammar

The [EBNF][] variant used here is based on W3C [EBNF][] (see EBNF grammar) as defined in the XML 1.0 recommendation, with minor extensions:

Note that the grammar includes an optional [number] in front of rule names, which can be in conflict with the RANGE terminal. It is typically not a problem, but if it comes up, try parsing with the native parser, add comments or sequences to disambiguate. EBNF does not have beginning of line checks as all whitespace is treated the same, so the common practice of identifying each rule inherently leads to such ambiguity.

The character set for EBNF is UTF-8.

The general form of a rule is:

symbol ::= expression

which can also be proceeded by an optional number enclosed in square brackets to identify the rule number:

[1] symbol ::= expression

(Note, introduces an ambiguity if the previous rule ends in a range or enum and the current rule has no number. The parsers dynamically determine the terminal rules for the LHS (the identifier, symbol, and ::=) and RANGE).

Symbols are written in CAPITAL CASE if they are the start symbol of a regular language (terminals), otherwise with they are treated as non-terminal rules. Literal strings are quoted.

Within the expression on the right-hand side of a rule, the following expressions are used to match strings of one or more characters:

<table> <tr><td><code>#xN</code></td> <td>where <code>N</code> is a hexadecimal integer, the expression matches the character whose number (code point) in ISO/IEC 10646 is <code>N</code>. The number of leading zeros in the <code>#xN</code> form is insignificant.</td></tr> <tr><td><code>[a-zA-Z], [#xN-#xN]</code> <td>matches any Char or HEX with a value in the range(s) indicated (inclusive).</td></tr> <tr><td><code>[abc], [#xN#xN#xN]</code></td> <td>matches any UTF-8 R\_CHAR or HEX with a value among the characters enumerated. The last component may be '-'. Enumerations and ranges may be mixed in one set of brackets.</td></tr> <tr><td><code>[^a-z], [^#xN-#xN]</code></td> <td>matches any UTF-8 Char or HEX a value outside the range indicated.</td></tr> <tr><td><code>[^abc], [^#xN#xN#xN]</code></td> <td>matches any UTF-8 R\_CHAR or HEX with a value not among the characters given. The last component may be '-'. Enumerations and ranges of excluded values may be mixed in one set of brackets.</td></tr> <tr><td><code>"string"</code></td> <td>matches a literal string matching that given inside the double quotes case insensitively.</td></tr> <tr><td><code>'string'</code></td> <td>matches a literal string matching that given inside the single quotes.</td></tr> <tr><td><code>A (B | C)</code></td> <td><code>(B | C)</code> is treated as a unit and may be combined as described in this list.</td></tr> <tr><td><code>A?</code></td> <td>matches A or nothing; optional A.</td></tr> <tr><td><code>A B</code></td> <td>matches <code>A</code> followed by <code>B</code>. This operator has higher precedence than alternation; thus <code>A B | C D</code> is identical to <code>(A B) | (C D)</code>.</td></tr> <tr><td><code>A | B</code></td> <td>matches <code>A</code> or <code>B</code>.</td></tr> <tr><td><code>A - B</code></td> <td>matches any string that matches <code>A</code> but does not match <code>B</code>. (Only supported on Terminals in LL(1) BNF).</td></tr> <tr><td><code>A+</code></td> <td>matches one or more occurrences of <code>A</code>. Concatenation has higher precedence than alternation; thus <code>A+ | B+</code> is identical to <code>(A+) | (B+)</code>.</td></tr> <tr><td><code>A*</code></td> <td>matches zero or more occurrences of <code>A</code>. Concatenation has higher precedence than alternation; thus <code>A* | B*</code> is identical to <code>(A*) | (B*)</code>.</td></tr> <tr><td><code>@pass " "*</code></td> <td>Defines consumed whitespace in the document. Any whitespace found between non-terminal rules is consumed and ignored.</td></tr> <tr><td><code>@terminals</code></td> <td>Introduces terminal rules. All rules defined after this point are treated as terminals.</td></tr> </table>

Comments include // and # through end of line (other than hex character) and /* ... */ (* ... *) which may cross lines
All rules MAY start with an number, contained within square brackets. For example [1] rule, where the value within the brackets is a symbol ([a-z] | [A-Z] | [0-9] | "_" | ".")+, which is not retained after parsing
Symbols MAY be enclosed in angle brackets '<' and >, which are dropped when parsing.
@terminals causes following rules to be treated as terminals. Any terminal which is all upper-case (egTERMINAL), or any rules with expressions that match characters (#xN, [a-z], [^a-z], [abc], [^abc], "string", 'string', or A - B), are also treated as terminals.
@pass defines the expression used to detect whitespace, which is removed in pro

Ebnf

Install / Use

README