Parboiled2

A macro-based PEG parser generator for Scala 2.10+

Generate Convert Improve

Install / Use

/learn @sirthias/Parboiled2

About this skill

Quality Score

0/100

README

parboiled2 |--| A Macro-Based PEG Parser Generator for Scala 2.12+

.. contents:: Contents of this Document

Introduction

parboiled2 is a Scala 2.12+ library enabling lightweight and easy-to-use, yet powerful, fast and elegant parsing of arbitrary input text. It implements a macro-based parser generator for Parsing Expression Grammars_ (PEGs), which runs at compile time and translates a grammar rule definition (written in an internal Scala DSL) into corresponding JVM bytecode.

PEGs are an alternative to Context-Free Grammars_ (CFGs) for formally specifying syntax, they make a good replacement for regular expressions and have some advantages over the "traditional" way of building parsers via CFGs (like not needing a separate lexer/scanner phase).

parboiled2 is the successor of parboiled 1.x_ , which provides a similar capability (for Scala as well as Java) but does not actually generate a parser. Rather parboiled 1.x_ interprets a rule tree structure (which is also created via an internal DSL) against the input, which results in a much lower parsing performance. For more info on how parboiled 1.x_ and parboiled2 compare see parboiled2 vs. parboiled 1.x. You might also be interested in reading about parboiled2 vs. Scala Parser Combinators and parboiled2 vs. Regular Expressions_.

.. _PEG: .. _Parsing Expression Grammars: https://en.wikipedia.org/wiki/Parsing_expression_grammar .. _Context-Free Grammars: https://en.wikipedia.org/wiki/Context-free_grammar .. _parboiled 1.x: http://parboiled.org

Features

Concise, flexible and type-safe DSL for expressing parsing logic
Full expressive power of Parsing Expression Grammars_, for effectively dealing with most real-world parsing needs
Excellent reporting of parse errors
Parsing performance comparable to hand-written parsers
Easy to learn and use (just one parsing phase (no lexer code required), rather small API)
Light-weight enough to serve as a replacement for regular expressions (also strictly more powerful than regexes)

Installation

The artifacts for parboiled2 live on Maven Central_ and can be tied into your SBT-based Scala project like this:

.. code:: Scala

libraryDependencies += "org.parboiled" %% "parboiled" % "2.5.1"

The latest released version is 2.5.1. It is available for Scala 2.12, 2.13 and 3, Scala.js and Scala Native.

parboiled2 has no external dependencies. (It used to depend on shapeless_, but the few bits it was using have been internalized at some point).

Once on your classpath you can use this single import to bring everything you need into scope:

.. code:: Scala

import org.parboiled2._

.. _Maven Central: https://search.maven.org/search?q=g:org.parboiled .. _shapeless: https://github.com/milessabin/shapeless

Example

This is what a simple parboiled2 parser looks like:

.. code:: Scala

import org.parboiled2._

class Calculator(val input: ParserInput) extends Parser {
  def InputLine = rule { Expression ~ EOI }

  def Expression: Rule1[Int] = rule {
    Term ~ zeroOrMore(
      '+' ~ Term ~> ((_: Int) + _)
    | '-' ~ Term ~> ((_: Int) - _))
  }

  def Term = rule {
    Factor ~ zeroOrMore(
      '*' ~ Factor ~> ((_: Int) * _)
    | '/' ~ Factor ~> ((_: Int) / _))
  }

  def Factor = rule { Number | Parens }

  def Parens = rule { '(' ~ Expression ~ ')' }

  def Number = rule { capture(Digits) ~> (_.toInt) }

  def Digits = rule { oneOrMore(CharPredicate.Digit) }
}

new Calculator("1+1").InputLine.run() // evaluates to `scala.util.Success(2)`

This implements a parser for simple integer expressions like 1+(2-3*4)/5 and runs the actual calculation in-phase with the parser. If you'd like to see it run and try it out yourself check out Running the Examples_.

Quick Start

A parboiled2 parser is a class deriving from org.parboiled2.Parser, which defines one abstract member:

.. code:: Scala

def input: ParserInput

holding the input for the parsing run. Usually it is best implemented as a val parameter in the constructor (as shown in the Example_ above). As you can see from this design you need to (re-)create a new parser instance for every parsing run (parser instances are very lightweight).

The "productions" (or "rules") of your grammar are then defined as simple methods, which in most cases consist of a single call to the rule macro whose argument is a DSL expression_ defining what input the rule is to match and what actions_ to perform.

In order to run your parser against a given input you create a new instance and call run() on the top-level rule, e.g:

.. code:: Scala

val parser = new MyParser(input)
parser.topLevelRule.run() // by default returns a ``scala.util.Try``

For more info on what options you have with regard to accessing the results of a parsing run check out the section on Access to Parser Results_.

.. DSL expression: The Rule DSL .. actions: Parser Actions

How the Parser matches Input

PEG_ parsers are quite easy to understand as they work just like most people without a lot of background in parsing theory would build a parser "by hand": recursive-descent with backtracking. They have only one parsing phase (not two, like most parsers produced by traditional parser generators like ANTLR_), do not require any look-ahead and perform quite well in most real-world scenarios (although they can exhibit exponential runtime for certain pathological languages and inputs).

A PEG_ parser consists of a number of rules that logically form a "tree", with one "root" rule at the top calling zero or more lower-level rules, which can each call other rules and so on. Since rules can also call themselves or any of their parents the rule "tree" is not really a tree but rather a potentially cyclic directed graph, but in most cases the tree structure dominates, which is why its useful to think of it as a tree with potential cycles.

When a rule is executed against the current position in an input buffer it applies its specific matching logic to the input, which can either succeed or fail. In the success case the parser advances the input position (the cursor) and potentially executes the next rule. Otherwise, when the rule fails, the cursor is reset and the parser backtracks in search of another parsing alternative that might succeed.

For example consider this simple parboiled2 rule:

.. code:: Scala

def foo = rule { 'a' ~ ('b' ~ 'c' | 'b' ~ 'd') }

When this rule is confronted with the input abd the parser matches the input in these steps:

Rule foo starts executing, which calls its first sub-rule 'a'. The cursor is at position 0.
Rule 'a' is executed against input position 0, matches (succeeds) and the cursor is advanced to position 1.
Rule 'b' ~ 'c' | 'b' ~ 'd' starts executing, which calls its first sub-rule 'b' ~ 'c'.
Rule 'b' ~ 'c' starts executing, which calls its first sub-rule 'b'.
Rule 'b' is executed against input position 1, matches (succeeds) and the cursor is advanced to position 2.
Rule 'c' is executed against input position 2 and mismatches (fails).
Rule 'b' ~ 'c' | 'b' ~ 'd' notices that its first sub-rule has failed, resets the cursor to position 1 and calls its 2nd sub-rule 'b' ~ 'd'.
Rule 'b' ~ 'd' starts executing, which calls its first sub-rule 'b'.
Rule 'b' is executed against input position 1, matches and the cursor is advanced to position 2.
Rule 'd' is executed against input position 2, matches and the cursor is advanced to position 3.
Rule 'b' ~ 'd' completes successfully, as its last sub-rule has succeeded.
Rule 'b' ~ 'c' | 'b' ~ 'd' completes successfully, as one of its sub-rules has succeeded.
Rule foo completes execution successfully, as its last sub-rule has succeeded. The whole input "abd" was matched and the cursor is left at position 3 (after the last-matched character).

.. _ANTLR: https://www.antlr.org/

The Rule DSL

In order to work with parboiled2 effectively you should understand the core concepts behind its rule DSL, mainly the "Value Stack" and how parboiled2 encodes value stack operations in the Scala type system.

Rule Types and the Value Stack

Apart from the input buffer and the cursor the parser manages another important structure: the "Value Stack". The value stack is a simple stack construct that serves as temporary storage for your Parser Actions. In many cases it is used for constructing an AST during the parsing run but it can also be used for "in-phase" computations (like in the Example_ above) or for any other purpose.

When a rule of a parboiled2 parser executes it performs any combination of the following three things:

match input, i.e. advance the input cursor
operate on the value stack, i.e. pop values off and/or push values to the value stack
perform side-effects

Matching input is done by calling Basic Character Matching_ rules, which do nothing but match input and advance the cursor. Value stack operations (and other potential side-effects) are performed by Parser Actions_.

It is important to understand that rules in parboiled2 (i.e. the rule methods in your parser class) do not directly return some custom value as a method result. Instead, all their consuming and producing values happens as side-effects to the value stack. Thereby the way that a rule interacts with value stack is encoded in the rule's type.

This is the general definition of a parboiled2 rule:

.. code:: Scala

class Rule[-I <: HList, +O <: HList]

This can look scary at first but is really quite simple. An HList is defined by shapeless_ and is essentially a type of list whose element number and element types are statical

Related Skills

node-connect

346.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

107.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

346.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

346.8k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。