Whittle: A little LALR(1) Parser in Pure Ruby — Not a Generator

Whittle is a LALR(1) parser. It's very small, easy to understand, and what's most important, it's 100% ruby. You write parsers by specifying sequences of allowable rules (which refer to other rules, or even to themselves). For each rule in your grammar, you provide a block that is invoked when the grammar is recognized.

TL;DR (Skip to 'Summary & FAQ')

If you're not familiar with parsing, you should find Whittle to be a very friendly little parser.

It is related, somewhat, to yacc and bison, which belong to the class of parsers known as LALR(1): Left-Right, using 1 Lookahead token. This class of parsers is both easy to work with and particularly powerful (ruby itself is parsed using a LALR(1) parser). Since the algorithm is based around a theory that never has to backtrack (that is, each token read takes the parse forward, with just a single lookup in a parse table), parse time is also fast. Parse time is governed by the size of the input, not by the size of the grammar.

Whittle provides meaningful error reporting (line number, expected tokens, received token) and even lets you hook into the error handling logic if you need to write some sort of crazy madman-forgiving parser.

If you've had issues with other parsers hitting "stack level too deep" errors, you should find that Whittle does not suffer from the same issues, since it uses a state-switching algorithm (a pushdown automaton to be precise), rather than simply having one parse function call another and so on. Whittle also supports the following concepts:

Left/right recursion
Left/right associativity
Operator precedences
Skipping of silent tokens in the input (e.g. whitespace/comments)

Installation

Via rubygems:

gem install whittle

Or in your Gemfile, if you're using bundler:

gem 'whittle'

The Basics

Parsers using Whittle do not generate ruby code from a grammar file. This may strike users of other LALR(1) parsers as odd, but c'mon, we're using Ruby, right?

I'll avoid discussing the algorithm until we get into the really advanced stuff, but you will need to understand a few fundamental ideas before we begin.

There are two types of rule that make up a complete parser: terminal, and nonterminal - A terminal rule is quite simply a chunk of the input string, like '42', or 'function' - A nonterminal rule is a rule that makes reference to other rules (both terminal and nonterminal)
The input to be parsed always conforms to just one rule at the topmost level. This is known as the "start rule" and describes the structure of the program as a whole.

The easiest way to understand how the parser works is just to learn by example, so let's see an example.

require 'whittle'

class Mathematician < Whittle::Parser
  rule("+")

  rule(:int => /[0-9]+/).as { |num| Integer(num) }

  rule(:expr) do |r|
    r[:int, "+", :int].as { |a, _, b| a + b }
  end

  start(:expr)
end

mathematician = Mathematician.new
mathematician.parse("1+2")
# => 3

Let's break this down a bit. As you can see, the whole thing is really just rule used in different ways. We also have to set the start rule that we can use to describe an entire program, which in this case is the :expr rule that can add two numbers together.

There are two terminal rules ("+" and :int) and one nonterminal (:expr) in the above grammar. Each rule can have a block attached to it. The block is invoked with the result evaluating each of its inputs via their own blocks (in a depth-first manner). The default action if no block is given, is to return whatever the leftmost input to the rule happens to be. We use #as to provide an action that actually does something meaningful with the inputs.

We can optionally use the Hash notation to map a name with a pattern (or a fixed string) when we declare terminal rules too, as we have done with the :int rule above.

As the input string is parsed, it must match the start rule :expr.

Let's step through the parse for the above input "1+2".

When the parser starts, it looks at the start rule :expr and decides what tokens would be valid if they were encountered.
Since :expr starts with :int, the only thing that would be valid is anything matching /[0-9]+/.
When the parser reads the "1", it recognizes it as an :int, evaluates its block (thus casting it to an Integer), and moves it aside (puts it on the stack, to be precise).
Now it advances through the rule for :expr and decides the only valid input would be a "+"
Upon reading the "+", the rule for "+" is invoked (which does nothing) and the "+" is put on the stack, along with the :int we already have.
Now the parser's only valid input is another :int, which it gets from the "2", casting it to an Integer according to its block, and putting it on the stack.
Finally, upon having read the sequence :int, "+", :int, our block attached to that particular rule is invoked to return a result by summing the 1 and the 2 to make 3. Magic!

This was a simple parse. At each point there was only one valid input. As we'll see, parses can be arbitrarily complex, without increasing the amount of work needed to process the input string.

Nonterminal rules can have more than one valid sequence

Our mathematician class above is not much of a mathematician. It can only add numbers together. Surely subtraction, division and multiplication should be possible too?

It turns out that this is really simple to do. Just add multiple possibilities to the same rule.

require 'whittle'

class Mathematician < Whittle::Parser
  rule("+")
  rule("-")
  rule("*")
  rule("/")

  rule(:int => /[0-9]+/).as { |num| Integer(num) }

  rule(:expr) do |r|
    r[:int, "+", :int].as { |a, _, b| a + b }
    r[:int, "-", :int].as { |a, _, b| a - b }
    r[:int, "*", :int].as { |a, _, b| a * b }
    r[:int, "/", :int].as { |a, _, b| a / b }
  end

  start(:expr)
end

mathematician = Mathematician.new

mathematician.parse("1+2")
# => 3

mathematician.parse("1-2")
# => -1

mathematician.parse("2*3")
# => 6

mathematician.parse("4/2")
# => 2

Now you're probably beginning to see how matching just one rule for the entire input is not a problem. To think about a more real world example, you can describe most programming languages as a series of statements and constructs.

Rules can refer to themselves

But our mathematician is still not very bright. It can only work with two operands. What about more complex expressions?

require 'whittle'

class Mathematician < Whittle::Parser
  rule("+")
  rule("-")
  rule("*")
  rule("/")

  rule(:int => /[0-9]+/).as { |num| Integer(num) }

  rule(:expr) do |r|
    r[:expr, "+", :expr].as { |a, _, b| a + b }
    r[:expr, "-", :expr].as { |a, _, b| a - b }
    r[:expr, "*", :expr].as { |a, _, b| a * b }
    r[:expr, "/", :expr].as { |a, _, b| a / b }
    r[:int]
  end

  start(:expr)
end

mathematician = Mathematician.new
mathematician.parse("1+5-2")
# => 4

Adding a rule of just :int to the :expr rule means that any integer is also a valid :expr. It is now possible to say that any :expr can be added to, multiplied by, divided by or subtracted from another :expr. It is this ability to self-reference that makes LALR(1) parsers so powerful and easy to use. Note that because the result each input to any given rule is computed before being passed as arguments to the block, each :expr in the calculations above will always be a number, since each :expr returns a number. The recursion in these rules is practically limitless. You can write "1+2-3*4+775/3" and it's still an :expr.

Specifying the associativity

If we poke around for more than a few seconds, we'll soon realize that our mathematician makes some silly mistakes. Let's see what happens when we do the following:

mathematician.parse("6-3-1")
# => 4

Oops. That's not correct. Shouldn't the answer be 2?

Our grammar is ambiguous. The input string could be interpreted as either:

6-(3-1)

Or as:

(6-3)-1

Basic arithmetic takes the latter approach, but the parser's default approach is to go the other way. We refer to these two alternatives as being left associative (the second example) and right associative (the first example). By default, operators are right associative, which means as much input will be read as possible before beginning to compute a result.

We can correct this by tagging our operators as left associative.

require 'whittle'

class Mathematician < Whittle::Parser
  rule("+") % :left
  rule("-") % :left
  rule("*") % :left
  rule("/") % :left

  rule(:int => /[0-9]+/).as { |num| Integer(num) }

  rule(:expr) do |r|
    r[:expr, "+", :expr].as { |a, _, b| a + b }
    r[:expr, "-", :expr].as { |a, _, b| a - b }
    r[:expr, "*", :expr].as { |a, _, b| a * b }
    r[:expr, "/", :expr].as { |a, _, b| a / b }
    r[:int]
  end

  start(:expr)
end

mathematician = Mathematician.new
mathematician.parse("6-3-1")
# => 2

Attaching a percent sign followed by either :left or :right changes the associativity of a terminal rule. We now get the correct result.

Specifying the operator precedence

Basic arithmetic is easy peasy, right? Well, despite fixing the associativity, we find we still have a problem:

mathematician.parse("1+2*3")
# => 9

Hmm. The expression has been interpreted as (1+2)*3. It turns out arithmetic is not as simple as one might think ;) The parser does not (yet) know that the multiplication operator has a higher precedence than the addition operator. We need to indicate this in the grammar.

require 'whittle'

class Mathematician < Whittle::Parser
  rule("+") % :left ^ 1
  rule("-") % :left ^ 1
  rule("*") % :left ^ 2
  rule("/") % :left ^ 2

  rule(:int => /[0-9]+/).as { |num|

Whittle

Install / Use

README