PEGKit

PEGKit is a 'Parsing Expression Grammar' toolkit for iOS and OS X written by Todd Ditchendorf in Objective-C and released under the MIT Open Source License.

Always use the Xcode Workspace PEGKit.xcworkspace, NOT the Xcode Project.

This project includes TDTemplateEngine as a Git Submodule. So proper cloning of this project requires the --recursive argument:

git clone --recursive git@github.com:itod/pegkit.git

PEGKit is heavily influenced by ANTLR by Terence Parr and "Building Parsers with Java" by Steven John Metsker.

The PEGKit Framework offers 2 basic services of general interest to Cocoa developers:

String Tokenization via the Objective-C PKTokenizer and PKToken classes.
Objective-C parser generation via grammars - Generate source code for an Objective-C parser class from simple, intuitive, and powerful BNF-style grammars (similar to yacc or ANTLR). While parsing, the generated parser will provide callbacks to your Objective-C delegate.

The PEGKit source code is available on Github.

A tutorial for using PEGKit in your iOS applications is available on GitHub.

History

PEGKit is a re-write of an earlier framework by the same author called ParseKit. ParseKit should generally be considered deprecated, and PEGKit should probably be used for all future development.

ParseKit produces dynamic, non-deterministic parsers at runtime. The parsers produced by ParseKit exhibit poor (exponential) performance characteristics -- although they have some interesting properties which are useful in very rare circumstances.
PEGKit produces static ObjC source code for deterministic (PEG) memoizing parsers at design time which you can then compile into your project. The parsers produced by PEGKit exhibit good (linear) performance characteristics.

Tokenization

Basic Usage of PKTokenizer

PEGKit provides general-purpose string tokenization services through the PKTokenizer and PKToken classes. Cocoa developers will be familiar with the NSScanner class provided by the Foundation Framework which provides a similar service. However, the PKTokenizer class is much easier to use for many common tokenization tasks, and offers powerful configuration options if the default tokenization behavior doesn't match your needs.

<table border="1" cellpadding="5" cellspacing="0" width="100%"> <tr> <th><tt>PKTokenizer</tt></th> </tr> <tr> <td> <tt>+ (id)tokenizerWithString:(NSString *)s;</tt> <tt>- (PKToken *)nextToken;</tt> <tt>…</tt> </td> </tr> </table>

To use PKTokenizer, provide it with an NSString object and retrieve a series of PKToken objects as you repeatedly call the -nextToken method. The EOFToken singleton signals the end.

NSString *s = @"2 != -47. /* comment */ Blast-off!! 'Woo-hoo!' //comment";

PKTokenizer *t = [PKTokenizer tokenizerWithString:s];
PKToken *eof = [PKToken EOFToken];
PKToken *tok = nil;

while (eof != (tok = [t nextToken])) {
    NSLog(@"(%@) (%.1f) : %@", tok.stringValue, tok.floatValue, [tok debugDescription]);
}

Outputs:

(2) (2.0) : <Number «2»>

(!=) (0.0) : <Symbol «!=»>

(-47) (-47.0) : <Number «-47»>

(.) (0.0) : <Symbol «.»>

(Blast-off) (0.0) : <Word «Blast-off»>

(!) (0.0) : <Symbol «!»>

(!) (0.0) : <Symbol «!»>

('Woo-hoo!') (0.0) : <Quoted String «'Woo-hoo!'»>

Each PKToken object returned has a stringValue, a floatValue and a tokenType. The tokenType is and enum value type called PKTokenType with possible values of:

PKTokenTypeWord
PKTokenTypeNumber
PKTokenTypeQuotedString
PKTokenTypeSymbol
PKTokenTypeWhitespace
PKTokenTypeComment
PKTokenTypeDelimitedString

PKTokens also have corresponding BOOL properties for convenience (isWord, isNumber, etc.)

<table border="1" cellpadding="5" cellspacing="0" width="100%"> <tr> <th><tt>PKToken</tt></th> </tr> <tr> <td> <tt>+ (PKToken *)EOFToken;</tt> <tt>@property (readonly) PKTokenType tokenType;</tt> <tt>@property (readonly) CGFloat floatValue;</tt> <tt>@property (readonly, copy) NSString *stringValue;</tt> <tt>@property (readonly) BOOL isNumber;</tt> <tt>@property (readonly) BOOL isSymbol;</tt> <tt>@property (readonly) BOOL isWord;</tt> <tt>@property (readonly) BOOL isQuotedString;</tt> <tt>@property (readonly) BOOL isWhitespace;</tt> <tt>@property (readonly) BOOL isComment;</tt> <tt>@property (readonly) BOOL isDelimitedString;</tt> <tt>…</tt> </td> </tr> </table>

Default Behavior of PKTokenizer

The default behavior of PKTokenizer is correct for most common situations and will fit many tokenization needs without additional configuration.

Number

Sequences of digits («2» «42» «1054») are recognized as Number tokens. Floating point numbers containing a dot («3.14») are recognized as single Number tokens as you'd expect (rather than two Number tokens separated by a «.» Symbol token). By default, PKTokenizer will recognize a «-» symbol followed immediately by digits («-47») as a number token with a negative value. However, «+» characters are always seen as the beginning of a Symbol token by default, even when followed immediately by digits, so "explicitly-positive" Number tokens are not recognized by default (this behavior can be configured, see below).

Symbol

Most symbol characters («.» «!») are recognized as single-character Symbol tokens (even when sequential such as «!»``«!»). However, notice that PKTokenizer recognizes common multi-character symbols («!=») as a single Symbol token by default. In fact, PKTokenizer can be configured to recognize any given string as a multi-character symbol. Alternatively, it can be configured to always recognize each symbol character as an individual Symbol token (no multi- character symbols). The default multi-character symbols recognized by PKTokenizer are: «<=», «>=», «!=», «==».

Word

«Blast-off» is recognized as a single Word token despite containing a symbol character («-») that would normally signal the start of a new Symbol token. By default, PKTokenzier allows Word tokens to contain (but not start with) several symbol and number characters: «-», «_», «'», «0»-«9». The consequence of this behavior is that PKTokenizer will recognize the following strings as individual Word tokens by default: «it's», «first_name», «sat-yr-9» «Rodham-Clinton». Again, you can configure PKTokenizer to alter this default behavior.

Quoted String

PKTokenizer produces Quoted String tokens for substrings enclosed in quote delimiter characters. The default delimiters are single- or double-quotes («'» or «"»). The quote delimiter characters may be changed (see below), but must be a single character. Note that the stringValue of Quoted String tokens include the quote delimiter characters («'Woo-hoo!'»).

Whitespace

By default, whitespace characters are silently consumed by PKTokenizer, and Whitespace tokens are never emitted. However, you can configure which characters are considered whitespace characters or even ask PKTokenizer to return Whitespace tokens containing the literal whitespace stringValues by setting: t.whitespaceState.reportsWhitespaceTokens = YES.

Comment

By default, PKTokenizer recognizes C-style («//») and C++-style («/*» «*/») comments and silently removes the associated comments from the output rather than producing Comment tokens. See below for steps to either change comment delimiting markers, report Comment tokens, or to turn off comments recognition altogether.

Delimited String

The Delimited String token type is a powerful feature of PEGKit which can be used much like a regular expression. Use the Delimited String token type to ask PKTokenizer to recognize tokens with arbitrary start and end symbol strings much like a Quoted String but with more power:

The start and end symbols may be multi-char (e.g. «<#» «#>»)
The start and end symbols need not match (e.g. «<?=» «?>»)
The characters allowed within the delimited string may be specified using an NSCharacterSet

Customizing PKTokenizer behavior

There are two basic types of decisions PKTokenizer must make when tokenizing strings:

Which token type should be created for a given start character?
Which characters are allowed within the curren

Pegkit

Install / Use

README