Pegkit
'Parsing Expression Grammar' toolkit for Cocoa/Objective-C
Install / Use
/learn @itod/PegkitREADME
PEGKit
PEGKit is a 'Parsing Expression Grammar' toolkit for iOS and OS X written by Todd Ditchendorf in Objective-C and released under the MIT Open Source License.
Always use the Xcode Workspace PEGKit.xcworkspace, NOT the Xcode Project.
This project includes TDTemplateEngine as a Git Submodule. So proper cloning of this project requires the --recursive argument:
git clone --recursive git@github.com:itod/pegkit.git
PEGKit is heavily influenced by ANTLR by Terence Parr and "Building Parsers with Java" by Steven John Metsker.
The PEGKit Framework offers 2 basic services of general interest to Cocoa developers:
- String Tokenization via the Objective-C
PKTokenizerandPKTokenclasses. - Objective-C parser generation via grammars - Generate source code for an Objective-C parser class from simple, intuitive, and powerful BNF-style grammars (similar to yacc or ANTLR). While parsing, the generated parser will provide callbacks to your Objective-C delegate.
The PEGKit source code is available on Github.
A tutorial for using PEGKit in your iOS applications is available on GitHub.
History
PEGKit is a re-write of an earlier framework by the same author called ParseKit. ParseKit should generally be considered deprecated, and PEGKit should probably be used for all future development.
-
ParseKit produces dynamic, non-deterministic parsers at runtime. The parsers produced by ParseKit exhibit poor (exponential) performance characteristics -- although they have some interesting properties which are useful in very rare circumstances.
-
PEGKit produces static ObjC source code for deterministic (PEG) memoizing parsers at design time which you can then compile into your project. The parsers produced by PEGKit exhibit good (linear) performance characteristics.
Documentation
<a name="tokenization"></a>
Tokenization
<a name="basic-tokenizer-usage"></a>
Basic Usage of PKTokenizer
PEGKit provides general-purpose string tokenization services through the PKTokenizer and PKToken classes. Cocoa developers will be familiar with the NSScanner class provided by the Foundation Framework which provides a similar service. However, the PKTokenizer class is much easier to use for many common tokenization tasks, and offers powerful configuration options if the default tokenization behavior doesn't match your needs.
To use PKTokenizer, provide it with an NSString object and retrieve a series of PKToken objects as you repeatedly call the -nextToken method. The EOFToken singleton signals the end.
NSString *s = @"2 != -47. /* comment */ Blast-off!! 'Woo-hoo!' //comment";
PKTokenizer *t = [PKTokenizer tokenizerWithString:s];
PKToken *eof = [PKToken EOFToken];
PKToken *tok = nil;
while (eof != (tok = [t nextToken])) {
NSLog(@"(%@) (%.1f) : %@", tok.stringValue, tok.floatValue, [tok debugDescription]);
}
Outputs:
(2) (2.0) : <Number «2»>
(!=) (0.0) : <Symbol «!=»>
(-47) (-47.0) : <Number «-47»>
(.) (0.0) : <Symbol «.»>
(Blast-off) (0.0) : <Word «Blast-off»>
(!) (0.0) : <Symbol «!»>
(!) (0.0) : <Symbol «!»>
('Woo-hoo!') (0.0) : <Quoted String «'Woo-hoo!'»>
Each PKToken object returned has a stringValue, a floatValue and a tokenType. The tokenType is and enum value type called PKTokenType with possible values of:
-
PKTokenTypeWord -
PKTokenTypeNumber -
PKTokenTypeQuotedString -
PKTokenTypeSymbol -
PKTokenTypeWhitespace -
PKTokenTypeComment -
PKTokenTypeDelimitedString
PKTokens also have corresponding BOOL properties for convenience (isWord, isNumber, etc.)
<a name="default-tokenizer-behavior"></a>
Default Behavior of PKTokenizer
The default behavior of PKTokenizer is correct for most common situations and will fit many tokenization needs without additional configuration.
Number
Sequences of digits («2» «42» «1054») are recognized as Number tokens. Floating point numbers containing a dot («3.14») are recognized as single Number tokens as you'd expect (rather than two Number tokens separated by a «.» Symbol token). By default, PKTokenizer will recognize a «-» symbol followed immediately by digits («-47») as a number token with a negative value. However, «+» characters are always seen as the beginning of a Symbol token by default, even when followed immediately by digits, so "explicitly-positive" Number tokens are not recognized by default (this behavior can be configured, see below).
Symbol
Most symbol characters («.» «!») are recognized as single-character Symbol tokens (even when sequential such as «!»``«!»). However, notice that PKTokenizer recognizes common multi-character symbols («!=») as a single Symbol token by default. In fact, PKTokenizer can be configured to recognize any given string as a multi-character symbol. Alternatively, it can be configured to always recognize each symbol character as an individual Symbol token (no multi- character symbols). The default multi-character symbols recognized by PKTokenizer are: «<=», «>=», «!=», «==».
Word
«Blast-off» is recognized as a single Word token despite containing a symbol character («-») that would normally signal the start of a new Symbol token. By default, PKTokenzier allows Word tokens to contain (but not start with) several symbol and number characters: «-», «_», «'», «0»-«9». The consequence of this behavior is that PKTokenizer will recognize the following strings as individual Word tokens by default: «it's», «first_name», «sat-yr-9» «Rodham-Clinton». Again, you can configure PKTokenizer to alter this default behavior.
Quoted String
PKTokenizer produces Quoted String tokens for substrings enclosed in quote delimiter characters. The default delimiters are single- or double-quotes («'» or «"»). The quote delimiter characters may be changed (see below), but must be a single character. Note that the stringValue of Quoted String tokens include the quote delimiter characters («'Woo-hoo!'»).
Whitespace
By default, whitespace characters are silently consumed by PKTokenizer, and Whitespace tokens are never emitted. However, you can configure which characters are considered whitespace characters or even ask PKTokenizer to return Whitespace tokens containing the literal whitespace stringValues by setting: t.whitespaceState.reportsWhitespaceTokens = YES.
Comment
By default, PKTokenizer recognizes C-style («//») and C++-style («/*» «*/») comments and silently removes the associated comments from the output rather than producing Comment tokens. See below for steps to either change comment delimiting markers, report Comment tokens, or to turn off comments recognition altogether.
Delimited String
The Delimited String token type is a powerful feature of PEGKit which can be used much like a regular expression. Use the Delimited String token type to ask PKTokenizer to recognize tokens with arbitrary start and end symbol strings much like a Quoted String but with more power:
- The start and end symbols may be multi-char (e.g.
«<#»«#>») - The start and end symbols need not match (e.g.
«<?=»«?>») - The characters allowed within the delimited string may be specified using an NSCharacterSet
<a name="custom-tokenizer-behavior"></a>
Customizing PKTokenizer behavior
There are two basic types of decisions PKTokenizer must make when tokenizing strings:
- Which token type should be created for a given start character?
- Which characters are allowed within the curren
Related Skills
node-connect
339.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.9kCommit, push, and open a PR
