Gotreesitter
Pure Go tree-sitter runtime
Install / Use
/learn @odvcencio/GotreesitterREADME
gotreesitter
Pure-Go tree-sitter runtime. No CGo, no C toolchain. Cross-compiles to any GOOS/GOARCH target Go supports, including wasip1.
go get github.com/odvcencio/gotreesitter
gotreesitter loads the same parse-table format that tree-sitter's C runtime uses. Grammar tables are extracted from upstream parser.c files by ts2go, compressed into binary blobs, and deserialized on first use. 206 grammars ship in the registry.
Motivation
Every Go tree-sitter binding in the ecosystem depends on CGo:
- Cross-compilation requires a C cross-toolchain per target.
GOOS=wasip1,GOARCH=arm64from a Linux host, or any Windows build without MSYS2/MinGW, will not link. - CI images must carry
gccand the grammar's C sources.go installfails for downstream users who don't have a C compiler. - The Go race detector, coverage instrumentation, and fuzzer cannot see across the CGo boundary. Bugs in the C runtime or in FFI marshaling are invisible to
go test -race.
gotreesitter eliminates the C dependency entirely. The parser, lexer, query engine, incremental reparsing, arena allocator, external scanners, and tree cursor are all implemented in Go. The only input is the grammar blob.
Quick start
import (
"fmt"
"github.com/odvcencio/gotreesitter"
"github.com/odvcencio/gotreesitter/grammars"
)
func main() {
src := []byte(`package main
func main() {}
`)
lang := grammars.GoLanguage()
parser := gotreesitter.NewParser(lang)
tree, _ := parser.Parse(src)
fmt.Println(tree.RootNode())
}
grammars.DetectLanguage("main.go") resolves a filename to the appropriate LangEntry.
Queries
q, _ := gotreesitter.NewQuery(`(function_declaration name: (identifier) @fn)`, lang)
cursor := q.Exec(tree.RootNode(), lang, src)
for {
match, ok := cursor.NextMatch()
if !ok {
break
}
for _, cap := range match.Captures {
fmt.Println(cap.Node.Text(src))
}
}
The query engine supports the full S-expression pattern language: structural quantifiers (?, *, +), alternation ([...]), field constraints, negated fields, anchor (!), and all standard predicates. See Query API.
Typed query codegen
Generate type-safe Go wrappers from .scm query files:
go run ./cmd/tsquery -input queries/go_functions.scm -lang go -output go_functions_query.go -package queries
Given a query like (function_declaration name: (identifier) @name body: (block) @body), tsquery generates:
type FunctionDeclarationMatch struct {
Name *gotreesitter.Node
Body *gotreesitter.Node
}
q, _ := queries.NewGoFunctionsQuery(lang)
cursor := q.Exec(tree.RootNode(), lang, src)
for {
match, ok := cursor.Next()
if !ok { break }
fmt.Println(match.Name.Text(src))
}
Multi-pattern queries generate one struct per pattern with MatchPatternN conversion helpers.
Multi-language documents (injection parsing)
Parse documents with embedded languages (HTML+JS+CSS, Markdown+code fences, Vue/Svelte templates):
ip := gotreesitter.NewInjectionParser()
ip.RegisterLanguage("html", htmlLang)
ip.RegisterLanguage("javascript", jsLang)
ip.RegisterLanguage("css", cssLang)
ip.RegisterInjectionQuery("html", injectionQuery)
result, _ := ip.Parse(source, "html")
for _, inj := range result.Injections {
fmt.Printf("%s: %d ranges\n", inj.Language, len(inj.Ranges))
// inj.Tree is the child language's parse tree
}
Supports static (#set! injection.language "javascript") and dynamic (@injection.language capture) language detection, recursive nested injections, and incremental reparse with child tree reuse.
Source rewriting
Collect source-level edits and apply atomically, producing InputEdit records for incremental reparse:
rw := gotreesitter.NewRewriter(src)
rw.Replace(funcNameNode, []byte("newName"))
rw.InsertBefore(bodyNode, []byte("// added\n"))
rw.Delete(unusedNode)
newSrc, _ := rw.ApplyToTree(tree)
newTree, _ := parser.ParseIncremental(newSrc, tree)
Apply() returns both the new source bytes and the []InputEdit records. ApplyToTree() is a convenience that calls tree.Edit() for each edit and returns source ready for ParseIncremental.
Incremental reparsing
tree, _ := parser.Parse(src)
// User types "x" at byte offset 42
src = append(src[:42], append([]byte("x"), src[42:]...)...)
tree.Edit(gotreesitter.InputEdit{
StartByte: 42,
OldEndByte: 42,
NewEndByte: 43,
StartPoint: gotreesitter.Point{Row: 3, Column: 10},
OldEndPoint: gotreesitter.Point{Row: 3, Column: 10},
NewEndPoint: gotreesitter.Point{Row: 3, Column: 11},
})
tree2, _ := parser.ParseIncremental(src, tree)
ParseIncremental walks the old tree's spine, identifies the edit region, and reuses unchanged subtrees by reference. Only the invalidated span is re-lexed and re-parsed. Both leaf and non-leaf subtrees are eligible for reuse; non-leaf reuse is driven by pre-goto state tracking on interior nodes, so the parser can skip entire subtrees without re-deriving their contents.
When no edit has occurred, ParseIncremental detects the nil-edit on a pointer check and returns in single-digit nanoseconds with zero allocations.
Tree cursor
TreeCursor maintains an explicit (node, childIndex) frame stack. Parent, child, and sibling movement are O(1) with zero allocations — sibling traversal indexes directly into the parent's children[] slice.
c := gotreesitter.NewTreeCursorFromTree(tree)
c.GotoFirstChild()
c.GotoChildByFieldName("body")
for ok := c.GotoFirstNamedChild(); ok; ok = c.GotoNextNamedSibling() {
fmt.Printf("%s at %d\n", c.CurrentNodeType(), c.CurrentNode().StartByte())
}
idx := c.GotoFirstChildForByte(128)
Movement methods: GotoFirstChild, GotoLastChild, GotoNextSibling, GotoPrevSibling, GotoParent, named-only variants (GotoFirstNamedChild, etc.), field-based (GotoChildByFieldName, GotoChildByFieldID), and position-based (GotoFirstChildForByte, GotoFirstChildForPoint).
Cursors hold direct pointers into tree nodes. Recreate after Tree.Release(), Tree.Edit(...), or incremental reparse.
Highlighting
hl, _ := gotreesitter.NewHighlighter(lang, highlightQuery)
ranges := hl.Highlight(src)
for _, r := range ranges {
fmt.Printf("%s: %q\n", r.Capture, src[r.StartByte:r.EndByte])
}
Tagging
entry := grammars.DetectLanguage("main.go")
lang := entry.Language()
tagger, _ := gotreesitter.NewTagger(lang, entry.TagsQuery)
tags := tagger.Tag(src)
for _, tag := range tags {
fmt.Printf("%s %s at %d:%d\n", tag.Kind, tag.Name,
tag.NameRange.StartPoint.Row, tag.NameRange.StartPoint.Column)
}
Benchmarks
All measurements below use the same workload: a generated Go source file with 500 functions (19294 bytes).
Numbers are medians from 10 runs on:
goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) Ultra 9 285
| Runtime | Full parse | Incremental (1-byte edit) | Incremental (no edit) | |---|---:|---:|---:| | Native C (pure C runtime) | 1.76 ms | 102.3 μs | 101.7 μs | | CGo binding (C runtime via cgo) | ~2.0 ms | ~130 μs | — | | gotreesitter (pure Go) | 4.20 ms | 1.49 μs | 2.18 ns |
On this workload:
- Full parse is ~2.4x slower than native C.
- Incremental single-byte edits are ~69x faster than native C (~87x faster than CGo).
- No-edit reparses are ~46,600x faster than native C, zero allocations.
# Pure Go (this repo):
GOMAXPROCS=1 go test . -run '^$' \
-bench 'BenchmarkGoParseFullDFA|BenchmarkGoParseIncrementalSingleByteEditDFA|BenchmarkGoParseIncrementalNoEditDFA' \
-benchmem -count=10 -benchtime=1s
# CGo binding benchmarks:
cd cgo_harness
GOMAXPROCS=1 go test . -run '^$' -tags treesitter_c_bench \
-bench 'BenchmarkCTreeSitterGoParseFull|BenchmarkCTreeSitterGoParseIncrementalSingleByteEdit|BenchmarkCTreeSitterGoParseIncrementalNoEdit' \
-benchmem -count=10 -benchtime=750ms
# Native C benchmarks (no Go, direct C binary):
./pure_c/run_go_benchmark.sh 500 2000 20000
| Benchmark | Median ns/op | B/op | allocs/op |
|---|---:|---:|---:|
| Native C full parse | 1,764,436 | — | — |
| Native C incremental (1-byte edit) | 102,336 | — | — |
| Native C incremental (no edit) | 101,740 | — | — |
| CTreeSitterGoParseFull | ~1,990,000 | 600 | 6 |
| CTreeSitterGoParseIncrementalSingleByteEdit | ~130,000 | 648 | 7 |
| GoParseFullDFA | 4,197,811 | 585 | 7 |
| GoParseIncrementalSingleByteEditDFA | 1,490 | 1,584 | 9 |
| GoParseIncrementalNoEditDFA | 2.181 | 0 | 0 |
Benchmark matrix
For repeatable multi-workload tracking:
go run ./cmd/benchmatrix --count 10
Emits bench_out/matrix.json (machine-readable), bench_out/matrix.md (summary), and raw logs under bench_out/raw/.
Supported languages
206 grammars ship in the registry. All 206 produce error-free parse trees on smoke samples. Run go run ./cmd/parity_report for current status.
- 116 external scanners (hand-written Go implementations of upstream C scanners)
- 7 hand-written Go token sources (authzed, c, cpp, go, java, json, lua)
- Remaining languages use the DFA lexer generated from grammar tables
Parse quality
Each LangEntry carries a Quality field:
| Quality | Meaning |
|---|---|
| full | All scanner and lexer components present. Parser has full access to the grammar. |
| partial | Missing external scanner. DFA lexer handles what it can; external tokens are skipped. |
| none | Cannot parse. |
full means the parser has every component the grammar requires. It does not guarantee error-free trees on all inputs — grammars with high GLR ambiguity may produce syntax errors on very large or deeply nested constructs due to parser safety limits (iteration cap, stack depth cap, node count cap). These limits scale with input size. Check `tree.RootNode().HasError(
