gotreesitter

Pure-Go tree-sitter runtime. No CGo, no C toolchain. Cross-compiles to any GOOS/GOARCH target Go supports, including wasip1.

go get github.com/odvcencio/gotreesitter

gotreesitter loads the same parse-table format that tree-sitter's C runtime uses. Grammar tables are extracted from upstream parser.c files by ts2go, compressed into binary blobs, and deserialized on first use. 206 grammars ship in the registry.

Motivation

Every Go tree-sitter binding in the ecosystem depends on CGo:

Cross-compilation requires a C cross-toolchain per target. GOOS=wasip1, GOARCH=arm64 from a Linux host, or any Windows build without MSYS2/MinGW, will not link.
CI images must carry gcc and the grammar's C sources. go install fails for downstream users who don't have a C compiler.
The Go race detector, coverage instrumentation, and fuzzer cannot see across the CGo boundary. Bugs in the C runtime or in FFI marshaling are invisible to go test -race.

gotreesitter eliminates the C dependency entirely. The parser, lexer, query engine, incremental reparsing, arena allocator, external scanners, and tree cursor are all implemented in Go. The only input is the grammar blob.

Quick start

import (
    "fmt"

    "github.com/odvcencio/gotreesitter"
    "github.com/odvcencio/gotreesitter/grammars"
)

func main() {
    src := []byte(`package main

func main() {}
`)

    lang := grammars.GoLanguage()
    parser := gotreesitter.NewParser(lang)

    tree, _ := parser.Parse(src)
    fmt.Println(tree.RootNode())
}

grammars.DetectLanguage("main.go") resolves a filename to the appropriate LangEntry.

Queries

q, _ := gotreesitter.NewQuery(`(function_declaration name: (identifier) @fn)`, lang)
cursor := q.Exec(tree.RootNode(), lang, src)

for {
    match, ok := cursor.NextMatch()
    if !ok {
        break
    }
    for _, cap := range match.Captures {
        fmt.Println(cap.Node.Text(src))
    }
}

The query engine supports the full S-expression pattern language: structural quantifiers (?, *, +), alternation ([...]), field constraints, negated fields, anchor (!), and all standard predicates. See Query API.

Typed query codegen

Generate type-safe Go wrappers from .scm query files:

go run ./cmd/tsquery -input queries/go_functions.scm -lang go -output go_functions_query.go -package queries

Given a query like (function_declaration name: (identifier) @name body: (block) @body), tsquery generates:

type FunctionDeclarationMatch struct {
    Name *gotreesitter.Node
    Body *gotreesitter.Node
}

q, _ := queries.NewGoFunctionsQuery(lang)
cursor := q.Exec(tree.RootNode(), lang, src)
for {
    match, ok := cursor.Next()
    if !ok { break }
    fmt.Println(match.Name.Text(src))
}

Multi-pattern queries generate one struct per pattern with MatchPatternN conversion helpers.

Multi-language documents (injection parsing)

Parse documents with embedded languages (HTML+JS+CSS, Markdown+code fences, Vue/Svelte templates):

ip := gotreesitter.NewInjectionParser()
ip.RegisterLanguage("html", htmlLang)
ip.RegisterLanguage("javascript", jsLang)
ip.RegisterLanguage("css", cssLang)
ip.RegisterInjectionQuery("html", injectionQuery)

result, _ := ip.Parse(source, "html")

for _, inj := range result.Injections {
    fmt.Printf("%s: %d ranges\n", inj.Language, len(inj.Ranges))
    // inj.Tree is the child language's parse tree
}

Supports static (#set! injection.language "javascript") and dynamic (@injection.language capture) language detection, recursive nested injections, and incremental reparse with child tree reuse.

Source rewriting

Collect source-level edits and apply atomically, producing InputEdit records for incremental reparse:

rw := gotreesitter.NewRewriter(src)
rw.Replace(funcNameNode, []byte("newName"))
rw.InsertBefore(bodyNode, []byte("// added\n"))
rw.Delete(unusedNode)

newSrc, _ := rw.ApplyToTree(tree)
newTree, _ := parser.ParseIncremental(newSrc, tree)

Apply() returns both the new source bytes and the []InputEdit records. ApplyToTree() is a convenience that calls tree.Edit() for each edit and returns source ready for ParseIncremental.

Incremental reparsing

tree, _ := parser.Parse(src)

// User types "x" at byte offset 42
src = append(src[:42], append([]byte("x"), src[42:]...)...)

tree.Edit(gotreesitter.InputEdit{
    StartByte:   42,
    OldEndByte:  42,
    NewEndByte:  43,
    StartPoint:  gotreesitter.Point{Row: 3, Column: 10},
    OldEndPoint: gotreesitter.Point{Row: 3, Column: 10},
    NewEndPoint: gotreesitter.Point{Row: 3, Column: 11},
})

tree2, _ := parser.ParseIncremental(src, tree)

ParseIncremental walks the old tree's spine, identifies the edit region, and reuses unchanged subtrees by reference. Only the invalidated span is re-lexed and re-parsed. Both leaf and non-leaf subtrees are eligible for reuse; non-leaf reuse is driven by pre-goto state tracking on interior nodes, so the parser can skip entire subtrees without re-deriving their contents.

When no edit has occurred, ParseIncremental detects the nil-edit on a pointer check and returns in single-digit nanoseconds with zero allocations.

Tree cursor

TreeCursor maintains an explicit (node, childIndex) frame stack. Parent, child, and sibling movement are O(1) with zero allocations — sibling traversal indexes directly into the parent's children[] slice.

c := gotreesitter.NewTreeCursorFromTree(tree)

c.GotoFirstChild()
c.GotoChildByFieldName("body")

for ok := c.GotoFirstNamedChild(); ok; ok = c.GotoNextNamedSibling() {
    fmt.Printf("%s at %d\n", c.CurrentNodeType(), c.CurrentNode().StartByte())
}

idx := c.GotoFirstChildForByte(128)

Movement methods: GotoFirstChild, GotoLastChild, GotoNextSibling, GotoPrevSibling, GotoParent, named-only variants (GotoFirstNamedChild, etc.), field-based (GotoChildByFieldName, GotoChildByFieldID), and position-based (GotoFirstChildForByte, GotoFirstChildForPoint).

Cursors hold direct pointers into tree nodes. Recreate after Tree.Release(), Tree.Edit(...), or incremental reparse.

Highlighting

hl, _ := gotreesitter.NewHighlighter(lang, highlightQuery)
ranges := hl.Highlight(src)

for _, r := range ranges {
    fmt.Printf("%s: %q\n", r.Capture, src[r.StartByte:r.EndByte])
}

Tagging

entry := grammars.DetectLanguage("main.go")
lang := entry.Language()

tagger, _ := gotreesitter.NewTagger(lang, entry.TagsQuery)
tags := tagger.Tag(src)

for _, tag := range tags {
    fmt.Printf("%s %s at %d:%d\n", tag.Kind, tag.Name,
        tag.NameRange.StartPoint.Row, tag.NameRange.StartPoint.Column)
}

Benchmarks

All measurements below use the same workload: a generated Go source file with 500 functions (19294 bytes). Numbers are medians from 10 runs on:

goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) Ultra 9 285

| Runtime | Full parse | Incremental (1-byte edit) | Incremental (no edit) | |---|---:|---:|---:| | Native C (pure C runtime) | 1.76 ms | 102.3 μs | 101.7 μs | | CGo binding (C runtime via cgo) | ~2.0 ms | ~130 μs | — | | gotreesitter (pure Go) | 4.20 ms | 1.49 μs | 2.18 ns |

On this workload:

Full parse is ~2.4x slower than native C.
Incremental single-byte edits are ~69x faster than native C (~87x faster than CGo).
No-edit reparses are ~46,600x faster than native C, zero allocations.

<details> <summary>Raw benchmark output</summary>

# Pure Go (this repo):
GOMAXPROCS=1 go test . -run '^$' \
  -bench 'BenchmarkGoParseFullDFA|BenchmarkGoParseIncrementalSingleByteEditDFA|BenchmarkGoParseIncrementalNoEditDFA' \
  -benchmem -count=10 -benchtime=1s

# CGo binding benchmarks:
cd cgo_harness
GOMAXPROCS=1 go test . -run '^$' -tags treesitter_c_bench \
  -bench 'BenchmarkCTreeSitterGoParseFull|BenchmarkCTreeSitterGoParseIncrementalSingleByteEdit|BenchmarkCTreeSitterGoParseIncrementalNoEdit' \
  -benchmem -count=10 -benchtime=750ms

# Native C benchmarks (no Go, direct C binary):
./pure_c/run_go_benchmark.sh 500 2000 20000

| Benchmark | Median ns/op | B/op | allocs/op | |---|---:|---:|---:| | Native C full parse | 1,764,436 | — | — | | Native C incremental (1-byte edit) | 102,336 | — | — | | Native C incremental (no edit) | 101,740 | — | — | | CTreeSitterGoParseFull | ~1,990,000 | 600 | 6 | | CTreeSitterGoParseIncrementalSingleByteEdit | ~130,000 | 648 | 7 | | GoParseFullDFA | 4,197,811 | 585 | 7 | | GoParseIncrementalSingleByteEditDFA | 1,490 | 1,584 | 9 | | GoParseIncrementalNoEditDFA | 2.181 | 0 | 0 |

</details>

Benchmark matrix

For repeatable multi-workload tracking:

go run ./cmd/benchmatrix --count 10

Emits bench_out/matrix.json (machine-readable), bench_out/matrix.md (summary), and raw logs under bench_out/raw/.

Supported languages

206 grammars ship in the registry. All 206 produce error-free parse trees on smoke samples. Run go run ./cmd/parity_report for current status.

116 external scanners (hand-written Go implementations of upstream C scanners)
7 hand-written Go token sources (authzed, c, cpp, go, java, json, lua)
Remaining languages use the DFA lexer generated from grammar tables

Parse quality

Each LangEntry carries a Quality field:

| Quality | Meaning | |---|---| | full | All scanner and lexer components present. Parser has full access to the grammar. | | partial | Missing external scanner. DFA lexer handles what it can; external tokens are skipped. | | none | Cannot parse. |

full means the parser has every component the grammar requires. It does not guarantee error-free trees on all inputs — grammars with high GLR ambiguity may produce syntax errors on very large or deeply nested constructs due to parser safety limits (iteration cap, stack depth cap, node count cap). These limits scale with input size. Check `tree.RootNode().HasError(

Gotreesitter

Install / Use

README