zpdf (alpha stage - early version)

A PDF text extraction library written in Zig.

Features

Memory-mapped file reading, zero-copy where possible
Streaming text extraction with efficient arena allocation
Multiple decompression filters: FlateDecode, ASCII85, ASCIIHex, LZW, RunLength
Font encoding support: WinAnsi, MacRoman, ToUnicode CMap
XRef table and stream parsing (PDF 1.5+)
Configurable error handling (strict or permissive)
Structure tree extraction for tagged PDFs (PDF/UA)
Geometric (Y→X) reading order for non-tagged PDFs
Markdown export for structured PDFs

Benchmark

Text extraction performance on Apple M4 Pro (reading order):

| Document | Pages | zpdf | MuPDF | Speedup | |----------|------:|-----:|------:|--------:| | Intel SDM | 5,252 | 582ms | 2,152ms | 3.7x | | Pandas Docs | 3,743 | 640ms | 1,130ms | 1.8x | | C++ Standard | 2,134 | 438ms | 1,007ms | 2.3x | | PDF Reference 1.7 | 1,310 | 236ms | 1,481ms | 6.3x |

Build with zig build -Doptimize=ReleaseFast for best performance.

Requirements

Zig 0.15.2 or later

Building

zig build              # Build library and CLI
zig build test         # Run tests

Usage

Library

const std = @import("std");
const zpdf = @import("zpdf");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    const doc = try zpdf.Document.open(allocator, "file.pdf");
    defer doc.close();

    var buf: [4096]u8 = undefined;
    var bw = std.fs.File.stdout().writer(&buf);
    const writer = &bw.interface;
    defer writer.flush() catch {};

    for (0..doc.pageCount()) |page_num| {
        try doc.extractText(page_num, writer);
    }
}

CLI

zpdf extract document.pdf              # Extract all pages (uses structure tree for reading order)
zpdf extract -p 1-10 document.pdf      # Extract pages 1-10
zpdf extract -o out.txt document.pdf   # Output to file
zpdf info document.pdf                 # Show document info
zpdf bench document.pdf                # Run benchmark

Python

import zpdf

with zpdf.Document("file.pdf") as doc:
    print(doc.page_count)

    # Single page
    text = doc.extract_page(0)

    # All pages (accuracy mode is default)
    all_text = doc.extract_all()

    # Fast mode (higher throughput, stream-order extraction)
    fast_text = doc.extract_all(mode="fast")

    # Page info
    info = doc.get_page_info(0)
    print(f"{info.width}x{info.height}")

# Zero-copy memory open (unsafe semantics for other language bindings)
with zpdf.Document.open_memory_unsafe(open("file.pdf", "rb").read()) as doc:
    print(doc.page_count)

Build the shared library first:

zig build -Doptimize=ReleaseFast
PYTHONPATH=python python3 examples/basic.py

Project Structure

src/
├── root.zig         # Document API and core types
├── main.zig         # CLI entry point
├── capi.zig         # C ABI exports for FFI
├── wapi.zig         # WASM API exports
├── parser.zig       # PDF object parser
├── xref.zig         # XRef table/stream parsing
├── pagetree.zig     # Page tree resolution
├── decompress.zig   # Stream decompression filters
├── encoding.zig     # Font encoding and CMap parsing
├── agl.zig          # Adobe Glyph List mappings
├── cff.zig          # CFF/Type1 font parsing
├── interpreter.zig  # Content stream interpreter
├── structtree.zig   # Structure tree parser (PDF/UA)
├── layout.zig       # Text layout and bounding boxes
├── markdown.zig     # Markdown export
└── simd.zig         # SIMD-accelerated parsing

python/zpdf/         # Python bindings (cffi)
examples/            # Usage examples

Reading Order

zpdf extracts text in logical reading order using a three-tier approach:

Structure Tree (preferred): Uses the PDF's semantic structure for tagged/accessible PDFs (PDF/UA). Correctly handles multi-column layouts, sidebars, tables, and captions.
Geometric Sort (fallback): When no structure tree exists, sorts text spans by Y→X position to approximate visual reading order.
Stream Order (last resort): When bounding box extraction fails, falls back to raw PDF content stream order.

| Method | Pros | Cons | |--------|------|------| | Structure tree | Correct semantic order, handles complex layouts | Only works on tagged PDFs | | Geometric sort | Works on any PDF, respects visual layout | May fail on complex multi-column layouts | | Stream order | Always works | May not match visual order |

Comparison

| Feature | zpdf | pdfium | MuPDF | |---------|------|--------|-------| | Text Extraction | | | | | Stream order | Yes | Yes | Yes | | Tagged/structure tree | Yes | No | Yes | | Visual reading order | No | No | Yes | | Word bounding boxes | Yes | Yes | Yes | | Font Support | | | | | WinAnsi/MacRoman | Yes | Yes | Yes | | ToUnicode CMap | Yes | Yes | Yes | | CID fonts (Type0) | Partial* | Yes | Yes | | Compression | | | | | FlateDecode, LZW, ASCII85/Hex | Yes | Yes | Yes | | JBIG2, JPEG2000 | No | Yes | Yes | | Other | | | | | Encrypted PDFs | No | Yes | Yes | | Rendering | No | Yes | Yes |

*CID fonts: Works when CMap is embedded directly.

Use zpdf when: Batch processing, tagged PDFs (PDF/UA), simple text extraction, Zig integration.

Use pdfium when: Browser integration, full PDF support, proven stability.

Use MuPDF when: Complex visual layouts, rendering needed.

License

CC0 - Public Domain

Zpdf

Install / Use

README