SkillAgentSearch skills...

Kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 88+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

Install / Use

/learn @kreuzberg-dev/Kreuzberg

README

Kreuzberg

<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;"> <!-- Language Bindings --> <a href="https://crates.io/crates/kreuzberg"> <img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust"> </a> <a href="https://hex.pm/packages/kreuzberg"> <img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir"> </a> <a href="https://pypi.org/project/kreuzberg/"> <img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python"> </a> <a href="https://www.npmjs.com/package/@kreuzberg/node"> <img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js"> </a> <a href="https://www.npmjs.com/package/@kreuzberg/wasm"> <img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM"> </a> <a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg"> <img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java"> </a> <a href="https://github.com/kreuzberg-dev/kreuzberg/releases"> <img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v4.6.1" alt="Go"> </a> <a href="https://www.nuget.org/packages/Kreuzberg/"> <img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#"> </a> <a href="https://packagist.org/packages/kreuzberg/kreuzberg"> <img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP"> </a> <a href="https://rubygems.org/gems/kreuzberg"> <img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby"> </a> <a href="https://kreuzberg-dev.r-universe.dev/kreuzberg"> <img src="https://img.shields.io/badge/R-kreuzberg-007ec6" alt="R"> </a> <a href="https://github.com/kreuzberg-dev/kreuzberg/pkgs/container/kreuzberg"> <img src="https://img.shields.io/badge/Docker-007ec6?logo=docker&logoColor=white" alt="Docker"> </a> <a href="https://github.com/kreuzberg-dev/kreuzberg/releases"> <img src="https://img.shields.io/badge/C-FFI-007ec6" alt="C"> </a> <!-- Project Info --> <a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE"> <img src="https://img.shields.io/badge/License-MIT-007ec6" alt="License"> </a> <a href="https://docs.kreuzberg.dev"> <img src="https://img.shields.io/badge/docs-kreuzberg.dev-007ec6" alt="Documentation"> </a> <a href="https://docs.kreuzberg.dev/demo.html"> <img src="https://img.shields.io/badge/%E2%96%B6%EF%B8%8F_Live_Demo-007ec6" alt="Live Demo"> </a> <a href="https://huggingface.co/Kreuzberg"> <img src="https://img.shields.io/badge/%F0%9F%A4%97_Hugging_Face-007ec6" alt="Hugging Face"> </a> </div> <img width="3384" height="573" alt="Linkedin- Banner" src="https://github.com/user-attachments/assets/1b6c6ad7-3b6d-4171-b1c9-f2026cc9deb8" /> <div align="center" style="margin-top: 20px;"> <a href="https://discord.gg/xt9WY3GnKR"> <img height="22" src="https://img.shields.io/badge/Discord-Join%20our%20community-7289da?logo=discord&logoColor=white" alt="Discord"> </a> </div>

Extract text and metadata from a wide range of file formats (91+), generate embeddings and post-process at native speeds without needing a GPU.

Key Features

  • Extensible architecture – Plugin system for custom OCR backends, validators, post-processors, and document extractors
  • Polyglot – Native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, C#, PHP, Elixir, R, and C
  • 91+ file formats – PDF, Office documents, images, HTML, XML, emails, archives, academic formats across 8 categories
  • OCR support – Tesseract (all bindings, including Tesseract-WASM for browsers), PaddleOCR (all native bindings), EasyOCR (Python), extensible via plugin API
  • High performance – Rust core with native PDFium, SIMD optimizations and full parallelism
  • Flexible deployment – Use as library, CLI tool, REST API server, or MCP server
  • Memory efficient – Streaming parsers for multi-GB files

Complete Documentation | Live Demo | Installation Guides

Installation

Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started:

Scripting Languages:

  • Python – PyPI package, async/sync APIs, OCR backends (Tesseract, PaddleOCR, EasyOCR)
  • Ruby – RubyGems package, idiomatic Ruby API, native bindings
  • PHP – Composer package, modern PHP 8.4+ support, type-safe API, async extraction
  • Elixir – Hex package, OTP integration, concurrent processing
  • R – r-universe package, idiomatic R API, extendr bindings

JavaScript/TypeScript:

  • @kreuzberg/node – Native NAPI-RS bindings for Node.js/Bun, fastest performance
  • @kreuzberg/wasm – WebAssembly for browsers/Deno/Cloudflare Workers, full feature parity (PDF, Excel, OCR, archives)

Compiled Languages:

  • Go – Go module with FFI bindings, context-aware async
  • Java – Maven Central, Foreign Function & Memory API
  • C# – NuGet package, .NET 6.0+, full async/await support

Native:

  • Rust – Core library, flexible feature flags, zero-copy APIs
  • C (FFI) – C header + shared library, pkg-config/CMake support, cross-platform

Containers:

  • Docker – Official images with API, CLI, and MCP server modes (Core: ~1.0-1.3GB, Full: ~1.0-1.3GB with OCR + legacy format support)

Command-Line:

  • CLI – Cross-platform binary, batch processing, MCP server mode

All language bindings include precompiled binaries for both x86_64 and aarch64 architectures on Linux and macOS.

Platform Support

Complete architecture coverage across all language bindings:

| Language | Linux x86_64 | Linux aarch64 | macOS ARM64 | Windows x64 | |----------|:------------:|:-------------:|:-----------:|:-----------:| | Python | ✅ | ✅ | ✅ | ✅ | | Node.js | ✅ | ✅ | ✅ | ✅ | | WASM | ✅ | ✅ | ✅ | ✅ | | Ruby | ✅ | ✅ | ✅ | - | | R | ✅ | ✅ | ✅ | ✅ | | Elixir | ✅ | ✅ | ✅ | ✅ | | Go | ✅ | ✅ | ✅ | ✅ | | Java | ✅ | ✅ | ✅ | ✅ | | C# | ✅ | ✅ | ✅ | ✅ | | PHP | ✅ | ✅ | ✅ | ✅ | | Rust | ✅ | ✅ | ✅ | ✅ | | C (FFI) | ✅ | ✅ | ✅ | ✅ | | CLI | ✅ | ✅ | ✅ | ✅ | | Docker | ✅ | ✅ | ✅ | - |

Note: ✅ = Precompiled binaries available with instant installation. WASM runs in any environment with WebAssembly support (browsers, Deno, Bun, Cloudflare Workers). All platforms are tested in CI. macOS support is Apple Silicon only.

Embeddings Support (Optional)

To use embeddings functionality:

  1. Install ONNX Runtime 1.24+:

  2. Use embeddings in your code - see Embeddings Guide

Note: Kreuzberg requires ONNX Runtime version 1.24+ for embeddings. All other Kreuzberg features work without ONNX Runtime.

Supported Formats

91+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

Office Documents

| Category | Formats | Capabilities | |----------|---------|--------------| | Word Processing | .docx, .docm, .dotx, .dotm, .dot, .odt, .pages | Full text, tables, lists, images, metadata, styles | | Spreadsheets | .xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .xltx, .xlt, .ods, .numbers | Sheet data, formulas, cell metadata, charts | | Presentations | .pptx, .pptm, .ppsx, .potx, .potm, .pot, .key | Slides, speaker notes, images, metadata | | PDF | .pdf | Text, tables, images, metadata, OCR support | | eBooks | .epub, .fb2 | Chapters, metadata, embedded resources | | Database | .dbf | Table data extraction, field type support | | Hangul | .hwp, .hwpx | Korean document format, text extraction |

Images (OCR-Enabled)

| Category | Formats | Features | |----------|---------|----------| | Raster | .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif | OCR, table detection, EXIF metadata, dimensions, color space | | Advanced | .jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppm | Pure Rust decoders (JPEG 2000, JBIG2), OCR, table detection | | Vector | .svg | DOM parsing, embedded text, graphics metadata |

Web & Data

| Category | Formats | Features | |----------|---------|----------| | Markup | .html, .htm, .xhtml, .xml, .svg | DOM parsing, metadata (Open Graph, Twitter Card), link extraction | | Structured Data | .json, .yaml, .yml, .toml, .csv, .tsv | Schema detect

Related Skills

View on GitHub
GitHub Stars7.1k
CategoryDevelopment
Updated1h ago
Forks342

Languages

Rust

Security Score

100/100

Audited on Mar 25, 2026

No findings