Transmutation
Transmutation is a Rust-based document conversion module designed to transform various file formats into optimized text and image outputs suitable for LLM processing and vector embeddings. Built as a core component of the HiveLLM Vectorizer ecosystem, it leverages [Docling](https://github.com/docling-project) for advanced document understanding.
Install / Use
/learn @hivellm/TransmutationREADME
Transmutation
High-performance document conversion engine for AI/LLM embeddings
Transmutation is a pure Rust document conversion engine designed to transform various file formats into optimized text and image outputs suitable for LLM processing and vector embeddings. Built as a core component of the HiveLLM Vectorizer ecosystem, Transmutation is a high-performance alternative to Docling, offering superior speed, lower memory usage, and zero runtime dependencies.
🎯 Project Goals
- Pure Rust implementation - No Python dependencies, maximum performance
- Convert documents to LLM-friendly formats (Markdown, Images, JSON)
- Optimize output for embedding generation (text and multimodal)
- Maintain maximum quality with minimum size
- Competitor to Docling - 98x faster, more efficient, and easier to deploy
- Seamless integration with HiveLLM Vectorizer
📊 Benchmark Results
Transmutation vs Docling (Fast Mode - Pure Rust):
| Metric | Paper 1 (15 pages) | Paper 2 (25 pages) | Average | |--------|--------------------|--------------------|---------| | Similarity | 76.36% | 84.44% | 80.40% | | Speed | 108x faster | 88x faster | 98x faster | | Time (Docling) | 31.36s | 40.56s | ~35s | | Time (Transmutation) | 0.29s | 0.46s | ~0.37s |
- ✅ 80% similarity - Acceptable for most use cases
- ✅ 98x faster - Near-instant conversion
- ✅ Pure Rust - No Python/ML dependencies
- ✅ Low memory - 50 MB footprint
- 🎯 Goal: 95% similarity (Precision Mode with C++ FFI - in development)
See BENCHMARK_COMPARISON.md for detailed results.
📋 Supported Formats
Document Formats
| Input Format | Output Options | Status | Modes | |-------------|----------------|---------|-------| | PDF | Image per page, Markdown (per page/full), JSON | ✅ Production | Fast, Precision, FFI | | DOCX | Image per page, Markdown (per page/full), JSON | ✅ Production | Pure Rust + LibreOffice | | XLSX | Markdown tables, CSV, JSON | ✅ Production | Pure Rust (148 pg/s) | | PPTX | Image per slide, Markdown per slide | ✅ Production | Pure Rust (1639 pg/s) | | HTML | Markdown, JSON | ✅ Production | Pure Rust (2110 pg/s) | | XML | Markdown, JSON | ✅ Production | Pure Rust (2353 pg/s) | | TXT | Markdown, JSON | ✅ Production | Pure Rust (2805 pg/s) | | CSV/TSV | Markdown tables, JSON | ✅ Production | Pure Rust (2647 pg/s) | | RTF | Markdown, JSON | ⚠️ Beta | Pure Rust (simplified parser) | | ODT | Markdown, JSON | ⚠️ Beta | Pure Rust (ZIP + XML) | | MD | Markdown (normalized), JSON | 🔄 Planned | - |
Image Formats (OCR)
| Input Format | Output Options | OCR Engine | Status | |-------------|----------------|------------|---------| | JPG/JPEG | Markdown (OCR), JSON | Tesseract | ✅ Production | | PNG | Markdown (OCR), JSON | Tesseract | ✅ Production | | TIFF/TIF | Markdown (OCR), JSON | Tesseract | ✅ Production | | BMP | Markdown (OCR), JSON | Tesseract | ✅ Production | | GIF | Markdown (OCR), JSON | Tesseract | ✅ Production | | WEBP | Markdown (OCR), JSON | Tesseract | ✅ Production |
Audio/Video Formats
| Input Format | Output Options | Engine | Status | |-------------|----------------|---------|---------| | MP3 | Markdown (transcription), JSON | Whisper | ✅ Production | | WAV | Markdown (transcription), JSON | Whisper | ✅ Production | | M4A | Markdown (transcription), JSON | Whisper | ✅ Production | | FLAC | Markdown (transcription), JSON | Whisper | ✅ Production | | OGG | Markdown (transcription), JSON | Whisper | ✅ Production | | MP4 | Markdown (transcription), JSON | FFmpeg + Whisper | ✅ Production | | AVI | Markdown (transcription), JSON | FFmpeg + Whisper | ✅ Production | | MKV | Markdown (transcription), JSON | FFmpeg + Whisper | ✅ Production | | MOV | Markdown (transcription), JSON | FFmpeg + Whisper | ✅ Production | | WEBM | Markdown (transcription), JSON | FFmpeg + Whisper | ✅ Production |
Archive Formats
| Input Format | Output Options | Status | Performance | |-------------|----------------|---------|-------------| | ZIP | File listing, statistics, Markdown index, JSON | ✅ Production | Pure Rust (1864 pg/s) | | TAR/GZ | Extract and process contents | 🔄 Planned | - | | 7Z | Extract and process contents | 🔄 Planned | - |
🚀 Quick Start
Installation
Windows MSI Installer:
# Download from releases or build:
.\build-msi.ps1
msiexec /i target\wix\transmutation-0.3.0-x86_64.msi
See docs/MSI_BUILD.md for details.
Cargo:
# Add to Cargo.toml
[dependencies]
transmutation = "0.2"
# Core features (always enabled, no flags needed):
# - PDF, HTML, XML, ZIP, TXT, CSV, TSV, RTF, ODT
# With Office formats (default)
[dependencies.transmutation]
version = "0.2"
features = ["office"] # DOCX, XLSX, PPTX
# With optional features (requires external tools)
features = ["office", "pdf-to-image", "tesseract", "audio"]
External Dependencies
Transmutation is mostly pure Rust, with core features requiring ZERO dependencies:
| Feature | Requires | Status |
|---------|----------|---------|
| Core (PDF, HTML, XML, ZIP, TXT, CSV, TSV, RTF, ODT) | ✅ None | Always enabled |
| office (DOCX, XLSX, PPTX - Text) | ✅ None | Pure Rust (default) |
| pdf-to-image | ⚠️ poppler-utils | Optional |
| office + images | ⚠️ LibreOffice | Optional |
| image-ocr | ⚠️ Tesseract OCR | Optional |
| audio | ⚠️ Whisper CLI | Optional |
| video | ⚠️ FFmpeg + Whisper | Optional |
| archives-extended (TAR, GZ, 7Z) | ⚠️ tar, flate2 crates | Optional |
During compilation, build.rs will automatically detect missing dependencies and provide installation instructions:
cargo build --features "pdf-to-image"
# If pdftoppm is missing, you'll see:
⚠️ Optional External Dependencies Missing
❌ pdftoppm (poppler-utils): PDF → Image conversion
Install: sudo apt-get install poppler-utils
📖 Quick install (all dependencies):
./install/install-deps-linux.sh
Installation scripts are provided for all platforms:
- Linux:
./install/install-deps-linux.sh - macOS:
./install/install-deps-macos.sh - Windows:
.\install\install-deps-windows.ps1(or.bat)
See install/README.md for detailed instructions.
📖 Usage Guide
CLI Usage
Basic Conversion:
# Convert PDF to Markdown
transmutation convert document.pdf -o output.md
# Convert DOCX to Markdown with images
transmutation convert report.docx -o output.md --extract-images
# Convert with precision mode (77% similarity)
transmutation convert paper.pdf -o output.md --precision
# Convert multiple files
transmutation batch *.pdf -o output/ --parallel 4
Format-Specific Examples:
# PDF → Markdown (split by pages)
transmutation convert document.pdf -o output/ --split-pages
# DOCX → Markdown + Images
transmutation convert report.docx -o output.md --images
# XLSX → CSV
transmutation convert data.xlsx -o output.csv --format csv
# PPTX → Markdown (one file per slide)
transmutation convert slides.pptx -o output/ --split-slides
# Image OCR → Markdown
transmutation convert scan.jpg -o output.md --ocr --lang eng
# ZIP → Extract and convert all
transmutation convert archive.zip -o output/ --recursive
Advanced Options:
# Optimize for LLM embeddings
transmutation convert document.pdf \
--optimize-llm \
--max-chunk-size 512 \
--remove-headers \
--normalize-whitespace
# High-quality image extraction
transmutation convert document.pdf \
--extract-images \
--dpi 300 \
--image-quality high
# Batch processing with progress
transmutation batch papers/*.pdf \
-o converted/ \
--parallel 8 \
--progress \
--format markdown
Library Usage (Rust)
Basic Conversion:
use transmutation::{Converter, OutputFormat, ConversionOptions};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Initialize converter
let converter = Converter::new()?;
// Convert PDF to Markdown
let result = converter
.convert("document.pdf")
.to(OutputFormat::Markdown)
.with_options(ConversionOptions {
split_pages: true,
optimize_for_llm: true,
..Default::default()
})
.execute()
.await?;
// Save output
result.save("output/document.md").await?;
println!("Converted {} pages", result.page_count());
Ok(())
}
Batch Processing
use transmutation::{Converter, BatchProcessor, OutputFormat};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let converter = Converter::new()?;
let batch = BatchProcessor::new(converter);
// Process multiple files
let results = batch
.add_files(&["doc1.pdf", "doc2.docx", "doc3.pptx"])
.to(OutputFormat::Markdown)
.parallel(4)
.execute()
.await?;
for (file, result) in results {
println!("{}: {} -> {}", file, result.input_size(), result.output_size());
}
Ok(())
}
Vectorizer Integration
use transmutation::{Converter, OutputFormat};
use vectorizer::VectorizerClient;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let converter = Converter::new()?;
let vectorizer = VectorizerClient::new("http://localhost:15002").await?;
// Convert and embed in one pipeline
