Transmutation

High-performance document conversion engine for AI/LLM embeddings

Transmutation is a pure Rust document conversion engine designed to transform various file formats into optimized text and image outputs suitable for LLM processing and vector embeddings. Built as a core component of the HiveLLM Vectorizer ecosystem, Transmutation is a high-performance alternative to Docling, offering superior speed, lower memory usage, and zero runtime dependencies.

🎯 Project Goals

Pure Rust implementation - No Python dependencies, maximum performance
Convert documents to LLM-friendly formats (Markdown, Images, JSON)
Optimize output for embedding generation (text and multimodal)
Maintain maximum quality with minimum size
Competitor to Docling - 98x faster, more efficient, and easier to deploy
Seamless integration with HiveLLM Vectorizer

📊 Benchmark Results

Transmutation vs Docling (Fast Mode - Pure Rust):

| Metric | Paper 1 (15 pages) | Paper 2 (25 pages) | Average | |--------|--------------------|--------------------|---------| | Similarity | 76.36% | 84.44% | 80.40% | | Speed | 108x faster | 88x faster | 98x faster | | Time (Docling) | 31.36s | 40.56s | ~35s | | Time (Transmutation) | 0.29s | 0.46s | ~0.37s |

✅ 80% similarity - Acceptable for most use cases
✅ 98x faster - Near-instant conversion
✅ Pure Rust - No Python/ML dependencies
✅ Low memory - 50 MB footprint
🎯 Goal: 95% similarity (Precision Mode with C++ FFI - in development)

See BENCHMARK_COMPARISON.md for detailed results.

📋 Supported Formats

Document Formats

| Input Format | Output Options | Status | Modes | |-------------|----------------|---------|-------| | PDF | Image per page, Markdown (per page/full), JSON | ✅ Production | Fast, Precision, FFI | | DOCX | Image per page, Markdown (per page/full), JSON | ✅ Production | Pure Rust + LibreOffice | | XLSX | Markdown tables, CSV, JSON | ✅ Production | Pure Rust (148 pg/s) | | PPTX | Image per slide, Markdown per slide | ✅ Production | Pure Rust (1639 pg/s) | | HTML | Markdown, JSON | ✅ Production | Pure Rust (2110 pg/s) | | XML | Markdown, JSON | ✅ Production | Pure Rust (2353 pg/s) | | TXT | Markdown, JSON | ✅ Production | Pure Rust (2805 pg/s) | | CSV/TSV | Markdown tables, JSON | ✅ Production | Pure Rust (2647 pg/s) | | RTF | Markdown, JSON | ⚠️ Beta | Pure Rust (simplified parser) | | ODT | Markdown, JSON | ⚠️ Beta | Pure Rust (ZIP + XML) | | MD | Markdown (normalized), JSON | 🔄 Planned | - |

Image Formats (OCR)

| Input Format | Output Options | OCR Engine | Status | |-------------|----------------|------------|---------| | JPG/JPEG | Markdown (OCR), JSON | Tesseract | ✅ Production | | PNG | Markdown (OCR), JSON | Tesseract | ✅ Production | | TIFF/TIF | Markdown (OCR), JSON | Tesseract | ✅ Production | | BMP | Markdown (OCR), JSON | Tesseract | ✅ Production | | GIF | Markdown (OCR), JSON | Tesseract | ✅ Production | | WEBP | Markdown (OCR), JSON | Tesseract | ✅ Production |

Audio/Video Formats

| Input Format | Output Options | Engine | Status | |-------------|----------------|---------|---------| | MP3 | Markdown (transcription), JSON | Whisper | ✅ Production | | WAV | Markdown (transcription), JSON | Whisper | ✅ Production | | M4A | Markdown (transcription), JSON | Whisper | ✅ Production | | FLAC | Markdown (transcription), JSON | Whisper | ✅ Production | | OGG | Markdown (transcription), JSON | Whisper | ✅ Production | | MP4 | Markdown (transcription), JSON | FFmpeg + Whisper | ✅ Production | | AVI | Markdown (transcription), JSON | FFmpeg + Whisper | ✅ Production | | MKV | Markdown (transcription), JSON | FFmpeg + Whisper | ✅ Production | | MOV | Markdown (transcription), JSON | FFmpeg + Whisper | ✅ Production | | WEBM | Markdown (transcription), JSON | FFmpeg + Whisper | ✅ Production |

Archive Formats

| Input Format | Output Options | Status | Performance | |-------------|----------------|---------|-------------| | ZIP | File listing, statistics, Markdown index, JSON | ✅ Production | Pure Rust (1864 pg/s) | | TAR/GZ | Extract and process contents | 🔄 Planned | - | | 7Z | Extract and process contents | 🔄 Planned | - |

🚀 Quick Start

Installation

Windows MSI Installer:

# Download from releases or build:
.\build-msi.ps1
msiexec /i target\wix\transmutation-0.3.0-x86_64.msi

See docs/MSI_BUILD.md for details.

Cargo:

# Add to Cargo.toml
[dependencies]
transmutation = "0.2"

# Core features (always enabled, no flags needed):
# - PDF, HTML, XML, ZIP, TXT, CSV, TSV, RTF, ODT

# With Office formats (default)
[dependencies.transmutation]
version = "0.2"
features = ["office"]  # DOCX, XLSX, PPTX

# With optional features (requires external tools)
features = ["office", "pdf-to-image", "tesseract", "audio"]

External Dependencies

Transmutation is mostly pure Rust, with core features requiring ZERO dependencies:

| Feature | Requires | Status | |---------|----------|---------| | Core (PDF, HTML, XML, ZIP, TXT, CSV, TSV, RTF, ODT) | ✅ None | Always enabled | | office (DOCX, XLSX, PPTX - Text) | ✅ None | Pure Rust (default) | | pdf-to-image | ⚠️ poppler-utils | Optional | | office + images | ⚠️ LibreOffice | Optional | | image-ocr | ⚠️ Tesseract OCR | Optional | | audio | ⚠️ Whisper CLI | Optional | | video | ⚠️ FFmpeg + Whisper | Optional | | archives-extended (TAR, GZ, 7Z) | ⚠️ tar, flate2 crates | Optional |

During compilation, build.rs will automatically detect missing dependencies and provide installation instructions:

cargo build --features "pdf-to-image"

# If pdftoppm is missing, you'll see:
⚠️  Optional External Dependencies Missing

  ❌ pdftoppm (poppler-utils): PDF → Image conversion
     Install: sudo apt-get install poppler-utils

📖 Quick install (all dependencies):
   ./install/install-deps-linux.sh

Installation scripts are provided for all platforms:

Linux: ./install/install-deps-linux.sh
macOS: ./install/install-deps-macos.sh
Windows: .\install\install-deps-windows.ps1 (or .bat)

See install/README.md for detailed instructions.

📖 Usage Guide

CLI Usage

Basic Conversion:

# Convert PDF to Markdown
transmutation convert document.pdf -o output.md

# Convert DOCX to Markdown with images
transmutation convert report.docx -o output.md --extract-images

# Convert with precision mode (77% similarity)
transmutation convert paper.pdf -o output.md --precision

# Convert multiple files
transmutation batch *.pdf -o output/ --parallel 4

Format-Specific Examples:

# PDF → Markdown (split by pages)
transmutation convert document.pdf -o output/ --split-pages

# DOCX → Markdown + Images
transmutation convert report.docx -o output.md --images

# XLSX → CSV
transmutation convert data.xlsx -o output.csv --format csv

# PPTX → Markdown (one file per slide)
transmutation convert slides.pptx -o output/ --split-slides

# Image OCR → Markdown
transmutation convert scan.jpg -o output.md --ocr --lang eng

# ZIP → Extract and convert all
transmutation convert archive.zip -o output/ --recursive

Advanced Options:

# Optimize for LLM embeddings
transmutation convert document.pdf \
  --optimize-llm \
  --max-chunk-size 512 \
  --remove-headers \
  --normalize-whitespace

# High-quality image extraction
transmutation convert document.pdf \
  --extract-images \
  --dpi 300 \
  --image-quality high

# Batch processing with progress
transmutation batch papers/*.pdf \
  -o converted/ \
  --parallel 8 \
  --progress \
  --format markdown

Library Usage (Rust)

Basic Conversion:

use transmutation::{Converter, OutputFormat, ConversionOptions};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Initialize converter
    let converter = Converter::new()?;
    
    // Convert PDF to Markdown
    let result = converter
        .convert("document.pdf")
        .to(OutputFormat::Markdown)
        .with_options(ConversionOptions {
            split_pages: true,
            optimize_for_llm: true,
            ..Default::default()
        })
        .execute()
        .await?;
    
    // Save output
    result.save("output/document.md").await?;
    
    println!("Converted {} pages", result.page_count());
    Ok(())
}

Batch Processing

use transmutation::{Converter, BatchProcessor, OutputFormat};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let converter = Converter::new()?;
    let batch = BatchProcessor::new(converter);
    
    // Process multiple files
    let results = batch
        .add_files(&["doc1.pdf", "doc2.docx", "doc3.pptx"])
        .to(OutputFormat::Markdown)
        .parallel(4)
        .execute()
        .await?;
    
    for (file, result) in results {
        println!("{}: {} -> {}", file, result.input_size(), result.output_size());
    }
    
    Ok(())
}

Vectorizer Integration

use transmutation::{Converter, OutputFormat};
use vectorizer::VectorizerClient;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let converter = Converter::new()?;
    let vectorizer = VectorizerClient::new("http://localhost:15002").await?;
    
    // Convert and embed in one pipeline

Transmutation

Install / Use

README

Transmutation

🎯 Project Goals

📊 Benchmark Results

📋 Supported Formats

Document Formats

Image Formats (OCR)

Audio/Video Formats

Archive Formats

🚀 Quick Start

Installation

External Dependencies

📖 Usage Guide

CLI Usage

Library Usage (Rust)

Batch Processing

Vectorizer Integration