Gdelta

Fast delta compression algorithm for similar data chunks in pure Rust. Achieves 496 MB/s encoding and 4.4 GB/s decoding with competitive compression ratios.

Generate Convert Improve

Install / Use

/learn @ImGajeed76/Gdelta

About this skill

Quality Score

0/100

README

gdelta

A fast delta compression algorithm for similar data chunks, implemented in pure Rust.

Overview

gdelta is a Rust implementation of the GDelta algorithm by Haoliang Tan. It provides efficient delta encoding for similar data chunks (typically 4KB - 64KB) commonly found in deduplication systems.

Key Features:

Fast delta encoding and decoding with optional SIMD optimization
Memory-safe implementation in pure Rust
Simple, ergonomic API with full-featured CLI tool
No unsafe code
Thoroughly tested with comprehensive benchmarking suite

Performance:

Synthetic benchmarks (2,097 tests):

Encoding: 496 MB/s average throughput
- Small chunks (16KB): up to 1.08 GiB/s
- Large data (256KB+): 370-400 MiB/s sustained
Decoding: 4.4 GB/s average (9x faster than encoding)
- Peak: 10.6 GiB/s on binary data
Compression: 68% space saved (raw delta)
- With zstd: 75% space saved, 305 MB/s
- With lz4: 71% space saved, 430 MB/s

Real-world git repositories (1.36M comparisons):

mdn/content: 97.0% space saved, 4µs encode, 0µs decode
tokio: 94.5% space saved, 5µs encode, 0µs decode

Benchmarked on AMD Ryzen 7 7800X3D with 16 cores (Fedora Linux 42). See PERFORMANCE.md for detailed analysis including comparisons with xpatch, vcdiff, qbsdiff, and zstd_dict.

Installation

As a Library

Add this to your Cargo.toml:

[dependencies]
gdelta = "0.2"

As a CLI Tool

Install using cargo:

cargo install gdelta --features cli

Or build from source:

git clone https://github.com/ImGajeed76/gdelta
cd gdelta
cargo build --release --features cli
# Binary at: target/release/gdelta

Usage

Library API

use gdelta::{encode, decode};

// Create base data and modified version
let base_data = b"Hello, World! This is some base data.";
let new_data = b"Hello, Rust! This is some modified data.";

// Encode the delta
let delta = encode(new_data, base_data)?;
println!("Delta size: {} bytes", delta.len());

// Decode to recover the new data
let recovered = decode(&delta, base_data)?;
assert_eq!(recovered, new_data);

CLI Tool

Use gdelta help to see the most up-to-date options and descriptions.

The CLI provides a simple interface for creating and applying delta patches:

Create a delta patch:

# Basic usage
gdelta encode old_file.bin new_file.bin -o patch.delta

# With compression (recommended)
gdelta encode old_file.bin new_file.bin -o patch.delta -c zstd
gdelta encode old_file.bin new_file.bin -o patch.delta -c lz4

# With verification
gdelta encode old_file.bin new_file.bin -o patch.delta --verify

Apply a delta patch:

# Auto-detects compression format
gdelta decode old_file.bin patch.delta -o new_file.bin

# Force specific format (if magic bytes conflict)
gdelta decode old_file.bin patch.delta -o new_file.bin --format zstd

Options:

-c, --compress <FORMAT> - Compression: none, zstd, lz4 (default: none)
-v, --verify - Verify delta after creation (encode only)
-y, --yes - Skip memory warning prompts
-f, --force - Overwrite existing files
-q, --quiet - Suppress output except errors

Example workflow:

# Create compressed delta
gdelta encode database-v1.db database-v2.db -o update.delta -c zstd

# Later, apply the update
gdelta decode database-v1.db update.delta -o database-v2-restored.db

# Verify the result
diff database-v2.db database-v2-restored.db

Memory Management:

The CLI monitors memory usage and warns when operations might use >80% of available RAM. This is important because gdelta loads entire files into memory.

# For large files, the tool will prompt:
⚠  Memory warning: This operation requires ~12.4 GB
  Available: 8.2 GB free (16 GB total)
  
  Continue? [y/N]:

Use -y to skip prompts in automated scripts.

How It Works

GDelta uses:

GEAR Rolling Hash - Fast fingerprinting for chunk boundaries
Variable-Length Integer Encoding - Efficient space utilization
Copy/Literal Instructions - Minimal delta representation
Prefix/Suffix Matching - Optimized for common data patterns

The algorithm identifies matching regions between base and new data, then encodes only the differences as a series of copy and literal instructions.

Algorithm Parameters

The implementation uses optimized default parameters:

Chunk size: 300 KB
Word size: 8 bytes
Base sample rate: 3
Features: Skip optimization, reverse matching

These parameters are tuned for typical deduplication workloads.

Comparison with Other Algorithms

Synthetic Workloads (2,097 tests)

| Algorithm | Speed | Compression | Memory | Use Case | |-----------------|----------|-------------|--------|-------------------------------| | gdelta | 496 MB/s | 68% | Low | Maximum speed | | gdelta+lz4 | 430 MB/s | 71% | Low | Fast with compression | | gdelta+zstd | 305 MB/s | 75% | Low | Balanced speed/compression | | xpatch | 306 MB/s | 75% | Low | Automatic algorithm selection | | vcdiff | 94 MB/s | 64% | Medium | Standard delta format | | qbsdiff | 22 MB/s | 84% | Medium | Maximum compression | | zstd_dict | 14 MB/s | 55% | Medium | Dictionary-based compression |

Git Repository Workloads (1.36M comparisons)

mdn/content repository (306K comparisons):

| Algorithm | Compression | Encode Time | Notes | |---------------|-------------|-------------|------------------| | xpatch (tags) | 97.5% saved | 319 µs | Best compression | | xpatch (seq) | 97.5% saved | 14 µs | Fast alternative | | gdelta | 97.0% saved | 4 µs | Fastest | | vcdiff | 95.6% saved | 49 µs | Standard format |

tokio repository (33K comparisons):

| Algorithm | Compression | Encode Time | Notes | |---------------|-------------|-------------|------------------| | xpatch (tags) | 97.9% saved | 306 µs | Best compression | | xpatch (seq) | 95.6% saved | 24 µs | Balanced | | gdelta | 94.5% saved | 5 µs | Fastest | | vcdiff | 93.1% saved | 28 µs | Standard format |

Git benchmark data from xpatch test results. See PERFORMANCE.md for detailed analysis.

Key Observations

Synthetic benchmarks:

qbsdiff: best compression (84%), slowest speed (22 MB/s)
gdelta: fastest speed (496 MB/s), competitive compression (68%)
xpatch/gdelta+zstd: balanced (75% compression, 305 MB/s)
Algorithm choice impacts compression by 29 percentage points

Git repositories:

All algorithms achieve 93-98% compression on real file evolution
xpatch with tag optimization: best compression (97.5-97.9%)
gdelta: fastest (4-5µs), competitive compression (94.5-97.0%)
Differences measured in microseconds, all are fast
For versioned text with micro edits (docs, configs, source code), xpatch is generally the better fit — tag optimization finds better base versions across history, yielding up to 97.9% compression with acceptable overhead (~20µs per comparison)

Workload matters: Performance characteristics vary significantly between synthetic edits and real file evolution patterns.

Use Cases

High-Speed Applications:

Deduplication systems: gdelta (496 MB/s)
Real-time processing: gdelta (4.4 GB/s decode)
Database replication: gdelta or gdelta+lz4

Version Control Systems:

Recommended for versioned text: xpatch with tags (97.5-97.9% on real repos)
Speed-critical: gdelta (4-5µs per operation, 94.5-97.0% on real repos)
Balanced: xpatch sequential mode or gdelta+lz4

Backup & Storage:

Space-critical: qbsdiff (84% saved)
Balanced: gdelta+zstd (75% saved, 305 MB/s)
Fast backups: gdelta or gdelta+lz4

Network Synchronization:

Low bandwidth: qbsdiff or xpatch
Low latency: gdelta (4.4 GB/s decode)
Balanced: gdelta+zstd

Standards Compliance:

RFC 3284 (VCDIFF): vcdiff implementation

Development

Running Tests

# Run unit tests
cargo test

# Run integration tests
cargo test --test '*'

# Run CLI test suite
./test_gdelta.sh

# Run simple benchmarks (quick verification)
cargo bench --bench simple

# Run comprehensive benchmarks (15-30 min)
cargo bench --bench comprehensive

# Run comprehensive benchmarks with custom filters
BENCH_FORMATS=json,csv BENCH_ALGOS=gdelta,xpatch cargo bench --bench comprehensive

Benchmark Modes

The comprehensive benchmark supports two modes:

# Quick mode (default): smaller sample size, faster
BENCH_MODE=quick cargo bench --bench comprehensive

# Full mode: larger sample size, more accurate
BENCH_MODE=full cargo bench --bench comprehensive

Results are saved to target/benchmark_report_<timestamp>.md and .json.

Credits

This is a Rust implementation of the GDelta algorithm by Haoliang Tan.

Original repositories/resources:

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Built with ❤️ by Oliver Seifert | [Blog Post](https://oseifert.ch/blog/devto-i-got-obsessed-with-2-byt

Related Skills

node-connect

349.7k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.7k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.7k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。