Chunkrs
Streaming Content-Defined Chunking (CDC) using FastCDC algorithm with modernized API, support sync and async I/O. Prioritizes correctness, determinism, and composability. Flexible async backend support such as Tokio, async-std, and smol ect.
Install / Use
/learn @elemeng/ChunkrsREADME
chunkrs
Deterministic, streaming Content-Defined Chunking (CDC) for Rust
chunkrs is a high-performance, portable infrastructure library for FastCDC chunking and cryptographic hashing.
Bytes in → Chunks & hashes out.
Zero-copy streaming. Async-agnostic. Excellent for any chunking and hashing use case.
Features
- Streaming API:
push()/finish()pattern for processing data in any batch size - Deterministic-by-design: Identical bytes produce identical chunk boundaries and hashes, regardless of batching or execution timing
- Zero-copy: Efficient
Bytesslicing from input with minimal allocations - FastCDC algorithm: Byte-by-byte gear hash rolling with configurable min/avg/max sizes
- BLAKE3 identity: Cryptographic chunk hashing (optional, feature-gated)
- Strictly safe:
#![forbid(unsafe_code)]- zero unsafe code throughout - Minimal API: Only 6 public types accessible from crate root -
Chunker,Chunk,ChunkHash,ChunkConfig,HashConfig,ChunkError - Well-tested: Comprehensive unit tests, integration tests, and fuzzing
API Changes from v0.8 to v0.9
Breaking Change: v0.9 simplifies the API by removing I/O-specific functionality and focusing on pure streaming CDC.
What Changed
| v0.8 API | v0.9 API |
|----------|----------|
| Chunker::chunk_file() | Removed - use Chunker::push() with your file reader |
| Chunker::chunk_bytes() | Removed - use Chunker::push() directly |
| Chunker::chunk_async() | Removed - async support is application-layer concern |
| chunker.push(bytes) | ✅ Kept - core streaming API |
| chunker.finish() | ✅ Kept - finalize stream |
Benefits of the New Design
- Simpler: One API (
push()) for all data sources - Flexible: Works with any byte source (files, network, memory)
- Composable: Easily integrates with existing I/O code
- Explicit: I/O strategy is controlled by your application
- Smaller: Smaller dependency footprint (no tokio requirement)
Features Removed
The following features were intentionally removed to simplify the crate:
- ❌ File I/O helpers (read files yourself)
- ❌ Async streaming adapters (use your async runtime)
- ❌ Thread-local buffer pools (caller manages memory)
- ❌ Iterator-based APIs (use
push()/finish()loop)
Architecture
chunkrs processes one logical byte stream at a time with byte-by-byte serial CDC:
┌───────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Input Bytes │ │ Push-based │ │ Serial CDC State │
│ (any source) │────▶│ Streaming API │────▶ │ (FastCDC rolling │
│ │ │ push()/finish() │ │ hash, byte-by- │
└───────────────┘ └──────────────────┘ │ byte) │
└──────────────────┘
┌─────────────┐ ┌───────────────────┐
│ │ │ Chunk { │
──▶ │ Chunk │────▶ │ data: Bytes, │
│ Stream │ │ offset: u64, │
│ │ │ hash: ChunkHash │
└─────────────┘ │ } │
└───────────────────┘
When to Use chunkrs
| Scenario | Recommendation | |----------|---------------| | Delta sync (rsync-style) | ✅ Perfect fit | | Backup tools | ✅ Ideal for single-stream chunking | | Deduplication (CAS) | ✅ Use with your own index | | NVMe Gen4/5 saturation | ✅ 3–5 GB/s per core | | Distributed dedup | ✅ Stateless, easy to distribute | | Any other CDC use case | ✅ Likely fits |
Quick Start
[dependencies]
chunkrs = "0.9"
use chunkrs::{Chunker, ChunkConfig};
use bytes::Bytes;
fn main() {
let mut chunker = Chunker::new(ChunkConfig::default());
let mut pending = Bytes::new();
// Feed data in any size (streaming)
for chunk in &[Bytes::from(&b"first part"[..]),
Bytes::from(&b"second part"[..])] {
let (chunks, leftover) = chunker.push(chunk);
// Process complete chunks...
for chunk in chunks {
println!("offset: {:?}, len: {}, hash: {:?}",
chunk.offset, chunk.len(), chunk.hash);
}
pending = leftover;
}
// Finalize stream
if let Some(final_chunk) = chunker.finish() {
println!("Final chunk: offset: {:?}, len: {}, hash: {:?}",
final_chunk.offset, final_chunk.len(), final_chunk.hash);
}
}
What's in a Chunk:
Each Chunk contains:
data:Bytes— the actual chunk payload (zero-copy reference when possible)offset:Option<u64>— byte position in the original streamhash:Option<ChunkHash>— BLAKE3 hash for content identity (if enabled)
API Overview
Flat API Design
chunkrs uses a flat API design for simplicity and clarity. All types are accessible directly from the crate root:
use chunkrs::{Chunker, Chunk, ChunkHash, ChunkConfig, HashConfig, ChunkError};
No duplicate paths like chunkrs::chunk::Chunk - only chunkrs::Chunk.
Core Types
| Type | Description |
|------|-------------|
| Chunker | Stateful CDC engine with streaming push()/finish() API |
| Chunk | Content-addressed block with Bytes payload and optional BLAKE3 hash |
| ChunkHash | 32-byte BLAKE3 hash identifying chunk content |
| ChunkConfig | Min/avg/max chunk sizes and hash configuration |
| HashConfig | Hash algorithm configuration (BLAKE3) |
| ChunkError | Error enum for chunking operations (InvalidConfig) |
Streaming API
The Chunker provides a streaming API:
use chunkrs::{Chunker, ChunkConfig};
use bytes::Bytes;
let mut chunker = Chunker::new(ChunkConfig::default());
let mut pending = Bytes::new();
// Feed data in any size (1 byte to megabytes)
let (chunks, leftover) = chunker.push(Bytes::from(&b"data"[..]));
// Process complete chunks immediately
for chunk in chunks {
// chunk.data: Bytes - the chunk payload
// chunk.offset: Option<u64> - position in original stream
// chunk.hash: Option<ChunkHash> - BLAKE3 hash (if enabled)
}
// Feed leftover back in next push
pending = leftover;
// When stream ends, get final chunk
if let Some(final_chunk) = chunker.finish() {
// Process final chunk
}
Determinism
The same input produces identical chunks regardless of how data is fed:
let data: Vec<u8> = vec![0u8; 10000];
// All at once
let mut chunker1 = Chunker::new(ChunkConfig::default());
let (chunks1, _) = chunker1.push(Bytes::from(data.clone()));
let final1 = chunker1.finish();
// In 100-byte chunks
let mut chunker2 = Chunker::new(ChunkConfig::default());
let mut all_chunks2 = Vec::new();
for chunk in data.chunks(100) {
let (chunks, _) = chunker2.push(Bytes::from(chunk));
all_chunks2.extend(chunks);
}
let final2 = chunker2.finish();
// Same chunks, same hashes
assert_eq!(chunks1.len() + final1.is_some() as usize,
all_chunks2.len() + final2.is_some() as usize);
Configuration
Chunk Sizes
Choose based on your deduplication granularity needs:
use chunkrs::ChunkConfig;
// Small files / high dedup (8 KiB average)
let small = ChunkConfig::new(2 * 1024, 8 * 1024, 32 * 1024)?;
// Default (16 KiB average) - good general purpose
let default = ChunkConfig::default();
// Large files / high throughput (256 KiB average)
let large = ChunkConfig::new(64 * 1024, 256 * 1024, 1024 * 1024)?;
Hash Configuration
use chunkrs::{ChunkConfig, HashConfig};
// With BLAKE3 (default)
let with_hash = ChunkConfig::default();
// Boundary detection only (faster, no content identity)
let no_hash = ChunkConfig::default().with_hash_config(HashConfig::disabled());
Performance
Throughput targets on modern hardware:
| Storage | Single-core CDC | Bottleneck | |---------|----------------|------------| | NVMe Gen4 | ~3–5 GB/s | CPU (hashing) | | NVMe Gen5 | ~3–5 GB/s | CDC algorithm | | SATA SSD | ~500 MB/s | Storage | | 10 Gbps LAN | ~1.2 GB/s | Network | | HDD | ~200 MB/s | Seek latency |
Memory usage:
- Per stream:
O(pending_bytes)- typically minimal as pending is flushed on boundaries - Zero-copy: Chunk data references input
Byteswithout copying - Caller controls memory management (buffer pools, reuse, etc.)
To saturate NVMe Gen5:
Process multiple files concurrently by running multiple Chunker instances. Do not attempt to parallelize within a single file—this destroys deduplication ratios.
Determinism Guarantees
chunkrs guarantees exact determinism:
- Boundary determinism: Identical byte streams produce identical chunk boundaries at identical byte positions
- Hash determinism: Identical byte streams produce identical
ChunkHash(BLAKE3) values - Batch independence: Results are identical regardless of input batch sizes (1 byte vs 1MB vs streaming)
- Serial consistency: Rolling hash state is strictly maintained across all
push()calls
What this means: You can re-chunk a file on Tuesday with different batch sizes and get bit-identical chunks to Monday's run. This is essential for delta sync correctness.
Safety & Correctness
- No unsafe code:
#![forbid(unsafe_code)] - Comprehensive testing: Unit tests, doc tests, and property-based tests ensure:
- Determinism invariants
- Batch equivalence (chunking whole vs chunked yields same results)
- No panics on edge cases (empty files, single byte, max-size boundari
