Dataprof
Library and CLI for profiling tabular data
Install / Use
/learn @AndreaBozzo/DataprofREADME
dataprof is a Rust library and CLI for profiling tabular data. It computes column-level statistics, detects data types and patterns, and evaluates data quality against the ISO 8000/25012 standard -- all with bounded memory usage that lets you profile datasets far larger than your available RAM.
[!NOTE] This is a work in progress, the project is now over 6 months old and the API is not stable and may change in future versions. Please, report any issues or suggestions you may have.
Highlights
- Rust core -- fast columnar and streaming engines
- ISO 8000/25012 quality assessment -- five dimensions: Completeness, Consistency, Uniqueness, Accuracy, Timeliness
- Multi-format -- CSV (auto-delimiter detection), JSON, JSONL, Parquet, databases, DataFrames, Arrow
- True streaming -- bounded-memory profiling with online algorithms (Incremental engine)
- Three interfaces -- CLI binary, Rust library, Python package
- Async-ready --
async/awaitAPI for embedding in web services and stream pipelines
Quick Start
CLI
cargo install dataprof
dataprof analyze data.csv --detailed
dataprof schema data.csv
dataprof count data.parquet
Rust
use dataprof::Profiler;
let report = Profiler::new().analyze_file("data.csv")?;
println!("Rows: {}", report.execution.rows_processed);
println!("Quality: {:.1}%", report.quality_score().unwrap_or(0.0));
for col in &report.column_profiles {
println!(" {} ({:?}): {} nulls", col.name, col.data_type, col.null_count);
}
Python
import dataprof
report = dataprof.profile("data.csv")
print(f"{report.rows} rows, {report.columns} columns")
print(f"Quality score: {report.quality_score}")
for col in report.column_profiles.values():
print(f" {col.name} ({col.data_type}): {col.null_percentage:.1f}% null")
Installation
CLI binary
cargo install dataprof # default (CLI only)
cargo install dataprof --features full-cli # CLI + all formats + databases
Rust library
[dependencies]
dataprof = "0.6" # core library (no CLI deps)
dataprof = { version = "0.6", features = ["async-streaming"] }
Python package
uv pip install dataprof
# or
pip install dataprof
Feature Flags
| Feature | Description |
|---|---|
| cli (default) | CLI binary with clap, colored output, progress bars |
| minimal | CSV-only, no CLI -- fastest compile |
| async-streaming | Async profiling engine with tokio |
| parquet-async | Profile Parquet files over HTTP |
| database | Database profiling (connection handling, retry, SSL) |
| postgres | PostgreSQL connector (includes database) |
| mysql | MySQL/MariaDB connector (includes database) |
| sqlite | SQLite connector (includes database) |
| all-db | All three database connectors |
| datafusion | DataFusion SQL engine integration |
| python | Python bindings via PyO3 |
| python-async | Async Python API (includes python + async-streaming) |
| full-cli | CLI + Parquet + all databases |
| production | PostgreSQL + MySQL (common deployment) |
Supported Formats
| Format | Engine | Notes |
|---|---|---|
| CSV | Incremental, Columnar | Auto-detects , ; \| \t delimiters |
| JSON | Incremental | Array-of-objects |
| JSONL / NDJSON | Incremental | One object per line |
| Parquet | Columnar | Reads metadata for schema/count without scanning rows |
| Database query | Async | PostgreSQL, MySQL, SQLite via connection string |
| pandas / polars DataFrame | Columnar | Python API only |
| Arrow RecordBatch | Columnar | Via PyCapsule (zero-copy) or Rust API |
| Async byte stream | Incremental | Any AsyncRead source (HTTP, WebSocket, etc.) |
Quality Metrics
dataprof evaluates data quality against the five dimensions defined in ISO 8000-8 and ISO/IEC 25012:
| Dimension | What it measures | |---|---| | Completeness | Missing values ratio, complete records ratio, fully-null columns | | Consistency | Data type consistency, format violations, encoding issues | | Uniqueness | Duplicate rows, key uniqueness, high-cardinality warnings | | Accuracy | Outlier ratio, range violations, negative values in positive-only columns | | Timeliness | Future dates, stale data ratio, temporal ordering violations |
An overall quality score (0 -- 100) is computed as a weighted average of dimension scores.
Documentation
- CLI Usage Guide -- every subcommand and flag
- Python API Guide --
profile(), report types, async, databases - Getting Started -- tutorial from zero to profiling
- Examples Cookbook -- copy-pasteable recipes (CLI, Python, Rust)
- Database Connectors -- PostgreSQL, MySQL, SQLite setup
- Contributing
- Changelog
Academic Work
dataprof is the subject of a peer-reviewed paper submitted to IEEE ScalCom 2026:
A. Bozzo, "A Compiled Paradigm for Scalable and Sustainable Edge AI: Out-of-Core Execution and SIMD Acceleration in Telemetry Profiling," IEEE ScalCom 2026 (under review). [Repository & reproducible benchmarks]
The paper benchmarks dataprof against YData Profiling, Polars, and pandas across execution efficiency, memory scalability, energy consumption, and zero-copy interoperability in constrained Edge AI environments.
BibTeX
@inproceedings{bozzo2026compiled,
author={Bozzo, Andrea},
title={A Compiled Paradigm for Scalable and Sustainable Edge AI: Out-of-Core Execution and SIMD Acceleration in Telemetry Profiling},
booktitle={2026 IEEE International Conference on Scalable Computing and Communications (ScalCom)},
year={2026},
note={Under review}
}
License
Dual-licensed under either the MIT License or the Apache License, Version 2.0, at your option.
