Precizer

precizer is a high-performance CLI for verifying data integrity across large file trees. It performs byte-for-byte comparisons to identify mismatches after sync, backed by a checksum database and persisted state for resumable long-running jobs. Turning “seems fine” into “verified”

Generate Convert Improve

Install / Use

/learn @precizer/Precizer

About this skill

Quality Score

0/100

README

<img src=".html/img/i18n-icon.svg"> Link to the Russian language README page

Precizer: data integrity verification for file systems of any scale

A Tiny, High-Performance File Integrity and Comparison Tool

“A truly great application will always fit on a floppy disk. Hopefully, someone out there still remembers what those were… But it’s not about the floppies, it’s about quality software!”© :-D

Comprehensive hybrid test suite:

In-process integration tests
Out-of-process CLI system tests

TL;DR

Overview

precizer is a lightweight, high-performance CLI tool written in pure C. It’s designed for file integrity verification and comparison, making it especially useful for validating synchronization results. The program walks directory trees and builds a database of files and their checksums for fast, repeatable comparisons.

Built for embedded systems and large-scale clustered environments, precizer detects synchronization drift by comparing files and checksums across sources. It can also analyze historical changes by comparing databases captured from the same source at different points in time.

Basic Example

Consider a scenario where two machines have large mounted volumes at /mnt1 and /mnt2, respectively, containing identical data. The goal is to verify, byte by byte, whether the contents are truly identical or if discrepancies exist.

Run precizer on the first machine (e.g., hostname host1):

precizer --progress /mnt1

This command traverses the directory tree under /mnt1, creating a database file host1.db in the current directory. The --progress flag provides real-time progress updates, displaying the total traversed space and the number of processed files.

Run precizer on the second machine (e.g., hostname host2):

precizer --progress /mnt2

This will generate a database file host2.db in the current directory.

Copy host1.db and host2.db to one of the machines and run the following command to compare them:

precizer --compare host1.db host2.db

The output will display:

Files that exist on host1 but are missing on host2, and vice versa.
Files present on both hosts but with different checksums.

Relative Paths for Consistent Comparison

precizer stores only relative file paths in its database. For example, a file located at:

/mnt1/abc/def/aaa.txt

will be stored as:

abc/def/aaa.txt

without the /mnt1 prefix. Similarly, the corresponding file on /mnt2:

/mnt2/abc/def/aaa.txt

will also be stored as:

abc/def/aaa.txt

This ensures that even when files reside in different mount points or sources, they can still be compared accurately under the same relative paths and their respective checksums.

DOWNLOAD

Download https://github.com/precizer/precizer/releases/latest/ executables for:

Linux x86_64 precizer_linux_x86_64_portable.zip
Linux arm aarch64 precizer_linux_aarch64_portable.zip
macOS arm64 precizer_macos_arm64.zip

The release packages contain portable executables in a zip archive.

Download, unzip, and run

A universal approach to automating upgrades to newer versions

# Automation for downloading and unarchiving new versions

# Download
wget -O precizer.zip -q "https://github.com/precizer/precizer/releases/latest/download/precizer_$(uname -s | tr '[:upper:]' '[:lower:]' | sed 's/darwin/macos/')_$(uname -m | sed 's/amd64/x86_64/')$( [ "$(uname -s)" = "Linux" ] && echo '_portable' ).zip"

# Extract the archive
unzip -jqo precizer.zip '*/precizer' -d ./

# Run
./precizer --version

Technical details of the portable build

The Linux build is a single executable, statically linked ELF binary not tied to any specific distribution. It can be run immediately on almost any Linux distro and does not require external shared libraries.
The binary is produced by GitHub CI/CD, then compressed with UPX (the executable packer). The self-extracting compressed binary is then placed into a ZIP archive for convenient download. The file can be extracted from the archive and run directly.
Static linking is not supported on macOS, so running the downloaded application requires the following libraries to be available on the system: sqlite3, pcre2, argp and fts.

CHANGELOG

A list of changes by version is available in a separate file: CHANGELOG

TECHNICAL DETAILS

Consider a scenario where a primary storage system has a backup copy. For example, this could be a data center storage and its Disaster Recovery copy.

Synchronization from the primary storage to the backup occurs periodically, but due to the massive data volumes, synchronization is most likely not performed byte-by-byte but rather by detecting metadata changes within the file system. In such cases, file size and modification time are taken into account, but the actual content is not verified byte by byte.

This approach makes sense because the primary data center and the Disaster Recovery site usually have high-speed communication channels, but a full byte-by-byte synchronization would take an unreasonably long time.

Tools like rsync allow both types of synchronization — metadata-based and byte-by-byte — but they have one major drawback: state is not preserved between sessions.

The following scenario illustrates the issue:

Given: Server "A" and Server "B" (Primary Data Center and Disaster Recovery)
Some files have been modified on Server "A".
The rsync algorithm detects them based on changes in size and modification time and synchronizes them to Server "B".
Multiple connection failures occur during synchronization between the Primary Data Center and the Disaster Recovery site.
To verify data integrity (i.e., ensuring that files on "A" and "B" are identical byte by byte), rsync is often used with byte-by-byte comparison. The process works as follows:
- rsync is launched on Server "A" with the --checksum mode, attempting to compute checksums sequentially on both "A" and "B" in a single session.
- This process takes an extremely long time for large-scale storage systems.
- Since rsync does not save computed checksums between sessions, it introduces several technical challenges:
  - If the connection drops, rsync terminates the session, and on the next run, everything must start from scratch! Given the huge data volumes, performing a byte-by-byte verification for full data integrity becomes an impossible task.
- Storage subsystem failures can also lead to binary inconsistencies. In such cases, file system metadata cannot reliably determine whether file contents on "A" and "B" are truly identical.
- Over time, errors accumulate, increasing the risk of maintaining an inconsistent Disaster Recovery copy of system "A" on system "B", rendering the entire Disaster Recovery effort useless. Standard utilities do not detect these inconsistencies, and technical personnel may be completely unaware of data integrity problems in the Disaster Recovery storage.
To overcome these limitations, precizer was developed. The program identifies exactly which files differ between "A" and "B" so that they can be resynchronized with the necessary corrections. The tool operates at maximum speed (pushing hardware performance to its limits) because it is written in pure C and utilizes high-performance algorithms optimized for efficiency. The program is designed to handle both small files and petabyte-scale data volumes, with no upper limits*.
The name precizer comes from the word precision, implying something that enhances accuracy.
The program precisely analyzes directory contents, including subdirectories, computing checksums for every encountered file while storing metadata in an SQLite database (a regular binary file).
precizer is fault-tolerant and can resume execution from the point of interruption. For example, if the program is terminated via Ctrl+C while analyzing a petabyte-scale file, it will NOT restart from the beginning but continue exactly where it left off using previously recorded data in the database. This significantly saves resources, time, and effort for system administrators.
The program can be interrupted at any time using any method, and this is completely safe for both the scanned data and the database created by precizer.
If the program is intentionally or accidentally stopped, there is no need to worry about losing progress. All results are fully preserved and can be used in subsequent runs.
Checksum calculations rely on the cryptographic SHA512 hash algorithm, which is reliable, fast, and provides very strong practical collision resistance. If two lar