pyFileFixity

|Build-Status| |Coverage|

|LICENCE|

pyFileFixity provides a suite of open source, cross-platform, easy to use and easy to maintain (readable code) to protect and manage data for long term storage/archival, and also test the performance of any data protection algorithm.

The project is done in pure-Python to meet those criteria, although cythonized extensions are available for core routines to speed up encoding/decoding, but always with a pure python specification available so as to allow long term replication.

Here is an example of what pyFileFixity can do:

|Example|

On the left, this is the original image.

At the center, the same image but with a few symbols corrupted (only 3 in header and 2 in the rest of the file, which equals to 5 bytes corrupted in total, over 19KB which is the total file size). Only a few corrupted bytes are enough to make the image looks like totally unrecoverable, and yet we are lucky, because the image could be unreadable at all if any of the "magic bytes" were to be corrupted!

At the right, the corrupted image was repaired using pff header command of pyFileFixity. This repaired only the image header (ie, the first part of the file), so only the first 3 corrupted bytes were repaired, not the 2 bytes in the rest of the file, but we can see the image looks indistinguishable from the untampered original! And the best thing is that it only costed the generation of a "ecc repair file" for the header, which size is only a constant 3.3KB per file, regardless of the protected file's size!

This works because most files will store the most important information to read them at their beginning, also called "file's header", so repairing this part will almost always ensure the possibility to read the file (even if the rest of the file is still corrupted, if the header is safe, you can read it). This works especially well for images, compressed files, formatted documents such as DOCX and ODT, etc.

Of course, you can also protect the whole file, not only the header, using pyFileFixity's pff whole command. You can also detect any corruption using pff hash.

.. contents:: Table of contents :backlinks: top

Quickstart

Runs on Python 3 up to Python 3.12-dev. PyPy 3 is also supported.

To install or update on Python 3:

pip install --upgrade pyfilefixity

For Python 2.7, the latest working version was v3.0.2:

pip install --upgrade pyfilefixity==3.0.2 reedsolo==1.7.0 unireedsolomon==1.0.5

Once installed, the suite of tools can be accessed from a centralized interface script called pff which provides several subcommands, to list them:

pff --help

You should see:

usage: pff [-h]
           {hash,rfigc,header,header_ecc,hecc,whole,structural_adaptive_ecc,saecc,protect,repair,recover,repair_ecc,recc,dup,replication_repair,restest,resilience_tester,filetamper,speedtest,ecc_speedtest}
           ...

positional arguments:
  {hash,rfigc,header,header_ecc,hecc,whole,structural_adaptive_ecc,saecc,protect,repair,recover,repair_ecc,recc,dup,replication_repair,restest,resilience_tester,filetamper,speedtest,ecc_speedtest}
    hash (rfigc)        Check files integrity fast by hash, size, modification date or by data structure integrity.
    header (header_ecc, hecc)
                        Protect/repair files headers with error correction codes
    whole (structural_adaptive_ecc, saecc, protect, repair)
                        Protect/repair whole files with error correction codes
    recover (repair_ecc, recc)
                        Utility to try to recover damaged ecc files using a failsafe mechanism, a sort of recovery
                        mode (note: this does NOT recover your files, only the ecc files, which may then be used to
                        recover your files!)
    dup (replication_repair)
                        Repair files from multiple copies of various storage mediums using a majority vote
    restest (resilience_tester)
                        Run tests to quantify robustness of a file protection scheme (can be used on any, not just
                        pyFileFixity)
    filetamper          Tamper files using various schemes
    speedtest (ecc_speedtest)
                        Run error correction encoding and decoding speedtests

options:
  -h, --help            show this help message and exit

Every subcommands provide their own more detailed help instructions, eg for the hash submodule:

pff hash --help

To generate a monitoring database (to later check very fast which files are corrupted, but cannot repair anything but filesystem metadata):

pff hash -i "your_folder" -d "dbhash.csv" -g -f -l "log.txt"

Note: this also works for a single file, just replace "your_folder" by "your_file.ext".

To update this monitoring database (check for new files, but does not remove files that do not exist anymore - replace --append with --remove for the latter):

pff hash -i "your_folder -d "dbhash.csv" --update --append

Later, to check which files were corrupted:

pff hash -i "your_folder" -d "dbhash.csv" -l log.txt -s -e errors.csv

To use this monitoring database to recover filesystem metadata such as files names and directory layout by filescraping from files contents:

pff hash -i "your_folder" -d "dbhash.csv" -l "log.txt" -o "output_folder" --filescraping_recovery

To protect files headers with a file called hecc.txt:

pff header -i "your_folder" -d "hecc.txt" -l "log.txt" -g -f --ecc_algo 3

To repair files headers and store the repaired files in output_folder:

pff header -i "your_folder" -d "hecc.txt" -o "output_folder" -l "log.txt" -c -v --ecc_algo 3

To protect whole files with a file called ecc.txt:

pff whole -i "your_folder" -d "ecc.txt" -l "log.txt" -g -f -v --ecc_algo 3

To repair whole files:

pff whole -i "your_folder" -d "ecc.txt" -o "output_folder" -l "log.txt" -c -v --ecc_algo 3

Note that header and whole can also detect corrupted files and even which blocks inside a file, but they are much slower than hash.

To try to recover a damaged ecc file ecc.txt using an index file ecc.txt.idx (index file is generated automatically with ecc.txt):

pff recovery -i "ecc.txt" --index "ecc.txt.idx" -o "ecc_repaired.txt" -l "log.txt" -v -f

To try to recover a damaged ecc file ecc.txt without an index file (you can tweak the -t parameter from 0.0 to 1.0, 1.0 producing many false positives):

pff recovery -i "ecc.txt" -o "ecc_repaired.txt" -l "log.txt" -v -f -t 0.4

To repair your files using multiple duplicated copies that you have stored on different mediums:

pff dup -i "path/to/dir1" "path/to/dir2" "path/to/dir3" -o "path/to/output" --report "rlog.csv" -f -v

If you have previously generated a rfigc database, you can use it to enhance the replication repair:

pff dup -i "path/to/dir1" "path/to/dir2" "path/to/dir3" -o "path/to/output" -d "dbhash.csv" --report "rlog.csv" -f -v

To run tests on your recovery tools, you can make a Makefile-like configuration file and use the Resiliency Tester submodule:

pff restest -i "your_folder" -o "test_folder" -c "resiliency_tester_config.txt" -m 3 -l "testlog.txt" -f

Internally, pff restest uses pff filetamper to tamper files with various schemes, but you can also use pff filetamper directly.
To run speedtests of encoding/decoding error correction codes on your machine:

pff speedtest

In case the pff command does not work, it can be replaced with python -m pyFileFixity.pff .

The problem of long term storage

Why are data corrupted with time? One sole reason: entropy. Entropy refers to the universal tendency for systems to become less ordered over time. Data corruption is exactly that: a disorder in bits order. In other words: the Universe hates your data.

Long term storage is thus a very difficult topic: it's like fighting with death (in this case, the death of data). Indeed, because of entropy, data will eventually fade away because of various silent errors such as bit rot or cosmic rays. pyFileFixity aims to provide tools to detect any data corruption, but also fight data corruption by providing repairing tools.

The only solution is to use a principle of engineering that is long known and which makes bridges and planes safe: add some redundancy.

There are only 2 ways to add redundancy:

the simple way is to duplicate the object (also called replication), but for data storage, this eats up a lot of storage and is not optimal. However, if storage is cheap, then this is a good solution, as it is much faster than encoding with error correction codes. For replication to work, at least 3 duplicates are necessary at all times, so that if one fails, it must replaced asap. As sailors say: "Either bring 1 compass or 3 compasses, but never two, because then you won't know which one is correct if one fails." Indeed, with 3 duplicates, if you frequently monitor their integrity (eg, with hashes), then if one fails, simply do a majority vote: the bit value given by 2 of the duplicates is probably correct.
the second way, the optimal tools ever invented to recover from data corruption, are the error correction codes (forward error correction), which are a way to smartly produce redundant codes from your data so that you can later repair your data using these additional pieces of information (ie, an ECC generates n blocks for a file cut in k blocks (with k < n), and then the ecc code can rebuild the whole file with (at least) any k blocks among the total n blocks available). In other words, you can correct up to (n-k) era

PyFileFixity

Install / Use

README

pyFileFixity

Quickstart

The problem of long term storage