fclones

Efficient duplicate file finder and remover

This is the repo for command line fclones and its core libraries. For the desktop frontend, see fclones-gui.

fclones is a command line utility that identifies groups of identical files and gets rid of the file copies you no longer need. It comes with plenty of configuration options for controlling the search scope and offers many ways of removing duplicates. For maximum flexibility, it integrates well with other Unix utilities like find and it speaks JSON, so you have a lot of control over the search and cleanup process.

fclones treats your data seriously. You can inspect and modify the list of duplicate files before removing them. There is also a --dry-run option that can tell you exactly what changes on the file system would be made.

fclones has been implemented in Rust with a strong focus on high performance on modern hardware. It employs several optimization techniques not present in many other programs. It adapts to the type of the hard drive, orders file operations by physical data placement on HDDs, scans directory tree in parallel and uses prefix compression of paths to reduce memory consumption when working with millions of files. It is also friendly to page-cache and does not push out your data out of cache. As a result, fclones easily outperforms many other popular duplicate finders by a wide margin on either SSD or HDD storage.

fclones is available on a wide variety of operating systems, but it works best on Linux.

Features
Demo
Installation
Usage
Algorithm
Tuning
Benchmarks

Features

Identifying groups of identical files
- finding duplicate files
- finding files with more than N replicas
- finding unique files
- finding files with fewer than N replicas
Advanced file selection for reducing the amount of data to process
- scanning multiple directory roots
- can work with a list of files piped directly from standard input
- recursive/non-recursive file selection
- recursion depth limit
- filtering names and paths by extended UNIX globs
- filtering names and paths by regular expressions
- filtering by min/max file size
- proper handling of symlinks and hardlinks
Removing redundant data
- removing, moving or replacing files with soft or hard links
- removing redundant file data using native copy-on-write (reflink) support on some file systems
- selecting files for removal by path or name patterns
- prioritizing files to remove by creation, modification, last access time or nesting level
High performance
- parallel processing capability in all I/O and CPU heavy stages
- automatic tuning of parallelism and access strategy based on device type (SSD vs HDD)
- low memory footprint thanks to heavily optimized path representation
- variety of fast non-cryptographic and cryptographic hash functions up to 512 bits wide
- doesn't push data out of the page-cache (Linux-only)
- optional persistent caching of file hashes
- accurate progress reporting
Variety of output formats for easy further processing of results
- standard text format
  - groups separated by group headers with file size and hash
  - one path per line in a group
- optional fdupes compatibility (no headers, no indent, groups separated by blank lines)
- machine-readable formats: CSV, JSON

Limitations

Copy-on-write file data deduplication (reflink) is not supported on Windows.

Some optimisations are not available on platforms other than Linux:

ordering of file accesses by physical placement
page-cache drop-behind

Demo

Let's first create some files:

$ mkdir test
$ cd test
$ echo foo >foo1.txt
$ echo foo >foo2.txt
$ echo foo >foo3.txt
$ echo bar >bar1.txt
$ echo bar >bar2.txt

Now let's identify the duplicates:

$ fclones group . >dupes.txt
[2021-06-05 18:21:33.358] fclones:  info: Started grouping
[2021-06-05 18:21:33.738] fclones:  info: Scanned 7 file entries
[2021-06-05 18:21:33.738] fclones:  info: Found 5 (20 B) files matching selection criteria
[2021-06-05 18:21:33.738] fclones:  info: Found 4 (16 B) candidates after grouping by size
[2021-06-05 18:21:33.738] fclones:  info: Found 4 (16 B) candidates after grouping by paths and file identifiers
[2021-06-05 18:21:33.739] fclones:  info: Found 3 (12 B) candidates after grouping by prefix
[2021-06-05 18:21:33.740] fclones:  info: Found 3 (12 B) candidates after grouping by suffix
[2021-06-05 18:21:33.741] fclones:  info: Found 3 (12 B) redundant files

$ cat dupes.txt
# Report by fclones 0.12.0
# Timestamp: 2021-06-05 18:21:33.741 +0200
# Command: fclones group .
# Found 2 file groups
# 12 B (12 B) in 3 redundant files can be removed
7d6ebf613bf94dfd976d169ff6ae02c3, 4 B (4 B) * 2:
    /tmp/test/bar1.txt
    /tmp/test/bar2.txt
6109f093b3fd5eb1060989c990d1226f, 4 B (4 B) * 3:
    /tmp/test/foo1.txt
    /tmp/test/foo2.txt
    /tmp/test/foo3.txt

Finally we can replace the duplicates by soft links:

$ fclones link --soft <dupes.txt 
[2021-06-05 18:25:42.488] fclones:  info: Started deduplicating
[2021-06-05 18:25:42.493] fclones:  info: Processed 3 files and reclaimed 12 B space

$ ls -l
total 12
-rw-rw-r-- 1 pkolaczk pkolaczk   4 cze  5 18:19 bar1.txt
lrwxrwxrwx 1 pkolaczk pkolaczk  18 cze  5 18:25 bar2.txt -> /tmp/test/bar1.txt
-rw-rw-r-- 1 pkolaczk pkolaczk 382 cze  5 18:21 dupes.txt
-rw-rw-r-- 1 pkolaczk pkolaczk   4 cze  5 18:19 foo1.txt
lrwxrwxrwx 1 pkolaczk pkolaczk  18 cze  5 18:25 foo2.txt -> /tmp/test/foo1.txt
lrwxrwxrwx 1 pkolaczk pkolaczk  18 cze  5 18:25 foo3.txt -> /tmp/test/foo1.txt

Installation

The code has been thoroughly tested on Ubuntu Linux 21.10. Other systems like Windows or Mac OS X and other architectures may work. Help test and/or port to other platforms is welcome. Please report successes as well as failures.

Official Packages

Snap store (Linux):

snap install fclones

Homebrew (macOS and Linux)

brew install fclones

Installation packages and binaries for some platforms are also attached directly to Releases.

Third-party Packages

Building from Source

Install Rust Toolchain and then run:

cargo install fclones

The build will write the binary to .cargo/bin/fclones.

Shell completions

fclones supports shell completions but you have to set it up manually at the moment, which can be done by adding the script printed by the fclones complete subcommand to your shell configuration. All shells supported by clap_complete are supported. At the time of writing this includes:

Bash: Add eval "$(fclones complete bash)" to your ~/.bashrc
Zsh: Add source <(fclones complete zsh) to your ~/.zshrc
Fish: Add fclones complete fish | source to your ~/.config/fish/config.fish
Elvish
Powershell

Usage

fclones offers separate commands for finding and removing files. This way, you can inspect the list of found files before applying any modifications to the file system.

group – identifies groups of identical files and prints them to the standard output
remove – removes redundant files earlier identified by group
link – replaces redundant files with links (default: hard links)
dedupe – does not remove any files, but deduplicates file data by using native copy-on-write capabilities of the file system (reflink)

Finding Files

Find duplicate, unique, under-replicated or over-replicated files in the current directory, including subdirectories:

fclones group .
fclones group . --unique 
fclones group . --rf-under 3
fclones group . --rf-over 3

You can search in multiple directories:

fclones group dir1 dir2 dir3

By default, hidden files and files matching patterns listed in .gitignore and .fdignore are ignored. To search all files, use:

fclones group --no-ignore --hidden dir

Limit the recursion depth:

fclones group . --depth 1   # scan only files in the current dir, skip subdirs
fclones group * --depth 0   # similar as above in shells that expand `*`

Caution: Versions up to 0.10 did not descend into directories by default. In those old versions, add -R flag to enable recursive directory walking.

Finding files that match across two directory trees, without matching identical files within each tree:

fclones group --isolate dir1 dir2

Finding duplicate files of size at least 100 MB:

fclones group . -s 100M

Filter by file name or path pattern:

fclones group . --name '*.jpg' '*.png'

Run fclones on files selected by find (note: this is likely slower than built-in filtering):

find . -name '*.c' | fclones group --stdin --depth 0

Follow symbolic links, but don't e

Fclones

Install / Use

README