Fclones
Efficient Duplicate File Finder
Install / Use
/learn @pkolaczk/FclonesREADME
fclones
Efficient duplicate file finder and remover
This is the repo for command line fclones and its core libraries. For the desktop frontend, see fclones-gui.
fclones is a command line utility that identifies groups of identical files and gets rid
of the file copies you no longer need. It comes with plenty of configuration options for controlling
the search scope and offers many ways of removing duplicates. For maximum flexibility,
it integrates well with other Unix utilities like find and it speaks JSON, so you have a lot
of control over the search and cleanup process.
fclones treats your data seriously. You can inspect and modify the list of duplicate files before removing them.
There is also a --dry-run option that can tell you exactly what changes on the file system would be made.
fclones has been implemented in Rust with a strong focus on high performance on modern hardware.
It employs several optimization techniques not present in many other programs.
It adapts to the type of the hard drive, orders file operations by physical data placement on HDDs,
scans directory tree in parallel and uses prefix compression of paths to reduce memory consumption when working
with millions of files. It is also friendly to page-cache and does not push out your data out of cache.
As a result, fclones easily outperforms many other popular duplicate finders by a wide margin
on either SSD or HDD storage.
fclones is available on a wide variety of operating systems, but it works best on Linux.
Features
- Identifying groups of identical files
- finding duplicate files
- finding files with more than N replicas
- finding unique files
- finding files with fewer than N replicas
- Advanced file selection for reducing the amount of data to process
- scanning multiple directory roots
- can work with a list of files piped directly from standard input
- recursive/non-recursive file selection
- recursion depth limit
- filtering names and paths by extended UNIX globs
- filtering names and paths by regular expressions
- filtering by min/max file size
- proper handling of symlinks and hardlinks
- Removing redundant data
- removing, moving or replacing files with soft or hard links
- removing redundant file data using native copy-on-write (reflink) support on some file systems
- selecting files for removal by path or name patterns
- prioritizing files to remove by creation, modification, last access time or nesting level
- High performance
- parallel processing capability in all I/O and CPU heavy stages
- automatic tuning of parallelism and access strategy based on device type (SSD vs HDD)
- low memory footprint thanks to heavily optimized path representation
- variety of fast non-cryptographic and cryptographic hash functions up to 512 bits wide
- doesn't push data out of the page-cache (Linux-only)
- optional persistent caching of file hashes
- accurate progress reporting
- Variety of output formats for easy further processing of results
- standard text format
- groups separated by group headers with file size and hash
- one path per line in a group
- optional
fdupescompatibility (no headers, no indent, groups separated by blank lines) - machine-readable formats:
CSV,JSON
- standard text format
Limitations
Copy-on-write file data deduplication (reflink) is not supported on Windows.
Some optimisations are not available on platforms other than Linux:
- ordering of file accesses by physical placement
- page-cache drop-behind
Demo
Let's first create some files:
$ mkdir test
$ cd test
$ echo foo >foo1.txt
$ echo foo >foo2.txt
$ echo foo >foo3.txt
$ echo bar >bar1.txt
$ echo bar >bar2.txt
Now let's identify the duplicates:
$ fclones group . >dupes.txt
[2021-06-05 18:21:33.358] fclones: info: Started grouping
[2021-06-05 18:21:33.738] fclones: info: Scanned 7 file entries
[2021-06-05 18:21:33.738] fclones: info: Found 5 (20 B) files matching selection criteria
[2021-06-05 18:21:33.738] fclones: info: Found 4 (16 B) candidates after grouping by size
[2021-06-05 18:21:33.738] fclones: info: Found 4 (16 B) candidates after grouping by paths and file identifiers
[2021-06-05 18:21:33.739] fclones: info: Found 3 (12 B) candidates after grouping by prefix
[2021-06-05 18:21:33.740] fclones: info: Found 3 (12 B) candidates after grouping by suffix
[2021-06-05 18:21:33.741] fclones: info: Found 3 (12 B) redundant files
$ cat dupes.txt
# Report by fclones 0.12.0
# Timestamp: 2021-06-05 18:21:33.741 +0200
# Command: fclones group .
# Found 2 file groups
# 12 B (12 B) in 3 redundant files can be removed
7d6ebf613bf94dfd976d169ff6ae02c3, 4 B (4 B) * 2:
/tmp/test/bar1.txt
/tmp/test/bar2.txt
6109f093b3fd5eb1060989c990d1226f, 4 B (4 B) * 3:
/tmp/test/foo1.txt
/tmp/test/foo2.txt
/tmp/test/foo3.txt
Finally we can replace the duplicates by soft links:
$ fclones link --soft <dupes.txt
[2021-06-05 18:25:42.488] fclones: info: Started deduplicating
[2021-06-05 18:25:42.493] fclones: info: Processed 3 files and reclaimed 12 B space
$ ls -l
total 12
-rw-rw-r-- 1 pkolaczk pkolaczk 4 cze 5 18:19 bar1.txt
lrwxrwxrwx 1 pkolaczk pkolaczk 18 cze 5 18:25 bar2.txt -> /tmp/test/bar1.txt
-rw-rw-r-- 1 pkolaczk pkolaczk 382 cze 5 18:21 dupes.txt
-rw-rw-r-- 1 pkolaczk pkolaczk 4 cze 5 18:19 foo1.txt
lrwxrwxrwx 1 pkolaczk pkolaczk 18 cze 5 18:25 foo2.txt -> /tmp/test/foo1.txt
lrwxrwxrwx 1 pkolaczk pkolaczk 18 cze 5 18:25 foo3.txt -> /tmp/test/foo1.txt
Installation
The code has been thoroughly tested on Ubuntu Linux 21.10. Other systems like Windows or Mac OS X and other architectures may work. Help test and/or port to other platforms is welcome. Please report successes as well as failures.
Official Packages
Snap store (Linux):
snap install fclones
Homebrew (macOS and Linux)
brew install fclones
Installation packages and binaries for some platforms are also attached directly to Releases.
Third-party Packages
Building from Source
Install Rust Toolchain and then run:
cargo install fclones
The build will write the binary to .cargo/bin/fclones.
Shell completions
fclones supports shell completions but you have to set it up manually at the moment,
which can be done by adding the script printed by the fclones complete subcommand to your shell configuration.
All shells supported by clap_complete are supported.
At the time of writing this includes:
- Bash: Add
eval "$(fclones complete bash)"to your~/.bashrc - Zsh: Add
source <(fclones complete zsh)to your~/.zshrc - Fish: Add
fclones complete fish | sourceto your~/.config/fish/config.fish - Elvish
- Powershell
Usage
fclones offers separate commands for finding and removing files. This way, you can inspect
the list of found files before applying any modifications to the file system.
group– identifies groups of identical files and prints them to the standard outputremove– removes redundant files earlier identified bygrouplink– replaces redundant files with links (default: hard links)dedupe– does not remove any files, but deduplicates file data by using native copy-on-write capabilities of the file system (reflink)
Finding Files
Find duplicate, unique, under-replicated or over-replicated files in the current directory, including subdirectories:
fclones group .
fclones group . --unique
fclones group . --rf-under 3
fclones group . --rf-over 3
You can search in multiple directories:
fclones group dir1 dir2 dir3
By default, hidden files and files matching patterns listed in .gitignore and .fdignore are
ignored. To search all files, use:
fclones group --no-ignore --hidden dir
Limit the recursion depth:
fclones group . --depth 1 # scan only files in the current dir, skip subdirs
fclones group * --depth 0 # similar as above in shells that expand `*`
Caution: Versions up to 0.10 did not descend into directories by default.
In those old versions, add -R flag to enable recursive directory walking.
Finding files that match across two directory trees, without matching identical files within each tree:
fclones group --isolate dir1 dir2
Finding duplicate files of size at least 100 MB:
fclones group . -s 100M
Filter by file name or path pattern:
fclones group . --name '*.jpg' '*.png'
Run fclones on files selected by find (note: this is likely slower than built-in filtering):
find . -name '*.c' | fclones group --stdin --depth 0
Follow symbolic links, but don't e
