Dcd
Duplicate code detector with fuzzy matching, gap tolerance, and interactive codebase visualization
Install / Use
/learn @boyter/DcdREADME
Duplicate Code Detector (dcd)
A tool similar to Simian designed to identify duplicate code within a project. It is, however, under a free software license.
Licensed under GNU Affero General Public License 3.0.
Support
Using dcd commercially? If you want priority support for dcd you can purchase a years worth https://boyter.gumroad.com/l/wajuc which entitles you to priority direct email support from the developer.
Install
Go Get
If you are comfortable using Go and have >= 1.19 installed:
go install github.com/boyter/dcd@latest
Manual
Binaries for GNU/Linux and macOS for both i386 and x86_64 and ARM64 machines are available from the releases page.
Pitch
Why use dcd?
- It's reasonably fast and works with large projects
- Works very well across multiple platforms without slowdown (GNU/Linux, macOS)
- Supports fuzzy matching to catch near-duplicate lines
- Supports gap tolerance to find duplicate blocks even when lines have been inserted, deleted, or modified
- Can compare a single file against the rest of a codebase
- Can generate PBM scatter plot visualizations of the comparison matrix between two files
- Supports ignoring marked blocks of code (e.g. generated code) via configurable markers
Usage
Command line usage of dcd is designed to be as simple as possible.
Full details can be found in dcd --help or dcd -h. Note that the below reflects the state of master, not a release.
$ dcd -h
dcd
Version 1.1.0
Ben Boyter <ben@boyter.org>
Usage:
dcd [flags]
Flags:
--duplicates-both-ways report duplicates from both file perspectives (default reports each pair once)
-x, --exclude-pattern strings file and directory locations matching case sensitive patterns will be ignored [comma separated list: e.g. vendor,_test.go]
--file string compare a single file against the rest of the codebase
--format string output format: text (default), json, or html
-f, --fuzz uint8 fuzzy value where higher numbers allow increasingly fuzzy lines to match, values 0-255 where 0 indicates exact match
-g, --gap-tolerance int allow gaps of up to N lines when matching duplicate blocks (0 = no gaps allowed)
-h, --help help for dcd
--ignore-blocks-end string marker string to stop ignoring lines (e.g. duplicate-enable)
--ignore-blocks-start string marker string to start ignoring lines (e.g. duplicate-disable)
--max-hole-size int allow up to N consecutive modified lines (holes) within a duplicate diagonal (0 = no holes allowed)
-i, --include-ext strings limit to file extensions [comma separated list: e.g. go,java,js]
-m, --match-length int min match length (default 6)
--max-gap-bridges int maximum number of gap bridges allowed per duplicate match (default 1)
--max-read-size-bytes int number of bytes to read into a file with the remaining content ignored (default 10000000)
--min-line-length int number of bytes per average line for file to be considered minified (default 255)
--no-gitignore disables .gitignore file logic
--no-ignore disables .ignore file logic
--pbm-file-a string first file to compare for PBM scatter plot output
--pbm-file-b string second file to compare for PBM scatter plot output
--pbm-output string output path for PBM scatter plot file
--process-same-file find duplicate blocks within the same file
-v, --verbose verbose output
--version version for dcd
Basic usage
Running dcd with no arguments scans the current directory for duplicate code blocks:
$ dcd
Found duplicate lines in processor/cocomo_test.go:
lines 0-8 match 0-8 in processor/workers_tokei_test.go (length 8)
Found duplicate lines in processor/detector_test.go:
lines 0-8 match 0-8 in processor/processor_test.go (length 8)
Found duplicate lines in processor/filereader.go:
lines 0-7 match 0-7 in processor/workers.go (length 7)
Found 98634 duplicate lines in 140 files
You can also pass a directory path: dcd /path/to/project.
Fuzzy matching
By default, dcd requires exact line matches. The --fuzz (-f) flag enables fuzzy matching using simhash distance, allowing lines that are similar but not identical to be treated as matches.
The value ranges from 0 to 255, where 0 means exact match and higher values allow increasingly fuzzy matches. Low values (1-3) catch minor differences like variable renames or whitespace changes. Higher values catch more significant changes but may produce false positives.
# Find near-duplicate code with slight differences
$ dcd -f 2
# More permissive fuzzy matching
$ dcd -f 5
Gap tolerance
The --gap-tolerance (-g) flag allows dcd to bridge over small gaps in otherwise matching blocks. This catches duplicate blocks where a few lines have been inserted, deleted, or modified in one copy.
When set to N, the algorithm searches up to N positions ahead in both source and target to find the next matching line, bridging over the gap. The --match-length requirement still applies to the number of actual matching lines, regardless of any gaps bridged.
# Allow gaps of up to 2 lines within duplicate blocks
$ dcd -g 2
# Allow larger gaps with multiple bridges
$ dcd -g 3 --max-gap-bridges 3
The --max-gap-bridges flag (default 1) controls how many gaps can be bridged within a single duplicate block. Increasing this allows noisier but more permissive matching.
Hole tolerance
The --max-hole-size flag allows dcd to skip over modified lines within a diagonal match — lines that stayed in the same position but were changed. This is directly inspired by the Ducasse et al. paper, where holes in diagonal patterns represent in-place modifications.
# Allow up to 2 consecutive modified lines within a match
$ dcd --max-hole-size 2
Holes differ from gaps:
- Holes (
--max-hole-size): lines modified in place — the diagonal continues straight but some cells don't match - Gaps (
--gap-tolerance): lines inserted or deleted — the diagonal shifts to a new position
All three mechanisms are orthogonal and compose together: --fuzz controls line-level similarity, --max-hole-size handles in-place modifications, and --gap-tolerance handles insertions/deletions.
# Maximum duplicate detection: fuzzy lines, holes, and gap bridging
$ dcd -f 2 --max-hole-size 2 -g 2
When holes or gaps are present, the output includes counts:
Found duplicate lines in fileA.go:
lines 10-25 match 30-46 in fileB.go (matching lines 14, holes 2)
lines 50-68 match 80-100 in fileB.go (matching lines 15, holes 1, gaps 3)
Single file comparison
The --file flag compares a single file against the rest of the codebase, useful for checking whether a specific file contains code duplicated elsewhere:
$ dcd --file src/utils.go
PBM scatter plot
The --pbm-file-a, --pbm-file-b, and --pbm-output flags generate a PBM (Portable Bitmap) scatter plot of the comparison matrix between two files. This is directly inspired by the scatter plot visualization described in the Ducasse et al. paper — diagonals represent copied code, holes represent in-place modifications, and broken diagonals represent insertions/deletions.
All three flags must be specified together. When set, normal duplicate scanning is skipped and only the PBM file is produced.
# Compare two files and generate a scatter plot
$ dcd --pbm-file-a src/utils.go --pbm-file-b src/helpers.go --pbm-output scatter.pbm
# Self-comparison shows the main diagonal plus any internal duplication
$ dcd --pbm-file-a processor.go --pbm-file-b processor.go --pbm-output self.pbm
# Combine with fuzzy matching for a denser visualization
$ dcd --pbm-file-a fileA.go --pbm-file-b fileB.go --pbm-output fuzzy.pbm -f 2
The output is a P1 ASCII PBM file where each pixel represents a line pair: black (1) means the lines match, white (0) means they don't. The image can be viewed with any image viewer that supports PBM (GIMP, feh, ImageMagick's display, etc.).
| Duplicate code (self-comparison) | Totally different files | Some duplicate/copied code |
|:---:|:---:|:---:|
|
|
|
|
Same-file duplicates
By default, dcd only compares different files. Use --process-same-file to also find duplicate blocks within the same file:
$ dcd --process-same-file
Ignoring blocks of code
The --ignore-blocks-start and --ignore-blocks-end flags let you mark regions of code that should be excluded from duplicate detection. This is similar to Simian's ignore feature. Lines between (and including) the start and end markers are zeroed out so the duplicate detector skips them.
Both flags must be specified together. The markers are matched case-insensitively against the normalized (lowercased, whitespace-stripped) line content using substring matching.
# Ignore lines between "duplicate-disable" and "duplicate-enable" markers
$ dcd --ignore-blocks-start duplicate-disable --ignore-blocks-end duplicate-ena
Related Skills
node-connect
344.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
99.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
344.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
344.4kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
