Dcd

Duplicate code detector with fuzzy matching, gap tolerance, and interactive codebase visualization

Generate Convert Improve

Install / Use

/learn @boyter/Dcd

About this skill

Quality Score

0/100

README

Duplicate Code Detector (dcd)

A tool similar to Simian designed to identify duplicate code within a project. It is, however, under a free software license.

Licensed under GNU Affero General Public License 3.0.

Support

Using dcd commercially? If you want priority support for dcd you can purchase a years worth https://boyter.gumroad.com/l/wajuc which entitles you to priority direct email support from the developer.

Install

Go Get

If you are comfortable using Go and have >= 1.19 installed:

go install github.com/boyter/dcd@latest

Manual

Binaries for GNU/Linux and macOS for both i386 and x86_64 and ARM64 machines are available from the releases page.

Pitch

Why use dcd?

It's reasonably fast and works with large projects
Works very well across multiple platforms without slowdown (GNU/Linux, macOS)
Supports fuzzy matching to catch near-duplicate lines
Supports gap tolerance to find duplicate blocks even when lines have been inserted, deleted, or modified
Can compare a single file against the rest of a codebase
Can generate PBM scatter plot visualizations of the comparison matrix between two files
Supports ignoring marked blocks of code (e.g. generated code) via configurable markers

Usage

Command line usage of dcd is designed to be as simple as possible. Full details can be found in dcd --help or dcd -h. Note that the below reflects the state of master, not a release.

$ dcd -h
dcd
Version 1.1.0
Ben Boyter <ben@boyter.org>

Usage:
  dcd [flags]

Flags:
      --duplicates-both-ways         report duplicates from both file perspectives (default reports each pair once)
  -x, --exclude-pattern strings      file and directory locations matching case sensitive patterns will be ignored [comma separated list: e.g. vendor,_test.go]
      --file string                  compare a single file against the rest of the codebase
      --format string                output format: text (default), json, or html
  -f, --fuzz uint8                   fuzzy value where higher numbers allow increasingly fuzzy lines to match, values 0-255 where 0 indicates exact match
  -g, --gap-tolerance int            allow gaps of up to N lines when matching duplicate blocks (0 = no gaps allowed)
  -h, --help                         help for dcd
      --ignore-blocks-end string     marker string to stop ignoring lines (e.g. duplicate-enable)
      --ignore-blocks-start string   marker string to start ignoring lines (e.g. duplicate-disable)
      --max-hole-size int            allow up to N consecutive modified lines (holes) within a duplicate diagonal (0 = no holes allowed)
  -i, --include-ext strings          limit to file extensions [comma separated list: e.g. go,java,js]
  -m, --match-length int             min match length (default 6)
      --max-gap-bridges int          maximum number of gap bridges allowed per duplicate match (default 1)
      --max-read-size-bytes int      number of bytes to read into a file with the remaining content ignored (default 10000000)
      --min-line-length int          number of bytes per average line for file to be considered minified (default 255)
      --no-gitignore                 disables .gitignore file logic
      --no-ignore                    disables .ignore file logic
      --pbm-file-a string            first file to compare for PBM scatter plot output
      --pbm-file-b string            second file to compare for PBM scatter plot output
      --pbm-output string            output path for PBM scatter plot file
      --process-same-file            find duplicate blocks within the same file
  -v, --verbose                      verbose output
      --version                      version for dcd

Basic usage

Running dcd with no arguments scans the current directory for duplicate code blocks:

$ dcd
Found duplicate lines in processor/cocomo_test.go:
 lines 0-8 match 0-8 in processor/workers_tokei_test.go (length 8)
Found duplicate lines in processor/detector_test.go:
 lines 0-8 match 0-8 in processor/processor_test.go (length 8)
Found duplicate lines in processor/filereader.go:
 lines 0-7 match 0-7 in processor/workers.go (length 7)

Found 98634 duplicate lines in 140 files

You can also pass a directory path: dcd /path/to/project.

Fuzzy matching

By default, dcd requires exact line matches. The --fuzz (-f) flag enables fuzzy matching using simhash distance, allowing lines that are similar but not identical to be treated as matches.

The value ranges from 0 to 255, where 0 means exact match and higher values allow increasingly fuzzy matches. Low values (1-3) catch minor differences like variable renames or whitespace changes. Higher values catch more significant changes but may produce false positives.

# Find near-duplicate code with slight differences
$ dcd -f 2

# More permissive fuzzy matching
$ dcd -f 5

Gap tolerance

The --gap-tolerance (-g) flag allows dcd to bridge over small gaps in otherwise matching blocks. This catches duplicate blocks where a few lines have been inserted, deleted, or modified in one copy.

When set to N, the algorithm searches up to N positions ahead in both source and target to find the next matching line, bridging over the gap. The --match-length requirement still applies to the number of actual matching lines, regardless of any gaps bridged.

# Allow gaps of up to 2 lines within duplicate blocks
$ dcd -g 2

# Allow larger gaps with multiple bridges
$ dcd -g 3 --max-gap-bridges 3

The --max-gap-bridges flag (default 1) controls how many gaps can be bridged within a single duplicate block. Increasing this allows noisier but more permissive matching.

Hole tolerance

The --max-hole-size flag allows dcd to skip over modified lines within a diagonal match — lines that stayed in the same position but were changed. This is directly inspired by the Ducasse et al. paper, where holes in diagonal patterns represent in-place modifications.

# Allow up to 2 consecutive modified lines within a match
$ dcd --max-hole-size 2

Holes differ from gaps:

Holes (--max-hole-size): lines modified in place — the diagonal continues straight but some cells don't match
Gaps (--gap-tolerance): lines inserted or deleted — the diagonal shifts to a new position

All three mechanisms are orthogonal and compose together: --fuzz controls line-level similarity, --max-hole-size handles in-place modifications, and --gap-tolerance handles insertions/deletions.

# Maximum duplicate detection: fuzzy lines, holes, and gap bridging
$ dcd -f 2 --max-hole-size 2 -g 2

When holes or gaps are present, the output includes counts:

Found duplicate lines in fileA.go:
 lines 10-25 match 30-46 in fileB.go (matching lines 14, holes 2)
 lines 50-68 match 80-100 in fileB.go (matching lines 15, holes 1, gaps 3)

Single file comparison

The --file flag compares a single file against the rest of the codebase, useful for checking whether a specific file contains code duplicated elsewhere:

$ dcd --file src/utils.go

PBM scatter plot

The --pbm-file-a, --pbm-file-b, and --pbm-output flags generate a PBM (Portable Bitmap) scatter plot of the comparison matrix between two files. This is directly inspired by the scatter plot visualization described in the Ducasse et al. paper — diagonals represent copied code, holes represent in-place modifications, and broken diagonals represent insertions/deletions.

All three flags must be specified together. When set, normal duplicate scanning is skipped and only the PBM file is produced.

# Compare two files and generate a scatter plot
$ dcd --pbm-file-a src/utils.go --pbm-file-b src/helpers.go --pbm-output scatter.pbm

# Self-comparison shows the main diagonal plus any internal duplication
$ dcd --pbm-file-a processor.go --pbm-file-b processor.go --pbm-output self.pbm

# Combine with fuzzy matching for a denser visualization
$ dcd --pbm-file-a fileA.go --pbm-file-b fileB.go --pbm-output fuzzy.pbm -f 2

The output is a P1 ASCII PBM file where each pixel represents a line pair: black (1) means the lines match, white (0) means they don't. The image can be viewed with any image viewer that supports PBM (GIMP, feh, ImageMagick's display, etc.).

Same-file duplicates

By default, dcd only compares different files. Use --process-same-file to also find duplicate blocks within the same file:

$ dcd --process-same-file

Ignoring blocks of code

The --ignore-blocks-start and --ignore-blocks-end flags let you mark regions of code that should be excluded from duplicate detection. This is similar to Simian's ignore feature. Lines between (and including) the start and end markers are zeroed out so the duplicate detector skips them.

Both flags must be specified together. The markers are matched case-insensitively against the normalized (lowercased, whitespace-stripped) line content using substring matching.

# Ignore lines between "duplicate-disable" and "duplicate-enable" markers
$ dcd --ignore-blocks-start duplicate-disable --ignore-blocks-end duplicate-ena

Related Skills

node-connect

344.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

99.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。