Codesim

Source Code Clone Detection Using Unsupervised Similarity Measures

Generate Convert Improve

Install / Use

/learn @jorge-martinez-gil/Codesim

About this skill

Quality Score

0/100

README

CodeSim: Source Code Clone Detection Using Unsupervised Similarity Measures

A comprehensive benchmark of 23 unsupervised similarity measures for detecting code clones in Java

📖 Paper • 📊 Dataset • 🔬 Results • 🚀 Quick Start

</div>

📖 Overview

CodeSim is a research toolkit that implements and benchmarks 23 different unsupervised similarity measures for detecting code clones in Java source code. This work addresses the critical challenge of identifying duplicated or similar code segments, which is essential for:

🔍 Software Maintenance: Identifying redundant code for refactoring
🎓 Plagiarism Detection: Academic integrity in programming courses
🐛 Bug Detection: Finding similar code patterns that may contain similar bugs
📊 Code Quality: Measuring code reusability and maintainability

Why CodeSim?

Unlike supervised machine learning approaches that require labeled training data, CodeSim explores unsupervised methods that can detect clones without prior training. This makes them particularly useful for:

New programming languages or frameworks
Organizations without labeled clone datasets
Real-time clone detection scenarios
Resource-constrained environments

Key Features

🎯 23 Different Approaches: From simple text-based to advanced semantic methods
📊 Comprehensive Benchmark: Detailed performance metrics on IR-Plag dataset
⚡ Performance Metrics: Accuracy, precision, recall, F-measure, and execution time
🔧 Ready-to-Use Scripts: Each method in a standalone Python script
📈 Automatic Threshold Detection: Optimizes similarity thresholds for each method

🔬 Research Context

This work is based on the research paper:

Martinez-Gil, J. (2024). Source Code Clone Detection Using Unsupervised Similarity Measures. In: Bludau, P., Ramler, R., Winkler, D., Bergsmann, J. (eds) Software Quality as a Foundation for Security. SWQD 2024. Lecture Notes in Business Information Processing, vol 505. Springer, Cham.

📄 Read the full paper | 📑 arXiv preprint

📊 Performance Results

Summary Table

| Approach | Script | Accuracy | Precision | Recall | F-Measure | Time (ms) | |----------|--------|----------|-----------|--------|-----------|-----------| | 🥇 Output Analysis | java-sim-exec-opt.py | 0.94 | 0.85 | 0.97 | 0.90 | 1,381,335 | | 🥈 Jaccard | java-sim-jaccard-opt.py | 0.86 | 0.81 | 0.94 | 0.87 | 2,066 ⚡ | | 🥉 Winnow | java-sim-winn-opt.py | 0.86 | 0.81 | 0.98 | 0.88 | 77,161 | | N-Grams | java-sim-ngrams-opt.py | 0.85 | 0.84 | 0.29 | 0.43 | 66,635 | | Winnow II | java-sim-winn2-opt.py | 0.83 | 0.80 | 0.94 | 0.87 | 104,033 | | Rabin-Karp | java-sim-rk-opt.py | 0.81 | 0.79 | 0.99 | 0.88 | 225,219 | | AST | java-sim-ast-opt.py | 0.77 | 0.77 | 0.78 | 0.78 | 80,907 | | Function Calls | java-sim-fcall-opt.py | 0.78 | 0.78 | 0.91 | 0.84 | 30,304 | | Graph Matching | java-sim-graph-opt.py | 0.78 | 0.80 | 0.52 | 0.63 | 65,077 |

<details> <summary>📋 <b>View All 23 Methods (Click to expand)</b></summary>

| Approach | Script | Accuracy | Precision | Recall | F-Measure | Time (ms) | |----------|--------|----------|-----------|--------|-----------|-----------| | Output Analysis | java-sim-exec-opt.py | 0.94 | 0.85 | 0.97 | 0.90 | 1,381,335 | | Jaccard | java-sim-jaccard-opt.py | 0.86 | 0.81 | 0.94 | 0.87 | 2,066 | | Winnow | java-sim-winn-opt.py | 0.86 | 0.81 | 0.98 | 0.88 | 77,161 | | N-Grams | java-sim-ngrams-opt.py | 0.85 | 0.84 | 0.29 | 0.43 | 66,635 | | Winnow II | java-sim-winn2-opt.py | 0.83 | 0.80 | 0.94 | 0.87 | 104,033 | | Rabin-Karp | java-sim-rk-opt.py | 0.81 | 0.79 | 0.99 | 0.88 | 225,219 | | Function Calls | java-sim-fcall-opt.py | 0.78 | 0.78 | 0.91 | 0.84 | 30,304 | | Graph Matching | java-sim-graph-opt.py | 0.78 | 0.80 | 0.52 | 0.63 | 65,077 | | AST | java-sim-ast-opt.py | 0.77 | 0.77 | 0.78 | 0.78 | 80,907 | | Bag-of-Words | java-sim-bow-opt.py | 0.77 | 0.79 | 0.66 | 0.72 | 57,445 | | Bag-of-Words II | java-sim-bow2-opt.py | 0.77 | 0.77 | 1.00 | 0.87 | 59,962 | | Comment Similarity | java-sim-comments-opt.py | 0.77 | 0.77 | 1.00 | 0.87 | 983,231 | | Fuzzy Matching | java-sim-fuzz-opt.py | 0.77 | 0.77 | 1.00 | 0.87 | 12,779 | | Perceptual Hash | java-sim-image-opt.py | 0.77 | 0.77 | 0.85 | 0.81 | 38,153 | | Levenshtein | java-sim-lev-opt.py | 0.77 | 0.80 | 0.66 | 0.72 | 10,280 | | Metrics Comparison | java-sim-metrics-opt.py | 0.77 | 0.77 | 1.00 | 0.87 | 60,509 | | TDF-IDF | java-sim-tdf-opt.py | 0.77 | 0.77 | 0.99 | 0.87 | 68,587 | | Semantic Clone | java-sim-semclone-opt.py | 0.77 | 0.79 | 0.68 | 0.73 | 41,544 | | Semdiff | java-sim-semdiff-opt.py | 0.77 | 0.79 | 0.38 | 0.51 | 26,351 | | PDG | java-sim-pdg-opt.py | 0.65 | 0.85 | 0.39 | 0.53 | 40,519 | | Rolling Hash | java-sim-hash-opt.py | 0.59 | 0.93 | 0.18 | 0.30 | 959,158 | | CodeBERT* | java-sim-codebert-opt.py | 0.54 | 0.75 | 0.34 | 0.47 | 868,756 | | LCS | java-sim-lcs-opt.py | 0.48 | 0.74 | 0.06 | 0.11 | 7,269 |

*CodeBERT is used without recalibration

</details>

🎯 Key Findings

🏆 Best Overall: Output Analysis achieves 94% accuracy with 0.90 F-measure
⚡ Fastest & Effective: Jaccard with only 2ms execution time and 0.87 F-measure
🎯 High Precision: Rolling Hash achieves 93% precision (but low recall)
⚖️ Best Balance: Winnow provides excellent speed-accuracy tradeoff

Performance Visualization

Accuracy vs Execution Time (Best performers in top-left)
│
│  Output Analysis ●
│                   
0.9┤                   Winnow ●
│                   Jaccard ● (BEST)
│  
0.8┤        
│  
0.7┤
└────────────────────────────────────
   Fast (< 100ms)      Slow (> 1s)

🚀 Quick Start

Prerequisites

Python 3.8+
Java 8+ (for processing Java source files)
pip package manager

Installation

# Clone the repository
git clone https://github.com/jorge-martinez-gil/codesim.git
cd codesim

# Install Python dependencies
pip install -r requirements.txt

Dataset Setup

The repository uses the IR-Plag dataset, which is already included in the IR-Plag-Dataset/ directory.

Dataset Credits: Created by Oscar Karnalim
Source: https://github.com/oscarkarnalim/sourcecodeplagiarismdataset

Running Individual Methods

Each similarity measure has its own script. Run any method:

# Example 1: Run Jaccard similarity (fastest method)
python java-sim-jaccard-opt.py

# Example 2: Run Output Analysis (best accuracy)
python java-sim-exec-opt.py

# Example 3: Run AST-based comparison
python java-sim-ast-opt.py

Running All Methods (Benchmark)

# Run comprehensive benchmark
python main.py

# The script will:
# 1. Process all Java files in IR-Plag-Dataset
# 2. Apply each similarity measure
# 3. Find optimal thresholds
# 4. Generate performance report

Using as a Library

from java_sim_jaccard_opt import compute_jaccard_similarity

# Compare two Java code snippets
code1 = """
public class Example {
    public int add(int a, int b) {
        return a + b;
    }
}
"""

code2 = """
public class Sample {
    public int sum(int x, int y) {
        return x + y;
    }
}
"""

similarity_score = compute_jaccard_similarity(code1, code2)
print(f"Similarity: {similarity_score:.2f}")

# Determine if it's a clone (using optimal threshold)
is_clone = similarity_score > 0.75  # Threshold from optimization
print(f"Is clone: {is_clone}")

🔍 Similarity Measures Explained

1. Text-Based Methods

These methods treat code as text and compare character/token sequences.

Levenshtein Distance (`java-sim-lev-opt.py`)

What it does: Counts minimum edits to transform one code into another
Best for: Detecting minor code modifications
Speed: ⚡⚡⚡ Very fast
Accuracy: Moderate

Longest Common Subsequence (`java-sim-lcs-opt.py`)

What it does: Finds longest matching sequence of characters
Best for: Identifying core similarities despite reordering
Speed: ⚡⚡ Fast
Accuracy: Low (not recommended)

Fuzzy Matching (`java-sim-fuzz-opt.py`)

What it does: Uses fuzzy string matching algorithms
Best for: Handling typos and minor variations
Speed: ⚡⚡⚡ Very fast
Accuracy: Good

2. Token-Based Methods

These methods analyze code at the token/word level.

Bag-of-Words (`java-sim-bow-opt.py`, `java-sim-bow2-opt.py`)

What it does: Treats code as collection of tokens
Best for: Comparing overall code vocabulary
Speed: ⚡⚡⚡ Very fast
Accuracy: Good

Jaccard Similarity (`java-sim-jaccard-opt.py`) ⭐

What it does: Compares token set overlap
Best for: Fast, accurate clone detection
Speed: ⚡⚡⚡⚡ Extremely fast (2ms!)
Accuracy: Excellent (0.87 F-measure)

TF-IDF (`java-s

Related Skills

node-connect

350.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

110.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

350.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

350.8k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

jorge-martinez-gil

View profile

View on GitHub

GitHub Stars23

CategoryDevelopment

Updated2mo ago

Forks4

jorge-martinez-gil/codesim

Languages

Java

Security Score

95/100

Audited on Jan 31, 2026

No findings

Codesim

Install / Use

README

CodeSim: Source Code Clone Detection Using Unsupervised Similarity Measures

📖 Overview

Why CodeSim?

Key Features

🔬 Research Context

📊 Performance Results

Summary Table

🎯 Key Findings

Performance Visualization

🚀 Quick Start

Prerequisites

Installation

Dataset Setup

Running Individual Methods

Running All Methods (Benchmark)

Using as a Library

🔍 Similarity Measures Explained

1. Text-Based Methods

Levenshtein Distance (java-sim-lev-opt.py)

Longest Common Subsequence (java-sim-lcs-opt.py)

Fuzzy Matching (java-sim-fuzz-opt.py)

2. Token-Based Methods

Bag-of-Words (java-sim-bow-opt.py, java-sim-bow2-opt.py)

Jaccard Similarity (java-sim-jaccard-opt.py) ⭐

TF-IDF (`java-s

Related Skills

Levenshtein Distance (`java-sim-lev-opt.py`)

Longest Common Subsequence (`java-sim-lcs-opt.py`)

Fuzzy Matching (`java-sim-fuzz-opt.py`)

Bag-of-Words (`java-sim-bow-opt.py`, `java-sim-bow2-opt.py`)

Jaccard Similarity (`java-sim-jaccard-opt.py`) ⭐