Codesim
Source Code Clone Detection Using Unsupervised Similarity Measures
Install / Use
/learn @jorge-martinez-gil/CodesimREADME
CodeSim: Source Code Clone Detection Using Unsupervised Similarity Measures
<div align="center">A comprehensive benchmark of 23 unsupervised similarity measures for detecting code clones in Java
📖 Paper • 📊 Dataset • 🔬 Results • 🚀 Quick Start
</div>📖 Overview
CodeSim is a research toolkit that implements and benchmarks 23 different unsupervised similarity measures for detecting code clones in Java source code. This work addresses the critical challenge of identifying duplicated or similar code segments, which is essential for:
- 🔍 Software Maintenance: Identifying redundant code for refactoring
- 🎓 Plagiarism Detection: Academic integrity in programming courses
- 🐛 Bug Detection: Finding similar code patterns that may contain similar bugs
- 📊 Code Quality: Measuring code reusability and maintainability
Why CodeSim?
Unlike supervised machine learning approaches that require labeled training data, CodeSim explores unsupervised methods that can detect clones without prior training. This makes them particularly useful for:
- New programming languages or frameworks
- Organizations without labeled clone datasets
- Real-time clone detection scenarios
- Resource-constrained environments
Key Features
- 🎯 23 Different Approaches: From simple text-based to advanced semantic methods
- 📊 Comprehensive Benchmark: Detailed performance metrics on IR-Plag dataset
- ⚡ Performance Metrics: Accuracy, precision, recall, F-measure, and execution time
- 🔧 Ready-to-Use Scripts: Each method in a standalone Python script
- 📈 Automatic Threshold Detection: Optimizes similarity thresholds for each method
🔬 Research Context
This work is based on the research paper:
Martinez-Gil, J. (2024). Source Code Clone Detection Using Unsupervised Similarity Measures. In: Bludau, P., Ramler, R., Winkler, D., Bergsmann, J. (eds) Software Quality as a Foundation for Security. SWQD 2024. Lecture Notes in Business Information Processing, vol 505. Springer, Cham.
📄 Read the full paper | 📑 arXiv preprint
📊 Performance Results
Summary Table
| Approach | Script | Accuracy | Precision | Recall | F-Measure | Time (ms) |
|----------|--------|----------|-----------|--------|-----------|-----------|
| 🥇 Output Analysis | java-sim-exec-opt.py | 0.94 | 0.85 | 0.97 | 0.90 | 1,381,335 |
| 🥈 Jaccard | java-sim-jaccard-opt.py | 0.86 | 0.81 | 0.94 | 0.87 | 2,066 ⚡ |
| 🥉 Winnow | java-sim-winn-opt.py | 0.86 | 0.81 | 0.98 | 0.88 | 77,161 |
| N-Grams | java-sim-ngrams-opt.py | 0.85 | 0.84 | 0.29 | 0.43 | 66,635 |
| Winnow II | java-sim-winn2-opt.py | 0.83 | 0.80 | 0.94 | 0.87 | 104,033 |
| Rabin-Karp | java-sim-rk-opt.py | 0.81 | 0.79 | 0.99 | 0.88 | 225,219 |
| AST | java-sim-ast-opt.py | 0.77 | 0.77 | 0.78 | 0.78 | 80,907 |
| Function Calls | java-sim-fcall-opt.py | 0.78 | 0.78 | 0.91 | 0.84 | 30,304 |
| Graph Matching | java-sim-graph-opt.py | 0.78 | 0.80 | 0.52 | 0.63 | 65,077 |
| Approach | Script | Accuracy | Precision | Recall | F-Measure | Time (ms) |
|----------|--------|----------|-----------|--------|-----------|-----------|
| Output Analysis | java-sim-exec-opt.py | 0.94 | 0.85 | 0.97 | 0.90 | 1,381,335 |
| Jaccard | java-sim-jaccard-opt.py | 0.86 | 0.81 | 0.94 | 0.87 | 2,066 |
| Winnow | java-sim-winn-opt.py | 0.86 | 0.81 | 0.98 | 0.88 | 77,161 |
| N-Grams | java-sim-ngrams-opt.py | 0.85 | 0.84 | 0.29 | 0.43 | 66,635 |
| Winnow II | java-sim-winn2-opt.py | 0.83 | 0.80 | 0.94 | 0.87 | 104,033 |
| Rabin-Karp | java-sim-rk-opt.py | 0.81 | 0.79 | 0.99 | 0.88 | 225,219 |
| Function Calls | java-sim-fcall-opt.py | 0.78 | 0.78 | 0.91 | 0.84 | 30,304 |
| Graph Matching | java-sim-graph-opt.py | 0.78 | 0.80 | 0.52 | 0.63 | 65,077 |
| AST | java-sim-ast-opt.py | 0.77 | 0.77 | 0.78 | 0.78 | 80,907 |
| Bag-of-Words | java-sim-bow-opt.py | 0.77 | 0.79 | 0.66 | 0.72 | 57,445 |
| Bag-of-Words II | java-sim-bow2-opt.py | 0.77 | 0.77 | 1.00 | 0.87 | 59,962 |
| Comment Similarity | java-sim-comments-opt.py | 0.77 | 0.77 | 1.00 | 0.87 | 983,231 |
| Fuzzy Matching | java-sim-fuzz-opt.py | 0.77 | 0.77 | 1.00 | 0.87 | 12,779 |
| Perceptual Hash | java-sim-image-opt.py | 0.77 | 0.77 | 0.85 | 0.81 | 38,153 |
| Levenshtein | java-sim-lev-opt.py | 0.77 | 0.80 | 0.66 | 0.72 | 10,280 |
| Metrics Comparison | java-sim-metrics-opt.py | 0.77 | 0.77 | 1.00 | 0.87 | 60,509 |
| TDF-IDF | java-sim-tdf-opt.py | 0.77 | 0.77 | 0.99 | 0.87 | 68,587 |
| Semantic Clone | java-sim-semclone-opt.py | 0.77 | 0.79 | 0.68 | 0.73 | 41,544 |
| Semdiff | java-sim-semdiff-opt.py | 0.77 | 0.79 | 0.38 | 0.51 | 26,351 |
| PDG | java-sim-pdg-opt.py | 0.65 | 0.85 | 0.39 | 0.53 | 40,519 |
| Rolling Hash | java-sim-hash-opt.py | 0.59 | 0.93 | 0.18 | 0.30 | 959,158 |
| CodeBERT* | java-sim-codebert-opt.py | 0.54 | 0.75 | 0.34 | 0.47 | 868,756 |
| LCS | java-sim-lcs-opt.py | 0.48 | 0.74 | 0.06 | 0.11 | 7,269 |
*CodeBERT is used without recalibration
</details>🎯 Key Findings
- 🏆 Best Overall: Output Analysis achieves 94% accuracy with 0.90 F-measure
- ⚡ Fastest & Effective: Jaccard with only 2ms execution time and 0.87 F-measure
- 🎯 High Precision: Rolling Hash achieves 93% precision (but low recall)
- ⚖️ Best Balance: Winnow provides excellent speed-accuracy tradeoff
Performance Visualization
Accuracy vs Execution Time (Best performers in top-left)
│
│ Output Analysis ●
│
0.9┤ Winnow ●
│ Jaccard ● (BEST)
│
0.8┤
│
0.7┤
└────────────────────────────────────
Fast (< 100ms) Slow (> 1s)
🚀 Quick Start
Prerequisites
- Python 3.8+
- Java 8+ (for processing Java source files)
- pip package manager
Installation
# Clone the repository
git clone https://github.com/jorge-martinez-gil/codesim.git
cd codesim
# Install Python dependencies
pip install -r requirements.txt
Dataset Setup
The repository uses the IR-Plag dataset, which is already included in the IR-Plag-Dataset/ directory.
Dataset Credits: Created by Oscar Karnalim
Source: https://github.com/oscarkarnalim/sourcecodeplagiarismdataset
Running Individual Methods
Each similarity measure has its own script. Run any method:
# Example 1: Run Jaccard similarity (fastest method)
python java-sim-jaccard-opt.py
# Example 2: Run Output Analysis (best accuracy)
python java-sim-exec-opt.py
# Example 3: Run AST-based comparison
python java-sim-ast-opt.py
Running All Methods (Benchmark)
# Run comprehensive benchmark
python main.py
# The script will:
# 1. Process all Java files in IR-Plag-Dataset
# 2. Apply each similarity measure
# 3. Find optimal thresholds
# 4. Generate performance report
Using as a Library
from java_sim_jaccard_opt import compute_jaccard_similarity
# Compare two Java code snippets
code1 = """
public class Example {
public int add(int a, int b) {
return a + b;
}
}
"""
code2 = """
public class Sample {
public int sum(int x, int y) {
return x + y;
}
}
"""
similarity_score = compute_jaccard_similarity(code1, code2)
print(f"Similarity: {similarity_score:.2f}")
# Determine if it's a clone (using optimal threshold)
is_clone = similarity_score > 0.75 # Threshold from optimization
print(f"Is clone: {is_clone}")
🔍 Similarity Measures Explained
1. Text-Based Methods
These methods treat code as text and compare character/token sequences.
Levenshtein Distance (java-sim-lev-opt.py)
- What it does: Counts minimum edits to transform one code into another
- Best for: Detecting minor code modifications
- Speed: ⚡⚡⚡ Very fast
- Accuracy: Moderate
Longest Common Subsequence (java-sim-lcs-opt.py)
- What it does: Finds longest matching sequence of characters
- Best for: Identifying core similarities despite reordering
- Speed: ⚡⚡ Fast
- Accuracy: Low (not recommended)
Fuzzy Matching (java-sim-fuzz-opt.py)
- What it does: Uses fuzzy string matching algorithms
- Best for: Handling typos and minor variations
- Speed: ⚡⚡⚡ Very fast
- Accuracy: Good
2. Token-Based Methods
These methods analyze code at the token/word level.
Bag-of-Words (java-sim-bow-opt.py, java-sim-bow2-opt.py)
- What it does: Treats code as collection of tokens
- Best for: Comparing overall code vocabulary
- Speed: ⚡⚡⚡ Very fast
- Accuracy: Good
Jaccard Similarity (java-sim-jaccard-opt.py) ⭐
- What it does: Compares token set overlap
- Best for: Fast, accurate clone detection
- Speed: ⚡⚡⚡⚡ Extremely fast (2ms!)
- Accuracy: Excellent (0.87 F-measure)
TF-IDF (`java-s
Related Skills
node-connect
350.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
350.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
350.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
