SkillAgentSearch skills...

Codesim

Source Code Clone Detection Using Unsupervised Similarity Measures

Install / Use

/learn @jorge-martinez-gil/Codesim

README

CodeSim: Source Code Clone Detection Using Unsupervised Similarity Measures

<div align="center">

Python Java License: MIT arXiv Springer GitHub stars GitHub forks

A comprehensive benchmark of 23 unsupervised similarity measures for detecting code clones in Java

📖 Paper📊 Dataset🔬 Results🚀 Quick Start

</div>

📖 Overview

CodeSim is a research toolkit that implements and benchmarks 23 different unsupervised similarity measures for detecting code clones in Java source code. This work addresses the critical challenge of identifying duplicated or similar code segments, which is essential for:

  • 🔍 Software Maintenance: Identifying redundant code for refactoring
  • 🎓 Plagiarism Detection: Academic integrity in programming courses
  • 🐛 Bug Detection: Finding similar code patterns that may contain similar bugs
  • 📊 Code Quality: Measuring code reusability and maintainability

Why CodeSim?

Unlike supervised machine learning approaches that require labeled training data, CodeSim explores unsupervised methods that can detect clones without prior training. This makes them particularly useful for:

  • New programming languages or frameworks
  • Organizations without labeled clone datasets
  • Real-time clone detection scenarios
  • Resource-constrained environments

Key Features

  • 🎯 23 Different Approaches: From simple text-based to advanced semantic methods
  • 📊 Comprehensive Benchmark: Detailed performance metrics on IR-Plag dataset
  • Performance Metrics: Accuracy, precision, recall, F-measure, and execution time
  • 🔧 Ready-to-Use Scripts: Each method in a standalone Python script
  • 📈 Automatic Threshold Detection: Optimizes similarity thresholds for each method

🔬 Research Context

This work is based on the research paper:

Martinez-Gil, J. (2024). Source Code Clone Detection Using Unsupervised Similarity Measures. In: Bludau, P., Ramler, R., Winkler, D., Bergsmann, J. (eds) Software Quality as a Foundation for Security. SWQD 2024. Lecture Notes in Business Information Processing, vol 505. Springer, Cham.

📄 Read the full paper | 📑 arXiv preprint

📊 Performance Results

Summary Table

| Approach | Script | Accuracy | Precision | Recall | F-Measure | Time (ms) | |----------|--------|----------|-----------|--------|-----------|-----------| | 🥇 Output Analysis | java-sim-exec-opt.py | 0.94 | 0.85 | 0.97 | 0.90 | 1,381,335 | | 🥈 Jaccard | java-sim-jaccard-opt.py | 0.86 | 0.81 | 0.94 | 0.87 | 2,066 ⚡ | | 🥉 Winnow | java-sim-winn-opt.py | 0.86 | 0.81 | 0.98 | 0.88 | 77,161 | | N-Grams | java-sim-ngrams-opt.py | 0.85 | 0.84 | 0.29 | 0.43 | 66,635 | | Winnow II | java-sim-winn2-opt.py | 0.83 | 0.80 | 0.94 | 0.87 | 104,033 | | Rabin-Karp | java-sim-rk-opt.py | 0.81 | 0.79 | 0.99 | 0.88 | 225,219 | | AST | java-sim-ast-opt.py | 0.77 | 0.77 | 0.78 | 0.78 | 80,907 | | Function Calls | java-sim-fcall-opt.py | 0.78 | 0.78 | 0.91 | 0.84 | 30,304 | | Graph Matching | java-sim-graph-opt.py | 0.78 | 0.80 | 0.52 | 0.63 | 65,077 |

<details> <summary>📋 <b>View All 23 Methods (Click to expand)</b></summary>

| Approach | Script | Accuracy | Precision | Recall | F-Measure | Time (ms) | |----------|--------|----------|-----------|--------|-----------|-----------| | Output Analysis | java-sim-exec-opt.py | 0.94 | 0.85 | 0.97 | 0.90 | 1,381,335 | | Jaccard | java-sim-jaccard-opt.py | 0.86 | 0.81 | 0.94 | 0.87 | 2,066 | | Winnow | java-sim-winn-opt.py | 0.86 | 0.81 | 0.98 | 0.88 | 77,161 | | N-Grams | java-sim-ngrams-opt.py | 0.85 | 0.84 | 0.29 | 0.43 | 66,635 | | Winnow II | java-sim-winn2-opt.py | 0.83 | 0.80 | 0.94 | 0.87 | 104,033 | | Rabin-Karp | java-sim-rk-opt.py | 0.81 | 0.79 | 0.99 | 0.88 | 225,219 | | Function Calls | java-sim-fcall-opt.py | 0.78 | 0.78 | 0.91 | 0.84 | 30,304 | | Graph Matching | java-sim-graph-opt.py | 0.78 | 0.80 | 0.52 | 0.63 | 65,077 | | AST | java-sim-ast-opt.py | 0.77 | 0.77 | 0.78 | 0.78 | 80,907 | | Bag-of-Words | java-sim-bow-opt.py | 0.77 | 0.79 | 0.66 | 0.72 | 57,445 | | Bag-of-Words II | java-sim-bow2-opt.py | 0.77 | 0.77 | 1.00 | 0.87 | 59,962 | | Comment Similarity | java-sim-comments-opt.py | 0.77 | 0.77 | 1.00 | 0.87 | 983,231 | | Fuzzy Matching | java-sim-fuzz-opt.py | 0.77 | 0.77 | 1.00 | 0.87 | 12,779 | | Perceptual Hash | java-sim-image-opt.py | 0.77 | 0.77 | 0.85 | 0.81 | 38,153 | | Levenshtein | java-sim-lev-opt.py | 0.77 | 0.80 | 0.66 | 0.72 | 10,280 | | Metrics Comparison | java-sim-metrics-opt.py | 0.77 | 0.77 | 1.00 | 0.87 | 60,509 | | TDF-IDF | java-sim-tdf-opt.py | 0.77 | 0.77 | 0.99 | 0.87 | 68,587 | | Semantic Clone | java-sim-semclone-opt.py | 0.77 | 0.79 | 0.68 | 0.73 | 41,544 | | Semdiff | java-sim-semdiff-opt.py | 0.77 | 0.79 | 0.38 | 0.51 | 26,351 | | PDG | java-sim-pdg-opt.py | 0.65 | 0.85 | 0.39 | 0.53 | 40,519 | | Rolling Hash | java-sim-hash-opt.py | 0.59 | 0.93 | 0.18 | 0.30 | 959,158 | | CodeBERT* | java-sim-codebert-opt.py | 0.54 | 0.75 | 0.34 | 0.47 | 868,756 | | LCS | java-sim-lcs-opt.py | 0.48 | 0.74 | 0.06 | 0.11 | 7,269 |

*CodeBERT is used without recalibration

</details>

🎯 Key Findings

  1. 🏆 Best Overall: Output Analysis achieves 94% accuracy with 0.90 F-measure
  2. ⚡ Fastest & Effective: Jaccard with only 2ms execution time and 0.87 F-measure
  3. 🎯 High Precision: Rolling Hash achieves 93% precision (but low recall)
  4. ⚖️ Best Balance: Winnow provides excellent speed-accuracy tradeoff

Performance Visualization

Accuracy vs Execution Time (Best performers in top-left)
│
│  Output Analysis ●
│                   
0.9┤                   Winnow ●
│                   Jaccard ● (BEST)
│  
0.8┤        
│  
0.7┤
└────────────────────────────────────
   Fast (< 100ms)      Slow (> 1s)

🚀 Quick Start

Prerequisites

  • Python 3.8+
  • Java 8+ (for processing Java source files)
  • pip package manager

Installation

# Clone the repository
git clone https://github.com/jorge-martinez-gil/codesim.git
cd codesim

# Install Python dependencies
pip install -r requirements.txt

Dataset Setup

The repository uses the IR-Plag dataset, which is already included in the IR-Plag-Dataset/ directory.

Dataset Credits: Created by Oscar Karnalim
Source: https://github.com/oscarkarnalim/sourcecodeplagiarismdataset

Running Individual Methods

Each similarity measure has its own script. Run any method:

# Example 1: Run Jaccard similarity (fastest method)
python java-sim-jaccard-opt.py

# Example 2: Run Output Analysis (best accuracy)
python java-sim-exec-opt.py

# Example 3: Run AST-based comparison
python java-sim-ast-opt.py

Running All Methods (Benchmark)

# Run comprehensive benchmark
python main.py

# The script will:
# 1. Process all Java files in IR-Plag-Dataset
# 2. Apply each similarity measure
# 3. Find optimal thresholds
# 4. Generate performance report

Using as a Library

from java_sim_jaccard_opt import compute_jaccard_similarity

# Compare two Java code snippets
code1 = """
public class Example {
    public int add(int a, int b) {
        return a + b;
    }
}
"""

code2 = """
public class Sample {
    public int sum(int x, int y) {
        return x + y;
    }
}
"""

similarity_score = compute_jaccard_similarity(code1, code2)
print(f"Similarity: {similarity_score:.2f}")

# Determine if it's a clone (using optimal threshold)
is_clone = similarity_score > 0.75  # Threshold from optimization
print(f"Is clone: {is_clone}")

🔍 Similarity Measures Explained

1. Text-Based Methods

These methods treat code as text and compare character/token sequences.

Levenshtein Distance (java-sim-lev-opt.py)

  • What it does: Counts minimum edits to transform one code into another
  • Best for: Detecting minor code modifications
  • Speed: ⚡⚡⚡ Very fast
  • Accuracy: Moderate

Longest Common Subsequence (java-sim-lcs-opt.py)

  • What it does: Finds longest matching sequence of characters
  • Best for: Identifying core similarities despite reordering
  • Speed: ⚡⚡ Fast
  • Accuracy: Low (not recommended)

Fuzzy Matching (java-sim-fuzz-opt.py)

  • What it does: Uses fuzzy string matching algorithms
  • Best for: Handling typos and minor variations
  • Speed: ⚡⚡⚡ Very fast
  • Accuracy: Good

2. Token-Based Methods

These methods analyze code at the token/word level.

Bag-of-Words (java-sim-bow-opt.py, java-sim-bow2-opt.py)

  • What it does: Treats code as collection of tokens
  • Best for: Comparing overall code vocabulary
  • Speed: ⚡⚡⚡ Very fast
  • Accuracy: Good

Jaccard Similarity (java-sim-jaccard-opt.py) ⭐

  • What it does: Compares token set overlap
  • Best for: Fast, accurate clone detection
  • Speed: ⚡⚡⚡⚡ Extremely fast (2ms!)
  • Accuracy: Excellent (0.87 F-measure)

TF-IDF (`java-s

Related Skills

View on GitHub
GitHub Stars23
CategoryDevelopment
Updated2mo ago
Forks4

Languages

Java

Security Score

95/100

Audited on Jan 31, 2026

No findings