CodeDistillery

A highly parallel software repository mining framework.

Generate Convert Improve

Install / Use

/learn @staslev/CodeDistillery

About this skill

Quality Score

0/100

README

CodeDistillery

What?

CodeDistillery is a framework aimed at facilitating the mining of source code changes from version control systems.

The thing that makes CodeDistillery a framework more than a tool, is its support for pluggable source code mining mechanisms, while providing the underlying infrastructure to efficiently apply these mechanisms on entire revision histories of numerous software repositories.

Why?

While one could get away with mining a single repository with 100 or so commits without considering scale, it is no longer the case when this task involves dozens and more of repositories, each having thousands and more of revisions. In fact, the latter is quite standard for current studies in empirical software engineering, so we thought that the tool we had built for this task could make life easier on other people trying to do similar stuff.

How?

We utilize Spark as the distributed compute engine, and JGit as the data access layer to make the mining workload a highly parallel one. Then, we use spark to distribute and process it.

In light of the above, CodeDistillery would not have been possible (or at the very least, it would have been much much harder to build) without awesome people building awesome open source software, and in particular the OS projects we extensively built upon: Apache Spark, JGit and ChangeDistiller.

Getting Started

git clone https://github.com/staslev/CodeDistillery  
cd CodeDistillery  
mvn clean install

The project was tested and developed using JDK 8. If you have multiple JDKs installed, make sure maven uses JDK 8 when executing mvn clean install. This can be done using the following command:

export JAVA_HOME=/my_jdk1.8/Contents/Home && mvn clean install

Setting up Maven dependencies

<dependencies>  
  <dependency> 
    <groupId>com.staslev.codedistillery</groupId>   
    <artifactId>distillery-core</artifactId>
    <version>0.5-SNAPSHOT</version>
  </dependency>
  <dependency>  
    <groupId>com.staslev.codedistillery</groupId>
    <artifactId>change-distiller-uzh</artifactId>
    <version>0.5-SNAPSHOT</version>
  </dependency>
</dependencies>

Usage

We demonstrate CodeDistillery by providing an out-of-the-box support for mining Java fine-grained source code changes from Git repositories.

object Main {  
  
  def main(args: Array[String]): Unit = {  
 
  val codeDistillery =  
    new CodeDistillery(
      vcsFactory = GitRepo.apply,  
      distillerFactory = UzhSourceCodeChangeDistiller.apply,  
      encoderFactory = () => UzhSourceCodeChangeCSVEncoder)  
    with CrossRepoRevisionParallelism
  
    val repoPath = Paths.get("/path/to/my/repo")  
    val output = Paths.get("/path/to/write/output")  
    val branch = "master"

    import LocalSparkParallelism.spark

    codeDistillery.distill(Set((repoPath, branch)), output)  
 }  
}

Output

The output is a CSV file with a # delimiter, consisting of the following fields (in respective order):

Project name
Commit hash
Author name
Author email
Fine-grained change type
Unique name of changed entity
Significance level
Parent entity type
Unique name of parent entity
Root entity type
Unique name of root entity
Commit message
Filename

Obtaining commit level datasets

The output from the previous stage is a dataset of raw fine-grained source code changes as distilled from a software repository. It is often useful to aggregate this raw dataset into commit level statistics. A commit level dataset can be obtained by performing the following:

val input1 :: input2 :: output :: Nil =  
  List("/path/to/input1", "/path/to/input2", "/path/to/output")  
 .map(Paths.get(_))

import LocalSparkParallelism.spark

PerCommit.aggregate(Set(input1, input2), output)

The output is a CSV file with a # delimiter, consisting of the following fields (in respective order):

Project name
Commit hash
Author name
Author mail
Date
Non test versatility
Commit message
Test cases added
Test cases removed
Test cases changed
Test suites added
Test suites removed
Test suites affected
Has issue ref
Non test files in commit
Total files in commit
Commit message length

++ { fine-grained source code change type frequencies } Which is a lexicographically sorted list of fine-grained source code change types.

The complete list of columns (a.k.a. header line) can be obtained using: PerCommit.headerLine.

Related Skills

proje

Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

research_rules

Research & Verification Rules Quote Verification Protocol Primary Task "Make sure that the quote is relevant to the chapter and so you we want to make sure that we want to have it identifie

staslev

View profile

View on GitHub

GitHub Stars6

CategoryEducation

Updated3mo ago

Forks6

staslev/CodeDistillery

Languages

Java

Security Score

87/100

Audited on Dec 9, 2025

No findings