Gitminer
Java based tools for extracting information from GitHub and git repositories into a graph database
Install / Use
/learn @pridkett/GitminerREADME
GitMiner
Copyright (c) 2011-2012 by IBM and the University of Nebraska-Lincoln
By Patrick Wagstrom <patrick@wagstrom.net> and Corey Jergenson <corey.jergensen@gmail.com>
This project is from a joint research project between IBM Research and the University of Nebraska-Lincoln. Under the terms of that agreement all output from this project must be distributed under the terms of the [Apache License][license].
Compilation
This project uses Apache Maven to manage all dependencies and versioning. The simplest way to get going is to run the following command:
mvn clean compile package assembly:single
Configuration
GitMiner has many different properties that can be set to alter the behavior
of the program. The following is a simple configuration file that will be
enough to get you going. Copy this data to a file and name it something like
configuration.properties in the root directory of your GitMiner install.
net.wagstrom.research.github.login=YOURGITHUBLOGIN
net.wagstrom.research.github.password=YOURGITHUBPASSWORD
net.wagstrom.research.github.email=YOUREMAILADDRESS
net.wagstrom.research.github.dbengine=neo4j
net.wagstrom.research.github.dburl=graph.db
net.wagstrom.research.github.projects=pridkett/gitminer
Alternatively, you may authenticate by using an OAuth token to be configured as follows in lieu of giving login and password.
net.wagstrom.research.github.token=YOUROAUTHTOKEN
See http://developer.github.com/v3/oauth/ for more information.
If you plan on using the Git Repository loading functionality then you'll need to set the following options that are specific to that functionality. In the future these options will be merged together, but for right now you'll just need to repeat them.
edu.unl.cse.git.dbengine=neo4j
edu.unl.cse.git.dburl=graph.db
edu.unl.cse.git.repositories=pridkett/gitminer
Execution
Execution of GitMiner is a two step process that consists of first using the GitHub API to download project data and then later using git directly to process project source code commits. The configuration file created in the last step has all the settings you'll need for stages.
To begin, run GitMiner so it downloads data from using the GitHub API:
./gitminer.sh -c configuration.properties
Next, use the repository loader functions of GitMiner to download the source code history for the projects.
./repo-loader.sh -c configuration.properties
Configuration Parameters
For the most part we have attempted to provide sensible defaults for configuration parameters, however some parameters must have their values set for the tool to function.
-
name:
net.wagstrom.research.github.login<br> default: no default<br> description: this parameter must be set. On October 14, 2012 [GitHub changed the way their API works][gh-api-limit] and reduced the number of anonymous API requests to 60/hour. With this parameter set you can get as many as 5000 requests an hour. Without this parameter GitMiner will just refuse to run. -
name:
net.wagstrom.research.github.password<br> default: no default<br> description: the companion tonet.wagstrom.research.github.login. -
name:
net.wagstrom.research.github.token<br> default: no default<br> description: this can be set instead of login and password to authenticate with GitHub using the given OAuth token as documented on http://developer.github.com/v3/oauth/. -
name:
net.wagstrom.research.github.email<br> default: no default<br> description: this is your email address. GitHub has requested that all clients using the API provide additional mechanisms to identify themselves via the user-agent. One of the ways that GitMiner accomplishes this is by putting your email address in the user-agent string. Please be nice and set this value accordingly. -
name:
net.wagstrom.research.github.projects<br> default: no default<br> description: a comma separated list of projects to begin spidering. For examplerails/rails,pridkett/gitminer,tinkerpop/blueprints. -
name:
net.wagstrom.research.github.users<br> default: no default<br> description: a comma separated list of users to spider. For examplepridkett,jurgns,dhh. -
name:
net.wagstrom.research.github.organizations<br> default: no default<br> description: a comma separated list of organizations to spider. For example,37signals,tinkerpop. -
name:
net.wagstrom.research.github.refreshTime<br> default:0.0<br> description: minimum number of days since the last update to download information about a user or other element again. For most purposes you can probably set this much higher. This will GREATLY speed up your crawls if you set it to a high value. -
name:
net.wagstrom.research.github.apiThrottle.maxCalls.v3<br> default4980<br> description: The maximum number of calls via the GitHub v3 API in a given time period. Use this to rate limit under what the API says. I typically set this value to4980or something like that to avoid problems when I hit API limits. If the value is0then this is ignored. -
name:
net.wagstrom.research.github.apiThrottle.maxCallsInterval.v3<br> default:3600<br> description: Time period (in seconds) to make the maximum number of calls using the v3 GitHub API. Previously some APIs allowed 60calls/min and others 5000/hr, but the API didn't set this. Now it seems to always be 5000/hr, so this is generally set to3600. -
name:
net.wagstrom.research.github.miner.repositories<br> default:true<br> description: atrue/falseparameter on whether or not to download data for the projects specified innet.wagstrom.research.github.usersproperty. -
name:
net.wagstrom.research.github.miner.repositories.collaborators<br> default:true<br> description: atrue/falseparameter on whether or not to downlaod data for the collaborators listed for each project. -
name:
net.wagstrom.research.github.miner.repositories.contributors<br> default:true<br> description: atrue/falseparameter on whether or not to download data for the contributors listed for each project. -
name:
net.wagstrom.research.github.miner.repositories.watchers<br> default:true<br> description: atrue/falseparameter for whether or not to download data for the watchers listed for each project. -
name:
net.wagstrom.research.github.miner.repositories.forks<br> default:true<br> description: atrue/falseparameter for whether or not to download data about forks for each project. -
name:
net.wagstrom.research.github.miner.repositories.issues<br> default:true<br> description: atrue/falseparameter for whether or not to download data about issues for each project. -
name:
net.wagstrom.research.github.miner.repositories.pullrequests<br> default:true<br> description: atrue/falseparameter for whether or not to download data about pull requests for each project. -
name:
net.wagstrom.research.github.miner.repositories.users<br> default:true<br> description: atrue/falseparameter for whether or not to download all the data about all the users for each project. FIXME: I'm not certain off the top of my head about what interplay this has with other settings above. -
name:
net.wagstrom.research.github.miner.users.events<br> default:true<br> description: atrue/falseparameter for whether or not to download the public events stream for each user mined. -
name:
net.wagstrom.research.github.miner.users.gists<br> default:true<br> description: atrue/falseparameter for whether or not to download the set of gists for each user mined. -
name:
net.wagstrom.research.github.miner.users<br> default:true<br> description: atrue/falseparameter for whether or not to download any information for the users listed innet.wagstrom.research.github.users. -
name:
net.wagstrom.research.github.miner.organizations<br> default:true<br> description: atrue/falseparameter for whether or not to download any information for the organizations listed innet.wagstrom.research.github.organizations. -
name:
net.wagstrom.research.github.miner.gists<br> default:true<br> description: atrue/falseparameter for whether or not to download any gists at all. -
name:
net.wagstrom.research.github.dbengine<br> default:neo4j<br> description: the name of the [Blueprints][blueprints] database backend to use. Right now this has only been tested onneo4j,orientdb, andtinkergraph. This feature is dependent on the features present in [govscigraph][govscigraph]. -
name:
net.wagstrom.research.github.dburl<br> default:github.db<br> description: the URL of the database to save to. For neo4j this is simply the directory where the database exists.
Java Options
In some cases, for some repositories, substantial java memory is required.
In these cases, setting the java memory as follows seems to work.
*this fixes ISSUE 34
export JAVA_OPTIONS="-Xms12g -Xmx12g"
sample data
If you'd like to see the output of gitminer without having to execute it, we have made two full datasets available on github.
- [gitminer-data-rails][gitminer-data-rails]: a scrape of the rails ecosystem from the month of May 2012
- [gitminer-data-tinkerpop][gitminer-data-tinkerpop]: a scrape of the projects in the tinkerpop stack from the month of May 2012
[gh-api-limit]:
Related Skills
feishu-drive
339.5k|
things-mac
339.5kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
339.5kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
yu-ai-agent
2.0k编程导航 2025 年 AI 开发实战新项目,基于 Spring Boot 3 + Java 21 + Spring AI 构建 AI 恋爱大师应用和 ReAct 模式自主规划智能体YuManus,覆盖 AI 大模型接入、Spring AI 核心特性、Prompt 工程和优化、RAG 检索增强、向量数据库、Tool Calling 工具调用、MCP 模型上下文协议、AI Agent 开发(Manas Java 实现)、Cursor AI 工具等核心知识。用一套教程将程序员必知必会的 AI 技术一网打尽,帮你成为 AI 时代企业的香饽饽,给你的简历和求职大幅增加竞争力。
