SkillAgentSearch skills...

Ghs

GitHub Search: Platform used to crawl, store and present projects from GitHub, as well as any statistics related to them

Install / Use

/learn @seart-group/Ghs

README

GitHub Search · Status MIT license Latest Dump DOI <!-- markdownlint-disable-line -->

This project is made of two components:

  1. A Spring Boot powered back-end, responsible for:
    1. Continuously crawling GitHub API endpoints for repository information, and storing it in a central database;
    2. Acting as an API for providing access to the stored data.
  2. A Bootstrap-styled and jQuery-powered web user interface, serving as an accessible front for the API.

Running Locally

Prerequisites

| Dependency | Version Requirement | |----------------------------------------------|--------------------:| | Java | 17 | | Maven | 3.9 | | MySQL | 8.3 | | Flyway | 10.13 | | cloc[^1] | 2.00 | | Git[^1] | 2.43 |

[^1]: Only required in versions prior to 1.7.0

Database

Before choosing whether to start with a clean slate or pre-populated database, make sure the following requirements are met:

  1. The database timezone is set to +00:00. You can verify this via:

    SELECT @@global.time_zone, @@session.time_zone;
    
  2. The event scheduler is turned ON. You can verify this via:

    SELECT @@global.event_scheduler;
    
  3. The binary logging during the creation of stored functions is set to 1. You can verify this via:

    SELECT @@global.log_bin_trust_function_creators;
    
  4. The gse database exists. To create it:

    CREATE DATABASE gse CHARACTER SET utf8 COLLATE utf8_bin;
    
  5. The gseadmin user exists. To create one, run:

    CREATE USER IF NOT EXISTS 'gseadmin'@'%' IDENTIFIED BY 'Lugano2020';
    GRANT ALL ON gse.* TO 'gseadmin'@'%';
    

If you prefer to begin with an empty database, there is nothing more for you to do. The required tables will be generated through Flyway migrations during the initial startup of the server. However, if you would like your local database to be pre-populated with the data we've collected, you can use the compressed SQL dump we offer. We host this dump, along with the four previous iterations, on Dropbox. After choosing and downloading a database dump, you can import the data by executing:

gzcat < gse.sql.gz | mysql -u gseadmin -pLugano2020 gse

Server

Before attempting to run the server, you should generate your own GitHub personal access token (PAT). The crawler relies on the GraphQL API, which is inaccessible without authentication. To access the information provided by the GitHub API, the token must include the repo scope.

Once that is done, you can run the server locally using Maven:

mvn spring-boot:run

If you want to make use of the token when crawling, specify it in the run arguments:

mvn spring-boot:run -Dspring-boot.run.arguments=--ghs.github.tokens=<your_access_token>

Alternatively, you can compile and run the JAR directly:

mvn clean package
ln target/ghs-application-*.jar target/ghs-application.jar
java -Dghs.github.tokens=<your_access_token> -jar target/ghs-application.jar

Here is a list of project-specific arguments supported by the application that you can find in the application.properties:

| Variable Name | Type | Default Value | Description | |--------------------------------------|--------------------------|-------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ghs.github.tokens | List<String> | | List of GitHub personal access tokens (PATs) that will be used for mining the GitHub API. Must not contain blank strings. | | ghs.github.api-version | String | 2022-11-28 | GitHub API version used across various operations. | | ghs.git.username | String | | Git account login used to interact with the version control system. | | ghs.git.password | String | | Password used to authenticate the specified Git account. | | ghs.git.config | Map<String,String> | See application.properties | Git configurations specific to the application[^2]. | | ghs.git.folder-prefix | String | ghs-clone- | Prefix used for the temporary directories into which analyzed repositories are cloned. Must not be blank. | | ghs.git.ls-remote-timeout-duration | Duration | 1m | Maximum time allowed for listing remotes of Git repositories. | | ghs.git.clone-timeout-duration | Duration | 5m | Maximum time allowed for cloning Git repositories. | | ghs.cloc.max-file-size | DataSize | 25MB | Maximum file size threshold for analysis with cloc. | | ghs.cloc.timeout-duration | Duration | 5m | Maximum time allowed for a cloc command to execute. | | ghs.crawler.enabled | Boolean | true | Specifies if the repository crawling job is enabled. | | ghs.crawler.minimum-stars | int | 10 | Inclusive lower bound for the number of stars a project needs to have in order to be picked up by the crawler. Must not be negative. | | ghs.crawler.languages | List<String> | See application.properties | List of language names that will be targeted during crawling. Must not contain blank strings. T

View on GitHub
GitHub Stars185
CategoryDevelopment
Updated6d ago
Forks21

Languages

Java

Security Score

100/100

Audited on Mar 21, 2026

No findings