Ghs
GitHub Search: Platform used to crawl, store and present projects from GitHub, as well as any statistics related to them
Install / Use
/learn @seart-group/GhsREADME
GitHub Search ·
<!-- markdownlint-disable-line -->
This project is made of two components:
- A Spring Boot powered back-end, responsible for:
- Continuously crawling GitHub API endpoints for repository information, and storing it in a central database;
- Acting as an API for providing access to the stored data.
- A Bootstrap-styled and jQuery-powered web user interface, serving as an accessible front for the API.
Running Locally
Prerequisites
| Dependency | Version Requirement | |----------------------------------------------|--------------------:| | Java | 17 | | Maven | 3.9 | | MySQL | 8.3 | | Flyway | 10.13 | | cloc[^1] | 2.00 | | Git[^1] | 2.43 |
[^1]: Only required in versions prior to 1.7.0
Database
Before choosing whether to start with a clean slate or pre-populated database, make sure the following requirements are met:
-
The database timezone is set to
+00:00. You can verify this via:SELECT @@global.time_zone, @@session.time_zone; -
The event scheduler is turned
ON. You can verify this via:SELECT @@global.event_scheduler; -
The binary logging during the creation of stored functions is set to
1. You can verify this via:SELECT @@global.log_bin_trust_function_creators; -
The
gsedatabase exists. To create it:CREATE DATABASE gse CHARACTER SET utf8 COLLATE utf8_bin; -
The
gseadminuser exists. To create one, run:CREATE USER IF NOT EXISTS 'gseadmin'@'%' IDENTIFIED BY 'Lugano2020'; GRANT ALL ON gse.* TO 'gseadmin'@'%';
If you prefer to begin with an empty database, there is nothing more for you to do. The required tables will be generated through Flyway migrations during the initial startup of the server. However, if you would like your local database to be pre-populated with the data we've collected, you can use the compressed SQL dump we offer. We host this dump, along with the four previous iterations, on Dropbox. After choosing and downloading a database dump, you can import the data by executing:
gzcat < gse.sql.gz | mysql -u gseadmin -pLugano2020 gse
Server
Before attempting to run the server, you should generate your own GitHub personal access token (PAT).
The crawler relies on the GraphQL API, which is inaccessible without authentication.
To access the information provided by the GitHub API, the token must include the repo scope.
Once that is done, you can run the server locally using Maven:
mvn spring-boot:run
If you want to make use of the token when crawling, specify it in the run arguments:
mvn spring-boot:run -Dspring-boot.run.arguments=--ghs.github.tokens=<your_access_token>
Alternatively, you can compile and run the JAR directly:
mvn clean package
ln target/ghs-application-*.jar target/ghs-application.jar
java -Dghs.github.tokens=<your_access_token> -jar target/ghs-application.jar
Here is a list of project-specific arguments supported by the application that you can find in the application.properties:
| Variable Name | Type | Default Value | Description |
|--------------------------------------|--------------------------|-------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ghs.github.tokens | List<String> | | List of GitHub personal access tokens (PATs) that will be used for mining the GitHub API. Must not contain blank strings. |
| ghs.github.api-version | String | 2022-11-28 | GitHub API version used across various operations. |
| ghs.git.username | String | | Git account login used to interact with the version control system. |
| ghs.git.password | String | | Password used to authenticate the specified Git account. |
| ghs.git.config | Map<String,String> | See application.properties | Git configurations specific to the application[^2]. |
| ghs.git.folder-prefix | String | ghs-clone- | Prefix used for the temporary directories into which analyzed repositories are cloned. Must not be blank. |
| ghs.git.ls-remote-timeout-duration | Duration | 1m | Maximum time allowed for listing remotes of Git repositories. |
| ghs.git.clone-timeout-duration | Duration | 5m | Maximum time allowed for cloning Git repositories. |
| ghs.cloc.max-file-size | DataSize | 25MB | Maximum file size threshold for analysis with cloc. |
| ghs.cloc.timeout-duration | Duration | 5m | Maximum time allowed for a cloc command to execute. |
| ghs.crawler.enabled | Boolean | true | Specifies if the repository crawling job is enabled. |
| ghs.crawler.minimum-stars | int | 10 | Inclusive lower bound for the number of stars a project needs to have in order to be picked up by the crawler. Must not be negative. |
| ghs.crawler.languages | List<String> | See application.properties | List of language names that will be targeted during crawling. Must not contain blank strings. T
