GithubCrawler
Crawl github data using API and no-API
Install / Use
/learn @yang1young/GithubCrawlerREADME
Github_Crawler
Crawl github data using github API or no-API, then store it into Mysql database
you need to install:
- Python2.7 (better Anaconda)
- mysql, you can install and configure according to this tutorial http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/001391435131816c6a377e100ec4d43b3fc9145f3bb8056000
File introduction
- CrawlerWithoutAPI.py demo crwal github without API, give a URL of REPO then return result
- GithubCrawler.py crawl github using API, search all java project and extract readme,description,topic and all dependency file from .gradle and .pom
- MysqlOption.py Mysql option, create database, table and insert/search
- CleanUtils.py Some tools to do cleaning and extraction
- token_key your Github API token
- data_prepare.py prepare data from database to do deeplearning or data analyze
User guide
following steps:
- generate your github access token, following this https://github.com/settings/tokens
- mkdir a new file in the project path named token_key, then copy&paste your personal access token into it(no need to add \n)
- modify MysqlOption.py and set your mysql USER and PASSWORD
- run MysqlOption.py to create database and new table
- modify GithubCrawl.py ,set START_FROM_TIME = YOUR START,set END_TO_TIME = YOUR END, set SRART_FROM_PAGE = YOUR START since the return project count is 320W and every query total max return result is 1000, and for once time, max return result is 100,so firstly we need to split these result according to repo create time, ensure every query total return result is less than 1000, for every specific time period, we need to split the result(max is 1000) into page so we can get all result
- happily run GithubCrawl.py
if you think this project may helpful, may you give my repo a STAR? :)
Related Skills
apple-reminders
351.4kManage Apple Reminders via remindctl CLI (list, add, edit, complete, delete). Supports lists, date filters, and JSON/plain output.
gh-issues
351.4kFetch GitHub issues, spawn sub-agents to implement fixes and open PRs, then monitor and address PR review comments. Usage: /gh-issues [owner/repo] [--label bug] [--limit 5] [--milestone v1.0] [--assignee @me] [--fork user/repo] [--watch] [--interval 5] [--reviews-only] [--cron] [--dry-run] [--model glm-5] [--notify-channel -1002381931352]
node-connect
351.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
oracle
351.4kBest practices for using the oracle CLI (prompt + file bundling, engines, sessions, and file attachment patterns).
