Gogetcrawl
Extract web archive data using Wayback Machine and Common Crawl
Install / Use
/learn @karust/GogetcrawlREADME
Go Get Crawl
gogetcrawl is a tool and package that helps you download URLs and Files from popular Web Archives like Common Crawl and Wayback Machine. You can use it as a command line tool or import the solution into your Go project.
Installation
Source
go install github.com/karust/gogetcrawl@latest
Docker
docker build -t gogetcrawl .
docker run gogetcrawl --help
Binary
Check out the latest release here.
Usage
Docker
docker run uranusq/gogetcrawl url *.tutorialspoint.com/* --ext pdf --limit 5
Docker compose
docker-compose up --build
CLI usage
- See commands and flags:
gogetcrawl -h
Get URLs
- You can get multiple-domain archive data, flags will be applied to each. By default, you will get all results displayed in your terminal (use
--collapseto get unique results):
gogetcrawl url *.example.com *.tutorialspoint.com/* --collapse
- To limit the number of results, enable output to a file and select only Wayback as a source you can:
gogetcrawl url *.tutorialspoint.com/* --limit 10 --sources wb -o ./urls.txt
- Set date range:
gogetcrawl url *.tutorialspoint.com/* --limit 10 --from 20140131 --to 20231231
Download files
- Download 5
PDFfiles to./testdirectory with 3 workers:
gogetcrawl download *.cia.gov/* --limit 5 -w 3 -d ./test -f "mimetype:application/pdf"
Package usage
go get github.com/karust/gogetcrawl
For both Wayback and Common crawl you can use concurrent and non-concurrent ways to interract with archives:
Wayback
- Get urls
package main
import (
"fmt"
"github.com/karust/gogetcrawl/common"
"github.com/karust/gogetcrawl/wayback"
)
func main() {
// Get only 10 status:200 pages
config := common.RequestConfig{
URL: "*.example.com/*",
Filters: []string{"statuscode:200"},
Limit: 10,
}
// Set request timout and retries
wb, _ := wayback.New(15, 2)
// Use config to obtain all CDX server responses
results, _ := wb.GetPages(config)
for _, r := range results {
fmt.Println(r.Urlkey, r.Original, r.MimeType)
}
}
- Get files:
// Get all status:200 HTML files
config := common.RequestConfig{
URL: "*.tutorialspoint.com/*",
Filters: []string{"statuscode:200", "mimetype:text/html"},
}
wb, _ := wayback.New(15, 2)
results, _ := wb.GetPages(config)
// Get first file from CDX response
file, err := wb.GetFile(results[0])
fmt.Println(string(file))
CommonCrawl
To use CommonCrawl you just need to replace wayback module with commoncrawl. Let's use Common Crawl concurretly
- Get urls
cc, _ := commoncrawl.New(30, 3)
config1 := common.RequestConfig{
URL: "*.tutorialspoint.com/*",
Filters: []string{"statuscode:200", "mimetype:text/html"},
Limit: 6,
}
config2 := common.RequestConfig{
URL: "example.com/*",
Filters: []string{"statuscode:200", "mimetype:text/html"},
Limit: 6,
}
resultsChan := make(chan []*common.CdxResponse)
errorsChan := make(chan error)
go func() {
cc.FetchPages(config1, resultsChan, errorsChan)
}()
go func() {
cc.FetchPages(config2, resultsChan, errorsChan)
}()
for {
select {
case err := <-errorsChan:
fmt.Printf("FetchPages goroutine failed: %v", err)
case res, ok := <-resultsChan:
if ok {
fmt.Println(res)
}
}
}
- Get files:
config := common.RequestConfig{
URL: "kamaloff.ru/*",
Filters: []string{"statuscode:200", "mimetype:text/html"},
}
cc, _ := commoncrawl.New(15, 2)
results, _ := wb.GetPages(config)
file, err := cc.GetFile(results[0])
Bugs + Features
If you have some issues/bugs or feature request, feel free to open an issue.
Related Skills
node-connect
341.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
xurl
341.8kA CLI tool for making authenticated requests to the X (Twitter) API. Use this skill when you need to post tweets, reply, quote, search, read posts, manage followers, send DMs, upload media, or interact with any X API v2 endpoint.
frontend-design
84.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
