Goodbots
Verify IP addresses of respectful crawlers like Googlebot by reverse dns and forward dns lookups
Install / Use
/learn @eywu/GoodbotsREADME
goodbots - trust but verify
goodbots verifies the IP addresses of respectful crawlers like Googlebot by performing reverse dns and forward dns
lookups.
Latest release: v0.0.3-cetoddle
- Given an IP address (ex.
66.249.87.225) - It performs a reverse dns lookup to get a hostname (ex.
crawl-66-249-87-225.googlebot.com) - Then does a forward dns lookup on the hostname to get an IP (ex.
66.249.87.225) - It compares the 1st IP to the 2nd IP
- If they match, goodbots outputs the IP and hostname
The Job-to-be-Done (#jtbd)
In search engine optimization (SEO), it is common to analyze a site's access logs (aka bot logs). Often there are various requests by spoofed user-agents pretending to be official search engine crawlers like Googlebot. In order to have an accurate understanding of the site's crawl rate, we want to verify the IP address of the various crawlers.
What's new in v0.0.3-cetoddle
- Significant performance improvements for bulk DNS verification
- Configurable concurrency with
-c(default now50) - New
-modeflag forgoodbotsvsresolve - Safer output writing with buffered single-writer channel architecture
- Per-lookup context timeouts
- Last-line input handling fix for files without trailing newline
- Expanded test coverage for concurrency, timeout behavior, and DNS server selection
Getting Started
How to install/build goodbots
Clone the repo:
git clone git@github.com:eywu/goodbots.git
Change to the /cmd/goodbots directory:
cd goodbots/cmd/goodbots
Build the binary/executable main.go file:
go build
How to use goodbots
If you've built the main.go file that comes with goodbots above, you can simply feed goodbots IPs via standard-in.
Test a single IP
echo "203.208.60.1" | ./goodbots
Test a range of IPs with prips command line tool
prips 203.208.40.1 203.208.80.1 | ./goodbots
Test a list of IPs from a text or csv file
./goodbots < ip-list.txt
note: The CSV or text file expects only an IP on its own line.
Example:
66.249.87.224
203.208.23.146
203.208.23.126
203.208.60.227
Saving the results
goodbots prints to standard-out with tab (\t) delimiters, so you can capture the output with an output redirect.
Example Output
203.208.60.1 crawl-203-208-60-1.googlebot.com
66.249.85.123 google-proxy-66-249-85-123.google.com
66.249.87.12 rate-limited-proxy-66-249-87-12.google.com
66.249.85.224 google-proxy-66-249-85-224.google.com
Save verified bot IPs provide in a file name ip-list.txt to a filed named saved-results.tsv
./goodbots < ip-list.txt > saved-results.tsv
DNS Resolvers
goodbots uses a deterministic round-robin strategy across public DNS resolvers for each lookup. This provides stable load balancing while keeping throughput high on large batches.
It uses these DNS providers:
- CloudFlare Public DNS
- 1.1.1.1
- 1.0.0.1
- Google Public DNS
- 8.8.8.8
- 8.8.4.4
- Open DNS
- 208.67.222.222
- 208.67.220.220
- Quad9 DNS (⛔ not supported yet)
- ~~9.9.9.9~~
- ~~149.112.112.112~~
Supported Crawlers
Currently verifying the domain name is a little imprecise. goodbots looks for just the domain name to match and does NOT match the TLD.
Future improvements will test for more precise domains based on the crawlers specifications.
- googlebot
- .googlebot.
- .google.
- msnbot
- .msn.
- bingbot
- .msn.
- pinterest
- .pinterest.
- yandex
- .yandex.
- baidu
- .baidu.
- coccoc
- .coccoc.
Make it go faster!
Use the -c flag to control concurrency (default: 50).
./goodbots -c 100 < ip-list.txt
Other usage of goodbots
You can switch between verification and raw reverse-DNS resolution with the -mode flag:
-mode goodbots(default): reverse + forward validation-mode resolve: reverse-only hostname resolution (includes errors in TSV output)
Examples:
# verify bot IPs (default mode)
./goodbots -mode goodbots < ip-list.txt
# resolve hostnames only
./goodbots -mode resolve < ip-list.txt
➜ goodbots git:(main) ✗ prips -i 50 66.100.0.0 66.200.0.0 | ./goodbots
66.100.0.50 (error) lookup 50.0.100.66.in-addr.arpa. on 192.168.1.1:53: no such host
...
66.100.1.144 (error) lookup 144.1.100.66.in-addr.arpa. on 192.168.1.1:53: no such host
66.100.0.150 WebGods
66.100.0.250 (error) lookup 250.0.100.66.in-addr.arpa. on 192.168.1.1:53: no such host
...
66.100.4.76 (error) lookup 76.4.100.66.in-addr.arpa. on 192.168.1.1:53: no such host
66.100.4.126 mail.esai.com
Other Resources
- Google Documentation on Verifying Googlebot
- Google published IP ranges for Google API + services h/t Michael Stapelberg
- DuckDuckGo published IPs
- Facebook published IP ranges
- Pinterest: Verify pinterestbot
- Apple: Verify applebot
- Internet Archive: Verify archive.org_bot
- Bidu: Verify baiduspider
- Yandex: Verify yadex crawlers
- Cốc Cốc: Verify coccocbot
- Yahoo: Verify slurp
Written in Golang
Gopher courtesy of Gopherize.me
