SkillAgentSearch skills...

TorCrawl.py

Crawl and extract (regular or onion) webpages through TOR network

Install / Use

/learn @MikeMeliz/TorCrawl.py
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<!-- Title: TorCrawl.py Description: A Python script designed for anonymous web scraping via the Tor network. Author: MikeMeliz --> <div align="center"> <img width="50%" alt="TorCrawl.py Logo" src=".github/torcrawl.svg">

TorCrawl.py is a Python script designed for anonymous web scraping via the Tor network.

<p>It combines ease of use with the robust privacy features of Tor, allowing for secure and untraceable data collection. Ideal for both novice and experienced programmers, this tool is essential for responsible data gathering in the digital age.</p>

[![Release][release-version-shield]][releases-link] [![Last Commit][last-commit-shield]][commit-link] ![Python][python-version-shield] [![Quality Gate Status][quality-gate-shield]][quality-gate-link] [![license][license-shield]][license-link]

</div>

What makes it simple and easy to use?

If you are a terminal maniac you know that things have to be simple and clear. Passing the output into other tools is necessary and accuracy is the key.

With a single argument, you can read an .onion webpage or a regular one, through TOR Network and by using pipes you can pass the output at any other tool you prefer.

$ torcrawl -u http://www.github.com/ | grep 'google-analytics'
    <meta-name="google-analytics" content="UA-XXXXXX- "> 

If you want to crawl the links of a webpage use the -c and BAM you got on a file all the inside links. You can even use -d to crawl them and so on. You can also use the argument -p to wait some seconds before the next crawl.

$ torcrawl -v -u http://www.github.com/ -c -d 2 -p 2
# TOR is ready!
# URL: http://www.github.com/
# Your IP: XXX.XXX.XXX.XXX
# Crawler started from http://www.github.com/ with 2 depth crawl and 2 second(s) delay:
# Step 1 completed with: 11 results
# Step 2 completed with: 112 results
# File created on /path/to/project/links.txt

[!TIP]
Crawling is not illegal, but violating copyright is. It’s always best to double-check a website’s T&C before start crawling them. Some websites set up what’s called robots.txt to tell crawlers not to visit those pages. <br>This crawler will allow you to go around this, but we always recommend respecting robots.txt.

<hr>

Installation

Easy Installation:

  • from [PyPi][pypi-package]:<br> pip install torcrawl
  • with homebrew:<br> Coming soon...

Manual Installation:

  1. Clone this repository:<br> git clone https://github.com/MikeMeliz/TorCrawl.py.git
  2. Install dependencies:<br> pip install -r requirements.txt
  3. Install and Start TOR Service:
    1. Debian/Ubuntu: <br> apt-get install tor<br> service tor start
    2. Windows: Download [tor.exe][tor-download], and:<br> tor.exe --service install<br> tor.exe --service start
    3. MacOS: <br> brew install tor<br> brew services start tor
    4. For different distros, visit:<br> [TOR Setup Documentation][tor-docs]

Arguments

| arg | Long | Description | |--------------|---------------------|----------------------------------------------------------------------------------------| | General: | | | | -h | --help | Help message | | -v | --verbose | Show more information about the progress | | -u | --url *.onion | URL of Webpage to crawl or extract | | -w | --without | Without using TOR Network | | -rua | --random-ua | Enable random user-agent rotation for requests | | -rpr | --random-proxy | Enable random proxy rotation from res/proxies.txt (requires -w flag, one proxy per line, format: host:port) | | -px | --proxy | IP address for SOCKS5 proxy (Default: 127.0.0.1 for using TOR) | | -pr | --proxyport | Port for SOCKS5 proxy (Default: 9050) | | -f | --folder | The directory which will contain the generated files | | -V | --version | Show version and exit | | Extract: | | | | -e | --extract | Extract page's code to terminal or file (Default: Terminal) | | -i | --input filename | Input file with URL(s) (separated by line) | | -o | --output [filename] | Output page(s) to file(s) (for one page) | | -y | --yara | Perform yara keyword search:<br>h = search entire html object,<br>t = search only text | | Crawl: | | | | -c | --crawl | Crawl website (Default output on website/links.txt) | | -d | --depth | Set depth of crawler's travel (Default: 1) | | -p | --pause | Seconds of pause between requests (Default: 0) | | -j | --json | Export crawl findings to JSON in addition to txt outputs | | -x | --xml | Export crawl findings to XML in addition to txt outputs | | -DB | --database | Export crawl findings and link graph to SQLite database | | -vis | --visualization | Generate HTML visualization from SQLite database (requires -DB) | | -l | --log | Log file with visited URLs and their response code |

Usage & Examples

As Extractor:

To just extract a single webpage to terminal:

$ python torcrawl.py -u http://www.github.com
<!DOCTYPE html>
...
</html>

Extract into a file (github.htm) without the use of TOR:

$ python torcrawl.py -w -u http://www.github.com -o github.htm
## File created on /script/path/github.htm

Extract to terminal and find only the line with google-analytics:

$ python torcrawl.py -u http://www.github.com | grep 'google-analytics'
    <meta name="google-analytics" content="UA-*******-*">

Extract to file and find only the line with google-analytics using yara:

$ python torcrawl.py -v -w -u https://github.com -e -y 0
...

Note: update res/keyword.yar to search for other keywords. Use -y 0 for raw html searching and -y 1 for text search only.

Extract a set of webpages (imported from file) to terminal:

$ python torcrawl.py -i links.txt
...

As Crawler:

Crawl the links of the webpage without the use of TOR, also show verbose output (really helpful):

$ python torcrawl.py -v -w -u http://www.github.com/ -c
## URL: http://www.github.com/
## Your IP: *.*.*.*
## Crawler Started from http://www.github.com/ with step 1 and wait 0
## Step 1 completed with: 11 results
## File created on /script/path/links.txt

Crawl the webpage with depth 2 (2 clicks) and 5 seconds waiting before crawl the next page:

$ python torcrawl.py -v -u http://www.github.com/ -c -d 2 -p 5
## TOR is ready!
## URL: http://www.github.com/
## Your IP: *.*.*.*
## Crawler Started from http://www.github.com with step 2 and wait 5
## Step 1 completed with: 11 results
## Step 2 completed with: 112 results
## File created on /script/path/links.txt

As Both:

You can crawl a page and also extract the webpages into a folder with a single command:

$ python torcrawl.py -v -u http://www.github.com/ -c -d 2 -p 5 -e
## TOR is ready!
## URL: http://www.github.com/
## Your IP: *.*.*.*
## Crawler Started from http://www.github.com with step 1 and wait 5
## Step 1 completed with: 11 results
## File created on /script/path/FolderName/index.htm
## File created on /script/path/FolderName/projects.html
## ...

Note: The default (and only for now) file for crawler's links is the links.txt document. Also, to extract right after the crawl you have to give -e argument

Following the same logic; you can parse all these pages to grep (for example) and search for specific text:

$ python torcrawl.py -u http://www.github.com/ -c -e | grep '</html>'
</html>
</html>
...

As Both + Keyword Search:

You can crawl a page, perform a keyword search and extract the webpages that match the findings into a folder with a single command:

$ python torcrawl.py -v -u http://www.github.com/ -c -d 2 -p 5 -e -y h
## TOR is ready!
## URL: http://www.github.com/
## Your IP: *.*.*.*
## Crawler Started from http://www.github.com with step 1 and wait 5
## Step 1 completed with: 11 results
## File created on /script/path/FolderName/index.htm
## File created on /script/path/FolderName/projects.html
## ...

Note: Update res/keyword.yar to search for other keywords. Use -y h for raw html searching and -y t for text search only.

Demo

![TorCrawl-Demo][demo]

Contribution

Feel free to contribute on this project! Just fork it, make any change on your fork and add a pull request on current branch!

<a href="https://gi

Related Skills

View on GitHub
GitHub Stars500
CategoryDevelopment
Updated11h ago
Forks89

Languages

Python

Security Score

100/100

Audited on Apr 10, 2026

No findings