crazyDhtSpider

🌐 Language Switch

English README | 中文 README

This project is modified from phpDhtSpider: https://github.com/cuijun123/phpDhtSpider

📋 Project Overview

A distributed DHT network crawler implemented based on PHP + Swoole, designed to efficiently collect and process DHT network data.

🚀 Quick Start

Environment Requirements

Server has set ulimit -n 65535
Firewall has opened required ports
Swoole executable file has been placed in the project root directory

Installation Steps

Clone the repository

git clone https://github.com/ixiaofeng/crazyDhtSpider.git 
cd crazyDhtSpider

Download Swoole executable file
- Visit https://www.swoole.com/
- Download the corresponding Swoole executable file
- Place it in the directory at the same level as dht_client and dht_server
Configure the project
- Edit dht_client/config.php and dht_server/config.php as needed
- Ensure database connection settings are correct

Run the Crawler

dht_client (Crawler Server)

Set server limits
```
ulimit -n 65535 
```

Open firewall ports

# Ubuntu/Debian system example 
ufw allow 6882/udp 

# CentOS/RHEL system example 
firewall-cmd --permanent --add-port=6882/udp 
firewall-cmd --reload

Start the client
```
./swoole-cli dht_client/client.php 
```

Stop the client

# Find the process 
ps aux|grep php_dht_client_master 
# Terminate the process (use the found process ID) 
kill -2 <process_id>

dht_server (Data Receiving Server)

Set server limits
```
ulimit -n 65535 
```

Open firewall ports (if server and client are on different machines)

# Ubuntu/Debian system example 
ufw allow 2345/udp 

# CentOS/RHEL system example 
firewall-cmd --permanent --add-port=2345/udp 
firewall-cmd --reload

Start the server and client

# Start the server 
./swoole-cli dht_server/server.php 

# Start the client (in another terminal or background) 
./swoole-cli dht_client/client.php

Stop the server

# Find the process 
ps aux|grep php_dht_server_master 
# Terminate the process (use the found process ID) 
kill -2 <process_id>

📁 Project Structure

crazyDhtSpider/ 
├── dht_client/          # Crawler client directory 
│   ├── client.php       # Client main script 
│   ├── config.php       # Client configuration 
│   └── inc/             # Client include files 
├── dht_server/          # Data server directory 
│   ├── server.php       # Server main script 
│   ├── config.php       # Server configuration 
│   └── inc/             # Server include files 
├── import_infohash.php  # Infohash import Redis script 
├── README.md            # English documentation 
└── README_CN.md         # Chinese documentation

⚙️ Configuration Guide

dht_client/config.php

Download Mode Configuration

// Download mode configuration 
'enable_remote_download' => false,      // Enable remote download forwarding 
'enable_local_download' => true,        // Enable local download 
'only_remote_requests' => false,        // Only handle download requests from other servers

Download mode combinations:

Default full mode (false, true, false): Local download, run DHT crawler, handle local and remote download requests
Remote forwarding mode (true, false, false): All download tasks forwarded to remote server, local only runs DHT crawler
Dual download mode (true, true, false): Priority use remote download forwarding, local as backup
Pure crawler mode (false, false, false): Only run DHT crawler, do not handle any download requests
Dedicated download server mode (false, true, true): Only handle download requests from other servers, do not run DHT crawler
Restricted mode (any, any, true): When only_remote_requests is true, download forwarding is automatically disabled, only handle remote download requests

dht_server/config.php

Main configurations include Swoole server settings and database connection information, which can be adjusted according to the actual environment.

📊 Performance Optimization Suggestions

Server Requirements
- VPS with sufficient bandwidth (unlimited traffic recommended)
- At least 1GB memory to handle medium traffic
- SSD storage for better database performance
Database Optimization
- Implement table partitioning as data volume grows
- Use appropriate indexes for frequently queried fields
- Consider read-write separation for high-traffic scenarios
Scaling Suggestions
- Deploy multiple client instances on different servers
- Use load balancing for server components
- Monitor system resources and adjust as needed

🚨 Common Issues

Cannot collect data
- Ensure firewall has opened 6882 UDP port
- Check server limit setting ulimit -n
Large error logs
- Error logs are normal and do not affect functionality
- Use scheduled tasks to clean large log files
Initial data collection is slow
- This is normal as the crawler is building its node database
- Performance will gradually improve as more nodes are discovered

📝 Notes

Error logs will be generated during operation, which is normal and does not affect functionality.
For production deployment, it is recommended to enable background daemon mode (daemonize => true).
Monitor database performance and implement partitioning if necessary.
This tool is for learning and research purposes only. The author is not responsible for any disputes or legal issues arising from its use.

🤝 Contribution

Contributions are welcome! Please feel free to submit Pull Requests.

📄 License

This project is open source under the MIT License.

CrazyDhtSpider

Install / Use

README