ProxyPool
A ProxyPool based on Scrapy and Redis(基于Scrapy和Redis的代理池)
Install / Use
/learn @Time1ess/ProxyPoolREADME
ProxyPool
This tool is in developing and
READMEmay be out-dated.
An Python implementation of proxy pool.
ProxyPool is a tool to create a proxy pool with Scrapy and Redis, it will automatically add new available proxies to pool and maintain the pool to delete unusable proxies.
This tool currently get available proxies from 4 sources, I would add more sources in the future.
Compatibility
This tool has been tested on macOS Sierra 10.12.4 and Ubuntu 16.04 LTS successfully.
System Requirements:
- UNIX-Like systems(macOS, Ubuntu, etc..)
Fundamental Requirements:
- Redis 3.2.8
- Python 3.0+
Python package requirements:
- Scrapy 1.3.3
- redis 2.10.5
- Flask 0.12
I have not tested other versions of above packages, but I think it works fine for most users.
Features
- Automatically add new available proxies
- Automatically delete unusable proxies
- Less coding work by adding crawl rule, improve scalability
How-to
This tool requires Redis,please make sure Redis service(port 6379) has started
To start the tool, simply:
$ ./start.sh
It will start Crawling service、Pool maintain service、Maintain schedule service、Rule Maintain service and Web console
To monitor the tool, go to:
To stop the tool, simply:
$ sudo ./stop.sh
To add support for crawling more sites for proxies, this tool provides a usual crawling structure which should work for most free proxies site:
- Start the tool
- Open Web console(default port:5000)
- Switch to Rule management page
- Click New rule button
- Finish the form and submit
rule_namewill be used to distinguish different rulesurl_fmtwill be used to generate crawling pages, it's often that the coding rule of these free proxy providing website is something likexxx.com/yy/5row_xpathwill be used to extract a data row from page content.host_xpathwill be used to extract proxy ip from a data row extracted earlier.port_xpathwill be used to extract proxy port.addr_xpathwill be used to extract proxy address.mode_xpathwill be used to extract proxy mode.proto_xpathwill be used to extract proxy protocol.vt_xpathwill be used to extract proxy validation time.max_pagewill be used to control the size of crawling pages.- Above
xpaths can be set tonullto get a defaultunknownvalue.
- Once the form is submitted the rule will be applied automatically and start a new crawling process.
Data in Redis
All proxy information are stored in Redis.
Rule(hset)
key|description :---|:--- name|.. url_fmt|format: http://www.kuaidaili.com/free/intr/{} row_xpath|format: //div[@id="list"]/table//tr host_xpath|format: td[1]/text() port_xpath|.. addr_xpath|.. mode_xpath|.. proto_xpath|.. vt_xpath|.. max_page|a int
proxy_info(hset)
key|description :---|:--- proxy|full proxy address, format: 127.0.0.1:80 ip|proxy ip, format: 127.0.0.1 port|proxy port, format: 80 addr|where is the proxy mode|anonymous or not protocol| HTTP or HTTPS validation_time|source website checking time failed_times|recently failed times latency|proxy latency to source website
rookies_proxies(set)
New proxies which have not been tested yet will be stored at here, a new proxy will be moved to available_proxies after successfully tested or will be deleted after maximum retry times reached.
available_proxies(set)
Available proxies will be stored at here, every proxy will be tested whether it is available or not in certain time.
availables_checking(zset)
Available proxies test queue, the score of these proxies is a timestamp to indicate its priority.
rookies_checking(zset)
New proxies test queue, similar to availables_checking.
Jobs(list)
FIFO queue, format:cmd|rule_name, tell Rule maintain service how to deal with the rule-specific spider's action such as start、pause、stop and delete.
How it work
Getting new proxies
- Crawling pages
- Extract
ProxyItemfrom content - Use pipeline to store
ProxyItemin Redis
Proxy maintain
New proxies:
- Iterate over each of new proxies
- Available
- Move to
available_proxies
- Move to
- Unavailable
- Delete proxy
- Available
Proxies in pool:
- Iterate over each of proxies
- Available
- Reset retry times and wait for next test
- Unavailable
- Not reach maximum retry times
- wait for next test
- Maximum retry times reached
- Delete proxy
- Not reach maximum retry times
- Available
Rule maintain
- Listen FIFO queue
Jobsin redis- Fetch action_type and rule_name
- pause
- Pause the engine of the crawler which has the rule of rule_name and set rule status to
paused
- Pause the engine of the crawler which has the rule of rule_name and set rule status to
- stop
- Any working crawlers are using the specific rule
- Stop the engine gracefully
- Set rule status to
waiting - Add callback to set status to
stoppedwhen engine stopped
- No such rule is used
- Set rule status to
stoppedimmediately
- Set rule status to
- Any working crawlers are using the specific rule
- start
- Any working crawlers are using the specific rule and status is not
waitingand engine is paused- Unpause the engine and set rule status to
started
- Unpause the engine and set rule status to
- No such rule is used
- Load rule info from redis and instantiate a new rule object
- Instantiate a new crawler with the rule
- Add callback to set status to
finishedwhen crawler finished - Set rule status to
started
- Any working crawlers are using the specific rule and status is not
- reload
- Any working crawlers are using the specific rule and status is not
waiting- Re-assign rule to the crawler
- Any working crawlers are using the specific rule and status is not
- pause
- Fetch action_type and rule_name
Schedule proxies checking
- Iterate over proxies in different status(rookie, available, lost)
- Fetch
zrankfrom redis- if
zrankisNonewhich means no checking schedule for the proxy- Add a new checking schedule
- if
- Fetch
Retrieve a available proxy for others
To retrieve currently available proxy, Just get one from available_proxies with any Redis client.
An scrapy middleware example:
class RandomProxyMiddleware:
@classmethod
def from_crawler(cls, crawler):
s = cls()
s.conn = redis.Redis(decode_responses=True)
return s
def process_request(self, request, spider):
proxies = list(self.conn.smembers('available_proxies'))
if proxies:
while True:
proxy = choice(proxies)
if proxy.startswith('http'):
break
request.meta['proxy'] = proxy
json API(default port:5000):
Related Skills
node-connect
352.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
claude-opus-4-5-migration
111.1kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
model-usage
352.0kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
