ProxyPool

A ProxyPool based on Scrapy and Redis(基于Scrapy和Redis的代理池)

Generate Convert Improve

Install / Use

/learn @Time1ess/ProxyPool

About this skill

Quality Score

0/100

README

ProxyPool

This tool is in developing and README may be out-dated.

阅读中文版本(Read in Chinese)

An Python implementation of proxy pool.

ProxyPool is a tool to create a proxy pool with Scrapy and Redis, it will automatically add new available proxies to pool and maintain the pool to delete unusable proxies.

This tool currently get available proxies from 4 sources, I would add more sources in the future.

Compatibility

This tool has been tested on macOS Sierra 10.12.4 and Ubuntu 16.04 LTS successfully.

System Requirements:

UNIX-Like systems(macOS, Ubuntu, etc..)

Fundamental Requirements:

Redis 3.2.8
Python 3.0+

Python package requirements:

Scrapy 1.3.3
redis 2.10.5
Flask 0.12

I have not tested other versions of above packages, but I think it works fine for most users.

Features

Automatically add new available proxies
Automatically delete unusable proxies
Less coding work by adding crawl rule, improve scalability

How-to

This tool requires Redis，please make sure Redis service(port 6379) has started

To start the tool, simply:

$ ./start.sh

It will start Crawling service、Pool maintain service、Maintain schedule service、Rule Maintain service and Web console

To monitor the tool, go to:

Web console

To stop the tool, simply:

$ sudo ./stop.sh

To add support for crawling more sites for proxies, this tool provides a usual crawling structure which should work for most free proxies site:

Start the tool
Open Web console(default port:5000)
Switch to Rule management page
Click New rule button
Finish the form and submit
- rule_name will be used to distinguish different rules
- url_fmt will be used to generate crawling pages, it's often that the coding rule of these free proxy providing website is something like xxx.com/yy/5
- row_xpath will be used to extract a data row from page content.
- host_xpath will be used to extract proxy ip from a data row extracted earlier.
- port_xpath will be used to extract proxy port.
- addr_xpath will be used to extract proxy address.
- mode_xpath will be used to extract proxy mode.
- proto_xpath will be used to extract proxy protocol.
- vt_xpath will be used to extract proxy validation time.
- max_page will be used to control the size of crawling pages.
- Above xpaths can be set to null to get a default unknown value.
Once the form is submitted the rule will be applied automatically and start a new crawling process.

Data in Redis

All proxy information are stored in Redis.

Rule(hset)

key|description :---|:--- name|.. url_fmt|format: http://www.kuaidaili.com/free/intr/{} row_xpath|format: //div[@id="list"]/table//tr host_xpath|format: td[1]/text() port_xpath|.. addr_xpath|.. mode_xpath|.. proto_xpath|.. vt_xpath|.. max_page|a int

proxy_info(hset)

key|description :---|:--- proxy|full proxy address, format: 127.0.0.1:80 ip|proxy ip, format: 127.0.0.1 port|proxy port, format: 80 addr|where is the proxy mode|anonymous or not protocol| HTTP or HTTPS validation_time|source website checking time failed_times|recently failed times latency|proxy latency to source website

rookies_proxies(set)

New proxies which have not been tested yet will be stored at here, a new proxy will be moved to available_proxies after successfully tested or will be deleted after maximum retry times reached.

available_proxies(set)

Available proxies will be stored at here, every proxy will be tested whether it is available or not in certain time.

availables_checking(zset)

Available proxies test queue, the score of these proxies is a timestamp to indicate its priority.

rookies_checking(zset)

New proxies test queue, similar to availables_checking.

Jobs(list)

FIFO queue, format:cmd|rule_name, tell Rule maintain service how to deal with the rule-specific spider's action such as start、pause、stop and delete.

How it work

Getting new proxies

Crawling pages
Extract ProxyItem from content
Use pipeline to store ProxyItem in Redis

Proxy maintain

New proxies:

Iterate over each of new proxies
- Available
  - Move to available_proxies
- Unavailable
  - Delete proxy

Proxies in pool:

Iterate over each of proxies
- Available
  - Reset retry times and wait for next test
- Unavailable
  - Not reach maximum retry times
    - wait for next test
  - Maximum retry times reached
    - Delete proxy

Rule maintain

Listen FIFO queue Jobs in redis
- Fetch action_type and rule_name
  - pause
    - Pause the engine of the crawler which has the rule of rule_name and set rule status to paused
  - stop
    - Any working crawlers are using the specific rule
      - Stop the engine gracefully
      - Set rule status to waiting
      - Add callback to set status to stopped when engine stopped
    - No such rule is used
      - Set rule status to stopped immediately
  - start
    - Any working crawlers are using the specific rule and status is not waiting and engine is paused
      - Unpause the engine and set rule status to started
    - No such rule is used
      - Load rule info from redis and instantiate a new rule object
      - Instantiate a new crawler with the rule
      - Add callback to set status to finished when crawler finished
      - Set rule status to started
  - reload
    - Any working crawlers are using the specific rule and status is not waiting
      - Re-assign rule to the crawler

Schedule proxies checking

Iterate over proxies in different status(rookie, available, lost)
- Fetch zrank from redis
  - if zrank is None which means no checking schedule for the proxy
    - Add a new checking schedule

Retrieve a available proxy for others

To retrieve currently available proxy, Just get one from available_proxies with any Redis client.

An scrapy middleware example:

class RandomProxyMiddleware:
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        s.conn = redis.Redis(decode_responses=True)
        return s

    def process_request(self, request, spider):
        proxies = list(self.conn.smembers('available_proxies'))
        if proxies:
            while True:
                proxy = choice(proxies)
                if proxy.startswith('http'):
                    break
            request.meta['proxy'] = proxy

json API(default port:5000):

http://localhost:5000/api/proxy

Related Skills

node-connect

352.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

claude-opus-4-5-migration

111.1k

Migrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5

frontend-design

111.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

model-usage

352.0k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

Time1ess

View profile

View on GitHub

GitHub Stars20

CategoryDevelopment

Updated3mo ago

Forks8

Time1ess/ProxyPool

Languages

Python

Security Score

92/100

Audited on Jan 6, 2026

No findings