SkillAgentSearch skills...

Spider

๐Ÿ•ทsome website spider application base on proxy pool (support http & websocket)

Install / Use

/learn @iofu728/Spider
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p align="center"> <a href="https://wyydsb.xin" target="_blank" rel="noopener noreferrer"> <img width="100" src="https://cdn.nlark.com/yuque/0/2018/jpeg/104214/1540358574166-46cbbfd2-69fa-4406-aba9-784bf65efdf9.jpeg" alt="Spider logo"></a></p> <h1 align="center">Spider Man</h1>

GitHub GitHub tag GitHub code size in bytes

<div align="center"><strong>้ซ˜ๅฏ็”จไปฃ็†IPๆฑ  ้ซ˜ๅนถๅ‘็”Ÿๆˆๅ™จ ไธ€ไบ›ๅฎžๆˆ˜็ป้ชŒ</strong></div> <div align="center"><strong>Highly Available Proxy IP Pool, Highly Concurrent Request Builder, Some Application</strong></div>

Navigation

| site | document | Last Modified time | | -------------------- | ----------------------------------------- | ------------------ | | some proxy site,etc. | Proxy pool | 20-06-01 | | music.163.com | Netease | 18-10-21 | | - | Press Test System | 18-11-10 | | news.baidu.com | News | 19-01-25 | | note.youdao.com | Youdao Note | 20-01-04 | | jianshu.com/csdn.net | blog | 20-01-04 | | elective.pku.edu.cn | Brush Class | 19-10-11 | | zimuzu.tv | zimuzu | 19-04-13 | | bilibili.com | Bilibili | 20-06-06 | | exam.shaoq.com | shaoq | 19-03-21 | | data.eastmoney.com | Eastmoney | 19-03-29 | | hotel.ctrip.com | Ctrip Hotel Detail | 19-10-11 | | douban.com | DouBan | 19-05-07 | | 66ip.cn | 66ip | 19-05-07 |

keyword

  • Big data store
  • High concurrency requests
  • Support WebSocket
  • method for font cheat
  • method for js compile
  • Some Application

Quick Start

docker is on the road.

$ git clone https://github.com/iofu728/spider.git
$ cd spider
$ pip install -r requirement.txt

# load proxy pool
$ python proxy/getproxy.py                             # to load proxy resources

To use proxy pool

''' using proxy requests '''
from proxy.getproxy import GetFreeProxy                # to use proxy
proxy_req = GetFreeProxy().proxy_req
proxy_req(url:str, types:int, data=None, test_func=None, header=None)

''' using basic requests '''
from util.util import basic_req
basic_req(url: str, types: int, proxies=None, data=None, header=None, need_cookie: bool = False)

Structure

.
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ bilibili
โ”‚ย ย  โ”œโ”€โ”€ analysis.py                // data analysis
โ”‚ย ย  โ”œโ”€โ”€ bilibili.py                // bilibili basic
โ”‚ย ย  โ””โ”€โ”€ bsocket.py                 // bilibili websocket
โ”œโ”€โ”€ blog
โ”‚ย ย  โ””โ”€โ”€ titleviews.py              // Zhihu && CSDN && jianshu
โ”œโ”€โ”€ brushclass
โ”‚ย ย  โ””โ”€โ”€ brushclass.py              // PKU elective
โ”œโ”€โ”€ buildmd
โ”‚ย ย  โ””โ”€โ”€ buildmd.py                 // Youdao Note
โ”œโ”€โ”€ eastmoney
โ”‚ย ย  โ””โ”€โ”€ eastmoney.py               // font analysis
โ”œโ”€โ”€ exam
โ”‚ย ย  โ”œโ”€โ”€ shaoq.js                   // jsdom
โ”‚ย ย  โ””โ”€โ”€ shaoq.py                   // compile js shaoq
โ”œโ”€โ”€ log
โ”œโ”€โ”€ netease
โ”‚ย ย  โ”œโ”€โ”€ netease_music_base.py
โ”‚ย ย  โ”œโ”€โ”€ netease_music_db.py        // Netease Music
โ”‚ย ย  โ””โ”€โ”€ table.sql
โ”œโ”€โ”€ news
โ”‚ย ย  โ””โ”€โ”€ news.py                    // Google && Baidu
โ”œโ”€โ”€ press
โ”‚ย ย  โ””โ”€โ”€ press.py                   // Press text
โ”œโ”€โ”€ proxy
โ”‚ย ย  โ”œโ”€โ”€ getproxy.py                // Proxy pool
โ”‚ย ย  โ””โ”€โ”€ table.sql
โ”œโ”€โ”€ requirement.txt
โ”œโ”€โ”€ utils
โ”‚ย ย  โ”œโ”€โ”€ db.py
โ”‚ย ย  โ””โ”€โ”€ utils.py
โ””โ”€โ”€ zimuzu
    โ””โ”€โ”€ zimuzu.py                  // zimuzi

Proxy pool

proxy pool is the heart of this project.

  • Highly Available Proxy IP Pool
    • By obtaining data from Gatherproxy, Goubanjia, xici etc. Free Proxy WebSite
    • Analysis of the Goubanjia port data
    • Quickly verify IP availability
    • Cooperate with Requests to automatically assign proxy Ip, with Retry mechanism, fail to write DB mechanism
    • two models for proxy shell
      • model 1: load gather proxy list && update proxy list file(need over the GFW, your personality passwd in http://gatherproxy.com to proxy/data/passage one line by username, one line by passwd)
      • model 0: update proxy pool db && test available
    • one common proxy api
      • from proxy.getproxy import GetFreeProxy
      • proxy_req = GetFreeProxy().proxy_req
      • proxy_req(url: str, types: int, data=None, test_func=None, header=None)
    • also one common basic req api
      • from util import basic_req
      • basic_req(url: str, types: int, proxies=None, data=None, header=None)
    • if you want spider by using proxy
      • because access proxy web need over the GFW, so maybe you can't use model 1 to download proxy file.
      • download proxy txt from 'http://gatherproxy.com'
      • cp download_file proxy/data/gatherproxy
      • python proxy/getproxy.py --model==0

Netease

Netease Music song playlist crawl - netease/netease_music_db.py

  • problem: big data store

  • classify -> playlist id -> song_detail

  • V1 Write file, One run version, no proxy, no record progress mechanism

  • V1.5 Small amount of proxy IP

  • V2 Proxy IP pool, Record progress, Write to MySQL

    • Optimize the write to DB Load data/ Replace INTO
  • Netease Music Spider for DB

  • Netease Music Spider

Press Test System

Press Test System - press/press.py

  • problem: high concurrency requests
  • By highly available proxy IP pool to pretend user.
  • Give some web service uneven pressure
  • To do: press uniform

News

google & baidu info crawl- news/news.py

  • get news from search engine by Proxy Engine
  • one model: careful analysis DOM
  • the other model: rough analysis Chinese words

Youdao Note

Youdao Note documents crawl - buildmd/buildmd.py

  • load data from youdaoyun
  • by series of rules to deal data to .md

blog

csdn && zhihu && jianshu view info crawl - blog/titleview.py

$ python blog/titleviews.py --model=1 >> log 2>&1 # model = 1: load gather model or python blog/titleviews.py --model=1 >> proxy.log 2>&1
$ python blog/titleviews.py --model=0 >> log 2>&1 # model = 0: update gather model

Brush Class

PKU Class brush - brushclass/brushclass.py

  • when your expected class have places, It will send you some email.

zimuzu

ZiMuZu download list crawl - zimuzu/zimuzu.py

  • when you want to download lots of show like Season 22, Season 21.
  • If click one by one, It is very boring, so zimuzu.py is all you need.
  • The thing you only need do is to wait for the program run.
  • And you copy the Thunder URL for one to download the movies.
  • Now The Winter will come, I think you need it to review <Game of Thrones>.

Bilibili

Get av data by http - bilibili/bilibili.py

  • homepage rank -> check tids -> to check data every 2min(during on rank + one day)
  • monitor every rank av -> star num & basic data

Get av data by websocket - bilibili/bsocket.py

  • base on WebSocket
  • byte analysis
  • heartbeat

Get comment data by http - bilibili/bilibili.py

  • load comment from /x/v2/reply

  • UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-10: ordinal not in range(128)

    • read/write in utf-8
    • with codecs.open(filename, 'r/w', encoding='utf-8')
  • bilibili some url return 404 like http://api.bilibili.com/x/relation/stat?jsonp=jsonp&callback=__jp11&vmid=

    basic_req auto add host to headers, but this URL can't request in โ€˜Hostโ€™

shaoq

Get text data by compiling javascript - exam/shaoq.py

  • Idea

    1. get cookie
    2. request image
    3. requests after 5.5s
    4. compile javascript code -> get css
    5. analysic css
  • Requirement

    pip3 install PyExecJS
    yarn install add jsdom # npm install jsdom PS: not global
    
  • Can't get true html

    • Wait time must be 5.5s.

    • So you can use threading or await asyncio.gather to request image

    • Coroutines and Tasks

  • Error: Cannot find module 'jsdom'

    jsdom must install in local not in global

  • remove subtree & edit subtree & re.findall

    subtree.extract()
    subtree.string = new_string
    parent_tree.find_all(re.compile('''))
    

Eas

View on GitHub
GitHub Stars111
CategoryDevelopment
Updated24d ago
Forks34

Languages

Python

Security Score

100/100

Audited on Mar 15, 2026

No findings