<p align="center"> <a href="https://wyydsb.xin" target="_blank" rel="noopener noreferrer"> <img width="100" src="https://cdn.nlark.com/yuque/0/2018/jpeg/104214/1540358574166-46cbbfd2-69fa-4406-aba9-784bf65efdf9.jpeg" alt="Spider logo"></a></p> <h1 align="center">Spider Man</h1>

<div align="center"><strong>高可用代理IP池高并发生成器一些实战经验</strong></div> <div align="center"><strong>Highly Available Proxy IP Pool, Highly Concurrent Request Builder, Some Application</strong></div>

Navigation

| site | document | Last Modified time | | -------------------- | ----------------------------------------- | ------------------ | | some proxy site,etc. | Proxy pool | 20-06-01 | | music.163.com | Netease | 18-10-21 | | - | Press Test System | 18-11-10 | | news.baidu.com | News | 19-01-25 | | note.youdao.com | Youdao Note | 20-01-04 | | jianshu.com/csdn.net | blog | 20-01-04 | | elective.pku.edu.cn | Brush Class | 19-10-11 | | zimuzu.tv | zimuzu | 19-04-13 | | bilibili.com | Bilibili | 20-06-06 | | exam.shaoq.com | shaoq | 19-03-21 | | data.eastmoney.com | Eastmoney | 19-03-29 | | hotel.ctrip.com | Ctrip Hotel Detail | 19-10-11 | | douban.com | DouBan | 19-05-07 | | 66ip.cn | 66ip | 19-05-07 |

keyword

Big data store
High concurrency requests
Support WebSocket
method for font cheat
method for js compile
Some Application

Quick Start

docker is on the road.

$ git clone https://github.com/iofu728/spider.git
$ cd spider
$ pip install -r requirement.txt

# load proxy pool
$ python proxy/getproxy.py                             # to load proxy resources

To use proxy pool

''' using proxy requests '''
from proxy.getproxy import GetFreeProxy                # to use proxy
proxy_req = GetFreeProxy().proxy_req
proxy_req(url:str, types:int, data=None, test_func=None, header=None)

''' using basic requests '''
from util.util import basic_req
basic_req(url: str, types: int, proxies=None, data=None, header=None, need_cookie: bool = False)

Structure

.
├── LICENSE
├── README.md
├── bilibili
│   ├── analysis.py                // data analysis
│   ├── bilibili.py                // bilibili basic
│   └── bsocket.py                 // bilibili websocket
├── blog
│   └── titleviews.py              // Zhihu && CSDN && jianshu
├── brushclass
│   └── brushclass.py              // PKU elective
├── buildmd
│   └── buildmd.py                 // Youdao Note
├── eastmoney
│   └── eastmoney.py               // font analysis
├── exam
│   ├── shaoq.js                   // jsdom
│   └── shaoq.py                   // compile js shaoq
├── log
├── netease
│   ├── netease_music_base.py
│   ├── netease_music_db.py        // Netease Music
│   └── table.sql
├── news
│   └── news.py                    // Google && Baidu
├── press
│   └── press.py                   // Press text
├── proxy
│   ├── getproxy.py                // Proxy pool
│   └── table.sql
├── requirement.txt
├── utils
│   ├── db.py
│   └── utils.py
└── zimuzu
    └── zimuzu.py                  // zimuzi

Proxy pool

proxy pool is the heart of this project.

Highly Available Proxy IP Pool
- By obtaining data from Gatherproxy, Goubanjia, xici etc. Free Proxy WebSite
- Analysis of the Goubanjia port data
- Quickly verify IP availability
- Cooperate with Requests to automatically assign proxy Ip, with Retry mechanism, fail to write DB mechanism
- two models for proxy shell
  - model 1: load gather proxy list && update proxy list file(need over the GFW, your personality passwd in http://gatherproxy.com to proxy/data/passage one line by username, one line by passwd)
  - model 0: update proxy pool db && test available
- one common proxy api
  - from proxy.getproxy import GetFreeProxy
  - proxy_req = GetFreeProxy().proxy_req
  - proxy_req(url: str, types: int, data=None, test_func=None, header=None)
- also one common basic req api
  - from util import basic_req
  - basic_req(url: str, types: int, proxies=None, data=None, header=None)
- if you want spider by using proxy
  - because access proxy web need over the GFW, so maybe you can't use model 1 to download proxy file.
  - download proxy txt from 'http://gatherproxy.com'
  - cp download_file proxy/data/gatherproxy
  - python proxy/getproxy.py --model==0

Netease

Netease Music song playlist crawl - netease/netease_music_db.py

problem: big data store
classify -> playlist id -> song_detail
V1 Write file, One run version, no proxy, no record progress mechanism
V1.5 Small amount of proxy IP
V2 Proxy IP pool, Record progress, Write to MySQL
- Optimize the write to DB Load data/ Replace INTO
Netease Music Spider for DB
Netease Music Spider

Press Test System

Press Test System - press/press.py

problem: high concurrency requests
By highly available proxy IP pool to pretend user.
Give some web service uneven pressure
To do: press uniform

News

google & baidu info crawl- news/news.py

get news from search engine by Proxy Engine
one model: careful analysis DOM
the other model: rough analysis Chinese words

Youdao Note

Youdao Note documents crawl - buildmd/buildmd.py

load data from youdaoyun
by series of rules to deal data to .md

blog

csdn && zhihu && jianshu view info crawl - blog/titleview.py

$ python blog/titleviews.py --model=1 >> log 2>&1 # model = 1: load gather model or python blog/titleviews.py --model=1 >> proxy.log 2>&1
$ python blog/titleviews.py --model=0 >> log 2>&1 # model = 0: update gather model

Brush Class

PKU Class brush - brushclass/brushclass.py

when your expected class have places, It will send you some email.

zimuzu

ZiMuZu download list crawl - zimuzu/zimuzu.py

when you want to download lots of show like Season 22, Season 21.
If click one by one, It is very boring, so zimuzu.py is all you need.
The thing you only need do is to wait for the program run.
And you copy the Thunder URL for one to download the movies.
Now The Winter will come, I think you need it to review <Game of Thrones>.

Bilibili

Get av data by http - bilibili/bilibili.py

homepage rank -> check tids -> to check data every 2min(during on rank + one day)
monitor every rank av -> star num & basic data

Get av data by websocket - bilibili/bsocket.py

base on WebSocket
byte analysis
heartbeat

Get comment data by http - bilibili/bilibili.py

load comment from /x/v2/reply
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-10: ordinal not in range(128)
- read/write in utf-8
- with codecs.open(filename, 'r/w', encoding='utf-8')
bilibili some url return 404 like http://api.bilibili.com/x/relation/stat?jsonp=jsonp&callback=__jp11&vmid=

basic_req auto add host to headers, but this URL can't request in ‘Host’

shaoq

Get text data by compiling javascript - exam/shaoq.py

Idea
1. get cookie
2. request image
3. requests after 5.5s
4. compile javascript code -> get css
5. analysic css

Requirement

pip3 install PyExecJS
yarn install add jsdom # npm install jsdom PS: not global

Can't get true html
- Wait time must be 5.5s.
- So you can use threading or await asyncio.gather to request image
- Coroutines and Tasks
Error: Cannot find module 'jsdom'

jsdom must install in local not in global
- Cannot find module 'jsdom'
remove subtree & edit subtree & re.findall
```
subtree.extract()
subtree.string = new_string
parent_tree.find_all(re.compile('''))
```

Spider

Install / Use

README