Spider
๐ทsome website spider application base on proxy pool (support http & websocket)
Install / Use
/learn @iofu728/SpiderREADME
Navigation
| site | document | Last Modified time | | -------------------- | ----------------------------------------- | ------------------ | | some proxy site,etc. | Proxy pool | 20-06-01 | | music.163.com | Netease | 18-10-21 | | - | Press Test System | 18-11-10 | | news.baidu.com | News | 19-01-25 | | note.youdao.com | Youdao Note | 20-01-04 | | jianshu.com/csdn.net | blog | 20-01-04 | | elective.pku.edu.cn | Brush Class | 19-10-11 | | zimuzu.tv | zimuzu | 19-04-13 | | bilibili.com | Bilibili | 20-06-06 | | exam.shaoq.com | shaoq | 19-03-21 | | data.eastmoney.com | Eastmoney | 19-03-29 | | hotel.ctrip.com | Ctrip Hotel Detail | 19-10-11 | | douban.com | DouBan | 19-05-07 | | 66ip.cn | 66ip | 19-05-07 |
keyword
- Big data store
- High concurrency requests
- Support WebSocket
- method for font cheat
- method for js compile
- Some Application
Quick Start
docker is on the road.
$ git clone https://github.com/iofu728/spider.git
$ cd spider
$ pip install -r requirement.txt
# load proxy pool
$ python proxy/getproxy.py # to load proxy resources
To use proxy pool
''' using proxy requests '''
from proxy.getproxy import GetFreeProxy # to use proxy
proxy_req = GetFreeProxy().proxy_req
proxy_req(url:str, types:int, data=None, test_func=None, header=None)
''' using basic requests '''
from util.util import basic_req
basic_req(url: str, types: int, proxies=None, data=None, header=None, need_cookie: bool = False)
Structure
.
โโโ LICENSE
โโโ README.md
โโโ bilibili
โย ย โโโ analysis.py // data analysis
โย ย โโโ bilibili.py // bilibili basic
โย ย โโโ bsocket.py // bilibili websocket
โโโ blog
โย ย โโโ titleviews.py // Zhihu && CSDN && jianshu
โโโ brushclass
โย ย โโโ brushclass.py // PKU elective
โโโ buildmd
โย ย โโโ buildmd.py // Youdao Note
โโโ eastmoney
โย ย โโโ eastmoney.py // font analysis
โโโ exam
โย ย โโโ shaoq.js // jsdom
โย ย โโโ shaoq.py // compile js shaoq
โโโ log
โโโ netease
โย ย โโโ netease_music_base.py
โย ย โโโ netease_music_db.py // Netease Music
โย ย โโโ table.sql
โโโ news
โย ย โโโ news.py // Google && Baidu
โโโ press
โย ย โโโ press.py // Press text
โโโ proxy
โย ย โโโ getproxy.py // Proxy pool
โย ย โโโ table.sql
โโโ requirement.txt
โโโ utils
โย ย โโโ db.py
โย ย โโโ utils.py
โโโ zimuzu
โโโ zimuzu.py // zimuzi
Proxy pool
proxy pool is the heart of this project.
- Highly Available Proxy IP Pool
- By obtaining data from
Gatherproxy,Goubanjia,xicietc. Free Proxy WebSite - Analysis of the Goubanjia port data
- Quickly verify IP availability
- Cooperate with Requests to automatically assign proxy Ip, with Retry mechanism, fail to write DB mechanism
- two models for proxy shell
- model 1: load gather proxy list && update proxy list file(need over the GFW, your personality passwd in http://gatherproxy.com to
proxy/data/passageone line by username, one line by passwd) - model 0: update proxy pool db && test available
- model 1: load gather proxy list && update proxy list file(need over the GFW, your personality passwd in http://gatherproxy.com to
- one common proxy api
from proxy.getproxy import GetFreeProxyproxy_req = GetFreeProxy().proxy_reqproxy_req(url: str, types: int, data=None, test_func=None, header=None)
- also one common basic req api
from util import basic_reqbasic_req(url: str, types: int, proxies=None, data=None, header=None)
- if you want spider by using proxy
- because access proxy web need over the GFW, so maybe you can't use
model 1to download proxy file. - download proxy txt from 'http://gatherproxy.com'
- cp download_file proxy/data/gatherproxy
- python proxy/getproxy.py --model==0
- because access proxy web need over the GFW, so maybe you can't use
- By obtaining data from
Netease
Netease Music song playlist crawl - netease/netease_music_db.py
-
problem:
big data store -
classify -> playlist id -> song_detail
-
V1 Write file, One run version, no proxy, no record progress mechanism
-
V1.5 Small amount of proxy IP
-
V2 Proxy IP pool, Record progress, Write to MySQL
- Optimize the write to DB
Load data/ Replace INTO
- Optimize the write to DB
Press Test System
Press Test System - press/press.py
- problem:
high concurrency requests - By highly available proxy IP pool to pretend user.
- Give some web service uneven pressure
- To do: press uniform
News
google & baidu info crawl- news/news.py
- get news from search engine by Proxy Engine
- one model: careful analysis
DOM - the other model: rough analysis
Chinese words
Youdao Note
Youdao Note documents crawl - buildmd/buildmd.py
- load data from
youdaoyun - by series of rules to deal data to .md
blog
csdn && zhihu && jianshu view info crawl - blog/titleview.py
$ python blog/titleviews.py --model=1 >> log 2>&1 # model = 1: load gather model or python blog/titleviews.py --model=1 >> proxy.log 2>&1
$ python blog/titleviews.py --model=0 >> log 2>&1 # model = 0: update gather model
Brush Class
PKU Class brush - brushclass/brushclass.py
- when your expected class have places, It will send you some email.
zimuzu
ZiMuZu download list crawl - zimuzu/zimuzu.py
- when you want to download lots of show like Season 22, Season 21.
- If click one by one, It is very boring, so zimuzu.py is all you need.
- The thing you only need do is to wait for the program run.
- And you copy the Thunder URL for one to download the movies.
- Now The Winter will come, I think you need it to review
<Game of Thrones>.
Bilibili
Get av data by http - bilibili/bilibili.py
homepage rank-> checktids-> to check data every 2min(during on rank + one day)- monitor every rank av -> star num & basic data
Get av data by websocket - bilibili/bsocket.py
- base on WebSocket
- byte analysis
- heartbeat
Get comment data by http - bilibili/bilibili.py
-
load comment from
/x/v2/reply -
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-10: ordinal not in range(128)
- read/write in
utf-8 - with codecs.open(filename, 'r/w', encoding='utf-8')
- read/write in
-
bilibilisome url return 404 likehttp://api.bilibili.com/x/relation/stat?jsonp=jsonp&callback=__jp11&vmid=basic_req auto add
hostto headers, but this URL can't request in โHostโ
shaoq
Get text data by compiling javascript - exam/shaoq.py
-
Idea
- get cookie
- request image
- requests after 5.5s
- compile javascript code -> get css
- analysic css
-
Requirement
pip3 install PyExecJS yarn install add jsdom # npm install jsdom PS: not global -
Can't get true html
-
Wait time must be 5.5s.
-
So you can use
threadingorawait asyncio.gatherto request image
-
-
Error: Cannot find module 'jsdom'
jsdom must install in local not in global
-
remove subtree & edit subtree & re.findall
subtree.extract() subtree.string = new_string parent_tree.find_all(re.compile('''))
