ComicCrawler
An image crawler written in Python.
Install / Use
/learn @eight04/ComicCrawlerREADME
Comic Crawler
.. image:: https://travis-ci.org/eight04/ComicCrawler.svg?branch=master :target: https://travis-ci.org/eight04/ComicCrawler
Comic Crawler 是用來扒圖的一支 Python Script。擁有簡易的下載管理員、圖書館功能、 與方便的擴充能力。
下載和安裝(Windows)
Comic Crawler is on
PyPI <https://pypi.python.org/pypi/comiccrawler/>__. 安裝完
python 後,可以直接用 pip 指令自動安裝。
Install Python
你需要 Python 3.11 以上。安裝檔可以從它的
`官方網站 <https://www.python.org/>`__ 下載。
安裝時記得要選「Add python.exe to path」,才能使用 pip 指令。
Install Deno
~~~~~~~~~~~~
Comic Crawler 使用 Deno 來分析需要執行 JavaScript 的網站︰
https://docs.deno.com/runtime/manual/getting_started/installation
Windows 10 (1709) 以上的版本,可以直接在 cmd 底下輸入以下指令安裝︰
::
winget install deno
Install Comic Crawler
在 cmd 底下輸入以下指令︰
::
pip install comiccrawler
更新時︰
::
pip install comiccrawler --upgrade --upgrade-strategy eager
最後在 cmd 底下輸入以下指令執行 Comic Crawler︰
::
comiccrawler gui
Supported domains
.. DOMAINS ..
163.bilibili.com 8comic.com 99.hhxxee.com ac.qq.com beta.sankakucomplex.com chan.sankakucomplex.com comic.acgn.cc comic.sfacg.com comicbus.com coomer.su copymanga.com danbooru.donmai.us deviantart.com e-hentai.org exhentai.org fanbox.cc fantia.jp gelbooru.com hk.dm5.com ikanman.com imgbox.com jpg4.su kemono.party kemono.su konachan.com linevoom.line.me m.dmzj.com m.manhuabei.com m.wuyouhui.net manga.bilibili.com manhua.dmzj.com manhuagui.com nijie.info pixabay.com raw.senmanga.com seemh.com seiga.nicovideo.jp smp.yoedge.com tel.dm5.com tsundora.com tuchong.com tumblr.com tw.weibo.com twitter.com wix.com www.177pic.info www.1manhua.net www.33am.cn www.36rm.cn www.99comic.com www.aacomic.com www.artstation.com www.buka.cn www.cartoonmad.com www.chuixue.com www.chuixue.net www.cocomanhua.com www.colamanga.com www.comicabc.com www.comicvip.com www.dm5.com www.dmzj.com www.facebook.com www.flickr.com www.gufengmh.com www.gufengmh8.com www.hhcomic.cc www.hheess.com www.hhmmoo.com www.hhssee.com www.hhxiee.com www.iibq.com www.instagram.com www.mangacopy.com www.manhuadui.com www.manhuaren.com www.mh160.com www.mhgui.com www.ohmanhua.com www.pixiv.net www.sankakucomplex.com www.setnmh.com www.tohomh.com www.tohomh123.com www.xznj120.com x.com yande.re
.. END DOMAINS
使用說明
As a CLI tool:
::
Usage: comiccrawler [--profile=<profile>] ( domains | download <url> [--dest=<save_path>] | gui ) comiccrawler (--help | --version)
Commands: domains 列出支援的網址 download 下載指定的 url gui 啟動主視窗
Options: --profile 指定設定檔存放的資料夾(預設為 "~/comiccrawler") --dest 設定下載目錄(預設為 ".") --help 顯示幫助訊息 --version 顯示版本
or you can use it in your python script:
.. code:: python
from comiccrawler.mission import Mission
from comiccrawler.analyzer import Analyzer
from comiccrawler.crawler import download
# create a mission
m = Mission(url="http://example.com")
Analyzer(m).analyze()
# select the episodes you want
for ep in m.episodes:
if ep.title != "chapter 123":
ep.skip = True
# download to savepath
download(m, "path/to/save")
圖形介面
.. figure:: http://i.imgur.com/ZzF0YFx.png :alt: 主視窗
- 在文字欄貼上網址後點「加入連結」或是按 Enter
- 若是剪貼簿裡有支援的網址,且文字欄同時是空的,程式會自動貼上
- 對著任務右鍵,可以選擇把任務加入圖書館。圖書館內的任務,在每次程式啟動時,都會檢查是否有更新。
設定檔
.. code:: ini
[DEFAULT]
; 設定下載完成後要執行的程式,{target} 會被替換成任務資料夾的絕對路徑
runafterdownload = 7z a "{target}.zip" "{target}"
; 啟動時自動檢查圖書館更新
libraryautocheck = true
; 檢查更新間隔(單位︰小時)
autocheck_interval = 24
; 下載目的資料夾。相對路徑會根據設定檔資料夾的位置。
savepath = download
; 開啟 grabber 偵錯
errorlog = false
; 每隔 5 分鐘自動存檔
autosave = 5
; 存檔時使用下載時的原始檔名而不用頁碼
; 強列建議不要使用這個選項,見 https://github.com/eight04/ComicCrawler/issues/90
originalfilename = false
; 自動轉換集數名稱中數字的格式,可以用於補0
; 例︰第1集 -> 第001集
; 詳細的格式指定方式請參考 https://docs.python.org/3/library/string.html#format-specification-mini-language
; 注意︰這個設定會影響檔名中的所有數字,包括檔名中英數混合的ID如instagram
titlenumberformat = {:03d}
; 連線時使用 http/https proxy
proxy = 127.0.0.1:1080
; 加入新任務時,預設選擇所有集數
selectall = true
; 不要根據各集名稱建立子資料夾,將所有圖片放在任務資料夾內
noepfolder = true
; 遇到重複任務時的動作
; update: 檢查更新
; reselect_episodes: 重新選取集數
mission_conflict_action = update
; 是否驗證加密連線(SSL),預設是 true
verify = false
; 從瀏覽器中讀取 cookies,使用 yt-dlp 的 cookies-from-browser
; https://github.com/yt-dlp/yt-dlp/blob/e5d4f11104ce7ea1717a90eea82c0f7d230ea5d5/yt_dlp/cookies.py#L109
browser = firefox
; 瀏覽器 profile 的名稱
browser_profile = act3nn7e.default
; 並行下載的任務數量。注意︰你無法並行下載單一網站的多個任務,所以這個數字只對多個不同網站的任務有效
max_threads = 3
-
設定檔位於
~\comiccrawler\setting.ini。可以在執行時指定--profile選項以變更預設的位置。(在 Windows 中~會被展開為%HOME%或%USERPROFILE%) -
執行一次
comiccrawler gui後關閉,設定檔會自動產生。若 Comic Crawler 更新後有新增的設定,在關閉後會自動將新設定加入設定檔。 -
各別的網站會有自己的設定,通常是要填入一些登入相關資訊
- 以 curl 開頭的設定,要填入對應網址的 curl 指令。以 twitter 為例︰https://github.com/eight04/ComicCrawler/issues/241#issuecomment-904411605
- 以 cookie 開頭的設定,要填入對應的 cookie。
-
設定檔會在重新啟動後生效。若 ComicCrawler 正在執行中,可以點「重載設定檔」來載入新設定
.. warning::
若在執行時,修改設定檔並儲存,接著結束 ComicCrawler,修改會遺失。因為 ComicCrawler 結束前會把設定寫回設定檔。
-
各別網站的設定不會互相影響。假如在 [DEFAULT] 設 savepath = a;在 [Pixiv] 設 savepath = b,那麼從 pixiv 下載的都會存到 b 資料夾,其它的就用預設值,存到 a 資料夾。
關於需要登入的網站
只要在設定檔裡指定 browser 和 browser_profile ,Comic Crawler 就可以自動從瀏覽器讀取 cookies 並登入。然而最新版的 Chrome 加強了對 Cookie 的保護︰
- https://github.com/yt-dlp/yt-dlp/issues/7271
- https://github.com/yt-dlp/yt-dlp/issues/10927
所以目前只有 Firefox 可以正常運作。
有些網站可以在設定檔裡指定 cookie 或 curl,但這些設定在未來會逐步淘汰,改用瀏覽器 cookie 自動登入。
Module example
Starting from version 2016.4.21, you can add your own module to ~/comiccrawler/mods/module_name.py.
.. code:: python
#! python3
"""
This is an example to show how to write a comiccrawler module.
"""
import re
from urllib.parse import urljoin
from comiccrawler.episode import Episode
# The header used in grabber method. Optional.
header = {}
# The cookies. Optional.
cookie = {}
# Match domain. Support sub-domain, which means "example.com" will match
# "*.example.com"
domain = ["www.example.com", "comic.example.com"]
# Module name
name = "Example"
# With noepfolder = True, Comic Crawler won't generate subfolder for each
# episode. Optional, default to False.
noepfolder = False
# If False then setup the referer header automatically to mimic browser behavior.
# If True then disable this behavior.
# Default: False
no_referer = True
# Wait 5 seconds before downloading another image. Optional, default to 0.
rest = 5
# Wait 5 seconds before analyzing the next page in the analyzer. Optional,
# default to 0.
rest_analyze = 5
# User settings which could be modified from setting.ini. The keys are
# case-sensitive.
#
# After loading the module, the config dictionary would be converted into
# a ConfigParser section data object so you can e.g. call
# config.getboolean("use_large_image") directly.
#
# Optional.
config = {
# The config value can only be str
"use_largest_image": "true",
# These special config starting with `cookie__` will be automatically
# used when grabbing html or image.
"cookie_user": "user-default-value",
"cookie_hash": "hash-default-value"
}
def load_config():
"""This function will be called each time the config reloads. Optional.
"""
pass
def get_title(html, url):
"""Return mission title.
The title would be used in saving filepath, so be sure to avoid
duplicated title.
"""
return re.search("<h1 id='title'>(.+?)</h1>", html).group(1)
def get_episodes(html, url):
"""Return episode list.
The episode list should be sorted by date, oldest first.
If is a multi-page list, specify the URL of the next page in
get_next_page. Comic Crawler would grab the next page and call this
function again.
The `Episode` object accepts an `image` property which can be a list of `Image`.
However, unlike `get_images`, the `Episode` object is JSON-stringified and saved
to the disk, therefore you must only use JSON-compatible types i.e. no `Image.get_url`.
"""
match_list = re.findall("<a href='(.+?)'>(.+?)</a>", html)
return [Episode(title, urljoin(url, ep_url))
for ep_url, title in match_list]
def get_images(html, url):
"""Get the URL of all images.
The return value could be:
- A list of image.
- A generator yielding image.
- An image, when there is only one image on the current page.
Comic Crawler treats following types as an image:
- str - the URL of the image
- callable - return a URL when called
- comiccrawler.core.Image - use it to provide customized filename.
While receiving the value, it is converted to an Image instance. See ``comiccrawler.core.Image.create()``.
If the episode has multi-pages, uses get_next_page to change page.
Use generator in caution! If the generator raises any error between
two images, next call to the generator will always result in
StopIteration, which means that Comic Crawler will think it had crawled
all images and navigate to next page. If you have to call grabhtml()
for each image (i.e. it may raise HTTPError), use a list of
callback instead!
"""
return re.findall("<img src='(.+?)'>", html)
def get_next_page
