Telescrape
Telegram scraper for downloading messages, comment.bot/comment.app comments, users and photos
Install / Use
/learn @PeterWalchhofer/TelescrapeREADME
Telegram Scraper
This Telegram scraper collects telegram messages, comments (comments.bot/comments.app) and media files. It was originally build for this story on behalf of Addendum.
Contributors: @PeterWalchhofer, @vali101, @fin
Notes
This scraper was written before Telgram introduced its native comment feature for broadcasting channels. Nowadays, comment.bot/comment.app extensions are rarely used anymore. Feel free to make a PR to support this feature. @vali101 re-wrote the scraper for his bachelor thesis here with a focus on snowball sampling. His code may help for extending the scraper or for finding interesting channels in the first place. Check it out!
Setup
Requirements:
- Google Chrome (However, in theory you could als use Firefox when installing the necessary driver manually )
- Python 3
- Telegram Account (Phone Number)
- A lot of storage if downloading media
- Time - arround 3000msg and comments/ per Minute.
Getting Started
- Install dependencies
make installOR just install requirements.txt (using venv is recommended) - Create your own
channel.csvas explained in the next section - Put the phone-number of the linked telegram account int the
config.yaml - Get your API-key here an put them inside the
config.yaml. sh scrape.shto start the scraper OR runchannelscraper/python app.py- The outputs will be stored in the
/outputdirectory.
Input Data
You need to create your own channels.csvand put it in the /input folder.
Only Link and Broadcast Relevant for scraping. The csv should have the form described below. There also is an example csv in the folder.
Kategorie | Name | Link | @ | Broadcast --- | --- | --- | --- | --- Gruppe Typ XY | Example Channel | https://t.me/example_channel | example_channel | TRUE
- Kategorie(optional): Metadata to annotate channel
- Name(optional): Not identifier Name
- Link: Link to channel
- @ (optional): Indentifier Name
- Broadcast: ´True´ if channel is Broadcasting Channel ´else´ false. Broadcasting channels are large one-to-many channels that only allow owners to write messages.
Configuration
The Scraper can be further configured via the channelscraper/config.yaml.
Features
Messages
The Scraper extracts all messages from a channel. It is also possible to scrape only those messages that were written in the last x days. This can be set in the config.yaml.
Comment Bots
- In many broadcasting channels comment-bots are used in order to provide feedback from the audience. It is also possible to scrape those messages. Currently comments.app and comments.bot bots are supported.
- Be careful. The date format is different from the telegram-api and has to be parsed manually (e.g. Dec 09)
- As there is a "Load more comments" button it has to be clicked using javascript. Selenium is used to interact with the chrome driver that is installed automatically.
Comments.app
- Unique username extraction is working most of the time. However, if the user has deleted its account, this is not possible.
Comments.bot
- Unique usernames are not extracted, because we found no way to find out without querying the api at a high cost.
- Only the display name is persited
Further remarks
- Messages and comments are persisted in the same csv-file. To tell them apart use the
isCommentcolumn. Additionally, the ID includes a period in the format msgId.commId (e.g. 101.3) - We allocated ids to the comments manually. The telegram message ids are unique within a channel.
Optional
If you want to run selenium with docker use selenium/standalone-chrome:3.141.59-yttrium
Also see:
https://stackoverflow.com/questions/45323271/how-to-run-selenium-with-chrome-in-docker
Related Skills
node-connect
343.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
90.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
343.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
343.1kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
