Webspot
An intelligent web service to automatically detect web content and extract information from it.
Install / Use
/learn @crawlab-team/WebspotREADME
Webspot
Webspot is an intelligent web service to automatically detect web content and extract information from it.
Screenshots
Detected Results

Extracted Fields

Extracted Data

Get Started
Docker
Make sure you have installed Docker and Docker Compose.
# clone git repo
git clone https://github.com/crawlab-team/webspot
# start docker containers
docker-compose up -d
Then you can access the web UI at http://localhost:9999.
API Reference
Once you started Webspot, you can go to http://localhost:9999/redoc to view the API reference.
Architecture
The overall process of how Webspot detects meaningful elements from HTML or web pages is shown in the following figure.
graph LR
hr[HtmlRequester]
gl[GraphLoader]
d[Detector]
r[Results]
hr --"html + json"--> gl --"graph"--> d --"output"--> r
Development
You can follow the following guidance to get started.
Pre-requisites
- Python >=3.8 and <=3.10
- Go 1.16 or higher
- MongoDB 4.2 or higher
Install dependencies
# dependencies
pip install -r requirements.txt
Configure Environment Variables
Database configuration is located in .env file. You can copy the example file and modify it.
cp .env.example .env
Start web server
# start development server
python main.py web
Code Structure
The core code is located in webspot directory. The main.py file is the entry point of the web server.
webspot
├── cmd # command line tools
├── crawler # web crawler
├── data # data files (html, json, etc.)
├── db # database
├── detect # web content detection
├── graph # graph module
├── models # models
├── request # request helper
├── test # test cases
├── utils # utilities
└── web # web server
TODOs
Webspot is aimed at automating the process of web content detection and extraction. It is far from ready for production use. The following features are planned to be implemented in the future.
- [ ] Table detection
- [ ] Nested list detection
- [ ] Export to spiders
- [ ] Advanced browser request
Disclaimer
Please follow the local laws and regulations when using Webspot. The author is not responsible for any legal issues caused by. Please read the Disclaimer for details.
Community
If you are interested in Webspot, please add the author's WeChat account "tikazyq1" noting "Webspot" to enter the discussion group.
<p align="center"> <img src="https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/qrcode.png" height="360"> </p>Related Skills
qqbot-channel
345.4kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.0k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
345.4kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
ddd
Guía de Principios DDD para el Proyecto > 📚 Documento Complementario : Este documento define los principios y reglas de DDD. Para ver templates de código, ejemplos detallados y guías paso
