Goscraper
My Web Scraper with Golang
Install / Use
/learn @ramusaaa/GoscraperREADME
GoScraper 🚀
Enterprise-Grade Web Scraping Library & Microservice for Go
Modern, fast, and stealth web scraping library with AI-powered extraction, anti-bot detection, and microservice architecture. Perfect for e-commerce, news, and data extraction at scale.
🌟 Key Features
🤖 AI-Powered Smart Extraction
- Multiple AI Providers: OpenAI GPT-4, Anthropic Claude, Local models
- Smart Content Detection: Automatically identifies and extracts structured data
- Confidence Scoring: Quality assurance for extracted data
- Fallback Chain: CSS/XPath extraction when AI fails
🏗️ Microservice Architecture
- HTTP API Server: RESTful endpoints for scraping operations
- Docker Support: Container-ready with Docker Compose
- Kubernetes Ready: Production deployment manifests included
- Load Balancing: Nginx configuration for horizontal scaling
⚙️ Flexible Configuration System
- JSON Configuration: File-based configuration management
- Environment Variables: 12-factor app compliance
- CLI Tools: Interactive setup and validation
- Hot Reloading: Runtime configuration updates
🌐 Multi-Engine Browser Support
- ChromeDP: High-performance Chrome automation
- Rod: Lightning-fast browser control
- Stealth Mode: Advanced anti-detection techniques
- Headless & GUI: Flexible rendering options
🚀 Production Features
- Rate Limiting: Configurable request throttling
- Caching: Redis and in-memory caching
- Proxy Support: IP rotation and geo-targeting
- Health Checks: Monitoring and observability
- Graceful Shutdown: Clean resource management
📦 Installation
go get github.com/ramusaaa/goscraper
🚀 Quick Start
Method 1: Interactive Setup (Recommended)
# 1. Initialize configuration
make init-config
# 2. Interactive setup wizard
make setup
# Follow prompts to configure AI keys, caching, etc.
# 3. Validate configuration
make validate-config
# 4. Start the server
make run
Method 2: Environment Variables
# Set your API keys
export OPENAI_API_KEY="your-openai-key"
export GOSCRAPER_AI_ENABLED=true
# Start the server
go run ./cmd/api
Method 3: Manual Configuration
# Create config file
cp goscraper.example.json goscraper.json
# Edit configuration
vim goscraper.json
# Start server
go run ./cmd/api
💻 Usage Examples
Basic Library Usage
package main
import (
"fmt"
"log"
"github.com/ramusaaa/goscraper"
)
func main() {
// Simple scraping
scraper := goscraper.New()
resp, err := scraper.Get("https://example.com")
if err != nil {
log.Fatal(err)
}
title := resp.Document.Find("title").Text()
fmt.Printf("Page title: %s\n", title)
}
Advanced Configuration
scraper := goscraper.New(
goscraper.WithTimeout(30*time.Second),
goscraper.WithUserAgent("MyBot/1.0"),
goscraper.WithHeaders(map[string]string{
"Accept-Language": "en-US,en;q=0.9",
}),
goscraper.WithRateLimit(500*time.Millisecond),
goscraper.WithMaxRetries(3),
goscraper.WithProxy("http://proxy.example.com:8080"),
goscraper.WithStealth(true),
)
HTTP API Usage
# Health check
curl http://localhost:8080/health
# Get configuration
curl http://localhost:8080/config
# Scrape a website
curl -X POST http://localhost:8080/api/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
# Smart AI-powered scraping
curl -X POST http://localhost:8080/api/smart-scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://shop.example.com/products"}'
Client SDK Usage
package main
import (
"fmt"
"log"
"github.com/ramusaaa/goscraper/client"
)
func main() {
// Create client for remote scraper service
client := client.NewScraperClient("http://localhost:8080")
// Health check
if err := client.Health(); err != nil {
log.Fatal("Service unavailable:", err)
}
// Scrape website
data, err := client.Scrape("https://example.com")
if err != nil {
log.Fatal("Scraping failed:", err)
}
fmt.Printf("Title: %s\n", data.Title)
fmt.Printf("Status: %d\n", data.StatusCode)
}
📋 Configuration Reference
Configuration File Structure
{
"server": {
"port": "8080",
"host": "0.0.0.0",
"read_timeout": "30s",
"write_timeout": "30s"
},
"ai": {
"enabled": true,
"provider": "openai",
"confidence_threshold": 0.8,
"fallback_chain": ["openai", "css", "xpath"],
"models": {
"openai": {
"api_key": "your-openai-key",
"model": "gpt-4"
},
"anthropic": {
"api_key": "your-anthropic-key",
"model": "claude-3-sonnet-20240229"
}
}
},
"browser": {
"engine": "chromedp",
"headless": true,
"stealth": true,
"pool_size": 5
},
"cache": {
"enabled": true,
"type": "redis",
"ttl": "1h",
"redis": {
"host": "localhost",
"port": 6379
}
},
"rate_limit": {
"requests_per_second": 10,
"delay": "100ms"
}
}
Environment Variables
# Server Configuration
GOSCRAPER_PORT=8080
GOSCRAPER_HOST=0.0.0.0
# AI Configuration
GOSCRAPER_AI_ENABLED=true
GOSCRAPER_AI_PROVIDER=openai
OPENAI_API_KEY=your-openai-key
ANTHROPIC_API_KEY=your-anthropic-key
# Browser Configuration
GOSCRAPER_BROWSER_ENGINE=chromedp
GOSCRAPER_BROWSER_HEADLESS=true
GOSCRAPER_BROWSER_STEALTH=true
# Cache Configuration
GOSCRAPER_CACHE_ENABLED=true
GOSCRAPER_CACHE_TYPE=redis
REDIS_HOST=localhost
REDIS_PORT=6379
# Rate Limiting
GOSCRAPER_RATE_LIMIT_RPS=10
GOSCRAPER_RATE_LIMIT_DELAY=100ms
🛠️ CLI Tools
Available Commands
# Configuration Management
make init-config # Create default configuration
make setup # Interactive setup wizard
make validate-config # Validate configuration
make show-config # Display current configuration
# Development
make build # Build binaries
make run # Start API server
make test # Run tests
# Docker
make docker-build # Build Docker image
make docker-compose-up # Start with Docker Compose
make docker-compose-down # Stop Docker services
# Kubernetes
make k8s-deploy # Deploy to Kubernetes
make k8s-delete # Remove from Kubernetes
CLI Usage Examples
# Initialize new project
goscraper init
# Interactive setup
goscraper setup
# Validate configuration
goscraper validate
# Show current configuration
goscraper config
🏗️ Architecture Overview
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Load Balancer │ │ API Gateway │ │ Web Dashboard │
│ (Nginx) │ │ (Optional) │ │ (Optional) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└───────────────────────┼───────────────────────┘
│
┌───────────────────────┼───────────────────────┐
│ │ │
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Scraper Node 1 │ │ Scraper Node 2 │ │ Scraper Node N │
│ │ │ │ │ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │HTTP API │ │ │ │HTTP API │ │ │ │HTTP API │ │
│ │Server │ │ │ │Server │ │ │ │Server │ │
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │Browser Pool │ │ │ │Browser Pool │ │ │ │Browser Pool │ │
│ │+ AI Engine │ │ │ │+ AI Engine │ │ │ │+ AI Engine │ │
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└───────────────────────┼───────────────────────┘
│
┌─────────────────────────────────────────────────────────┐
│ Infrastructure Layer │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Redis │ │ Config │ │ Proxy │ │
│ │ Cache │ │ Storage │ │ Rotation │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ OpenAI │ │ Anthropic │ │ Local │ │
│ │ API │ │ API │ │ Models │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
🚀 Deployment Options
1. Standalone Binary
# Build and run
go build -o goscraper ./cmd/api
./goscraper
2. Docker Container
# Build image
docker build -t goscraper:latest .
# Run container
docker run -p 8080:8080 \
-e OPENAI_API_KEY=your-key \
-e GOSCRAPER_AI_ENABLED=true \
goscraper:latest
3. Docker Compose
# Start services
docker-compose up -d
# View logs
docker-compose logs -f scraper-api
# Stop services
docker-compose down
4. Kubernetes
Related Skills
node-connect
349.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
