CogniScrape
Intelligent Web Scraping Library with LLMs
Install / Use
/learn @Riddhish1/CogniScrapeREADME
🕷️ CogniScrape
Intelligent Web Scraping with LLMs - A TypeScript library that combines traditional web scraping with Large Language Models for intelligent, structured data extraction.
✨ Features
- 🤖 Dual LLM Support: Ollama (free/local) + Google Gemini (cloud)
- 📊 Graph-Based Architecture: Composable, reusable node pipelines
- 🚀 Production-Ready: Built-in caching, retries, rate limiting, and proxy rotation
- 🎯 Smart Parsing: Automatic HTML→Markdown conversion and intelligent chunking
- ✅ Schema Validation: Zod integration for type-safe outputs
- 📝 Multiple Formats: JSON, CSV, XML, PDF support
- 🌐 Browser Automation: Playwright for dynamic content
- 🧠 RAG Integration: Retrieval-Augmented Generation for better accuracy
📦 Installation
npm install cogniscrape
🚀 Quick Start
Basic Web Scraping with Gemini
import { SmartScraperGraph } from 'cogniscrape';
const scraper = new SmartScraperGraph({
prompt: 'Extract all product names and prices',
source: 'https://example.com/products',
config: {
llm: {
provider: 'gemini',
model: 'gemini-2.0-flash-exp',
apiKey: process.env.GEMINI_API_KEY,
},
verbose: true,
},
});
const result = await scraper.run();
console.log(result);
Using Ollama (100% Free & Local)
import { SmartScraperGraph } from 'cogniscrape';
const scraper = new SmartScraperGraph({
prompt: 'List all article titles and summaries',
source: 'https://news.example.com',
config: {
llm: {
provider: 'ollama',
model: 'llama2', // or 'mistral', 'codellama', etc.
baseUrl: 'http://localhost:11434',
},
},
});
const result = await scraper.run();
🎯 Available Graphs
| Graph | Purpose | Use Case |
|-------|---------|----------|
| SmartScraperGraph | Basic scraping | Extract data from single URL |
| SmartScraperMultiGraph | Multi-URL scraping | Scrape multiple sources (parallel/sequential) |
| SearchGraph | Internet search + scrape | Search engines + content extraction |
| DepthSearchGraph | Deep analysis | Search + reasoning + comprehensive analysis |
| CSVScraperGraph | CSV export | Scrape data → export to CSV |
| JSONScraperGraph | JSON export | Schema-validated JSON output |
📚 Examples
Multi-URL Scraping
import { SmartScraperMultiGraph, createLLM } from 'cogniscrape';
const llm = createLLM({
provider: 'gemini',
model: 'gemini-2.0-flash-exp',
apiKey: process.env.GEMINI_API_KEY,
});
const scraper = new SmartScraperMultiGraph(
'Extract company names and descriptions',
[
'https://company1.com',
'https://company2.com',
'https://company3.com',
],
{ llm },
llm,
true // parallel execution
);
const result = await scraper.run();
CSV Export with Schema Validation
import { CSVScraperGraph } from 'cogniscrape';
import { z } from 'zod';
const schema = z.object({
products: z.array(z.object({
name: z.string(),
price: z.number(),
rating: z.number().optional(),
})),
});
const scraper = new CSVScraperGraph(
'Extract all products with their prices',
'https://shop.example.com',
{
llm: {
provider: 'gemini',
model: 'gemini-2.0-flash-exp',
apiKey: process.env.GEMINI_API_KEY,
},
schema,
},
llm,
'products.csv'
);
await scraper.run();
Internet Search Graph
import { SearchGraph, createLLM } from 'cogniscrape';
const llm = createLLM({
provider: 'gemini',
model: 'gemini-2.0-flash-exp',
apiKey: process.env.GEMINI_API_KEY,
});
const searchGraph = new SearchGraph(
'Latest news about AI developments in 2026',
{
llm,
searchEngine: 'duckduckgo',
maxDepth: 3,
},
llm
);
const result = await searchGraph.run();
⚙️ Configuration Options
interface ScraperConfig {
llm: LLMConfig;
verbose?: boolean; // Enable logging
headless?: boolean; // Headless browser mode
timeout?: number; // Request timeout (ms)
cut?: boolean; // Enable HTML minification
htmlMode?: boolean; // Skip parsing (use raw HTML)
// Production features
proxy?: ProxyConfig; // Proxy configuration
retry?: RetryConfig; // Retry with backoff
rateLimit?: RateLimitConfig; // Rate limiting
cache?: CacheConfig; // Response caching
// Advanced
schema?: any; // Zod schema for validation
additionalInfo?: string; // Extra context for LLM
reasoning?: boolean; // Enable reasoning mode
}
🔧 Production Features
Proxy Rotation
const config = {
llm: { /* ... */ },
proxy: {
enabled: true,
proxies: [
'http://proxy1.com:8080',
'http://proxy2.com:8080',
],
},
};
Retry with Exponential Backoff
const config = {
llm: { /* ... */ },
retry: {
maxRetries: 3,
initialDelay: 1000,
maxDelay: 10000,
backoffMultiplier: 2,
},
};
Rate Limiting
const config = {
llm: { /* ... */ },
rateLimit: {
maxRequests: 10,
windowMs: 1000,
minDelay: 100,
},
};
Caching
const config = {
llm: { /* ... */ },
cache: {
enabled: true,
ttl: 3600000, // 1 hour
maxSize: 1000,
},
};
🧪 Testing
npm test
🛠️ Development
# Install dependencies
npm install
# Build the project
npm run build
# Watch mode
npm run dev
# Run examples
npx ts-node examples/smart-scraper-gemini.ts
📖 API Reference
Models
OllamaModel- Local LLM supportGeminiModel- Google Gemini integrationcreateLLM(config)- Factory function
Graphs
SmartScraperGraph- Basic web scrapingSmartScraperMultiGraph- Multi-URL scrapingSearchGraph- Search + scrapeDepthSearchGraph- Deep search with reasoningCSVScraperGraph- Export to CSVJSONScraperGraph- Export to JSON
Nodes
FetchNode- Fetch contentParseNode- Parse & chunkGenerateAnswerNode- LLM answer generationRAGNode- Retrieval-Augmented GenerationSearchNode- Internet searchMergeNode- Merge resultsPDFScraperNode- PDF extractionXMLScraperNode- XML parsing
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
📄 License
MIT License - see LICENSE file for details
📬 Support
- 📧 Email: bonderiddhish@gmail.com
- 🐛 Issues: GitHub Issues
- 💬 Discussions: GitHub Discussions
Made with ❤️ for the TypeScript community
Related Skills
node-connect
352.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
