CogniScrape

Intelligent Web Scraping Library with LLMs

Generate Convert Improve

Install / Use

/learn @Riddhish1/CogniScrape

About this skill

Quality Score

0/100

README

🕷️ CogniScrape

Intelligent Web Scraping with LLMs - A TypeScript library that combines traditional web scraping with Large Language Models for intelligent, structured data extraction.

✨ Features

🤖 Dual LLM Support: Ollama (free/local) + Google Gemini (cloud)
📊 Graph-Based Architecture: Composable, reusable node pipelines
🚀 Production-Ready: Built-in caching, retries, rate limiting, and proxy rotation
🎯 Smart Parsing: Automatic HTML→Markdown conversion and intelligent chunking
✅ Schema Validation: Zod integration for type-safe outputs
📝 Multiple Formats: JSON, CSV, XML, PDF support
🌐 Browser Automation: Playwright for dynamic content
🧠 RAG Integration: Retrieval-Augmented Generation for better accuracy

📦 Installation

npm install cogniscrape

🚀 Quick Start

Basic Web Scraping with Gemini

import { SmartScraperGraph } from 'cogniscrape';

const scraper = new SmartScraperGraph({
  prompt: 'Extract all product names and prices',
  source: 'https://example.com/products',
  config: {
    llm: {
      provider: 'gemini',
      model: 'gemini-2.0-flash-exp',
      apiKey: process.env.GEMINI_API_KEY,
    },
    verbose: true,
  },
});

const result = await scraper.run();
console.log(result);

Using Ollama (100% Free & Local)

import { SmartScraperGraph } from 'cogniscrape';

const scraper = new SmartScraperGraph({
  prompt: 'List all article titles and summaries',
  source: 'https://news.example.com',
  config: {
    llm: {
      provider: 'ollama',
      model: 'llama2',  // or 'mistral', 'codellama', etc.
      baseUrl: 'http://localhost:11434',
    },
  },
});

const result = await scraper.run();

🎯 Available Graphs

| Graph | Purpose | Use Case | |-------|---------|----------| | SmartScraperGraph | Basic scraping | Extract data from single URL | | SmartScraperMultiGraph | Multi-URL scraping | Scrape multiple sources (parallel/sequential) | | SearchGraph | Internet search + scrape | Search engines + content extraction | | DepthSearchGraph | Deep analysis | Search + reasoning + comprehensive analysis | | CSVScraperGraph | CSV export | Scrape data → export to CSV | | JSONScraperGraph | JSON export | Schema-validated JSON output |

📚 Examples

Multi-URL Scraping

import { SmartScraperMultiGraph, createLLM } from 'cogniscrape';

const llm = createLLM({
  provider: 'gemini',
  model: 'gemini-2.0-flash-exp',
  apiKey: process.env.GEMINI_API_KEY,
});

const scraper = new SmartScraperMultiGraph(
  'Extract company names and descriptions',
  [
    'https://company1.com',
    'https://company2.com',
    'https://company3.com',
  ],
  { llm },
  llm,
  true // parallel execution
);

const result = await scraper.run();

CSV Export with Schema Validation

import { CSVScraperGraph } from 'cogniscrape';
import { z } from 'zod';

const schema = z.object({
  products: z.array(z.object({
    name: z.string(),
    price: z.number(),
    rating: z.number().optional(),
  })),
});

const scraper = new CSVScraperGraph(
  'Extract all products with their prices',
  'https://shop.example.com',
  {
    llm: {
      provider: 'gemini',
      model: 'gemini-2.0-flash-exp',
      apiKey: process.env.GEMINI_API_KEY,
    },
    schema,
  },
  llm,
  'products.csv'
);

await scraper.run();

Internet Search Graph

import { SearchGraph, createLLM } from 'cogniscrape';

const llm = createLLM({
  provider: 'gemini',
  model: 'gemini-2.0-flash-exp',
  apiKey: process.env.GEMINI_API_KEY,
});

const searchGraph = new SearchGraph(
  'Latest news about AI developments in 2026',
  {
    llm,
    searchEngine: 'duckduckgo',
    maxDepth: 3,
  },
  llm
);

const result = await searchGraph.run();

⚙️ Configuration Options

interface ScraperConfig {
  llm: LLMConfig;
  verbose?: boolean;          // Enable logging
  headless?: boolean;         // Headless browser mode
  timeout?: number;           // Request timeout (ms)
  cut?: boolean;              // Enable HTML minification
  htmlMode?: boolean;         // Skip parsing (use raw HTML)
  
  // Production features
  proxy?: ProxyConfig;        // Proxy configuration
  retry?: RetryConfig;        // Retry with backoff
  rateLimit?: RateLimitConfig; // Rate limiting
  cache?: CacheConfig;        // Response caching
  
  // Advanced
  schema?: any;               // Zod schema for validation
  additionalInfo?: string;    // Extra context for LLM
  reasoning?: boolean;        // Enable reasoning mode
}

🔧 Production Features

Proxy Rotation

const config = {
  llm: { /* ... */ },
  proxy: {
    enabled: true,
    proxies: [
      'http://proxy1.com:8080',
      'http://proxy2.com:8080',
    ],
  },
};

Retry with Exponential Backoff

const config = {
  llm: { /* ... */ },
  retry: {
    maxRetries: 3,
    initialDelay: 1000,
    maxDelay: 10000,
    backoffMultiplier: 2,
  },
};

Rate Limiting

const config = {
  llm: { /* ... */ },
  rateLimit: {
    maxRequests: 10,
    windowMs: 1000,
    minDelay: 100,
  },
};

Caching

const config = {
  llm: { /* ... */ },
  cache: {
    enabled: true,
    ttl: 3600000, // 1 hour
    maxSize: 1000,
  },
};

🧪 Testing

npm test

🛠️ Development

# Install dependencies
npm install

# Build the project
npm run build

# Watch mode
npm run dev

# Run examples
npx ts-node examples/smart-scraper-gemini.ts

📖 API Reference

Models

OllamaModel - Local LLM support
GeminiModel - Google Gemini integration
createLLM(config) - Factory function

Graphs

SmartScraperGraph - Basic web scraping
SmartScraperMultiGraph - Multi-URL scraping
SearchGraph - Search + scrape
DepthSearchGraph - Deep search with reasoning
CSVScraperGraph - Export to CSV
JSONScraperGraph - Export to JSON

Nodes

FetchNode - Fetch content
ParseNode - Parse & chunk
GenerateAnswerNode - LLM answer generation
RAGNode - Retrieval-Augmented Generation
SearchNode - Internet search
MergeNode - Merge results
PDFScraperNode - PDF extraction
XMLScraperNode - XML parsing

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

MIT License - see LICENSE file for details

📬 Support

📧 Email: bonderiddhish@gmail.com
🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions

Made with ❤️ for the TypeScript community

Related Skills

node-connect

352.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。