Extractor

Using LLMs and AI browser automation to robustly extract web data

Generate Convert Improve

Install / Use

/learn @lightfeed/Extractor

About this skill

Quality Score

0/100

README

<h1 align="center"> <img src="https://www.lightfeed.ai/docs/img/logo.svg" width="32" height="32" alt="Lightfeed Logo"/> Lightfeed Extractor </h1> <p align="center"> <strong>Robust Web Data Extractor Using LLMs and Browser Automation</strong> </p> <div align="center"> <a href="https://www.npmjs.com/package/@lightfeed/extractor"> <img src="https://img.shields.io/npm/v/@lightfeed/extractor?logo=npm" alt="npm" /></a> <a href="https://github.com/lightfeed/extractor/actions/workflows/test.yml"> <img src="https://img.shields.io/github/actions/workflow/status/lightfeed/extractor/test.yml?branch=main" alt="Test status (main branch)"></a> <a href="https://github.com/lightfeed/extractor/blob/main/LICENSE"> <img src="https://img.shields.io/github/license/lightfeed/extractor" alt="License" /></a> <a href="https://www.linkedin.com/company/lightfeed-ai"> <img src="https://img.shields.io/badge/Follow%20on%20LinkedIn-0A66C2?logo=linkedin&logoColor=white" alt="Follow on LinkedIn" /></a> </div>

Overview

Lightfeed Extractor is a Typescript library built for robust web data extraction using LLMs and Playwright. Use natural language prompts to navigate web pages and extract structured data. Get complete, accurate results with great token efficiency — critical for production data pipelines.

Features

🤖 Browser Automation in Stealth Mode - Launch Playwright browsers locally, in serverless clouds, or connect to a remote browser server. Avoid detection with built-in anti-bot patches and proxy configuration for reliable web scraping.
🧭 AI Browser Navigation - Pair with @lightfeed/browser-agent to navigate pages using natural language commands before extracting structured data.
🧹 LLM-ready Markdown - Convert HTML to LLM-ready markdown, with options to extract only main content and clean URLs by removing tracking parameters.
⚡️ LLM Extraction - Use LLMs in JSON mode to extract structured data according to input Zod schema. Token usage limit and tracking included.
🛠️ JSON Recovery - Sanitize and recover failed JSON output. This makes complex schema extraction much more robust, especially with deeply nested objects and arrays.
🔗 URL Validation - Handle relative URLs, remove invalid ones, and repair markdown-escaped links.

[!TIP]
Building retail competitor intelligence at scale? Go to app.lightfeed.ai - our full platform for tracking competitor pricing, sales, promotions, and SEO across 1,000+ retail chains - get started for free. For generic web data pipelines with AI enrichment and workflow automation, check out lightfeed.ai.

Installation

Install the extractor:

npm install @lightfeed/extractor

Then install the LLM provider you want to use:

# OpenAI
npm install @langchain/openai

# Google Gemini
npm install @langchain/google-genai

# Anthropic
npm install @langchain/anthropic

# Ollama (local models)
npm install @langchain/ollama

@langchain/core will be installed automatically as a peer dependency.

Usage

E-commerce Product Extraction

This example demonstrates extracting structured product data from a real e-commerce website using a local headed Playwright browser. For production environments, you can use a Playwright browser in serverless or remote mode.

import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import { extract, ContentFormat, Browser } from "@lightfeed/extractor";
import { z } from "zod";

// Define schema for product catalog extraction
const productCatalogSchema = z.object({
  products: z
    .array(
      z.object({
        name: z.string().describe("Product name or title"),
        brand: z.string().optional().describe("Brand name"),
        price: z.number().describe("Current price"),
        originalPrice: z
          .number()
          .optional()
          .describe("Original price if on sale"),
        rating: z.number().optional().describe("Product rating out of 5"),
        reviewCount: z.number().optional().describe("Number of reviews"),
        productUrl: z.string().url().describe("Link to product detail page"),
        imageUrl: z.string().url().optional().describe("Product image URL"),
      })
    )
    .describe("List of bread and bakery products"),
});

// Create browser instance
const browser = new Browser({
  type: "local", // also supporting serverless and remote browser
  headless: false,
});

try {
  await browser.start();
  console.log("Browser started successfully");

  // Create page and navigate to e-commerce site
  const page = await browser.newPage();

  const pageUrl = "https://www.walmart.ca/en/browse/grocery/bread-bakery/10019_6000194327359";
  await page.goto(pageUrl);

  try {
    await page.waitForLoadState("networkidle", { timeout: 10000 });
  } catch {
    console.log("Network idle timeout, continuing...");
  }

  // Get HTML content
  const html = await page.content();
  console.log(`Loaded ${html.length} characters of HTML`);

  // Extract structured product data
  console.log("Extracting product data using LLM...");
  const result = await extract({
    llm: new ChatGoogleGenerativeAI({
      apiKey: process.env.GOOGLE_API_KEY,
      model: "gemini-2.5-flash",
      temperature: 0,
    }),
    content: html,
    format: ContentFormat.HTML,
    sourceUrl: pageUrl,
    schema: productCatalogSchema,
    htmlExtractionOptions: {
      extractMainHtml: true,
      includeImages: true,
      cleanUrls: true
    }
  });

  console.log("Extraction successful!");
  console.log("Found products:", result.data.products.length);

  // Print the extracted data
  console.log(JSON.stringify(result.data, null, 2));

} catch (error) {
  console.error("Error during extraction:", error);
} finally {
  await browser.close();
  console.log("Browser closed");
}

/* Expected output:
{
  "products": [
    {
      "name": "Dempster's® Signature The Classic Burger Buns, Pack of 8; 568 g",
      "brand": "Dempster's",
      "price": 3.98,
      "originalPrice": 4.57,
      "rating": 4.7376,
      "reviewCount": 141,
      "productUrl": "https://www.walmart.ca/en/ip/dempsters-signature-the-classic-burger-buns/6000188080451?classType=REGULAR&athbdg=L1300",
      "imageUrl": "https://i5.walmartimages.ca/images/Enlarge/725/979/6000196725979.jpg?odnHeight=580&odnWidth=580&odnBg=FFFFFF"
    },
    ... (more products)
  ]
}
*/

[!TIP] Run npm run test:browser to execute this example, or view the complete code in testBrowserExtraction.ts.

Using with Browser Agent

For pages that require interaction before extraction — searching, clicking through pagination, dismissing popups, etc. — you can pair this library with @lightfeed/browser-agent. The browser agent uses AI to navigate pages via natural language commands, and this library extracts structured data from the result.

Install both packages:

npm install @lightfeed/extractor @lightfeed/browser-agent

Then use the browser agent to navigate and the extractor to pull structured data:

import { BrowserAgent } from "@lightfeed/browser-agent";
import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import { extract, ContentFormat } from "@lightfeed/extractor";
import { z } from "zod";

const schema = z.object({
  products: z.array(
    z.object({
      name: z.string(),
      price: z.number(),
      rating: z.number().optional(),
      productUrl: z.string().url(),
    })
  ),
});

// 1. Use browser agent to navigate with AI
const agent = new BrowserAgent({ browserProvider: "Local" });
const page = await agent.newPage();
await page.goto("https://amazon.com");
await page.ai("Search for 'organic coffee' and go to the second page of results");

// 2. Extract structured data from the resulting page
const html = await page.content();
const result = await extract({
  llm: new ChatGoogleGenerativeAI({
    model: "gemini-2.5-flash",
    apiKey: process.env.GOOGLE_API_KEY,
    temperature: 0,
  }),
  content: html,
  format: ContentFormat.HTML,
  sourceUrl: page.url(),
  schema,
  prompt: "Extract all product listings from the search results",
  htmlExtractionOptions: {
    extractMainHtml: true,
    includeImages: true,
    cleanUrls: true,
  },
});

console.log(result.data.products);
await agent.close();

The browser agent supports local, serverless, and remote browsers — see the browser-agent docs for configuration options.

Extracting from Markdown or Plain Text

You can also extract structured data directly from HTML, Markdown or text string. Pass any LangChain chat model:

import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import { extract, ContentFormat } from "@lightfeed/extractor";

const result = await extract({
  llm: new ChatGoogleGenerativeAI({
    apiKey: process.env.GOOGLE_API_KEY,
    model: "gemini-2.5-flash",
    temperature: 0,
  }),
  content: markdownContent,
  format: ContentFormat.MARKDOWN,
  schema: mySchema,
});

Custom Extraction Prompts

You can provide a custom prompt to guide the extraction process:

const result = await extract({
  llm: myLLM,
  content: htmlContent,
  format: ContentFormat.HTML,
  schema: mySchema,
  sourceUrl: "https://example.com/products",
  prompt: "Extract ONLY products that are on sale or have special discounts. Include their original prices, discounted prices, and product URL.",
});

If no prompt is provided, a default extraction prompt will be used.