Extractor
Using LLMs and AI browser automation to robustly extract web data
Install / Use
/learn @lightfeed/ExtractorREADME
Overview
Lightfeed Extractor is a Typescript library built for robust web data extraction using LLMs and Playwright. Use natural language prompts to navigate web pages and extract structured data. Get complete, accurate results with great token efficiency — critical for production data pipelines.
Features
-
🤖 Browser Automation in Stealth Mode - Launch Playwright browsers locally, in serverless clouds, or connect to a remote browser server. Avoid detection with built-in anti-bot patches and proxy configuration for reliable web scraping.
-
🧭 AI Browser Navigation - Pair with @lightfeed/browser-agent to navigate pages using natural language commands before extracting structured data.
-
🧹 LLM-ready Markdown - Convert HTML to LLM-ready markdown, with options to extract only main content and clean URLs by removing tracking parameters.
-
⚡️ LLM Extraction - Use LLMs in JSON mode to extract structured data according to input Zod schema. Token usage limit and tracking included.
-
🛠️ JSON Recovery - Sanitize and recover failed JSON output. This makes complex schema extraction much more robust, especially with deeply nested objects and arrays.
-
🔗 URL Validation - Handle relative URLs, remove invalid ones, and repair markdown-escaped links.
[!TIP]
Building retail competitor intelligence at scale? Go to app.lightfeed.ai - our full platform for tracking competitor pricing, sales, promotions, and SEO across 1,000+ retail chains - get started for free. For generic web data pipelines with AI enrichment and workflow automation, check out lightfeed.ai.
Installation
Install the extractor:
npm install @lightfeed/extractor
Then install the LLM provider you want to use:
# OpenAI
npm install @langchain/openai
# Google Gemini
npm install @langchain/google-genai
# Anthropic
npm install @langchain/anthropic
# Ollama (local models)
npm install @langchain/ollama
@langchain/core will be installed automatically as a peer dependency.
Usage
E-commerce Product Extraction
This example demonstrates extracting structured product data from a real e-commerce website using a local headed Playwright browser. For production environments, you can use a Playwright browser in serverless or remote mode.
import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import { extract, ContentFormat, Browser } from "@lightfeed/extractor";
import { z } from "zod";
// Define schema for product catalog extraction
const productCatalogSchema = z.object({
products: z
.array(
z.object({
name: z.string().describe("Product name or title"),
brand: z.string().optional().describe("Brand name"),
price: z.number().describe("Current price"),
originalPrice: z
.number()
.optional()
.describe("Original price if on sale"),
rating: z.number().optional().describe("Product rating out of 5"),
reviewCount: z.number().optional().describe("Number of reviews"),
productUrl: z.string().url().describe("Link to product detail page"),
imageUrl: z.string().url().optional().describe("Product image URL"),
})
)
.describe("List of bread and bakery products"),
});
// Create browser instance
const browser = new Browser({
type: "local", // also supporting serverless and remote browser
headless: false,
});
try {
await browser.start();
console.log("Browser started successfully");
// Create page and navigate to e-commerce site
const page = await browser.newPage();
const pageUrl = "https://www.walmart.ca/en/browse/grocery/bread-bakery/10019_6000194327359";
await page.goto(pageUrl);
try {
await page.waitForLoadState("networkidle", { timeout: 10000 });
} catch {
console.log("Network idle timeout, continuing...");
}
// Get HTML content
const html = await page.content();
console.log(`Loaded ${html.length} characters of HTML`);
// Extract structured product data
console.log("Extracting product data using LLM...");
const result = await extract({
llm: new ChatGoogleGenerativeAI({
apiKey: process.env.GOOGLE_API_KEY,
model: "gemini-2.5-flash",
temperature: 0,
}),
content: html,
format: ContentFormat.HTML,
sourceUrl: pageUrl,
schema: productCatalogSchema,
htmlExtractionOptions: {
extractMainHtml: true,
includeImages: true,
cleanUrls: true
}
});
console.log("Extraction successful!");
console.log("Found products:", result.data.products.length);
// Print the extracted data
console.log(JSON.stringify(result.data, null, 2));
} catch (error) {
console.error("Error during extraction:", error);
} finally {
await browser.close();
console.log("Browser closed");
}
/* Expected output:
{
"products": [
{
"name": "Dempster's® Signature The Classic Burger Buns, Pack of 8; 568 g",
"brand": "Dempster's",
"price": 3.98,
"originalPrice": 4.57,
"rating": 4.7376,
"reviewCount": 141,
"productUrl": "https://www.walmart.ca/en/ip/dempsters-signature-the-classic-burger-buns/6000188080451?classType=REGULAR&athbdg=L1300",
"imageUrl": "https://i5.walmartimages.ca/images/Enlarge/725/979/6000196725979.jpg?odnHeight=580&odnWidth=580&odnBg=FFFFFF"
},
... (more products)
]
}
*/
[!TIP] Run
npm run test:browserto execute this example, or view the complete code in testBrowserExtraction.ts.
Using with Browser Agent
For pages that require interaction before extraction — searching, clicking through pagination, dismissing popups, etc. — you can pair this library with @lightfeed/browser-agent. The browser agent uses AI to navigate pages via natural language commands, and this library extracts structured data from the result.
Install both packages:
npm install @lightfeed/extractor @lightfeed/browser-agent
Then use the browser agent to navigate and the extractor to pull structured data:
import { BrowserAgent } from "@lightfeed/browser-agent";
import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import { extract, ContentFormat } from "@lightfeed/extractor";
import { z } from "zod";
const schema = z.object({
products: z.array(
z.object({
name: z.string(),
price: z.number(),
rating: z.number().optional(),
productUrl: z.string().url(),
})
),
});
// 1. Use browser agent to navigate with AI
const agent = new BrowserAgent({ browserProvider: "Local" });
const page = await agent.newPage();
await page.goto("https://amazon.com");
await page.ai("Search for 'organic coffee' and go to the second page of results");
// 2. Extract structured data from the resulting page
const html = await page.content();
const result = await extract({
llm: new ChatGoogleGenerativeAI({
model: "gemini-2.5-flash",
apiKey: process.env.GOOGLE_API_KEY,
temperature: 0,
}),
content: html,
format: ContentFormat.HTML,
sourceUrl: page.url(),
schema,
prompt: "Extract all product listings from the search results",
htmlExtractionOptions: {
extractMainHtml: true,
includeImages: true,
cleanUrls: true,
},
});
console.log(result.data.products);
await agent.close();
The browser agent supports local, serverless, and remote browsers — see the browser-agent docs for configuration options.
Extracting from Markdown or Plain Text
You can also extract structured data directly from HTML, Markdown or text string. Pass any LangChain chat model:
import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import { extract, ContentFormat } from "@lightfeed/extractor";
const result = await extract({
llm: new ChatGoogleGenerativeAI({
apiKey: process.env.GOOGLE_API_KEY,
model: "gemini-2.5-flash",
temperature: 0,
}),
content: markdownContent,
format: ContentFormat.MARKDOWN,
schema: mySchema,
});
Custom Extraction Prompts
You can provide a custom prompt to guide the extraction process:
const result = await extract({
llm: myLLM,
content: htmlContent,
format: ContentFormat.HTML,
schema: mySchema,
sourceUrl: "https://example.com/products",
prompt: "Extract ONLY products that are on sale or have special discounts. Include their original prices, discounted prices, and product URL.",
});
If no prompt is provided, a default extraction prompt will be used.
Extraction Context
Related Skills
feishu-drive
337.3k|
things-mac
337.3kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
337.3kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
eval
86 agent-executable skill packs converted from RefoundAI’s Lenny skills (unofficial). Works with Codex + Claude Code.
