ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

Generate Convert Improve

Install / Use

/learn @enoch3712/ExtractThinker

About this skill

Quality Score

0/100

README

ExtractThinker

ExtractThinker is a flexible document intelligence tool that leverages Large Language Models (LLMs) to extract and classify structured data from documents, functioning like an ORM for seamless document processing workflows.

TL;DR Document Intelligence for LLMs

🚀 Key Features

Flexible Document Loaders: Support for multiple document loaders, including Tesseract OCR, Azure Form Recognizer, AWS Textract, Google Document AI, and more.
Customizable Contracts: Define custom extraction contracts using Pydantic models for precise data extraction.
Advanced Classification: Classify documents or document sections using custom classifications and strategies.
Asynchronous Processing: Utilize asynchronous processing for efficient handling of large documents.
Multi-format Support: Seamlessly work with various document formats like PDFs, images, spreadsheets, and more.
ORM-style Interaction: Interact with documents and LLMs in an ORM-like fashion for intuitive development.
Splitting Strategies: Implement lazy or eager splitting strategies to process documents page by page or as a whole.
Integration with LLMs: Easily integrate with different LLM providers like OpenAI, Anthropic, Cohere, and more.
Community-driven Development: Inspired by the LangChain ecosystem with a focus on intelligent document processing.

📦 Installation

Install ExtractThinker using pip:

pip install extract_thinker

🛠️ Usage

Basic Extraction Example

Here's a quick example to get you started with ExtractThinker. This example demonstrates how to load a document using PyPdf and extract specific fields defined in a contract.

import os
from dotenv import load_dotenv
from extract_thinker import Extractor, DocumentLoaderPyPdf, Contract

load_dotenv()

class InvoiceContract(Contract):
    invoice_number: str
    invoice_date: str

# Set the path to your Tesseract executable
test_file_path = os.path.join("path_to_your_files", "invoice.pdf")

# Initialize the extractor
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderPyPdf())
extractor.load_llm("gpt-4o-mini")  # or any other supported model

# Extract data from the document
result = extractor.extract(test_file_path, InvoiceContract)

print("Invoice Number:", result.invoice_number)
print("Invoice Date:", result.invoice_date)

Classification Example

ExtractThinker allows you to classify documents or parts of documents using custom classifications:

import os
from dotenv import load_dotenv
from extract_thinker import (
    Extractor, Classification, Process, ClassificationStrategy,
    DocumentLoaderPyPdf, Contract
)

load_dotenv()

class InvoiceContract(Contract):
    invoice_number: str
    invoice_date: str

class DriverLicenseContract(Contract):
    name: str
    license_number: str

# Initialize the extractor and load the document loader
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderPyPdf())
extractor.load_llm("gpt-4o-mini")

# Define classifications
classifications = [
    Classification(
        name="Invoice",
        description="An invoice document",
        contract=InvoiceContract,
        extractor=extractor,
    ),
    Classification(
        name="Driver License",
        description="A driver's license document",
        contract=DriverLicenseContract,
        extractor=extractor,
    ),
]

# Classify the document directly using the extractor
result = extractor.classify(
    "path_to_your_document.pdf",  # Can be a file path or IO stream
    classifications,
    image=True  # Set to True for image-based classification
)

# The result will be a ClassificationResponse object with 'name' and 'confidence' fields
print(f"Document classified as: {result.name}")
print(f"Confidence level: {result.confidence}")

Splitting Files Example

ExtractThinker allows you to split and process documents using different strategies. Here's how you can split a document and extract data based on classifications.

import os
from dotenv import load_dotenv
from extract_thinker import (
    Extractor,
    Process,
    Classification,
    ImageSplitter,
    DocumentLoaderTesseract,
    Contract,
    SplittingStrategy,
)

load_dotenv()

class DriverLicenseContract(Contract):
    name: str
    license_number: str

class InvoiceContract(Contract):
    invoice_number: str
    invoice_date: str

# Initialize the extractor and load the document loader
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderPyPdf())
extractor.load_llm("gpt-4o-mini")

# Define classifications
classifications = [
    Classification(
        name="Driver License",
        description="A driver's license document",
        contract=DriverLicenseContract,
        extractor=extractor,
    ),
    Classification(
        name="Invoice",
        description="An invoice document",
        contract=InvoiceContract,
        extractor=extractor,
    ),
]

# Initialize the process and load the splitter
process = Process()
process.load_document_loader(DocumentLoaderPyPdf())
process.load_splitter(ImageSplitter(model="gpt-4o-mini"))

# Load and process the document
path_to_document = "path_to_your_multipage_document.pdf"
split_content = (
    process.load_file(path_to_document)
    .split(classifications, strategy=SplittingStrategy.LAZY)
    .extract()
)

# Process the extracted content as needed
for item in split_content:
    if isinstance(item, InvoiceContract):
        print("Extracted Invoice:")
        print("Invoice Number:", item.invoice_number)
        print("Invoice Date:", item.invoice_date)
    elif isinstance(item, DriverLicenseContract):
        print("Extracted Driver License:")
        print("Name:", item.name)
        print("License Number:", item.license_number)

Batch Processing Example

You can also perform batch processing of documents:

from extract_thinker import Extractor, Contract

class ReceiptContract(Contract):
    store_name: str
    total_amount: float

extractor = Extractor()
extractor.load_llm("gpt-4o-mini")

# List of file paths or streams
document = "receipt1.jpg"

batch_job = extractor.extract_batch(
    source=document,
    response_model=ReceiptContract,
    vision=True,
)

# Monitor the batch job status
print("Batch Job Status:", await batch_job.get_status())

# Retrieve results once processing is complete
results = await batch_job.get_result()
for result in results.parsed_results:
    print("Store Name:", result.store_name)
    print("Total Amount:", result.total_amount)

Local LLM Integration Example

ExtractThinker supports custom LLM integrations. Here's how you can use a custom LLM:

from extract_thinker import Extractor, LLM, DocumentLoaderTesseract, Contract

class InvoiceContract(Contract):
    invoice_number: str
    invoice_date: str

# Initialize the extractor
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH")))

# Load a custom LLM (e.g., Ollama)
os.environ['API_BASE'] = "http://localhost:11434"
llm = LLM('ollama/phi3')
extractor.load_llm(llm)

# Extract data
result = extractor.extract("invoice.png", InvoiceContract)
print("Invoice Number:", result.invoice_number)
print("Invoice Date:", result.invoice_date)

📚 Documentation and Resources

Examples: Check out the examples directory for Jupyter notebooks and scripts demonstrating various use cases.
Medium Articles: Read articles about ExtractThinker on the author's Medium page.
Test Suite: Explore the test suite in the tests/ directory for more advanced usage examples and test cases.

🧩 Integration with LLM Providers

ExtractThinker supports integration with multiple LLM providers:

OpenAI: Use models like gpt-3.5-turbo, gpt-4, etc.
Anthropic: Integrate with Claude models.
Cohere: Utilize Cohere's language models.
Azure OpenAI: Connect with Azure's OpenAI services.
Local Models: Ollama compatible models.

⚙️ How It Works

ExtractThinker uses a modular architecture inspired by the LangChain ecosystem:

Document Loaders: Responsible for loading and preprocessing documents from various sources and formats.
Extractors: Orchestrate the interaction between the document loaders and LLMs to extract structured data.
Splitters: Implement strategies to split documents into manageable chunks for processing.
Contracts: Define the expected structure of the extracted data using Pydantic models.
Classifications: Classify documents or document sections to apply appropriate extraction contracts.
Processes: Manage the workflow of loading, classifying, splitting, and extracting data from documents.

📝 Why Use ExtractThinker?

While general frameworks like LangChain offer a broad range of functionalities, ExtractThinker is specialized for Intelligent Document Processing (IDP). It simplifies the complexities associated with IDP by providing:

Specialized Components: Tailored tools for document loading, splitting, and extraction.
High Accuracy with LLMs: Leverages the power of L

Related Skills

claude-opus-4-5-migration

83.9k

Migrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5

model-usage

339.3k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

summarize

339.3k

Summarize or extract text/transcripts from URLs, podcasts, and local files (great fallback for “transcribe this YouTube/video”).

feishu-doc

339.3k