SkillAgentSearch skills...

ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

Install / Use

/learn @enoch3712/ExtractThinker

README

<p align="center"> <img src="https://github.com/enoch3712/Open-DocLLM/assets/9283394/41d9d151-acb5-44da-9c10-0058f76c2512" alt="Extract Thinker Logo" width="200"/> </p> <p align="center"> <img alt="Python Version" src="https://img.shields.io/badge/Python-3.9%2B-blue.svg" /> <a href="https://medium.com/@enoch3712"> <img alt="Medium" src="https://img.shields.io/badge/Medium-12100E?style=flat&logo=medium&logoColor=white" /> </a> <img alt="GitHub Last Commit" src="https://img.shields.io/github/last-commit/enoch3712/Open-DocLLM" /> <img alt="Github License" src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" /> </p>

ExtractThinker

ExtractThinker is a flexible document intelligence tool that leverages Large Language Models (LLMs) to extract and classify structured data from documents, functioning like an ORM for seamless document processing workflows.

TL;DR Document Intelligence for LLMs

🚀 Key Features

  • Flexible Document Loaders: Support for multiple document loaders, including Tesseract OCR, Azure Form Recognizer, AWS Textract, Google Document AI, and more.
  • Customizable Contracts: Define custom extraction contracts using Pydantic models for precise data extraction.
  • Advanced Classification: Classify documents or document sections using custom classifications and strategies.
  • Asynchronous Processing: Utilize asynchronous processing for efficient handling of large documents.
  • Multi-format Support: Seamlessly work with various document formats like PDFs, images, spreadsheets, and more.
  • ORM-style Interaction: Interact with documents and LLMs in an ORM-like fashion for intuitive development.
  • Splitting Strategies: Implement lazy or eager splitting strategies to process documents page by page or as a whole.
  • Integration with LLMs: Easily integrate with different LLM providers like OpenAI, Anthropic, Cohere, and more.
  • Community-driven Development: Inspired by the LangChain ecosystem with a focus on intelligent document processing. image

📦 Installation

Install ExtractThinker using pip:

pip install extract_thinker

🛠️ Usage

Basic Extraction Example

Here's a quick example to get you started with ExtractThinker. This example demonstrates how to load a document using PyPdf and extract specific fields defined in a contract.

import os
from dotenv import load_dotenv
from extract_thinker import Extractor, DocumentLoaderPyPdf, Contract

load_dotenv()

class InvoiceContract(Contract):
    invoice_number: str
    invoice_date: str

# Set the path to your Tesseract executable
test_file_path = os.path.join("path_to_your_files", "invoice.pdf")

# Initialize the extractor
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderPyPdf())
extractor.load_llm("gpt-4o-mini")  # or any other supported model

# Extract data from the document
result = extractor.extract(test_file_path, InvoiceContract)

print("Invoice Number:", result.invoice_number)
print("Invoice Date:", result.invoice_date)

Classification Example

ExtractThinker allows you to classify documents or parts of documents using custom classifications:

import os
from dotenv import load_dotenv
from extract_thinker import (
    Extractor, Classification, Process, ClassificationStrategy,
    DocumentLoaderPyPdf, Contract
)

load_dotenv()

class InvoiceContract(Contract):
    invoice_number: str
    invoice_date: str

class DriverLicenseContract(Contract):
    name: str
    license_number: str

# Initialize the extractor and load the document loader
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderPyPdf())
extractor.load_llm("gpt-4o-mini")

# Define classifications
classifications = [
    Classification(
        name="Invoice",
        description="An invoice document",
        contract=InvoiceContract,
        extractor=extractor,
    ),
    Classification(
        name="Driver License",
        description="A driver's license document",
        contract=DriverLicenseContract,
        extractor=extractor,
    ),
]

# Classify the document directly using the extractor
result = extractor.classify(
    "path_to_your_document.pdf",  # Can be a file path or IO stream
    classifications,
    image=True  # Set to True for image-based classification
)

# The result will be a ClassificationResponse object with 'name' and 'confidence' fields
print(f"Document classified as: {result.name}")
print(f"Confidence level: {result.confidence}")

Splitting Files Example

ExtractThinker allows you to split and process documents using different strategies. Here's how you can split a document and extract data based on classifications.

import os
from dotenv import load_dotenv
from extract_thinker import (
    Extractor,
    Process,
    Classification,
    ImageSplitter,
    DocumentLoaderTesseract,
    Contract,
    SplittingStrategy,
)

load_dotenv()

class DriverLicenseContract(Contract):
    name: str
    license_number: str

class InvoiceContract(Contract):
    invoice_number: str
    invoice_date: str

# Initialize the extractor and load the document loader
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderPyPdf())
extractor.load_llm("gpt-4o-mini")

# Define classifications
classifications = [
    Classification(
        name="Driver License",
        description="A driver's license document",
        contract=DriverLicenseContract,
        extractor=extractor,
    ),
    Classification(
        name="Invoice",
        description="An invoice document",
        contract=InvoiceContract,
        extractor=extractor,
    ),
]

# Initialize the process and load the splitter
process = Process()
process.load_document_loader(DocumentLoaderPyPdf())
process.load_splitter(ImageSplitter(model="gpt-4o-mini"))

# Load and process the document
path_to_document = "path_to_your_multipage_document.pdf"
split_content = (
    process.load_file(path_to_document)
    .split(classifications, strategy=SplittingStrategy.LAZY)
    .extract()
)

# Process the extracted content as needed
for item in split_content:
    if isinstance(item, InvoiceContract):
        print("Extracted Invoice:")
        print("Invoice Number:", item.invoice_number)
        print("Invoice Date:", item.invoice_date)
    elif isinstance(item, DriverLicenseContract):
        print("Extracted Driver License:")
        print("Name:", item.name)
        print("License Number:", item.license_number)

Batch Processing Example

You can also perform batch processing of documents:

from extract_thinker import Extractor, Contract

class ReceiptContract(Contract):
    store_name: str
    total_amount: float

extractor = Extractor()
extractor.load_llm("gpt-4o-mini")

# List of file paths or streams
document = "receipt1.jpg"

batch_job = extractor.extract_batch(
    source=document,
    response_model=ReceiptContract,
    vision=True,
)

# Monitor the batch job status
print("Batch Job Status:", await batch_job.get_status())

# Retrieve results once processing is complete
results = await batch_job.get_result()
for result in results.parsed_results:
    print("Store Name:", result.store_name)
    print("Total Amount:", result.total_amount)

Local LLM Integration Example

ExtractThinker supports custom LLM integrations. Here's how you can use a custom LLM:

from extract_thinker import Extractor, LLM, DocumentLoaderTesseract, Contract

class InvoiceContract(Contract):
    invoice_number: str
    invoice_date: str

# Initialize the extractor
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH")))

# Load a custom LLM (e.g., Ollama)
os.environ['API_BASE'] = "http://localhost:11434"
llm = LLM('ollama/phi3')
extractor.load_llm(llm)

# Extract data
result = extractor.extract("invoice.png", InvoiceContract)
print("Invoice Number:", result.invoice_number)
print("Invoice Date:", result.invoice_date)

📚 Documentation and Resources

  • Examples: Check out the examples directory for Jupyter notebooks and scripts demonstrating various use cases.
  • Medium Articles: Read articles about ExtractThinker on the author's Medium page.
  • Test Suite: Explore the test suite in the tests/ directory for more advanced usage examples and test cases.

🧩 Integration with LLM Providers

ExtractThinker supports integration with multiple LLM providers:

  • OpenAI: Use models like gpt-3.5-turbo, gpt-4, etc.
  • Anthropic: Integrate with Claude models.
  • Cohere: Utilize Cohere's language models.
  • Azure OpenAI: Connect with Azure's OpenAI services.
  • Local Models: Ollama compatible models.

⚙️ How It Works

ExtractThinker uses a modular architecture inspired by the LangChain ecosystem:

  • Document Loaders: Responsible for loading and preprocessing documents from various sources and formats.
  • Extractors: Orchestrate the interaction between the document loaders and LLMs to extract structured data.
  • Splitters: Implement strategies to split documents into manageable chunks for processing.
  • Contracts: Define the expected structure of the extracted data using Pydantic models.
  • Classifications: Classify documents or document sections to apply appropriate extraction contracts.
  • Processes: Manage the workflow of loading, classifying, splitting, and extracting data from documents.

image

📝 Why Use ExtractThinker?

While general frameworks like LangChain offer a broad range of functionalities, ExtractThinker is specialized for Intelligent Document Processing (IDP). It simplifies the complexities associated with IDP by providing:

  • Specialized Components: Tailored tools for document loading, splitting, and extraction.
  • High Accuracy with LLMs: Leverages the power of L

Related Skills

View on GitHub
GitHub Stars1.5k
CategoryEducation
Updated8h ago
Forks152

Languages

Python

Security Score

100/100

Audited on Mar 28, 2026

No findings