Parsemypdf

Collection of PDF parsing libraries like AI based docling, claude, openai, gemini, meta's llama-vision, unstructured-io, and pdfminer, pymupdf, pdfplumber etc for efficient snapshot, text, table, and metadata extraction.

Generate Convert Improve

Install / Use

/learn @genieincodebottle/Parsemypdf

About this skill

Quality Score

0/100

README

<a target="_blank" href="https://github.com/genieincodebottle/generative-ai/blob/main/GenAI_Roadmap.md">👉 GenAI Roadmap - 2025</a></h3>

🖼️ OCR with Multimodal | Vision Language Models

📑 Complex PDF Parsing

Comprehensive example code for extracting content from complex PDFs with mixed elements, including text and image data extraction. Includes two Streamlit apps:

PDF Parser & RAG Evaluator (pdf_parser_app.py) - Parse PDFs with 13 different parsers + ask questions using RAG
VLM OCR App (vlm_ocr_app.py) - Extract text from images using Vision Language Models (Claude, Gemini, GPT-4o, Mistral-OCR, Ollama, OmniAI)

Also, check -> PDF Parsing Guide

🎥 YouTube Video: Walkthrough on setup and running the app

📦 Implementation Options

1. ☁️ Paid - API Based Methods

| Model Provider | Models | Details | Example Code | Doc | | -------------- | -------|---------|:------------:|:---:| | Anthropic | claude-opus-4-20250514, claude-sonnet-4-20250514, claude-3-7-sonnet-20250219, claude-3-5-sonnet-20241022 | Claude 4/3.7/3.5 Sonnet is a multimodal AI model developed by Anthropic, capable of processing both text and images. It excels in visual reasoning tasks, such as interpreting charts and graphs, and can accurately transcribe text from imperfect images. Supports native PDF input via base64 encoding. | Code | Doc | Gemini | gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite-preview-06-17, gemini-2.0-flash, gemini-2.0-flash-lite | Gemini 2.5/2.0 models offer superior speed, native tool integration, and multimodal generation capabilities. Support 1M token context window, native PDF input, and multimodal outputs. | Code | Doc | OpenAI | gpt-4.1-2025-04-14, gpt-4.1-mini-2025-04-14, gpt-4o, gpt-4o-mini | GPT-4.1/4o is a multimodal AI model capable of processing text, images, and audio with high efficiency. It enhances text generation, reasoning, and vision tasks while improving latency and cost. | Code | Doc | Mistral-OCR | mistral-ocr-latest | Mistral OCR is an advanced AI-powered OCR API for extracting structured text, tables, and equations from documents with high accuracy. Supports multiple languages, processes up to 2,000 pages/min, and provides structured markdown output. | Code | Doc | Unstructured IO | -- | Advanced content partitioning and classification. Processes PDFs, HTML, Word, and images. The Enterprise ETL Platform automates data ingestion and cleaning, integrating seamlessly with GenAI stacks. | Code | Doc | Llama-Parse | -- | GenAI-native document parser for LLM applications like RAG and agents. Supports PDFs, PowerPoint, Word, Excel, and HTML. Free users get 1,000 pages/day. | Code | Doc | Amazon Textract | -- | AWS ML service that extracts text, forms, tables, and signatures from scanned documents. Goes beyond OCR by preserving structure for easy data integration. Supports PNG, JPEG, TIFF, and PDF. | Code | Doc | Azure Doc Intelligence | -- | Azure AI service (formerly Form Recognizer) for extracting text, tables, key-value pairs, and structure from documents. Supports handwriting, scanned docs, and custom models. Free tier: 500 pages/month. | Code | Doc | Zerox | -- | Vision model-based OCR by OmniAI. Converts PDF pages to images, then uses GPT-4o/mini for extraction. Supports structured data extraction via schemas. Clean markdown output. | Code | Doc

2. 🖥️ Open Weight - Local Methods

| Model/Framework Provider | Name | Details | Example Code | Doc | | -------------- | -------|---------|:------------:|:---:| | Meta | llama3.2-vision | Llama 3.2-11B Vision is a multimodal AI model designed to process both text and images. It excels in visual recognition, image reasoning, captioning, and answering general questions about images. 128K token context length. | Code | Doc | IBM | Docling | Excellent for complex PDFs with mixed content. Simplifies document processing, parsing diverse formats with advanced PDF understanding and seamless integrations with the GenAI ecosystem. | Code | Doc | Microsoft | MarkItDown | Converts various files to Markdown. Supports: PDF, PowerPoint, Word, Excel, Images (EXIF + OCR), Audio (EXIF + speech transcription), HTML, CSV, JSON, XML, ZIP files. | Code | Doc | -- | Marker | Quickly converts PDFs and images to Markdown, JSON, and HTML with high accuracy. Supports all languages and document types, handles tables, forms, math, links, and code blocks. Runs on GPU, CPU, or MPS. | Code | Doc | Camelot-Dev | Camelot | Specialized table extraction from text-based PDFs using "Lattice" (grid-based) and "Stream" (whitespace-based) methods. Outputs tables as pandas DataFrames. | Code | Doc | PyPdf | pypdf | Free, open-source, pure-Python PDF library for splitting, merging, cropping, transforming pages, and extracting text and metadata. | Code | Doc | PDFMiner | pdfminer.six | Text and layout extraction from PDFs, supporting various fonts and complex layouts. Enables conversion to HTML/XML and automatic layout analysis. | Code | Doc | Artifex Software | PyMuPDF | Fast Python library for extracting, analyzing, converting, and manipulating PDFs, XPS, and eBooks. Supports text/image extraction, rendering to PNG/SVG, and conversion to HTML, XML, JSON. | Code | Doc | Google | PDFium | Google's open-source C++ library for viewing, parsing, and rendering PDFs. Powers Chromium, enabling text extraction, metadata access, and page rendering. | Code | Doc | LangChain | PyPDFDirectory | Batch PDF content extraction using PyPDF Directory Loader. Process all PDFs in a folder at once. | Code | Doc | -- | PDFPlumber | Text and layout extraction. Extends pdfminer.six for PDF data extraction, handling text, tables, and shapes with visual debugging. Excels at extracting tables into pandas DataFrames. | Code | Doc | Datalab | Surya OCR | Lightweight OCR toolkit supporting 90+ languages with line-level detection, layout analysis, and table recognition. By the creator of Marker. Outperforms Tesseract on most benchmarks. Runs locally, no API key needed. | Code | Doc | StepFun | GOT-OCR2 | Unified end-to-end 580M parameter model for text, tables, charts, equations, and LaTeX. Supports formatted markdown output. Runs on consumer GPUs (8GB+ VRAM). | Code | Doc

⚙️ Setup Instructions

Prerequisites

Python 3.10 or higher
pip (Python package installer)

Installation

Clone the repository:

git clone https://github.com/genieincodebottle/parsemypdf.git
cd parsemypdf

Create a virtual environment:

pip install uv  # if uv not installed
uv venv
.venv\Scripts\activate  # On Linux/Mac -> source .venv/bin/activate

Install dependencies:
```
uv pip install -r requirements.txt
```
Configure environment variables:

Rename .env.example to .env and add the API keys you need.

You don't need ALL keys. Only add keys for the parsers/LLMs you want to use. Start with a free one.
```
# --- Free-tier (no credit card) ---
GROQ_
```

Related Skills

node-connect

339.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.9k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

339.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

triage-issue

83.9k

Triage GitHub issues by analyzing and applying labels