PDF to Markdown Converter (pdfmd)

A refined, privacy-first desktop and CLI tool that converts PDFs—including scanned documents—into clean, structured Markdown. Built for researchers, professionals, and creators who demand accuracy, speed, and absolute data privacy.

Fast. Local. Intelligent. Fully offline.

📑 Table of Contents

Privacy & Security First
- Trusted for Sensitive Workflows
- Password-Protected PDFs
Key Features
Interface Preview
Architecture Overview
Installation
Usage
- GUI Application
- Command-Line Interface
API Documentation
Configuration Options
- Key Settings
- Profile Storage
Example Output
- Table Example
- Math Example
Performance Tips
Troubleshooting
Contributing
License
Acknowledgments
- Special Thanks
Links
Support
- Getting Help
- Feature Requests
Tips & Best Practices

🛡️ Privacy & Security First

Many PDF converters silently upload documents to remote servers. This tool does not.

No uploads: Your files never leave your machine
No telemetry: No usage tracking or analytics
No cloud processing: All computation happens locally
No background requests: Completely offline operation

Every step—extraction, OCR, reconstruction, and rendering—happens locally on your machine.

Trusted for Sensitive Workflows

Intentionally designed for environments where confidentiality is non-negotiable:

🏥 Medical: Clinical notes, diagnostic reports, patient records
⚖️ Legal: Case files, evidence bundles, attorney-client communications
🏛️ Government: Policy drafts, restricted documents, classified materials
🎓 Academic Research: Paywalled journals, unpublished materials, grant proposals
💼 Corporate: Financial reports, IP-sensitive designs, strategic plans

Password-Protected PDFs — Secure Support

Full support for encrypted PDFs with security-first design:

✅ Passwords never logged or saved — Memory-only processing
✅ No command-line exposure — Prevents process monitoring attacks
✅ Auto-cleanup — Temporary files deleted immediately
✅ Interactive prompts — Hidden input in GUI and CLI

GUI: Modal password dialog with masked input (*****)
CLI: getpass hidden terminal input

Supports all PDF encryption standards: 40-bit RC4, 128-bit RC4, 128/256-bit AES.

✨ Key Features

🎯 Accurate Markdown From Any PDF

Smart paragraph reconstruction — Joins wrapped lines intelligently
Heading inference — Uses font metrics to detect document structure
Bullet & numbered list detection — Recognizes various formats (•, ○, -, 1., a., etc.)
Hyphenation repair — Automatically unwraps "hy-\nphen" patterns
URL auto-linking — Converts plain URLs into clickable Markdown links
Inline formatting — Preserves bold and italic styling
Header/footer removal — Detects and strips repeating page elements
Multi-column awareness — Reduces cross-column text mixing

📊 Automatic Table Detection & Reconstruction

Your PDFs often contain tables split across blocks, columns, and various layout quirks. The robust table engine handles:

Column-aligned tables — Detects 2+ space separated columns
Bordered tables — Recognizes explicit | and ¦ delimiters
Tab-separated blocks — Handles tab-delimited data
Multi-block vertical tables — Stitches tables split across PyMuPDF blocks
Full Markdown rendering — Generates proper pipe tables with alignment
Header row detection — Automatically identifies table headers
Conservative heuristics — Avoids false positives on prose and lists

Perfect for academic papers, financial documents, and structured reports.

Detection Strategies (priority order):

Bordered tables (highest confidence)
Vertical multi-block tables
ASCII whitespace-separated tables

🧮 Math-Aware Extraction & LaTeX Preservation

Scientific documents finally convert cleanly. The Math Engine automatically:

Detects inline & display math regions — Distinguishes equations from prose
Converts Unicode math to LaTeX — α → \alpha, √x → \sqrt{x}
Handles superscripts/subscripts — x² → x^{2}, x₁₀ → x_{10}
Preserves existing LaTeX — Keeps $...$ and $$...$$ intact
Avoids Markdown escaping — Math content bypasses normal escaping
Maintains equation integrity — Keeps equations intact across line breaks

Ideal for scientific PDFs in physics, mathematics, engineering, and chemistry.

Examples:

E = mc² → E = mc^{2}
α + β³ → \alpha + \beta^{3}
∫₀^∞ e^(-x²) dx → \int_{0}^{\infty} e^{-x^{2}} dx

📸 Scanned PDF Support (OCR)

Tesseract OCR — Lightweight, accurate, works on all major platforms
OCRmyPDF — High-fidelity layout preservation
Auto-detection — Automatically identifies scanned pages
Configurable quality — Balance between speed and accuracy
Mixed-mode support — Handles PDFs with both digital text and scanned pages

Auto-Detection Heuristics:

Text density analysis (< 50 chars/page = likely scanned)
Image coverage detection (>30% page area)
Combined signals trigger OCR automatically

🎨 Modern GUI Experience

Dark/Light themes — Obsidian-style dark mode (default) with instant toggle
Live progress tracking — Determinate progress bar with full logging
Real-time console — View extraction and conversion logs as they happen
Quick access — "Open Output Folder" link to finished Markdown
Non-blocking conversion — Cancel long-running jobs anytime with Esc
Keyboard shortcuts — Power-user workflow (Ctrl+Enter to convert)
Persistent settings — Theme, paths, options, and profiles saved between sessions
Conversion profiles — Built-in and custom presets for different document types

🖼️ Interface Preview

Dark Mode (Default)

Dark Mode

Obsidian-inspired dark theme with purple accents for optimal late-night work sessions.

Toggle between themes instantly — your preference is saved between sessions.

🧠 Architecture Overview

A modular pipeline ensures clarity, stability, and extensibility.

PDF Input
    ↓
┌─────────────────┐
│  1. EXTRACT     │ ← Native PyMuPDF or OCR (Tesseract/OCRmyPDF)
└─────────────────┘
    ↓
┌─────────────────┐
│  2. TRANSFORM   │ ← Clean text, remove headers/footers, detect structure
└─────────────────┘
    ↓
┌─────────────────┐
│  3. RENDER      │ ← Generate Markdown with headings, lists, links
└─────────────────┘
    ↓
┌─────────────────┐
│  4. EXPORT      │ ← Write .md file + optional image assets
└─────────────────┘
    ↓
Markdown Output

📦 Module Overview

Each module maintains a single responsibility, ensuring the system remains clean, testable, and easy to extend.

| Module | Purpose | | ----------------

Pdfmd

Install / Use

README