Pdfmd
Smart PDF to Markdown converter with intelligent heading detection, automatic header/footer removal, orphan fragment merging, and image export. Features a user-friendly GUI with preview mode, persistent settings, and per-page error recovery. Optimized for Obsidian and other Markdown-based note-taking workflows.
Install / Use
/learn @M1ck4/PdfmdREADME
PDF to Markdown Converter (pdfmd)
A refined, privacy-first desktop and CLI tool that converts PDFs—including scanned documents—into clean, structured Markdown. Built for researchers, professionals, and creators who demand accuracy, speed, and absolute data privacy.
Fast. Local. Intelligent. Fully offline.
📑 Table of Contents
🛡️ Privacy & Security First
Many PDF converters silently upload documents to remote servers. This tool does not.
- No uploads: Your files never leave your machine
- No telemetry: No usage tracking or analytics
- No cloud processing: All computation happens locally
- No background requests: Completely offline operation
Every step—extraction, OCR, reconstruction, and rendering—happens locally on your machine.
Trusted for Sensitive Workflows
Intentionally designed for environments where confidentiality is non-negotiable:
- 🏥 Medical: Clinical notes, diagnostic reports, patient records
- ⚖️ Legal: Case files, evidence bundles, attorney-client communications
- 🏛️ Government: Policy drafts, restricted documents, classified materials
- 🎓 Academic Research: Paywalled journals, unpublished materials, grant proposals
- 💼 Corporate: Financial reports, IP-sensitive designs, strategic plans
Password-Protected PDFs — Secure Support
Full support for encrypted PDFs with security-first design:
✅ Passwords never logged or saved — Memory-only processing
✅ No command-line exposure — Prevents process monitoring attacks
✅ Auto-cleanup — Temporary files deleted immediately
✅ Interactive prompts — Hidden input in GUI and CLI
GUI: Modal password dialog with masked input (*****)
CLI: getpass hidden terminal input
Supports all PDF encryption standards: 40-bit RC4, 128-bit RC4, 128/256-bit AES.
✨ Key Features
🎯 Accurate Markdown From Any PDF
- Smart paragraph reconstruction — Joins wrapped lines intelligently
- Heading inference — Uses font metrics to detect document structure
- Bullet & numbered list detection — Recognizes various formats (•, ○, -, 1., a., etc.)
- Hyphenation repair — Automatically unwraps "hy-\nphen" patterns
- URL auto-linking — Converts plain URLs into clickable Markdown links
- Inline formatting — Preserves bold and italic styling
- Header/footer removal — Detects and strips repeating page elements
- Multi-column awareness — Reduces cross-column text mixing
📊 Automatic Table Detection & Reconstruction
Your PDFs often contain tables split across blocks, columns, and various layout quirks. The robust table engine handles:
- Column-aligned tables — Detects 2+ space separated columns
- Bordered tables — Recognizes explicit
|and¦delimiters - Tab-separated blocks — Handles tab-delimited data
- Multi-block vertical tables — Stitches tables split across PyMuPDF blocks
- Full Markdown rendering — Generates proper pipe tables with alignment
- Header row detection — Automatically identifies table headers
- Conservative heuristics — Avoids false positives on prose and lists
Perfect for academic papers, financial documents, and structured reports.
Detection Strategies (priority order):
- Bordered tables (highest confidence)
- Vertical multi-block tables
- ASCII whitespace-separated tables
🧮 Math-Aware Extraction & LaTeX Preservation
Scientific documents finally convert cleanly. The Math Engine automatically:
- Detects inline & display math regions — Distinguishes equations from prose
- Converts Unicode math to LaTeX —
α → \alpha,√x → \sqrt{x} - Handles superscripts/subscripts —
x² → x^{2},x₁₀ → x_{10} - Preserves existing LaTeX — Keeps
$...$and$$...$$intact - Avoids Markdown escaping — Math content bypasses normal escaping
- Maintains equation integrity — Keeps equations intact across line breaks
Ideal for scientific PDFs in physics, mathematics, engineering, and chemistry.
Examples:
E = mc²→E = mc^{2}α + β³→\alpha + \beta^{3}∫₀^∞ e^(-x²) dx→\int_{0}^{\infty} e^{-x^{2}} dx
📸 Scanned PDF Support (OCR)
- Tesseract OCR — Lightweight, accurate, works on all major platforms
- OCRmyPDF — High-fidelity layout preservation
- Auto-detection — Automatically identifies scanned pages
- Configurable quality — Balance between speed and accuracy
- Mixed-mode support — Handles PDFs with both digital text and scanned pages
Auto-Detection Heuristics:
- Text density analysis (< 50 chars/page = likely scanned)
- Image coverage detection (>30% page area)
- Combined signals trigger OCR automatically
🎨 Modern GUI Experience
- Dark/Light themes — Obsidian-style dark mode (default) with instant toggle
- Live progress tracking — Determinate progress bar with full logging
- Real-time console — View extraction and conversion logs as they happen
- Quick access — "Open Output Folder" link to finished Markdown
- Non-blocking conversion — Cancel long-running jobs anytime with Esc
- Keyboard shortcuts — Power-user workflow (Ctrl+Enter to convert)
- Persistent settings — Theme, paths, options, and profiles saved between sessions
- Conversion profiles — Built-in and custom presets for different document types
🖼️ Interface Preview
Dark Mode (Default)

Obsidian-inspired dark theme with purple accents for optimal late-night work sessions.
Toggle between themes instantly — your preference is saved between sessions.
🧠 Architecture Overview
A modular pipeline ensures clarity, stability, and extensibility.
PDF Input
↓
┌─────────────────┐
│ 1. EXTRACT │ ← Native PyMuPDF or OCR (Tesseract/OCRmyPDF)
└─────────────────┘
↓
┌─────────────────┐
│ 2. TRANSFORM │ ← Clean text, remove headers/footers, detect structure
└─────────────────┘
↓
┌─────────────────┐
│ 3. RENDER │ ← Generate Markdown with headings, lists, links
└─────────────────┘
↓
┌─────────────────┐
│ 4. EXPORT │ ← Write .md file + optional image assets
└─────────────────┘
↓
Markdown Output
📦 Module Overview
Each module maintains a single responsibility, ensuring the system remains clean, testable, and easy to extend.
| Module | Purpose | | ----------------
