Pmcgrab
From PubMed Central ID to AI-Ready JSON in Seconds
Install / Use
/learn @rajdeepmondaldotcom/PmcgrabREADME
PMCGrab -- From PubMed Central ID to AI-Ready JSON in Seconds
Every AI workflow that touches biomedical literature hits the same wall:
- Download PMC XML hoping it's "structured."
- Fight nested tags, footnotes, figure refs, and half-broken links.
- Hope your regex didn't blow away the Methods section you actually need.
That wall steals hours from RAG pipelines, knowledge-graph builds, LLM fine-tuning -- any downstream AI task.
PMCGrab knocks it down. Feed it a list of PMC IDs -- or point it at a directory of bulk-downloaded XML -- and get back clean, section-aware JSON you can drop straight into a vector DB or LLM prompt. No network required for local files. No timeouts. No XML wrestling.
The Hidden Cost of "I'll Just Parse It Myself"
| Task | Manual / ad-hoc | PMCGrab |
| --------------------------- | ----------------------- | -------------------------------------------- |
| Install dependencies | 5-10 min | ~2 s (uv add pmcgrab) |
| Convert one article to JSON | 15-30 min | ~3 s (network) / instant (local XML) |
| Capture every IMRaD section | Hope & regex | 98 % detection accuracy* |
| Parallel processing | Bash loops & temp files | --workers N flag |
| Edge-case maintenance | Yours forever | 200+ tests, active upkeep |
*Evaluated on 7,500 PMC papers used in a disease-specific knowledge-graph pipeline.
At $50/hour, hand-parsing 100 papers burns $1,000+. PMCGrab does the same job for $0 -- within minutes -- so you can focus on using the information instead of extracting it.
Quick Install
Recommended (via uv):
uv add pmcgrab
Or with pip:
pip install pmcgrab
Python >= 3.10 required. Tested on 3.10, 3.11, 3.12, and 3.13.
Optional extras:
pip install pmcgrab[dev] # Linting, type-checking, pre-commit
pip install pmcgrab[test] # pytest + coverage
pip install pmcgrab[docs] # MkDocs + Material theme
pip install pmcgrab[notebook] # Jupyter support
30-Second Quick Start
from pmcgrab import Paper
paper = Paper.from_pmc("7181753")
print(paper.title)
# => "Single-cell transcriptomes of the human skin reveal age-related loss of ..."
print(paper.abstract_as_str()[:200])
# => "Fibroblasts are an essential cell population for human skin architecture ..."
# Every section, clean and ready
for section, text in paper.body_as_dict().items():
print(f"{section}: {len(text.split())} words")
# Save to JSON
paper.to_json()
That's it. One import, one line to fetch, structured data everywhere.
Ways to Use PMCGrab
1. Python API -- the Paper class (recommended)
The Paper class is the primary interface. It wraps every piece of parsed data with convenient accessor methods.
From the network:
from pmcgrab import Paper
paper = Paper.from_pmc("7181753", suppress_warnings=True)
From a local XML file (no network needed):
paper = Paper.from_local_xml("path/to/PMC7181753.xml")
Output methods -- choose the shape that fits your pipeline:
# Abstract
paper.abstract_as_str() # Plain-text string
paper.abstract_as_dict() # {"Background": "...", "Results": "..."}
# Body
paper.body_as_dict() # Flat: {"Introduction": "...", "Methods": "..."}
paper.body_as_nested_dict() # Hierarchical: preserves subsections
paper.body_as_paragraphs() # List of dicts -- ideal for RAG chunking
# [{"section": "Methods", "text": "...", "paragraph_index": 0}, ...]
# Full text
paper.full_text() # Abstract + body as one continuous string
# Table of contents
paper.get_toc() # ["Introduction", "Methods", "Results", ...]
# Serialization
paper.to_dict() # Full JSON-serializable dictionary
paper.to_json() # JSON string (pretty-printed)
Metadata you can access directly:
paper.title # Article title
paper.authors # pandas DataFrame (names, emails, affiliations)
paper.journal_title # "Genome Biology"
paper.article_id # {"pmcid": "PMC7181753", "doi": "10.1038/...", ...}
paper.keywords # ["fibroblasts", "aging", ...]
paper.published_date # {"epub": "2020-04-24", ...}
paper.citations # Structured reference list
paper.tables # List of pandas DataFrames
paper.figures # Figure metadata + captions
paper.permissions # Copyright, license info
paper.funding # Funding sources
paper.equations # MathML + TeX equations
# ... and 20+ more attributes (see "Extracted Metadata" below)
2. Dict-Based API (for data pipelines)
If you prefer raw dictionaries over the Paper object:
from pmcgrab import process_single_pmc, process_single_local_xml
# From network
data = process_single_pmc("7181753")
# From local XML
data = process_single_local_xml("path/to/article.xml")
print(data["title"])
print(data["abstract_text"]) # Plain-text abstract
print(data["abstract"]) # Structured abstract (dict)
print(list(data["body"].keys())) # Section titles
3. Bulk / Local XML Processing
This feature was inspired by a great suggestion from @vanAmsterdam, who pointed out that working with bulk-exported PMC data could be orders of magnitude faster than fetching articles one-by-one over the network.
We built it. Local XML processing skips the network entirely -- no HTTP requests, no timeouts, no rate limits. It is the fastest way to parse PMC articles at scale.
Python API:
from pmcgrab import Paper, process_single_local_xml, process_local_xml_dir
# Single file
paper = Paper.from_local_xml("./pmc_bulk/PMC7181753.xml")
# Single file (dict output)
data = process_single_local_xml("./pmc_bulk/PMC7181753.xml")
# Entire directory -- concurrent with 16 workers by default
results = process_local_xml_dir("./pmc_bulk/", workers=16)
for filename, data in results.items():
if data:
print(f"{filename}: {data['title'][:60]}")
CLI:
# Process a directory of bulk-downloaded XML
pmcgrab --from-dir ./pmc_bulk_xml/ --output-dir ./results
# Process specific files
pmcgrab --from-file article1.xml article2.xml --output-dir ./results
How to get bulk XML: Download from the PMC FTP service or the PMC Open Access subset. Each .xml file is a standard JATS XML article that PMCGrab can parse directly.
4. Command Line
PMCGrab's CLI supports six input modes, all mutually exclusive:
# PMC IDs (accepts PMC7181753, pmc7181753, or just 7181753)
pmcgrab --pmcids 7181753 3539614 --output-dir ./results
# PubMed IDs (auto-converted to PMC IDs via NCBI API)
pmcgrab --pmids 33087749 34567890 --output-dir ./results
# DOIs (auto-converted to PMC IDs via NCBI API)
pmcgrab --dois 10.1038/s41586-020-2832-5 --output-dir ./results
# IDs from a text file (one per line -- PMCIDs, PMIDs, or DOIs)
pmcgrab --from-id-file ids.txt --output-dir ./results
# Local XML directory (bulk mode -- no network)
pmcgrab --from-dir ./xml_bulk/ --output-dir ./results
# Specific local XML files (no network)
pmcgrab --from-file article1.xml article2.xml --output-dir ./results
Additional flags:
| Flag | Description | Default |
| ---------------------------- | ------------------------------------------------------ | -------------- |
| --output-dir / --out | Output directory for JSON files | ./pmc_output |
| --batch-size / --workers | Number of concurrent worker threads | 10 |
| --format | json (one file per article) or jsonl (single file) | json |
| --verbose / -v | Enable debug logging | off |
| --quiet / -q | Suppress progress bars | off |
5. Async Support
For asyncio-based applications:
import asyncio
from pmcgrab.application.processing import async_process_pmc_ids
results = asyncio.run(async_process_pmc_ids(
["7181753", "3539614", "3084273"],
max_concurrency=10,
))
for pid, data in results.items():
print(pid, "OK" if data else "FAIL")
6. Batch Processing
Process thousands of articles with built-in concurrency, retries, and rate-limit compliance:
from pmcgrab import process_pmc_ids_in_batches
pmc_ids = ["7181753", "3539614", "5454911", "3084273"]
process_pmc_ids_in_batches(pmc_ids, "./output", batch_size=8)
Output Example
Every parsed article produces a comprehensive JSON structure:
{
"pmc_id": "7181753",
"title": "Single-cell transcriptomes of the human skin reveal .
