SkillAgentSearch skills...

Pmcgrab

From PubMed Central ID to AI-Ready JSON in Seconds

Install / Use

/learn @rajdeepmondaldotcom/Pmcgrab
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

PMCGrab -- From PubMed Central ID to AI-Ready JSON in Seconds

PyPI Python Docs CI License

Every AI workflow that touches biomedical literature hits the same wall:

  1. Download PMC XML hoping it's "structured."
  2. Fight nested tags, footnotes, figure refs, and half-broken links.
  3. Hope your regex didn't blow away the Methods section you actually need.

That wall steals hours from RAG pipelines, knowledge-graph builds, LLM fine-tuning -- any downstream AI task.

PMCGrab knocks it down. Feed it a list of PMC IDs -- or point it at a directory of bulk-downloaded XML -- and get back clean, section-aware JSON you can drop straight into a vector DB or LLM prompt. No network required for local files. No timeouts. No XML wrestling.


The Hidden Cost of "I'll Just Parse It Myself"

| Task | Manual / ad-hoc | PMCGrab | | --------------------------- | ----------------------- | -------------------------------------------- | | Install dependencies | 5-10 min | ~2 s (uv add pmcgrab) | | Convert one article to JSON | 15-30 min | ~3 s (network) / instant (local XML) | | Capture every IMRaD section | Hope & regex | 98 % detection accuracy* | | Parallel processing | Bash loops & temp files | --workers N flag | | Edge-case maintenance | Yours forever | 200+ tests, active upkeep |

*Evaluated on 7,500 PMC papers used in a disease-specific knowledge-graph pipeline.

At $50/hour, hand-parsing 100 papers burns $1,000+. PMCGrab does the same job for $0 -- within minutes -- so you can focus on using the information instead of extracting it.


Quick Install

Recommended (via uv):

uv add pmcgrab

Or with pip:

pip install pmcgrab

Python >= 3.10 required. Tested on 3.10, 3.11, 3.12, and 3.13.

Optional extras:

pip install pmcgrab[dev]       # Linting, type-checking, pre-commit
pip install pmcgrab[test]      # pytest + coverage
pip install pmcgrab[docs]      # MkDocs + Material theme
pip install pmcgrab[notebook]  # Jupyter support

30-Second Quick Start

from pmcgrab import Paper

paper = Paper.from_pmc("7181753")

print(paper.title)
# => "Single-cell transcriptomes of the human skin reveal age-related loss of ..."

print(paper.abstract_as_str()[:200])
# => "Fibroblasts are an essential cell population for human skin architecture ..."

# Every section, clean and ready
for section, text in paper.body_as_dict().items():
    print(f"{section}: {len(text.split())} words")

# Save to JSON
paper.to_json()

That's it. One import, one line to fetch, structured data everywhere.


Ways to Use PMCGrab

1. Python API -- the Paper class (recommended)

The Paper class is the primary interface. It wraps every piece of parsed data with convenient accessor methods.

From the network:

from pmcgrab import Paper

paper = Paper.from_pmc("7181753", suppress_warnings=True)

From a local XML file (no network needed):

paper = Paper.from_local_xml("path/to/PMC7181753.xml")

Output methods -- choose the shape that fits your pipeline:

# Abstract
paper.abstract_as_str()          # Plain-text string
paper.abstract_as_dict()         # {"Background": "...", "Results": "..."}

# Body
paper.body_as_dict()             # Flat: {"Introduction": "...", "Methods": "..."}
paper.body_as_nested_dict()      # Hierarchical: preserves subsections
paper.body_as_paragraphs()       # List of dicts -- ideal for RAG chunking
                                 #   [{"section": "Methods", "text": "...", "paragraph_index": 0}, ...]

# Full text
paper.full_text()                # Abstract + body as one continuous string

# Table of contents
paper.get_toc()                  # ["Introduction", "Methods", "Results", ...]

# Serialization
paper.to_dict()                  # Full JSON-serializable dictionary
paper.to_json()                  # JSON string (pretty-printed)

Metadata you can access directly:

paper.title                      # Article title
paper.authors                    # pandas DataFrame (names, emails, affiliations)
paper.journal_title              # "Genome Biology"
paper.article_id                 # {"pmcid": "PMC7181753", "doi": "10.1038/...", ...}
paper.keywords                   # ["fibroblasts", "aging", ...]
paper.published_date             # {"epub": "2020-04-24", ...}
paper.citations                  # Structured reference list
paper.tables                     # List of pandas DataFrames
paper.figures                    # Figure metadata + captions
paper.permissions                # Copyright, license info
paper.funding                    # Funding sources
paper.equations                  # MathML + TeX equations
# ... and 20+ more attributes (see "Extracted Metadata" below)

2. Dict-Based API (for data pipelines)

If you prefer raw dictionaries over the Paper object:

from pmcgrab import process_single_pmc, process_single_local_xml

# From network
data = process_single_pmc("7181753")

# From local XML
data = process_single_local_xml("path/to/article.xml")

print(data["title"])
print(data["abstract_text"])       # Plain-text abstract
print(data["abstract"])            # Structured abstract (dict)
print(list(data["body"].keys()))   # Section titles

3. Bulk / Local XML Processing

This feature was inspired by a great suggestion from @vanAmsterdam, who pointed out that working with bulk-exported PMC data could be orders of magnitude faster than fetching articles one-by-one over the network.

We built it. Local XML processing skips the network entirely -- no HTTP requests, no timeouts, no rate limits. It is the fastest way to parse PMC articles at scale.

Python API:

from pmcgrab import Paper, process_single_local_xml, process_local_xml_dir

# Single file
paper = Paper.from_local_xml("./pmc_bulk/PMC7181753.xml")

# Single file (dict output)
data = process_single_local_xml("./pmc_bulk/PMC7181753.xml")

# Entire directory -- concurrent with 16 workers by default
results = process_local_xml_dir("./pmc_bulk/", workers=16)
for filename, data in results.items():
    if data:
        print(f"{filename}: {data['title'][:60]}")

CLI:

# Process a directory of bulk-downloaded XML
pmcgrab --from-dir ./pmc_bulk_xml/ --output-dir ./results

# Process specific files
pmcgrab --from-file article1.xml article2.xml --output-dir ./results

How to get bulk XML: Download from the PMC FTP service or the PMC Open Access subset. Each .xml file is a standard JATS XML article that PMCGrab can parse directly.


4. Command Line

PMCGrab's CLI supports six input modes, all mutually exclusive:

# PMC IDs (accepts PMC7181753, pmc7181753, or just 7181753)
pmcgrab --pmcids 7181753 3539614 --output-dir ./results

# PubMed IDs (auto-converted to PMC IDs via NCBI API)
pmcgrab --pmids 33087749 34567890 --output-dir ./results

# DOIs (auto-converted to PMC IDs via NCBI API)
pmcgrab --dois 10.1038/s41586-020-2832-5 --output-dir ./results

# IDs from a text file (one per line -- PMCIDs, PMIDs, or DOIs)
pmcgrab --from-id-file ids.txt --output-dir ./results

# Local XML directory (bulk mode -- no network)
pmcgrab --from-dir ./xml_bulk/ --output-dir ./results

# Specific local XML files (no network)
pmcgrab --from-file article1.xml article2.xml --output-dir ./results

Additional flags:

| Flag | Description | Default | | ---------------------------- | ------------------------------------------------------ | -------------- | | --output-dir / --out | Output directory for JSON files | ./pmc_output | | --batch-size / --workers | Number of concurrent worker threads | 10 | | --format | json (one file per article) or jsonl (single file) | json | | --verbose / -v | Enable debug logging | off | | --quiet / -q | Suppress progress bars | off |


5. Async Support

For asyncio-based applications:

import asyncio
from pmcgrab.application.processing import async_process_pmc_ids

results = asyncio.run(async_process_pmc_ids(
    ["7181753", "3539614", "3084273"],
    max_concurrency=10,
))

for pid, data in results.items():
    print(pid, "OK" if data else "FAIL")

6. Batch Processing

Process thousands of articles with built-in concurrency, retries, and rate-limit compliance:

from pmcgrab import process_pmc_ids_in_batches

pmc_ids = ["7181753", "3539614", "5454911", "3084273"]
process_pmc_ids_in_batches(pmc_ids, "./output", batch_size=8)

Output Example

Every parsed article produces a comprehensive JSON structure:

{
  "pmc_id": "7181753",
  "title": "Single-cell transcriptomes of the human skin reveal .
View on GitHub
GitHub Stars21
CategoryDevelopment
Updated9d ago
Forks1

Languages

Python

Security Score

90/100

Audited on Mar 26, 2026

No findings