SkillAgentSearch skills...

Paperscraper

Tools to scrape publications & their metadata from pubmed, arxiv, medrxiv, biorxiv and chemrxiv.

Install / Use

/learn @jannisborn/Paperscraper
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

build build build License:
MIT PyPI version Downloads codecov

paperscraper

paperscraper is a python package for scraping publication metadata or full text files (PDF or XML) from PubMed or preprint servers such as arXiv, medRxiv, bioRxiv and chemRxiv. It provides a streamlined interface to scrape metadata, allows to retrieve citation counts from Google Scholar, impact factors from journals and comes with simple postprocessing functions and plotting routines for meta-analysis.

Table of Contents

  1. Getting Started
  2. Examples
  3. Plotting
  4. Citation
  5. Contributions

Getting started

pip install paperscraper

This is enough to query PubMed, arXiv or Google Scholar.

Local development

uv sync

This installs the project and dev tooling into .venv. Use uv run to execute commands, for example:

uv run python -c "import paperscraper"

Download X-rxiv Dumps

However, to scrape publication data from the preprint servers biorxiv, medrxiv and chemrxiv, the setup is different. The entire history of papers is downloaded and stored in the server_dumps folder in a .jsonl format (one paper per line). This takes a while, as of November 2025:

from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
chemrxiv()  #  Takes <15min -> +50K papers (~30 MB file)
medrxiv()  #  Takes <30min -> +100K papers (~200 MB file)
biorxiv()  # Takes <3h -> +450 papers (~800 MB file)

NOTE: Once the dumps are stored, please make sure to restart the python interpreter so that the changes take effect. NOTE: If you experience API connection issues, retries and request behavior can be tuned, e.g.:

biorxiv(
    max_retries=12,
    request_timeout=(5.0, 45.0),      # connect timeout, read timeout
    retry_backoff_seconds=1.0,        # initial retry backoff
    max_workers=8,                    # number of parallel date windows
    window_days=30,                   # smaller windows increase parallelism
)

Since v0.2.5 paperscraper also allows to scrape {med/bio/chem}rxiv for specific dates.

medrxiv(start_date="2023-04-01", end_date="2023-04-08")

But watch out. The resulting .jsonl file will be labelled according to the current date and all your subsequent searches will be based on this file only. If you use this option you might want to keep an eye on the source files (paperscraper/server_dumps/*jsonl) to ensure they contain the paper metadata for all papers you're interested in.

Arxiv local dump

If you prefer local search rather than using the arxiv API:

from paperscraper.get_dumps import arxiv
arxiv(start_date='2024-01-01', end_date=None) # scrapes all metadata from 2024 until today.

Afterwards you can search the local arxiv dump just like the other x-rxiv dumps. The direct endpoint is paperscraper.arxiv.get_arxiv_papers_local. You can also specify the backend directly in the get_and_dump_arxiv_papers function:

from paperscraper.arxiv import get_and_dump_arxiv_papers
get_and_dump_arxiv_papers(..., backend='local')

Examples

paperscraper is build on top of the packages arxiv, pymed, and scholarly.

Publication keyword search

Consider you want to perform a publication keyword search with the query: COVID-19 AND Artificial Intelligence AND Medical Imaging.

  • Scrape papers from PubMed:
from paperscraper.pubmed import get_and_dump_pubmed_papers
covid19 = ['COVID-19', 'SARS-CoV-2']
ai = ['Artificial intelligence', 'Deep learning', 'Machine learning']
mi = ['Medical imaging']
query = [covid19, ai, mi]

get_and_dump_pubmed_papers(query, output_filepath='covid19_ai_imaging.jsonl')
  • Scrape papers from arXiv:
from paperscraper.arxiv import get_and_dump_arxiv_papers

get_and_dump_arxiv_papers(query, output_filepath='covid19_ai_imaging.jsonl')
  • Scrape papers from bioRiv, medRxiv or chemRxiv:
from paperscraper.xrxiv.xrxiv_query import XRXivQuery

querier = XRXivQuery('server_dumps/chemrxiv_2020-11-10.jsonl')
querier.search_keywords(query, output_filepath='covid19_ai_imaging.jsonl')

You can also use dump_queries to iterate over a bunch of queries for all available databases.

from paperscraper import dump_queries

queries = [[covid19, ai, mi], [covid19, ai], [ai]]
dump_queries(queries, '.')

Or use the harmonized interface of QUERY_FN_DICT to query multiple databases of your choice:

from paperscraper.load_dumps import QUERY_FN_DICT
print(QUERY_FN_DICT.keys())

QUERY_FN_DICT['biorxiv'](query, output_filepath='biorxiv_covid_ai_imaging.jsonl')
QUERY_FN_DICT['medrxiv'](query, output_filepath='medrxiv_covid_ai_imaging.jsonl')
  • Scrape papers from Google Scholar:

Thanks to scholarly, there is an endpoint for Google Scholar too. It does not understand Boolean expressions like the others, but should be used just like the Google Scholar search fields.

from paperscraper.scholar import get_and_dump_scholar_papers
topic = 'Machine Learning'
get_and_dump_scholar_papers(topic)

NOTE: The scholar endpoint does not require authentication but since it regularly prompts with captchas, it's difficult to apply large scale.

Full-Text Retrieval (PDFs & XMLs)

paperscraper allows you to download full text of publications using DOIs. The basic functionality works reliably for preprint servers (arXiv, bioRxiv, medRxiv, chemRxiv), but retrieving papers from PubMed dumps is more challenging due to publisher restrictions and paywalls.

Standard Usage

The main download functions work for all paper types with automatic fallbacks:

from paperscraper.pdf import save_pdf
paper_data = {'doi': "10.48550/arXiv.2207.03928"}
save_pdf(paper_data, filepath='gt4sd_paper.pdf')

To batch download full texts from your metadata search results:

from paperscraper.pdf import save_pdf_from_dump

# Save PDFs/XMLs in current folder and name the files by their DOI
save_pdf_from_dump('medrxiv_covid_ai_imaging.jsonl', pdf_path='.', key_to_save='doi')

Automatic Fallback Mechanisms

When the standard text retrieval fails, paperscraper automatically tries these fallbacks:

These fallbacks are tried automatically without requiring any additional configuration.

Enhanced Retrieval with Publisher APIs

For more comprehensive access to papers from major publishers, you can provide API keys for:

  • Wiley TDM API: Enables access to Wiley publications (2,000+ journals).
  • Elsevier TDM API: Enables access to Elsevier publications (The Lancet, Cell, ...).
  • bioRxiv TDM API Enable access to bioRxiv publications (since May 2025 bioRxiv is protected with Cloudflare)

To use publisher APIs:

  1. Create a file with your API keys:
WILEY_TDM_API_TOKEN=your_wiley_token_here
ELSEVIER_TDM_API_KEY=your_elsevier_key_here
AWS_ACCESS_KEY_ID=your_aws_access_key_here
AWS_SECRET_ACCESS_KEY=your_aws_secret_key_here

NOTE: The AWS keys can be created in your AWS/IAM account. When creating the key, make sure you tick the AmazonS3ReadOnlyAccess permission policy. NOTE: If you name the file .env it will be loaded automatically (if it is in the cwd or anywhere above the tree to home).

  1. Pass the file path when calling retrieval functions:
from paperscraper.pdf import save_pdf_from_dump

save_pdf_from_dump(
    'pubmed_query_results.jsonl',
    pdf_path='./papers',
    key_to_save='doi',
    api_keys='path/to/your/api_keys.txt'
)

For obtaining API keys:

View on GitHub
GitHub Stars504
CategoryDevelopment
Updated2d ago
Forks58

Languages

Python

Security Score

100/100

Audited on Apr 6, 2026

No findings