pubget

pubget is a command-line tool for collecting data for biomedical text-mining, and in particular large-scale coordinate-based neuroimaging meta-analysis. It exposes some of the machinery that was used to create the neuroquery dataset, which powers neuroquery.org.

pubget downloads full-text articles from PubMed Central and extracts their text, metadata and stereotactic coordinates. It can also compute TFIDF features for the extracted text, fit NeuroQuery or NeuroSynth, and format its output for use with NiMARE or labelbuddy. It can be extended with plugins.

Besides the command-line interface, pubget's functionality is also exposed through its Python API.

Installation

You can install pubget by running:

pip install pubget

This will install the pubget Python package, as well as the pubget command.

Quick Start

Once pubget is installed, we can download and process biomedical publications so that we can later use them for text-mining or meta-analysis.

pubget run ./pubget_data -q "fMRI[title]"

See pubget run --help for a description of this command. For example, the --n_jobs option allows running some of the steps in parallel.

Usage

The creation of a dataset happens in 3 steps:

Downloading the articles in bulk from the PMC API.
Extracting the articles from the bulk download
Extracting text, stereotactic coordinates and metadata from the articles, and storing this information in CSV files.

Afterwards, some optional steps can also be run, such as:

Vectorizing the text: transforming it into vectors of TFIDF features.
Running the same analyses as NeuroSynth or NeuroQuery.
Preparing the data for use with labelbuddy or NiMARE.

Each of these steps stores its output in a separate directory. Normally, you will run the whole procedure in one command by invoking pubget run. However, separate commands are also provided to run each step separately. Below, we describe each step and its output. Use pubget -h to see a list of all available commands and pubget run -h to see all the options of the main command.

All articles downloaded by pubget come from PubMed Central, and are therefore identified by their PubMed Central ID (pmcid). Note this is not the same as the PubMed ID (pmid). Not all articles in PMC have a pmid.

pubget only downloads articles from the Open Access subset of PMC. The open-access papers are the papers whose license allows downloading their text for text-mining or other reuse (Creative Commons or similar licenses). To restrict search results to the open-access subset on the PMC website (and see the papers that would be downloaded by pubget), select "Open access" in the "article attributes" list.

Step 1: Downloading articles from PMC

This step is executed by the pubget download command. Articles to download can be selected in 2 different ways: by using a query to search the PMC database, or by providing an explicit list of article PMCIDs. To use a list of PMCIDs, we must pass the path to a file containing the IDs as the --pmcids_file parameter. It must contain one ID per line, for example:

Note these must be PubMedCentral IDs, not PubMed IDs. Moreover, Some articles can be viewed on the PubMedCentral website, but are not in the Open Access subset. The publisher of these articles forbids downloading their full text in XML form. pubget filters the list of PMCIDs and only downloads those that are in the Open Access subset. When we use a query instead of a PMCID list, only articles in the Open Access subset are considered.

If we use a query instead, we do not use the --pmcids_file option, but either --query or --query_file. Everything else works in the same way, and the rest of this documentation relies on an example that uses a query.

We must first define our query, with which Pubmed Central will be searched for articles. It can be simple such as fMRI, or more specific such as fMRI[Abstract] AND (2000[PubDate] : 2022[PubDate]). You can build the query using the PMC advanced search interface. For more information see the E-Utilities help. Some examples are provided in the pubget git repository, in docs/example_queries.

The query can be passed either as a string on the command-line with -q or --query or by passing the path of a text file containing the query with -f or --query_file.

If we have an NCBI API key (see details in the E-utilities documentation), we can provide it through the NCBI_API_KEY environment variable or through the --api_key command line argument (the latter has higher precedence).

We must also specify the directory in which all pubget data will be stored. It can be provided either as a command-line argument (as in the examples below), or by exporting the PUBGET_DATA_DIR environment variable. Subdirectories will be created for each different query. In the following we suppose we are storing our data in a directory called pubget_data.

We can thus download all articles with "fMRI" in their title published in 2019 by running:

pubget download -q "fMRI[Title] AND (2019[PubDate] : 2019[PubDate])" pubget_data

Note: writing the query in a file rather than passing it as an argument is more convenient for complex queries, for example those that contain whitespace, newlines or quotes. By storing it in a file we do not need to take care to quote or escape characters that would be interpreted by the shell. In this case we would store our query in a file, say query.txt:

fMRI[Title] AND (2019[PubDate] : 2019[PubDate])

and run

pubget download -f query.txt pubget_data

After running this command, these are the contents of our data directory:

· pubget_data
  └── query_3c0556e22a59e7d200f00ac8219dfd6c
      ├── articlesets
      │   ├── articleset_00000.xml
      │   └── info.json
      └── query.txt

pubget has created a directory for this query, query_3c0556e22a59e7d200f00ac8219dfd6c — in the following we will call it "the query directory". Its name contains the md5 checksum of the query (or PMCID list), which is useful for pubget to reuse the same directory if we run the same query again, but not very helpful for us humans. Therefore, we can use the --alias command-line argument to give this query an alternative name, and pubget will create a symbolic link for us. For example if we run the query above with the added option --alias "fMRI-2019", our pubget_data directory will look like this:

· pubget_data
  ├── fMRI-2019 -> query_3c0556e22a59e7d200f00ac8219dd6c
  └── query_3c0556e22a59e7d200f00ac8219dd6c

If we had used a PMCID list instead of a query, the directory name would start with pmcidList_ instead of query_.

If we used a query it will be stored in query.txt, and if we used a list of PMCIDs, in requested_pmcids.txt, in the query directory.

Inside the query directory, the results of the bulk download are stored in the articlesets subdirectory. The articles themselves are in XML files bundling up to 500 articles called articleset_*.xml. Here there is only one because the search returned less than 500 articles.

Some information about the download is stored in info.json. In particular, is_complete indicates if all articles matching the search have been downloaded. If the download was interrupted, some batches failed to download, or the number of results was limited by using the --n_docs parameter, is_complete will be false and the exit status of the program will be 1. You may want to re-run the command before moving on to the next step if the download is incomplete.

If we run the same query again, only missing batches will be downloaded. If we want to force re-running the search and downloading the whole data we need to remove the articlesets directory.

Step 2: extracting articles from bulk download

This step is executed by the pubget extract_articles command.

Once our download is complete, we extract articles and store each of them in a separate directory. To do so, we pass the articlesets directory created by the pubget download command in step 1:

pubget extract_articles pubget_data/query_3c0556e22a59e7d200f00ac8219dfd6c/articlesets

This creates an articles subdirectory in the query directory, containing the articles. To avoid having a large number of files in a single directory when there are many articles, which can be problematic on some filesystems, the articles are spread over many subdirectories. The names of these subdirectories range from 000 to fff and an article goes in the subdirectory that matches the first 3 hexidecimal digits of the md5 hash of its pmcid.

Our data directory now looks like this (with many articles omitted for conciseness):

· pubget

Pubget

Install / Use

README