✍️📜 Cadmus

This project aims to build an automated full-text retrieval system for the generation of large biomedical corpora from published literature for research purposes. Cadmus has been developed for use in non-commercial research. Use out with this remit is not recommended, nor is the intended purpose.

📋 Requirements

In order to run the code, you need a few things:

You need to have Java 7+.

You need to git clone the project and install it.

An API key from NCBI (this is used to search PubMed for articles using a search string or list of PubMed IDs; you can find more information here).

In case you are running cadmus on a shared machine, you need to terminate all the Tika instances present in the tmp directory if you are not the owner of the instances, so cadmus can restart them for you.

Recommended requirements:

An API key from Wiley, this key will allow you to get access to the OA and publications you or your institution has the right to access from Wiley. You can find more information here

An API key from Elsevier, this key will allow you to get access to the OA and publications you or your institution has the right to access from Elsevier. You can find more information here

⚙️ Installation

Cadmus has a number of dependencies on other Python packages; it is recommended to install it in an isolated environment.

git clone https://github.com/biomedicalinformaticsgroup/cadmus.git
pip install ./cadmus

🚀 Get started

The format we are using for the search term(s) is the same as the one for PubMed. You can first try your search term(s) on PubMed and then use the same search term(s) as input for cadmus bioscraping.

In order to create your corpora, you are going to use the function called bioscraping. The function is taking the following required parameters:

A PubMed query string or a Python list of PubMed IDs
An email address
Your NCBI_API_KEY

The function can also receive optional parameters.

wiley_api_key parameter allows Wiley to identify which publications you or your institution have the right to access. It will give you access to the OA publications that you would not get access to without the key. RECOMMENDED
elsevier_api_key parameter allows Elsevier to identify which publications you or your institution have the right to access. It will give you access to the OA publications that you would not normally have access to without the key. RECOMMENDED
The "start" parameter tells the function at which service we were before failure (e.g. crossref, doi, PubMed Central API, ...).
The "idx" parameter tells the function what is the last saved row index (article).

Start and idx are designed to be used when restarting cadmus after a program failure. When Cadmus is running, there is a repeated output feed at the top of the live output. This line will show you the stage and index that your output dataframe was last saved in case of failure for whatever reason. By using these optional parameters, the program will pick off where it left off, saving you from starting the process from the beginning again.

"full_search", in case you want to check if a document became available since the last time you tried. "full_search" has three predefined values:
- The default value is 'None'; the function only looks for the new articles since the last run.
- 'light', the function looks for the new articles since the last run and retried the row where we did not get any format.
- 'heavy', the function looks for the new articles since the last run and retried the row where it did not retrieve at least one tagged version (i.e. HTML or XML) in combination with the PDF format.
The "keep_abstract" parameter has the default value 'True' and can be changed to 'False'. When set to 'True', our parsing will load any format from the beginning of the document. If changes to 'False', our parsing is trying to identify the abstract from any format and starts to extract the text after it. We are offering the option of removing the abstract, but we can not guarantee that our approach is more reliable for doing so. In case you would like to apply your own parsing method for removing the abstract, feel free to load any file saved during the retrieval available in the output folder: "output/formats/{format}s/{index}.{suffix}.zip".

You can now run bioscraping with the following example:

Minimum requirements:

from cadmus import bioscraping
bioscraping(
    INPUT, #type str
    EMAIL, #type str
    NCBI_API_KEY #type str
    )

Minimum recommended requirements:

from cadmus import bioscraping
bioscraping(
    INPUT, #type str
    EMAIL, #type str
    NCBI_API_KEY, #type str
    wiley_api_key = YOUR_WILEY_API_KEY, #type str
    elsevier_api_key = YOUR_ELSEVIER_API_KEY #type str
    )

🔬 Load the result

The output from cadmus is a directory with the content text of each retrieved publication saved as a zip file containing a txt file. You can find the files here: "./ouput/retrieved_parsed_files/content_text/*.txt.zip". It also provides the metadata saved as a zip file containing a JSON file and a zip file containing a TSV file. In order to load the metadata, you can use the following lines of code.

import zipfile
import json
import pandas as pd
with zipfile.ZipFile("./output/retrieved_df/retrieved_df2.json.zip", "r") as z:
    for filename in z.namelist():
        with z.open(filename) as f:
            data = f.read()
            data = json.loads(data)


f.close()
z.close()
metadata_retrieved_df = pd.read_json(data, orient='index')
metadata_retrieved_df.pmid = metadata_retrieved_df.pmid.astype(str)

Here is a helper function you can call to generate a DataFrame with the same index as the one used for the metadata and the content text. The content text is the "best" representation of the full text from the available formats. XML, HTML, Plain text, and PDF in that order of cleanliness. It is advised to keep the result somewhere else than in the output directory, as the DataFrame gets bigger, the function takes more time to run.

from cadmus import parsed_to_df
retrieved_df = parsed_to_df(path = './output/retrieved_parsed_files/content_text/')

By default, we assume the directory to the files is "./ouput/retrieved_parsed_files/content_text/ please change the parameter 'path' otherwise.

🔎 Output details

retrieved_df

The Metadata output is a pandas dataframe saved as a zip containing a JSON file.
This is stored in the directory "./ouput/retrieved_df/retrieved_df2.json.zip". The dataframe columns are:

pmid <class 'int64'>
- PubMed ID. If you prefer to change the data type of PMIDs to <class 'str'>, you can use the following example: metadata_retrieved_df.pmid = metadata_retrieved_df.pmid.astype(str)
pmcid <class 'float'>
- PubMed Central ID.
title <class 'str'>
abstract <class 'str'>
- Abstract (from PubMed metadata).
mesh <class 'list'>
- MeSH (Medical Subject Headings) provided by Medline.
keywords <class 'list'>
- This field contains largely non-MeSH subject terms that describe the content of an article. Beginning in January 2013, the author-supplied keywords.
authors <class 'list'>
journal <class 'str'>
pub_type <class 'list'>
- Publication type (from PubMed metadata).
pub_date <class 'str'>
- Publication date (from PubMed metadata).
doi <class 'str'>
issn <class 'str'>
crossref <class 'numpy.int64'>
- 1/0 for the presence of a crossref record when searching on doi.
full_text_links <class 'dict'>
- dict_keys:
  - 'cr_tdm' (list of crossref tdm links),
  - 'html_parse' (list of links parsed from HTML files),
  - 'pubmed_links' (list of links from "linkout" section on PubMed page, not including PMC).
licenses <class 'list'>
pdf <class 'numpy.int64'>
- (1/0) for successful download of the PDF version.
xml <class 'numpy.int64'>
- (1/0) for successful download of the XML version.
html <class 'numpy.int64'>
- (1/0) for successful download of the HTML version.
plain <class 'numpy.int64'>
- (1/0) for successful download of the plain text version.
pmc_tgz <class 'numpy.int64'>
- (1/0) for successful download of PubMed Central Tar g-zip.
xml_parse_d <class 'dict'>
html_parse_d <class 'dict'>
pdf_parse_d <class 'dict'>
plain_parse_d <class 'dict'>
- all parse_d have the same structure in the dictionary
- dict_keys:
  - 'file_path' (string representation of the path to the raw file saved at "output/formats/{format}s/{index}.{suffix}.zip"),
  - 'size' (file size - bytes),
  - 'wc' (rough word count based on string.split() for the content text (int)),
  - 'wc_abs' (rough word count based on string.split() for the abstract (int)),

Cadmus

Install / Use

README