The COVID-19 Open Research Dataset (CORD-19)

CORD-19 is a corpus of academic papers about COVID-19 and related coronavirus research. It's curated and maintained by the Semantic Scholar team at the Allen Institute for AI to support text mining and NLP research. Please read our paper for an in-depth description of how it was created: https://www.aclweb.org/anthology/2020.nlpcovid19-acl.1/

The final version of CORD-19 was released on June 2, 2022. Since we launched the dataset on March 13, 2020, we have released an updated version of the dataset almost every week. Starting from around 40K articles in its first version, the dataset has grown to index over 1M papers, and includes full text content for nearly 370K papers. We thank you for your support and feedback throughout this process. For more information, please see this blog post. A list of alternate data resources are provided under Other resources.

Updates

2022-06-02 - Final release of CORD-19
2021-03-01 - Review article published in Briefings in Bioinformatics
2020-07-09 - CORD-19 presented at the NLP-COVID workshop.
2020-03-13 - CORD-19 initial release

Important notes

We have performed some data cleaning that is sufficient to fuel most text mining & NLP research efforts. But we do not intend to provide sufficient cleaning for this data to be usable for directly consuming (reading) papers about COVID-19 or coronaviruses. There will always be some amount of error, which will make CORD-19 more/less usable for certain applications than others. We leave it up to the user to make this determination, though please feel free to consult us for recommendations.

While CORD-19 was initially released on 2020-03-13, the current schema is defined base on an update on 2020-05-26. Older versions of CORD-19 will not necessarily adhere to exactly the schema defined in this README. Please reach out for help on this if working with old CORD-19 versions.

Download

All versions of CORD-19 can be found HERE.

First published version (2020-03-13): Download Link (size: 0.3Gb, md5: a36fe181, sha1: 8fbea927)

Last published version (2022-06-02): Download Link (size: 18.7Gb, md5: c557069e, sha1: dd2c32bc)

Dataset Versions Used for TREC-COVID Shared Task

TREC-COVID Shared Task Website: https://ir.nist.gov/covidSubmit/index.html

| TREC-COVID | Date | Changelog | Link to download | md5 | sha1 | | :----------- | :---------- | :----------- | :-------------------------------- | :-------- | :-------- | | Round 1 | 2020-04-10 | link | cord-19_2020-04-10.tar.gz (1.5GB) | f4c3e742 | 4980d8ee | | Round 2 | 2020-05-01 | link | cord-19_2020-05-01.tar.gz (1.7GB) | e8c56920 | dc22dbc9 | | Round 3 | 2020-05-19 | link | cord-19_2020-05-19.tar.gz (2.8GB) | 6424de9c | 1781b935 | | Round 4 | 2020-06-19 | link | cord-19_2020-06-19.tar.gz (3.3GB) | 47b61215 | fdd0490e | | Round 5 | 2020-07-16 | link | cord-19_2020-07-16.tar.gz (3.7GB) | 018c4bc4 | 7adcf31a |

Dataset Versions Used for EPIC-QA Shared Task

EPIC-QA Shared Task Website: https://bionlp.nlm.nih.gov/epic_qa/

| EPIC-QA | Date | Changelog | Link to download | md5 | sha1 | | :----------------- | :---------- | :----------- | :-------------------------------- | :-------- | :-------- | | Preliminary round | 2020-06-19 | link | cord-19_2020-06-19.tar.gz (3.3GB) | 47b61215 | fdd0490e | | Primary round | 2020-10-22 | link | cord-19_2020-10-22.tar.gz (5.3GB) | 7cb9e743 | 7efe285f |

Overview

CORD-19 is released weekly. Each version of the corpus is tagged with a datestamp (e.g. 2020-05-26). Releases look like:

|-- 2020-05-26/
    |-- changelog
    |-- cord_19_embeddings.tar.gz
    |-- document_parses.tar.gz
    |-- metadata.csv
|-- 2020-05-27/
|-- ...

The files in each version are:

changelog: A text file summarizing changes between this and the previous version.
cord_19_embeddings.tar.gz: A collection of precomputed SPECTER document embeddings for each CORD-19 paper
document_parses.tar.gz: A collection of JSON files that contain full text parses of a subset of CORD-19 papers
metadata.csv: Metadata for all CORD-19 papers.

When cord_19_embeddings.tar.gz is uncompressed, it is a 769-column CSV file, where the first column is the cord_uid and the remaining columns correspond to a 768-dimensional document embedding. For example:

ug7v899j,-2.939983606338501,-6.312200546264648,-1.0459030866622925,5.164162635803223,-0.32564637064933777,-2.507413387298584,1.735608696937561,1.9363566637039185,0.622501015663147,1.5613162517547607,...

When document_parses.tar.gz is uncompressed, it is a directory:

|-- document_parses/
    |-- pdf_json/
        |-- 80013c44d7d2d3949096511ad6fa424a2c740813.json
        |-- bfe20b3580e7c539c16ce4b1e424caf917d3be39.json
        |-- ...
    |-- pmc_json/
        |-- PMC7096781.xml.json
        |-- PMC7118448.xml.json
        |-- ...

Example usage

We recommend everyone primarily use metadata.csv & augment data when needed with full text in document_parses/. For example, let's say we wanted to collect a bunch of Titles, Abstracts, and Introductions of papers. In Python, such a script might look like:

import csv
import os
import json
from collections import defaultdict

cord_uid_to_text = defaultdict(list)

# open the file
with open('metadata.csv') as f_in:
    reader = csv.DictReader(f_in)
    for row in reader:

        # access some metadata
        cord_uid = row['cord_uid']
        title = row['title']
        abstract = row['abstract']
        authors = row['authors'].split('; ')

        # access the full text (if available) for Intro
        introduction = []
        if row['pdf_json_files']:
            for json_path in row['pdf_json_files'].split('; '):
                with open(json_path) as f_json:
                    full_text_dict = json.load(f_json)

                    # grab introduction section from *some* version of the full text
                    for paragraph_dict in full_text_dict['body_text']:
                        paragraph_text = paragraph_dict['text']
                        section_name = paragraph_dict['section']
                        if 'intro' in section_name.lower():
                            introduction.append(paragraph_text)

                    # stop searching other copies of full text if already got introduction
                    if introduction:
                        break

        # save for later usage
        cord_uid_to_text[cord_uid].append({
            'title': title,
            'abstract': abstract,
            'introduction': introduction
        })

`metadata.csv` overview

We recommend everyone work with metadata.csv as the starting point. This file is comma-separated with the following columns:

cord_uid: A str-valued field that assigns a unique identifier to each CORD-19 paper. This is not necessariy unique per row, which is explained in the FAQs.
sha: A List[str]-valued field that is the SHA1 of all PDFs associated with the CORD-19 paper. Most papers will have either zero or one value here (since we either have a PDF or we don't), but some papers will have multiple. For example, the main paper might have supplemental information saved in a separate PDF. Or we might have two separate PDF copies of the same paper. If multiple PDFs exist, their SHA1 will be semicolon-separated (e.g. '4eb6e165ee705e2ae2a24ed2d4e67da42831ff4a; d4f0247db5e916c20eae3f6d772e8572eb828236')
source_x: A List[str]-valued field that is the names of sources from which we received this paper. Also semicolon-separated. For example, 'ArXiv; Elsevier; PMC; WHO'. There should always be at least one source listed.
title: A str-valued field for the paper title
doi: A str-value

Cord19

Install / Use

README

The COVID-19 Open Research Dataset (CORD-19)

Updates

Important notes

Download

Dataset Versions Used for TREC-COVID Shared Task

Dataset Versions Used for EPIC-QA Shared Task

Overview

Example usage

`metadata.csv` overview

Cord19

Install / Use

README

The COVID-19 Open Research Dataset (CORD-19)

Updates

Important notes

Download

Dataset Versions Used for TREC-COVID Shared Task

Dataset Versions Used for EPIC-QA Shared Task

Overview

Example usage

metadata.csv overview

`metadata.csv` overview