Covid19Search

This repository contains source code for searching covid-19 relevant papers based on the COVID-19 Open Research Dataset (CORD-19). The repository also provides a solution to the tasks in COVID-19 Open Research Dataset Challenge on Kaggle (CORD-19). Update: 2020-04-14.

Features

Support multiple bag-of-words models (count, tf-idf, bm25).
Support semantic search models such as fasttext, glove.
Enable to combine the aforementioned two types of models.
Provide a live web application that can customize models for end-users.

Quick Start

git clone https://github.com/wangcongcong123/covidsearch.git
cd covidsearch
pip install -e .

from cord import *

# make sure put the paper collections (four .tar.gz files) and medataset csv file under the dataset_folder
dataset_folder = "dataset/"
# load metadata and full texts of papers
metadata = load_metadata_papers(dataset_folder, "metadata.csv")
full_papers = load_full_papers(dataset_folder)
# full_input_instances include title, abstract, body text
full_input_instances = [(id_, metadata[id_]["title"], metadata[id_]["abstract"], body) for id_, body in
                        full_papers.items() if id_ in metadata]
tfidf_model = FullTextModel(full_input_instances, weights=[3, 2, 1], vectorizer_type="tfidf")
query = "covid-19 transmission characteristics"
top_k = 10
start = time.time()
results = tfidf_model.query(query, top_k=top_k)
print("Query time: ", time.time() - start)
# around 0.3 s after re-run (the first time runs more time for object serilisation)

Examples

Bag-of-words search # include count, tf-idf, and bm25 (examples/full_text_run.py).
Embedding-based search # include fasttext, glove (examples/embedding_run.py).
Model Combinations # combination of the aforementioned two types (examples/ensemble_run.py).
Pre-train Insights # pre-train insights based on the tasks in kaggle. (examples/insight_from_scratch.py).
Insights Extraction # load pre-trained insights by the kaggle tasks. (examples/insight_extract.py).

Try to run python examples/insight_extract.py where a pre-trained insights file is loaded and presented to you. If you do not want to use the pre-trained insights, you can pre-train it from scratch by python examples/insight_from_scratch.py. (have a look at this file to customize the pre-training process).

Start as a web server

Here just demonstrating pre-trained insights as an example. For customisation (query search), have a hack on app.py and templates/layout.html to easily figure out. Make sure you download the metadata.csv from CORD19 dataset and put it under ./dataset first, then enter:

python app.py

Go browser via http://127.0.0.1:5000, the web application is as follows.

Server as service

The server can also be requested in a cross-origin way.
You send a GET/POST request to obtaining insights by task name.
A GET request example is like this: http://127.0.0.1:5000/kaggle_task?task_name=task1.
A POST request example is like this: curl -i -X POST -H "Content-Type: application/json" -d "{\"task_name\":\"task1\"}" http://127.0.0.1:5000/kaggle_task.
Adapt these to Ajax GET/POST request in your case where you want to embed it to your front-end web html pages!
Try the live one: https://www.thinkingso.cf/kaggle_task?task_name=task1

Contributions

Feedback and pull requrest are welcome for getting the project better off.

Covidsearch

Install / Use

README