SkillAgentSearch skills...

Covidsearch

This repository provdes a retrieval system for searching covid-19 relevant papers, built upon CORD19 dataset

Install / Use

/learn @wangcongcong123/Covidsearch
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Covid19Search

<a href="/flairNLP/flair/blob/master/CONTRIBUTING.md"><img src="https://camo.githubusercontent.com/8f697c48adc5026cc6d83dd45e42b9b93ee1803c/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f636f6e747269627574696f6e732d77656c636f6d652d627269676874677265656e2e737667" alt="Contributions welcome" data-canonical-src="https://img.shields.io/badge/contributions-welcome-brightgreen.svg" style="max-width:100%;"></a> <a href="https://opensource.org/licenses/MIT" rel="nofollow"><img src="https://camo.githubusercontent.com/a2753323735099059bdc88b724534a1a6bd134ee/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d627269676874677265656e2e737667" alt="License: MIT" data-canonical-src="https://img.shields.io/badge/License-MIT-brightgreen.svg" style="max-width:100%;"></a>

This repository contains source code for searching covid-19 relevant papers based on the COVID-19 Open Research Dataset (CORD-19). The repository also provides a solution to the tasks in COVID-19 Open Research Dataset Challenge on Kaggle (CORD-19). Update: 2020-04-14.

Features

  • Support multiple bag-of-words models (count, tf-idf, bm25).
  • Support semantic search models such as fasttext, glove.
  • Enable to combine the aforementioned two types of models.
  • Provide a live web application that can customize models for end-users.

Quick Start

git clone https://github.com/wangcongcong123/covidsearch.git
cd covidsearch
pip install -e .
from cord import *

# make sure put the paper collections (four .tar.gz files) and medataset csv file under the dataset_folder
dataset_folder = "dataset/"
# load metadata and full texts of papers
metadata = load_metadata_papers(dataset_folder, "metadata.csv")
full_papers = load_full_papers(dataset_folder)
# full_input_instances include title, abstract, body text
full_input_instances = [(id_, metadata[id_]["title"], metadata[id_]["abstract"], body) for id_, body in
                        full_papers.items() if id_ in metadata]
tfidf_model = FullTextModel(full_input_instances, weights=[3, 2, 1], vectorizer_type="tfidf")
query = "covid-19 transmission characteristics"
top_k = 10
start = time.time()
results = tfidf_model.query(query, top_k=top_k)
print("Query time: ", time.time() - start)
# around 0.3 s after re-run (the first time runs more time for object serilisation)

Examples

Try to run python examples/insight_extract.py where a pre-trained insights file is loaded and presented to you. If you do not want to use the pre-trained insights, you can pre-train it from scratch by python examples/insight_from_scratch.py. (have a look at this file to customize the pre-training process).

Start as a web server

Here just demonstrating pre-trained insights as an example. For customisation (query search), have a hack on app.py and templates/layout.html to easily figure out. Make sure you download the metadata.csv from CORD19 dataset and put it under ./dataset first, then enter:

python app.py

Go browser via http://127.0.0.1:5000, the web application is as follows.

Server as service

  • The server can also be requested in a cross-origin way.
  • You send a GET/POST request to obtaining insights by task name.
  • A GET request example is like this: http://127.0.0.1:5000/kaggle_task?task_name=task1.
  • A POST request example is like this: curl -i -X POST -H "Content-Type: application/json" -d "{\"task_name\":\"task1\"}" http://127.0.0.1:5000/kaggle_task.
  • Adapt these to Ajax GET/POST request in your case where you want to embed it to your front-end web html pages!
  • Try the live one: https://www.thinkingso.cf/kaggle_task?task_name=task1

Contributions

Feedback and pull requrest are welcome for getting the project better off.

View on GitHub
GitHub Stars5
CategoryDevelopment
Updated5y ago
Forks1

Languages

Python

Security Score

75/100

Audited on Jul 17, 2020

No findings