BioDEX
BioDEX: Large-Scale Biomedical Adverse Drug Event Extraction for Real-World Pharmacovigilance.
Install / Use
/learn @KarelDO/BioDEXREADME
BioDEX: Large-Scale Biomedical Adverse Drug Event Extraction for Real-World Pharmacovigilance.
This is the official repository for the BioDEX paper.
BioDEX is a raw resource for drug safety monitoring that bundles full-text and abstract-only PubMed papers with drug safety reports. These reports contain structured information about an Adverse Drug Events (ADEs) described in the papers, and are produced by medical experts in real-world settings.
BioDEX contains 19k full-text papers, 65k abstracts, and over 256k associated drug-safety reports.
Our data and models are available on Hugging Face. If you're interested in full drug-reports, use BioDEX-ICSR. If you're here to only extract reactions (as in In-Context Learning for Extreme Multi-Label Classification), use BioDEX-Reactions.
Overview of this repository
This repository is structured as follows:
demo.ipynbcontains some quick demonstrations of the data.analysis/contains the data and notebooks to reproduce all plots in the paper.src/contains all code to represent the data objects and calculate the metrics.data_creation/contains the code to create the Report-Extraction dataset starting from the raw resource. Code to create the raw resource from scratch from will be released soon.task/icsr_extraction/contains the code to train and evaluate models for the Report-Extraction task.
Overview of this readme
- Installation
- Demos
- Train and Evaluate models
- Limitations
- Contact
- Data License
- Citation
- BioDEX Data Schema
Installation
Create the conda environment and install the code:
conda create -n biodex python=3.9
conda activate biodex
pip install -r requirements.txt
pip install .
Demos
You can find the code for these demos in demo.ipynb or in the sections below.
Load the raw resource
import datasets
# load the raw dataset
dataset = datasets.load_dataset("BioDEX/raw_dataset")['train']
print(len(dataset)) # 65,648
# investigate an example
article = dataset[1]['article']
report = dataset[1]['reports'][0]
print(article['title']) # Case Report: Perioperative Kounis Syndrome in an Adolescent With Congenital Glaucoma.
print(article['abstract']) # A 12-year-old male patient suffering from congenital glaucoma developed bradycardia, ...
print(article['fulltext']) # ...
print(article['fulltext_license']) # CC BY
print(report['patient']['patientsex']) # 1
print(report['patient']['drug'][0]['activesubstance']['activesubstancename']) # ATROPINE SULFATE
print(report['patient']['drug'][0]['drugadministrationroute']) # 040
print(report['patient']['drug'][1]['activesubstance']['activesubstancename']) # MIDAZOLAM
print(report['patient']['drug'][1]['drugindication']) # Anaesthesia
print(report['patient']['reaction'][0]['reactionmeddrapt']) # Kounis syndrome
print(report['patient']['reaction'][1]['reactionmeddrapt']) # Hypersensitivity
Optional, use our code to parse the raw resource into Python objects for easy manipulation
import datasets
from src.utils import get_matches
# load the raw dataset
dataset = datasets.load_dataset("BioDEX/raw_dataset")['train']
dataset = get_matches(dataset)
print(len(dataset)) # 65,648
# investigate an example
article = dataset[1].article
report = dataset[1].reports[0]
print(article.title) # Case Report: Perioperative Kounis Syndrome in an Adolescent With Congenital Glaucoma.
print(article.abstract) # A 12-year-old male patient suffering from congenital glaucoma developed bradycardia, ...
print(article.fulltext) # ...
print(article.fulltext_license) # CC BY
print(report.patient.patientsex) # 1
print(report.patient.drug[0].activesubstance.activesubstancename) # ATROPINE SULFATE
print(report.patient.drug[0].drugadministrationroute) # 040
print(report.patient.drug[1].activesubstance.activesubstancename) # MIDAZOLAM
print(report.patient.drug[1].drugindication) # Anaesthesia
print(report.patient.reaction[0].reactionmeddrapt) # Kounis syndrome
print(report.patient.reaction[1].reactionmeddrapt) # Hypersensitivity
Load the Report-Extraction dataset
import datasets
# load the report-extraction dataset
dataset = datasets.load_dataset("BioDEX/BioDEX-ICSR")
print(len(dataset['train'])) # 9,624
print(len(dataset['validation'])) # 2,407
print(len(dataset['test'])) # 3,628
example = dataset['train'][0]
print(example['fulltext_processed'][:1000], '...') # TITLE: # SARS-CoV-2-related ARDS in a maintenance hemodialysis patient ...
print(example['target']) # serious: 1 patientsex: 1 drugs: ACETAMINOPHEN, ASPIRIN ...
Use our fine-tuned Report-Extraction model
from transformers import AutoTokenizer, T5ForConditionalGeneration
import datasets
# load the report-extraction dataset
dataset = datasets.load_dataset("BioDEX/BioDEX-ICSR")
# load the model
model_path = "BioDEX/flan-t5-large-report-extraction"
model = T5ForConditionalGeneration.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# get an input and encode it
input = dataset['validation'][1]['fulltext_processed']
input_encoded = tokenizer(input, max_length=2048, truncation=True, padding="max_length", return_tensors='pt')
# forward pass
output_encoded = model.generate(**input_encoded, max_length=256)
output = tokenizer.batch_decode(output_encoded, skip_special_tokens=True)
output = output[0]
print(output) # serious: 1 patientsex: 2 drugs: AMLODIPINE BESYLATE, LISINOPRIL reactions: Intentional overdose, Metabolic acidosis, Shock``` -->
Train and evaluate Report-Extraction models
All code for this task is located in task/icsr_extraction/.
Make sure to activate the biodex environment!
Fine-tune a new Report-Extraction model
cd tasks/icsr_extraction
python run_encdec_for_icsr_extraction.py \
--overwrite_cache False \
--seed 42 \
--dataset_name BioDEX/BioDEX-ICSR \
--text_column fulltext_processed \
--summary_column target \
--model_name_or_path google/flan-t5-large \
--output_dir ../../checkpoints/flan-t5-large-report-extraction \
--max_source_length 2048 \
--max_target_length 256 \
--do_train True \
--do_eval True \
--lr_scheduler_type linear \
--warmup_ratio 0.0 \
--learning_rate 0.0001 \
--optim adafactor \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--eval_accumulation_steps 16 \
--num_train_epochs 5 \
--bf16 True \
--evaluation_strategy epoch \
--logging_strategy steps \
--save_strategy epoch \
--logging_steps 100 \
--save_total_limit 1 \
--report_to wandb \
--load_best_model_at_end True \
--metric_for_best_model loss \
--greater_is_better False \
--predict_with_generate True \
--generation_max_length 256 \
--num_beams 1 \
--repetition_penalty 1.0
Thus far, we only consider fine-tuning encoder-decooder models in the paper. Training a decoder-only model is still a work in progress, but we've supplied some code at ./tasks/icsr_extraction/run_decoder_for_icsr_extraction.py
Reproduce our fine-tune evaluation run
Using our model on Hugging Face.
cd tasks/icsr_extraction
python run_encdec_for_icsr_extraction.py \
--overwrite_cache False \
--seed 42 \
--dataset_name BioDEX/BioDEX-ICSR \
--text_column fulltext_processed \
--summary_column target \
--model_name_or_path BioDEX/flan-t5-large-report-extraction \
--output_dir ../../checkpoints/flan-t5-large-report-extraction \
--max_source_length 2048 \
--max_target_length 256 \
--do_train False \
--do_eval True \
--lr_scheduler_type linear \
--warmup_ratio 0.0 \
--learning_rate 0.0001 \
--optim adafactor \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--eval_accumulation_steps 16 \
--num_train_epochs 5 \
--bf16 True \
--evaluation_strategy epoch \
--logging_strategy steps \
--save_strategy epoch \
--logging_steps 100 \
--save_total_limit 1 \
--report_to wandb \
--load_best_model_at_end True \
--metric_for_best_model loss \
--greater_is_better False \
--predict_with_generate True \
--generation_max_length 256 \
--num_beams 1 \
--repetition_penalty 1.0
Add --do_predict True to get the results on the test set.
Reproduce our few-shot in-context learning results
We use the DSP framework to perform in-context learning experiments.
At the time of writing, DSP does not support a truncation strategy. This is vital for our task given the long inputs. To fix this and reproduce our results, you need to replace the predict.py file of your local dsp package (path/to/local/dsp/primitives/predict.py) with the adapted version located at tasks/icsr_extraction/dsp_predict_path.py.
Run text-davinci-003:
cd tasks/icsr_extraction
python run_gpt3_for_icsr_extraction.py \
--max_dev_samples 100 \
--max_tokens 128 \
--max_prompt_length 4096 \
--n_demos 7 \
--output_dir ../../checkpoints/ \
--model_na
