Biobert
Bioinformatics'2020: BioBERT: a pre-trained biomedical language representation model for biomedical text mining
Install / Use
/learn @dmis-lab/BiobertREADME
BioBERT
This repository provides the code for fine-tuning BioBERT, a biomedical language representation model designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. Please refer to our paper BioBERT: a pre-trained biomedical language representation model for biomedical text mining for more details. This project is done by DMIS-Lab.
Download
We provide five versions of pre-trained weights. Pre-training was based on the original BERT code provided by Google, and training details are described in our paper. Currently available versions of pre-trained weights are as follows (SHA1SUM):
- BioBERT-Base v1.2 (+ PubMed 1M) - trained in the same way as BioBERT-Base v1.1 but includes LM head, which can be useful for probing (available in PyTorch)
- BioBERT-Large v1.1 (+ PubMed 1M) - based on BERT-large-Cased (custom 30k vocabulary), NER/QA Results
- BioBERT-Base v1.1 (+ PubMed 1M) - based on BERT-base-Cased (same vocabulary), Results in the Paper
- BioBERT-Base v1.0 (+ PubMed 200K) - based on BERT-base-Cased (same vocabulary), Results in the Paper
- BioBERT-Base v1.0 (+ PMC 270K) - based on BERT-base-Cased (same vocabulary), Results in the Paper
- BioBERT-Base v1.0 (+ PubMed 200K + PMC 270K) - based on BERT-base-Cased (same vocabulary), Results in the Paper
Note that the performances of v1.0 and v1.1 base models (BioBERT-Base v1.0, BioBERT-Base v1.1) are reported in the paper. Alternately, you can download pre-trained weights from here
Installation
Sections below describe the installation and the fine-tuning process of BioBERT based on Tensorflow 1 (python version <= 3.7). For PyTorch version of BioBERT, you can check out this repository. If you are not familiar with coding and just want to recognize biomedical entities in your text using BioBERT, please use this tool which uses BioBERT for multi-type NER and normalization.
To fine-tune BioBERT, you need to download the pre-trained weights of BioBERT.
After downloading the pre-trained weights, use requirements.txt to install BioBERT as follows:
$ git clone https://github.com/dmis-lab/biobert.git
$ cd biobert; pip install -r requirements.txt
Note that this repository is based on the BERT repository by Google.
All the fine-tuning experiments were conducted on a single TITAN Xp GPU machine which has 12GB of RAM.
You might want to install java to use the official evaluation script of BioASQ. See requirements.txt for other details.
Quick Links
Link | Detail ------------- | ------------- BioBERT-PyTorch | PyTorch-based BioBERT implementation BERN | Web-based biomedical NER + normalization using BioBERT BERN2 | Advanced version of BERN (web-based biomedical NER) w/ NER from BioLM + NEN from PubMedBERT covidAsk | BioBERT based real-time question answering model for COVID-19 7th BioASQ | Code for the seventh BioASQ challenge winning model (factoid/yesno/list) Paper | Paper link with BibTeX (Bioinformatics)
FAQs
- How can I use BioBERT with PyTorch?
- Can I get word/sentence embeddings using BioBERT?
- How can I pre-train QA models on SQuAD?
- What vocabulary does BioBERT use?
Datasets
We provide a pre-processed version of benchmark datasets for each task as follows:
Named Entity Recognition: (17.3 MB), 8 datasets on biomedical named entity recognitionRelation Extraction: (2.5 MB), 2 datasets on biomedical relation extractionQuestion Answering: (5.23 MB), 3 datasets on biomedical question answering task.
You can simply run download.sh to download all the datasets at once.
$ ./download.sh
This will download the datasets under the folder datasets.
Due to the copyright issue of other datasets, we provide links of those datasets instead: 2010 i2b2/VA, ChemProt.
Fine-tuning BioBERT
After downloading one of the pre-trained weights, unpack it to any directory you want, and we will denote this as $BIOBERT_DIR.
For instance, when using BioBERT-Base v1.1 (+ PubMed 1M), set BIOBERT_DIR environment variable as:
$ export BIOBERT_DIR=./biobert_v1.1_pubmed
$ echo $BIOBERT_DIR
>>> ./biobert_v1.1_pubmed
Named Entity Recognition (NER)
Let $NER_DIR indicate a folder for a single NER dataset which contains train_dev.tsv, train.tsv, devel.tsv and test.tsv. Also, set $OUTPUT_DIR as a directory for NER outputs (trained models, test predictions, etc). For example, when fine-tuning on the NCBI disease corpus,
$ export NER_DIR=./datasets/NER/NCBI-disease
$ export OUTPUT_DIR=./ner_outputs
Following command runs fine-tuning code on NER with default arguments.
$ mkdir -p $OUTPUT_DIR
$ python run_ner.py --do_train=true --do_eval=true --vocab_file=$BIOBERT_DIR/vocab.txt --bert_config_file=$BIOBERT_DIR/bert_config.json --init_checkpoint=$BIOBERT_DIR/model.ckpt-1000000 --num_train_epochs=10.0 --data_dir=$NER_DIR --output_dir=$OUTPUT_DIR
You can change the arguments as you want. Once you have trained your model, you can use it in inference mode by using --do_train=false --do_predict=true for evaluating test.tsv.
The token-level evaluation result will be printed as stdout format.
For example, the result for NCBI-disease dataset will be like this:
INFO:tensorflow:***** token-level evaluation results *****
INFO:tensorflow: eval_f = 0.8972311
INFO:tensorflow: eval_precision = 0.88150835
INFO:tensorflow: eval_recall = 0.9136615
INFO:tensorflow: global_step = 2571
INFO:tensorflow: loss = 28.247158
(tips : You should go up a few lines to find the result. It comes before INFO:tensorflow:**** Trainable Variables **** )
Note that this result is the token-level evaluation measure while the official evaluation should use the entity-level evaluation measure.
The results of python run_ner.py will be recorded as two files: token_test.txt and label_test.txt in $OUTPUT_DIR.
Use ./biocodes/ner_detokenize.py to obtain word level prediction file.
$ python biocodes/ner_detokenize.py --token_test_path=$OUTPUT_DIR/token_test.txt --label_test_path=$OUTPUT_DIR/label_test.txt --answer_path=$NER_DIR/test.tsv --output_dir=$OUTPUT_DIR
This will generate NER_result_conll.txt in $OUTPUT_DIR.
Use ./biocodes/conlleval.pl for entity-level exact match evaluation results.
$ perl biocodes/conlleval.pl < $OUTPUT_DIR/NER_result_conll.txt
The entity-level results for the NCBI disease corpus will be like:
processed 24497 tokens with 960 phrases; found: 983 phrases; correct: 852.
accuracy: 98.49%; precision: 86.67%; recall: 88.75%; FB1: 87.70
MISC: precision: 86.67%; recall: 88.75%; FB1: 87.70 983
Note that this is a sample run of an NER model. The performance of NER models usually converges at more than 50 epochs (learning rate = 1e-5 is recommended).
Relation Extraction (RE)
Let $RE_DIR indicate a folder for a single RE dataset, $TASK_NAME denote the name of task (two possible options: {gad, euadr}), and $OUTPUT_DIR denote a directory for RE outputs:
$ export RE_DIR=./datasets/RE/GAD/1
$ export TASK_NAME=gad
$ export OUTPUT_DIR=./re_outputs_1
Following command runs fine-tuning code on RE with default arguments.
$ python run_re.py --task_name=$TASK_NAME --do_train=true --do_eval=true --do_predict=true --vocab_file=$BIOBERT_DIR/vocab.txt --bert_config_file=$BIOBERT_DIR/bert_config.json --init_checkpoint=$BIOBERT_DIR/model.ckpt-1000000 --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=3.0 --do_lower_case=false --data_dir=$RE_DIR --output_dir=$OUTPUT_DIR
The predictions will be saved into a file called test_results.tsv in the $OUTPUT_DIR.
Use ./biocodes/re_eval.py for the evaluation.
Note that the CHEMPROT dataset is a multi-class classification dataset and to evaluate the CHEMPROT result, you should run re_eval.py with additional `--task=chempr
