CBLUE

AI (Artificial Intelligence) plays an indispensable role in the biomedical field, helping improve medical technology. For further accelerating AI research in the biomedical field, we present Chinese Biomedical Language Understanding Evaluation (CBLUE), including datasets collected from real-world biomedical scenarios, baseline models, and an online platform for model evaluation, comparison, and analysis.

CBLUE Benchmark

We evaluate the current 11 Chinese pre-trained models on the eight biomedical language understanding tasks and report the baselines of these tasks.

| Model | CMedEE | CMedIE | CDN | CTC | STS | QIC | QTR | QQR | Avg. | | ------------------------------------------------------------ | :------: | :----: | :------: | :------: | :------: | :------: | :------: | :------: | :--: | | BERT-base | 62.1 | 54.0 | 55.4 | 69.2 | 83.0 | 84.3 | 60.0 | 84.7 | 69.0 | | BERT-wwm-ext-base | 61.7 | 54.0 | 55.4 | 70.1 | 83.9 | 84.5 | 60.9 | 84.4 | 69.4 | | ALBERT-tiny | 50.5 | 35.9 | 50.2 | 61.0 | 79.7 | 75.8 | 55.5 | 79.8 | 61.1 | | ALBERT-xxlarge | 61.8 | 47.6 | 37.5 | 66.9 | 84.8 | 84.8 | 62.2 | 83.1 | 66.1 | | RoBERTa-large | 62.1 | 54.4 | 56.5 | 70.9 | 84.7 | 84.2 | 60.9 | 82.9 | 69.6 | | RoBERTa-wwm-ext-base | 62.4 | 53.7 | 56.4 | 69.4 | 83.7 | 85.5 | 60.3 | 82.7 | 69.3 | | RoBERTa-wwm-ext-large | 61.8 | 55.9 | 55.7 | 69.0 | 85.2 | 85.3 | 62.8 | 84.4 | 70.0 | | PCL-MedBERT | 60.6 | 49.1 | 55.8 | 67.8 | 83.8 | 84.3 | 59.3 | 82.5 | 67.9 | | ZEN | 61.0 | 50.1 | 57.8 | 68.6 | 83.5 | 83.2 | 60.3 | 83.0 | 68.4 | | MacBERT-base | 60.7 | 53.2 | 57.7 | 67.7 | 84.4 | 84.9 | 59.7 | 84.0 | 69.0 | | MacBERT-large | 62.4 | 51.6 | 59.3 | 68.6 | 85.6 | 82.7 | 62.9 | 83.5 | 69.6 | | Human | 67.0 | 66.0 | 65.0 | 78.0 | 93.0 | 88.0 | 71.0 | 89.0 | 77.1 |

Baseline of tasks

We present the baseline models on the biomedical tasks and release corresponding codes for a quick start.

Requirements

python3 / pytorch 1.7 / transformers 4.5.1 / jieba / gensim / sklearn

Data preparation

Download dataset

The whole zip package includes the datasets of 8 biomedical NLU tasks (more detail in the following section). Every task includes the following files:

├── {Task}
|  └── {Task}_train.json
|  └── {Task}_test.json
|  └── {Task}_dev.json
|  └── example_gold.json
|  └── example_pred.json
|  └── README.md

Notice: a few tasks have additional files, e.g. it includes 'category.xlsx' file in the CHIP-CTC task.

You can download Chinese pre-trained models according to your need (download URLs are provided above). With Huggingface-Transformers , the models above could be easily accessed and loaded.

The reference directory:

├── CBLUE         
|  └── baselines
|     └── run_classifier.py
|     └── ...
|  └── examples
|     └── run_qqr.sh
|     └── ...
|  └── cblue
|  └── CBLUEDatasets
|     └── KUAKE-QQR
|     └── ...
|  └── data
|     └── output
|     └── model_data
|        └── bert-base
|        └── ...
|     └── result_output
|        └── KUAKE-QQR_test.json
|        └── ...

Running examples

The shell files of training and evaluation for every task are provided in examples/ , and could directly run.

Also, you can utilize the running codes in baselines/ , and write your shell files according to your need:

baselines/run_classifer.py: support {sts, qqr, qtr, qic, ctc, ee} tasks;
baselines/run_cdn.py: support {cdn} task;
baselines/run_ie.py: support {ie} task.

Training models

Running shell files: bash examples/run_{task}.sh, and the contents of shell files are as follow:

DATA_DIR="CBLUEDatasets"

TASK_NAME="qqr"
MODEL_TYPE="bert"
MODEL_DIR="data/model_data"
MODEL_NAME="chinese-bert-wwm"
OUTPUT_DIR="data/output"
RESULT_OUTPUT_DIR="data/result_output"

MAX_LENGTH=128

python baselines/run_classifier.py \
    --data_dir=${DATA_DIR} \
    --model_type=${MODEL_TYPE} \
    --model_dir=${MODEL_DIR} \
    --model_name=${MODEL_NAME} \
    --task_name=${TASK_NAME} \
    --output_dir=${OUTPUT_DIR} \
    --result_output_dir=${RESULT_OUTPUT_DIR} \
    --do_train \
    --max_length=${MAX_LENGTH} \
    --train_batch_size=16 \
    --eval_batch_size=16 \
    --learning_rate=3e-5 \
    --epochs=3 \
    --warmup_proportion=0.1 \
    --earlystop_patience=3 \
    --logging_steps=250 \
    --save_steps=250 \
    --seed=2021

Notice: the best checkpoint is saved in OUTPUT_DIR/MODEL_NAME/.

MODEL_TYPE: support {bert, roberta, albert, zen} model types;
MODEL_NAME: support {bert-base, bert-wwm-ext, albert-tiny, albert-xxlarge, zen, pcl-medbert, roberta-large, roberta-wwm-ext-base, roberta-wwm-ext-large, macbert-base, macbert-large} Chinese pre-trained models.

The MODEL_TYPE-MODEL_NAME mappings are listed below.

Inference & generation of results

Running shell files: base examples/run_{task}.sh predict, and the contents of shell files are as follows:

DATA_DIR="CBLUEDatasets"

TASK_NAME="qqr"
MODEL_TYPE="bert"
MODEL_DIR="data/model_data"
MODEL_NAME="chinese-bert-wwm"
OUTPUT_DIR="data/output"
RESULT_OUTPUT_DIR="data/result_output"

MAX_LENGTH=128

python baselines/run_classifier.py \
    --data_dir=${DATA_DIR} \
    --model_type=${MODEL_TYPE} \
    --model_name=${MODEL_NAME} \
    --model_dir=${MODEL_DIR} \
    --task_name=${TASK_NAME} \
    --output_dir=${OUTPUT_DIR} \
    --result_output_dir=${RESULT_OUTPUT_DIR} \
    --do_predict \
    --max_length=${MAX_LENGTH} \
    --eval_batch_size=16 \
    --seed=2021

Notice: the result of prediction {TASK_NAME}_test.json will be generated in RESULT_OUTPUT_DIR .

Check format

Before you submit the predicted test files, you could check the format of test files using format_checker and avoid the invalid evalution score induced by the format errors.

Step1: Copy the original test file(without answer) {taskname}_test.[json|jsonl|tsv] to this directory format_checker, and rename as {taskname}_test_raw.[json|jsonl|tsv].

# take the CMeEE task for example:
cp ${path_to_CMeEE}/CMeEE_test.json ${current_dir}/CMeEE_test_raw.json

Step2: Execute the following format_checker script using the raw test file (from Step1) and your prediction file:

python3 format_checker_${taskname}.py {taskname}_test_raw.[json|jsonl|tsv] {taskname}_test.[json|jsonl|tsv] 

# take the CMeEE task for example:
python3 format_checker_CMeEE.py CMeEE_test_raw.json CMeEE_test.json

What is special?

IMCS-NER & IMCS-V2-NER tasks:

Step1: Copy both the original test file(without answer) IMCS-NER_test.json(IMCS-V2-NER_test.json) and the IMCS_test.json(IMCS-V2_test.json) to this directory, and rename as IMCS-NER_test_raw.json(IMCS-V2-NER_test_raw.json)

# for IMCS-NER task:
cp ${path_to_IMCS-NER}/IMCS-NER_test.json ${current_dir}/IMCS-NER_test_raw.json 
cp ${path_to_IMCS-NER}/IMCS_test.json ${current_dir}
# for IMCS-V2-NER task:
cp ${path_to_IMCS-V2-NER}/IMCS-V2-NER_test.json ${current_dir}/IMCS-V2-NER_test_raw.json 
cp ${path_to_IMCS-V2-NER}/IMCS-V2_test.json ${current_dir}

Step2: Execute the following format_checker script using the raw test file (from Step1) and your prediction file:

# for IMCS-NER task:
python3 format_checker_IMCS_V1_NER.py  IMCS-NER_test_raw.json IMCS-NER_test.json IMCS_test.json
# for IMCS-V2-NER task:
python3 format_checker_IMCS_V2_NER.py  IMCS-V2-NER_test_raw.json IMCS-V2-NER_test.json IMCS-V2_test.json

IMCS-SR & IMCS-V2-SR, MedDG tasks

If you want to implement the optional check login in the check_format function, which is commented in the master branch. You need also copy the normalized dictionary files to the current dir.

MedDG: the dictionary file is entity_list.txt
IMCS-SR: the dictionary file is *

CBLUE

Install / Use

README

CBLUE

CBLUE Benchmark

Baseline of tasks

Requirements

Data preparation

Running examples

Training models

Inference & generation of results

Check format

What is special?

IMCS-NER & IMCS-V2-NER tasks:

IMCS-SR & IMCS-V2-SR, MedDG tasks