CBLUE
[CBLUE1] 中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
Install / Use
/learn @CBLUEbenchmark/CBLUEREADME
CBLUE
AI (Artificial Intelligence) plays an indispensable role in the biomedical field, helping improve medical technology. For further accelerating AI research in the biomedical field, we present Chinese Biomedical Language Understanding Evaluation (CBLUE), including datasets collected from real-world biomedical scenarios, baseline models, and an online platform for model evaluation, comparison, and analysis.
CBLUE Benchmark
We evaluate the current 11 Chinese pre-trained models on the eight biomedical language understanding tasks and report the baselines of these tasks.
| Model | CMedEE | CMedIE | CDN | CTC | STS | QIC | QTR | QQR | Avg. | | ------------------------------------------------------------ | :------: | :----: | :------: | :------: | :------: | :------: | :------: | :------: | :--: | | BERT-base | 62.1 | 54.0 | 55.4 | 69.2 | 83.0 | 84.3 | 60.0 | 84.7 | 69.0 | | BERT-wwm-ext-base | 61.7 | 54.0 | 55.4 | 70.1 | 83.9 | 84.5 | 60.9 | 84.4 | 69.4 | | ALBERT-tiny | 50.5 | 35.9 | 50.2 | 61.0 | 79.7 | 75.8 | 55.5 | 79.8 | 61.1 | | ALBERT-xxlarge | 61.8 | 47.6 | 37.5 | 66.9 | 84.8 | 84.8 | 62.2 | 83.1 | 66.1 | | RoBERTa-large | 62.1 | 54.4 | 56.5 | 70.9 | 84.7 | 84.2 | 60.9 | 82.9 | 69.6 | | RoBERTa-wwm-ext-base | 62.4 | 53.7 | 56.4 | 69.4 | 83.7 | 85.5 | 60.3 | 82.7 | 69.3 | | RoBERTa-wwm-ext-large | 61.8 | 55.9 | 55.7 | 69.0 | 85.2 | 85.3 | 62.8 | 84.4 | 70.0 | | PCL-MedBERT | 60.6 | 49.1 | 55.8 | 67.8 | 83.8 | 84.3 | 59.3 | 82.5 | 67.9 | | ZEN | 61.0 | 50.1 | 57.8 | 68.6 | 83.5 | 83.2 | 60.3 | 83.0 | 68.4 | | MacBERT-base | 60.7 | 53.2 | 57.7 | 67.7 | 84.4 | 84.9 | 59.7 | 84.0 | 69.0 | | MacBERT-large | 62.4 | 51.6 | 59.3 | 68.6 | 85.6 | 82.7 | 62.9 | 83.5 | 69.6 | | Human | 67.0 | 66.0 | 65.0 | 78.0 | 93.0 | 88.0 | 71.0 | 89.0 | 77.1 |
Baseline of tasks
We present the baseline models on the biomedical tasks and release corresponding codes for a quick start.
Requirements
python3 / pytorch 1.7 / transformers 4.5.1 / jieba / gensim / sklearn
Data preparation
The whole zip package includes the datasets of 8 biomedical NLU tasks (more detail in the following section). Every task includes the following files:
├── {Task}
| └── {Task}_train.json
| └── {Task}_test.json
| └── {Task}_dev.json
| └── example_gold.json
| └── example_pred.json
| └── README.md
Notice: a few tasks have additional files, e.g. it includes 'category.xlsx' file in the CHIP-CTC task.
You can download Chinese pre-trained models according to your need (download URLs are provided above). With Huggingface-Transformers , the models above could be easily accessed and loaded.
The reference directory:
├── CBLUE
| └── baselines
| └── run_classifier.py
| └── ...
| └── examples
| └── run_qqr.sh
| └── ...
| └── cblue
| └── CBLUEDatasets
| └── KUAKE-QQR
| └── ...
| └── data
| └── output
| └── model_data
| └── bert-base
| └── ...
| └── result_output
| └── KUAKE-QQR_test.json
| └── ...
Running examples
The shell files of training and evaluation for every task are provided in examples/ , and could directly run.
Also, you can utilize the running codes in baselines/ , and write your shell files according to your need:
baselines/run_classifer.py: support{sts, qqr, qtr, qic, ctc, ee}tasks;baselines/run_cdn.py: support{cdn}task;baselines/run_ie.py: support{ie}task.
Training models
Running shell files: bash examples/run_{task}.sh, and the contents of shell files are as follow:
DATA_DIR="CBLUEDatasets"
TASK_NAME="qqr"
MODEL_TYPE="bert"
MODEL_DIR="data/model_data"
MODEL_NAME="chinese-bert-wwm"
OUTPUT_DIR="data/output"
RESULT_OUTPUT_DIR="data/result_output"
MAX_LENGTH=128
python baselines/run_classifier.py \
--data_dir=${DATA_DIR} \
--model_type=${MODEL_TYPE} \
--model_dir=${MODEL_DIR} \
--model_name=${MODEL_NAME} \
--task_name=${TASK_NAME} \
--output_dir=${OUTPUT_DIR} \
--result_output_dir=${RESULT_OUTPUT_DIR} \
--do_train \
--max_length=${MAX_LENGTH} \
--train_batch_size=16 \
--eval_batch_size=16 \
--learning_rate=3e-5 \
--epochs=3 \
--warmup_proportion=0.1 \
--earlystop_patience=3 \
--logging_steps=250 \
--save_steps=250 \
--seed=2021
Notice: the best checkpoint is saved in OUTPUT_DIR/MODEL_NAME/.
MODEL_TYPE: support{bert, roberta, albert, zen}model types;MODEL_NAME: support{bert-base, bert-wwm-ext, albert-tiny, albert-xxlarge, zen, pcl-medbert, roberta-large, roberta-wwm-ext-base, roberta-wwm-ext-large, macbert-base, macbert-large}Chinese pre-trained models.
The MODEL_TYPE-MODEL_NAME mappings are listed below.
| MODEL_TYPE | MODEL_NAME |
| :--------: | :----------------------------------------------------------- |
| bert | bert-base, bert-wwm-ext, pcl-medbert, macbert-base, macbert-large |
| roberta | roberta-large, roberta-wwm-ext-base, roberta-wwm-ext-large |
| albert | albert-tiny, albert-xxlarge |
| zen | zen |
Inference & generation of results
Running shell files: base examples/run_{task}.sh predict, and the contents of shell files are as follows:
DATA_DIR="CBLUEDatasets"
TASK_NAME="qqr"
MODEL_TYPE="bert"
MODEL_DIR="data/model_data"
MODEL_NAME="chinese-bert-wwm"
OUTPUT_DIR="data/output"
RESULT_OUTPUT_DIR="data/result_output"
MAX_LENGTH=128
python baselines/run_classifier.py \
--data_dir=${DATA_DIR} \
--model_type=${MODEL_TYPE} \
--model_name=${MODEL_NAME} \
--model_dir=${MODEL_DIR} \
--task_name=${TASK_NAME} \
--output_dir=${OUTPUT_DIR} \
--result_output_dir=${RESULT_OUTPUT_DIR} \
--do_predict \
--max_length=${MAX_LENGTH} \
--eval_batch_size=16 \
--seed=2021
Notice: the result of prediction {TASK_NAME}_test.json will be generated in RESULT_OUTPUT_DIR .
Check format
Before you submit the predicted test files, you could check the format of test files using format_checker and avoid the invalid evalution score induced by the format errors.
- Step1: Copy the original test file(without answer) {taskname}_test.[json|jsonl|tsv] to this directory
format_checker, and rename as {taskname}_test_raw.[json|jsonl|tsv].
# take the CMeEE task for example:
cp ${path_to_CMeEE}/CMeEE_test.json ${current_dir}/CMeEE_test_raw.json
- Step2: Execute the following format_checker script using the raw test file (from Step1) and your prediction file:
python3 format_checker_${taskname}.py {taskname}_test_raw.[json|jsonl|tsv] {taskname}_test.[json|jsonl|tsv]
# take the CMeEE task for example:
python3 format_checker_CMeEE.py CMeEE_test_raw.json CMeEE_test.json
What is special?
IMCS-NER & IMCS-V2-NER tasks:
- Step1: Copy both the original test file(without answer) IMCS-NER_test.json(IMCS-V2-NER_test.json) and the IMCS_test.json(IMCS-V2_test.json) to this directory, and rename as IMCS-NER_test_raw.json(IMCS-V2-NER_test_raw.json)
# for IMCS-NER task:
cp ${path_to_IMCS-NER}/IMCS-NER_test.json ${current_dir}/IMCS-NER_test_raw.json
cp ${path_to_IMCS-NER}/IMCS_test.json ${current_dir}
# for IMCS-V2-NER task:
cp ${path_to_IMCS-V2-NER}/IMCS-V2-NER_test.json ${current_dir}/IMCS-V2-NER_test_raw.json
cp ${path_to_IMCS-V2-NER}/IMCS-V2_test.json ${current_dir}
- Step2: Execute the following format_checker script using the raw test file (from Step1) and your prediction file:
# for IMCS-NER task:
python3 format_checker_IMCS_V1_NER.py IMCS-NER_test_raw.json IMCS-NER_test.json IMCS_test.json
# for IMCS-V2-NER task:
python3 format_checker_IMCS_V2_NER.py IMCS-V2-NER_test_raw.json IMCS-V2-NER_test.json IMCS-V2_test.json
IMCS-SR & IMCS-V2-SR, MedDG tasks
If you want to implement the optional check login in the check_format function, which is commented in the master branch. You need also copy the normalized dictionary files to the current dir.
- MedDG: the dictionary file is entity_list.txt
- IMCS-SR: the dictionary file is *
