BONSAI
A BERT-based framework for processing and analyzing Electronic Health Records (EHR) data. It provides an end-to-end pipeline for data preprocessing, model training, and clinical outcome prediction.
Install / Use
/learn @FGA-DIKU/BONSAIREADME
BONSAI
A framework for processing and analyzing Electronic Health Records (EHR) data using transformer-based models.
BONSAI helps researchers and data scientists preprocess EHR data, train models, and generate outcomes for downstream clinical predictions and analyses.
Setup (requires Python 3.12)
git clone https://github.com/FGA-DIKU/BONSAI.git
pip install -e .
cp template_env .env
You can adapt the paths in .env to specify alternative directories containing custom configs, input data or where model checkpoint should be saved.
Basic usage:
-
Create data.
python bonsai/run/create_data.py --config-name examples/example_data dataset=correlated_MEDS_dataWe use the example_data.yaml config which transforms the correlated_MEDS_data in the example_data folder into the training format. This data will be saved indata/correlated_MEDS_data -
Pretrain model.
python bonsai/run/pretrain.py --config-name examples/example_pretrain dataset=correlated_MEDS_dataWe use the pretrain.yaml config to have a short resource-light training that can run locally and point it to the dataset created in step 1. -
Create outcomes (labels for finetuning)
python bonsai/run/create_outcome.py --config-name examples/example_outcome1 dataset=correlated_MEDS_dataWe use the example_outcome.yaml config which processes the target outcomes for the correlated_MEDS_data in the example_data folder and saves them in an outcome file indata/correlated_MEDS_data/outcomes/examples/example_outcome1.parquet -
Finetune model.
python bonsai/run/finetune.py --config-name examples/example_finetune dataset=correlated_MEDS_data outcome=examples/example_outcome1 pretrain_path=/path/to/your/pretrained/checkpoints/best.ckptWe use the finetune.yaml config to have a short resource-light training that can run locally and point it to the dataset created in step 1, the checkpoint created in step 2, and the labels created in step 3. -
Train model.
python bonsai/run/train.py --config-name examples/example_finetune dataset=correlated_MEDS_data outcome=examples/example_outcome1We use the finetune.yaml config to have a short resource-light no-pretraining training that can run locally and point it to the dataset created in step 1 and the labels created in step 3.
To use the old pre-lightning version use:
git checkout tags/pre-lightning
Contributing
We welcome contributions! Please see our Contributing Guidelines for details on:
- Code style and formatting
- Testing requirements
- Pull request process
- Issue reporting
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you use BONSAI in your research, please cite the following paper:
@article{Montgomery2025,
author = {Montgomery, A. and others},
title = {BONSAI: A framework for processing and analysing {E}lectronic {H}ealth {R}ecords ({EHR}) data using transformer-based models},
journal = {Journal of Open Source Software},
volume = {10},
number = {114},
pages = {8869},
year = {2025},
doi = {10.21105/joss.08869}
}
