HealthGPT
【ICML 2025 Spotlight】 Official Repo for Paper ‘’HealthGPT : A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation‘’
Install / Use
/learn @ZJU4HealthCare/HealthGPTREADME
Mengze Li<sup>4</sup>, Xiaohui Song<sup>1</sup>, Siliang Tang<sup>1</sup>, Jun Xiao<sup>1</sup>, Hui Lin<sup>1</sup>, Yueting Zhuang<sup>1</sup>, Beng Chin Ooi<sup>1</sup> <br><br>
<sup>1</sup>Zhejiang University, <sup>2</sup>University of Electronic Science and Technology of China,
<sup>3</sup>Alibaba, <sup>4</sup>The Hong Kong University of Science and Technology,
<a href='https://arxiv.org/abs/2502.09838'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/lintw/HealthGPT-M3'><img src='https://img.shields.io/badge/Model-Huggingface-yellow'></a> <a href='https://huggingface.co/datasets/lintw/VL-Health'><img src='https://img.shields.io/badge/Dataset-Huggingface-E59FB6'></a> <a href='https://llsuzy.github.io/HealthGPT.github.io/'><img src='https://img.shields.io/badge/Home-Page-green'></a> <a href='https://www.youtube.com/watch?v=UtusB3L2msk'><img src='https://img.shields.io/badge/Overview-Video-blue'></a>
</div>🌟 Overview
Welcome to HealthGPT! 🚀 HealthGPT is an advanced medical Large Vision-Language Model with a unified framework that integrates both medical visual comprehension and generation capabilities. In this project, a heterogeneous low rank adaptation (H-LoRA) and a three-stage learning strategy are proposed, enabling the pre-trained large language model to efficiently follow both visual comprehension and generation instructions.
🔥 News
- [2025.05.02] 🎉🎉🎉 HealthGPT has been accepted by ICML 2025 as Spotlight presentation.
- [2025.03.20] We upgraded our specialized comprehension model, HealthGPT-XL32, which is based on Qwen2.5-32B-Instruct. This enhanced model significantly outperforms HealthGPT-L14, with a score of 70.4 compared to 66.4.
- [2025.03.06] We have released the VL-Health Dataset.
- [2025.02.26] We have released the UI/UX for the inference.
- [2025.02.17] We have released the pre-trained weight on HuggingFace and inference script.
TODO
- [x] Release inference code.
- [x] Release the pre-trained weight of the model.
- [x] Release the inference UI/UX.
- [x] Release VL-Health dataset.
- [ ] Release training scripts.
- [ ] Construct the website.
📚 Task Classification and Support
HealthGPT supports 7 types of medical comprehension tasks and 5 types of medical generation tasks, outperforming recent unified visual models and medical-specific models.
<p align="center"> <img src="images/intro.png" alt="Example Image" style="width:97%;"> </p>🏗️ Architecture
The HealthGPT architecture integrates hierarchical visual perception and H-LoRA, employing a task-specific hard router to select visual features and H-LoRA plugins, generating text and vision outputs with an autoregressive manner.
<p align="center"> <img src="images/Framework.png" alt="Example Image" style="width:88%;"> </p>🛠️ Getting Started
We have released our model in two configurations, HealthGPT-M3 and HealthGPT-L14, to suit different requirements and resource availability:
- HealthGPT-M3: A smaller version optimized for speed and reduced memory usage.
- HealthGPT-L14: A larger version designed for higher Performance and more complex tasks.
Installation
1. Prepare Environment
First, clone our repository and create the Python environment for running HealthGPT using the following command:
# clone our project
git clone https://github.com/DCDmllm/HealthGPT.git
cd HealthGPT
# prepare python environment
conda create -n HealthGPT python=3.10
conda activate HealthGPT
pip install -r requirements.txt
2. Prepare Pre-trained Weights
HealthGPT utilizes clip-vit-large-patch14-336 as the visual encoder and employs Phi-3-mini-4k-instruct and phi-4 as the pre-trained LLM base models for HealthGPT-M3 and HealthGPT-L14, respectively. Please download the corresponding weights:
|Model Type|Model Name|Download|
|:-:|:-:|:-:|
|ViT|clip-vit-large-patch14-336|Download|
|Base Model (HealthGPT-M3)|Phi-3-mini-4k-instruct|Download|
|Base Model (HealthGPT-L14)|Phi-4|Download|
|Base Model (Qwen2.5-32B-Instruct)|Qwen2.5-32B-Instruct |Download|
For medical vision generation tasks, please follow the official VQGAN guide and download the VQGAN OpenImages (f=8), 8192 model weights from the "Overview of pretrained models" section. Below is the direct link to the corresponding VQGAN pre-trained weights:
|Model Name|Download|
|:-:|:-:|
|VQGAN OpenImages (f=8), 8192, GumbelQuantization|Download|
After downloading, place the last.ckpt and model.yaml files in the taming_transformers/ckpt directory.
3. Prepare H-LoRA and Adapter Weights
HealthGPT enhances the base model's capabilities for medical visual comprehension and generation by training a small number of H-LoRA parameters and adapter layers for aligning vision and text. We have currently released some weights from the training process, supporting medical visual question answering and open-world visual reconstruction tasks. Here are the corresponding weights: Download.
We will soon be releasing the full weights for HealthGPT-L14, along with the H-LoRA weights for medical generation tasks. Stay tuned!!!
⚡ Inference
Medical Visual Question Answering
To perform inference using HealthGPT, please follow these steps:
- Download Necessary Files:
- Ensure you have downloaded all the required model weights and resources.
- Update Script Paths:
- Open the script located at
llava/demo/com_infer.sh. - Modify the following variables to point to the paths where you stored the downloaded files:
- MODEL_NAME_OR_PATH: Path or identifier for base model.
- VIT_PATH: Path to the Vision Transformer model weights.
- HLORA_PATH: Path to the HLORA weights file for visual comprehension.
- FUSION_LAYER_PATH: Path to your fusion layer weights file.
- Open the script located at
- Run the Script:
- Execute the script in your terminal to begin inference:
cd llava/demo bash com_infer.sh
- Execute the script in your terminal to begin inference:
You can directly run the Python command in your terminal by specifying the paths and parameters. This approach allows you to easily change the image or question as needed:
python3 com_infer.py \
--model_name_or_path "microsoft/Phi-3-mini-4k-instruct" \
--dtype "FP16" \
--hlora_r "64" \
--hlora_alpha "128" \
--hlora_nums "4" \
--vq_idx_nums "8192" \
--instruct_template "phi3_instruct" \
--vit_path "openai/clip-vit-large-patch14-336/" \
--hlora_path "path/to/your/local/com_hlora_weights.bin" \
--fusion_layer_path "path/to/your/local/fusion_layer_weights.bin" \
--question "Your question" \
--img_path "path/to/image.jpg"
- Customize the Question and Image: You can modify the
--questionand--img_pathparameters to ask different questions or analyze different images.
Correspondingly, the visual Question Answering task of HealthGPT-L14 can be executed with the following Python command:
python3 com_infer_phi4.py \
--model_name_or_path "microsoft/Phi-4" \
--dtype "FP16" \
--hlora_r "32" \
--hlora_alpha "64" \
--hlora_nums "4" \
--vq_idx_nums "8192" \
--instruct_template "phi4_instruct" \
--vit_path "openai/clip-vit-large-patch14-336/" \
--hlora_path "path/to/your/local/com_hlora_weights_phi4.bin" \
--question "Your question" \
--img_path "path/to/image.jpg"
The weights of com_hlora_weights_phi4.bin can be downloaded here.
Image Reconstruction
Similarly, simply set the HLORA_PATH to point to the gen_hlora_weights.bin file and configure the other model paths. Then, you can perform the image reconstruction task using the following script:
cd llava/demo
bash gen_infer.sh
You can also directly execute the following python command:
python3 gen_infer.py \
--model_name_or_path "microsoft/Phi-3-mini-4k-instruct" \
--dtype "FP16" \
--hlora_r "256" \
--hlora_alpha "512" \
--hlora_nums "4" \
--vq_idx_nums "8192" \
--instruct_template "phi3_instruct" \
--vit_path "openai/clip-vit-large-patch14-336/" \
--hlora_path "path/to/your/local/gen_hlora_weights.bin" \
--fusion_layer_path "path/to/your/local/fusion_layer_weights.bin" \
--question "Reconstruct the image." \
--img_path "path/to/image.jpg" \
--save_path "path/to/save.jpg"
Server
An interactive Chat UI based on Gradio, supporting text + image input, and returning text or images according to different modes.
📌 Project Introduction
This project is a Gradio front-end interface, supporting users:
- Analyze image (comprehension task): input text + image, output text
- Generate image (generation task): input text + image, output **i
