<h1 align = "center"> <img src="images/HealthGPT.png" alt="icon" style="width:50px; vertical-align:middle;" /> HealthGPT : A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation </h1> <div align="center"> Tianwei Lin1, Wenqiao Zhang1, Sijing Li1, Yuqian Yuan1, Binhe Yu2, Haoyuan Li3, Wanggui He3, Hao Jiang3,

Mengze Li4, Xiaohui Song1, Siliang Tang1, Jun Xiao1, Hui Lin1, Yueting Zhuang1, Beng Chin Ooi1

1Zhejiang University, 2University of Electronic Science and Technology of China,

3Alibaba, 4The Hong Kong University of Science and Technology,

</div>

🌟 Overview

Welcome to HealthGPT! 🚀 HealthGPT is an advanced medical Large Vision-Language Model with a unified framework that integrates both medical visual comprehension and generation capabilities. In this project, a heterogeneous low rank adaptation (H-LoRA) and a three-stage learning strategy are proposed, enabling the pre-trained large language model to efficiently follow both visual comprehension and generation instructions.

🔥 News

[2025.05.02] 🎉🎉🎉 HealthGPT has been accepted by ICML 2025 as Spotlight presentation.
[2025.03.20] We upgraded our specialized comprehension model, HealthGPT-XL32, which is based on Qwen2.5-32B-Instruct. This enhanced model significantly outperforms HealthGPT-L14, with a score of 70.4 compared to 66.4.
[2025.03.06] We have released the VL-Health Dataset.
[2025.02.26] We have released the UI/UX for the inference.
[2025.02.17] We have released the pre-trained weight on HuggingFace and inference script.

TODO

[x] Release inference code.
[x] Release the pre-trained weight of the model.
[x] Release the inference UI/UX.
[x] Release VL-Health dataset.
[ ] Release training scripts.
[ ] Construct the website.

📚 Task Classification and Support

HealthGPT supports 7 types of medical comprehension tasks and 5 types of medical generation tasks, outperforming recent unified visual models and medical-specific models.

🏗️ Architecture

The HealthGPT architecture integrates hierarchical visual perception and H-LoRA, employing a task-specific hard router to select visual features and H-LoRA plugins, generating text and vision outputs with an autoregressive manner.

🛠️ Getting Started

We have released our model in two configurations, HealthGPT-M3 and HealthGPT-L14, to suit different requirements and resource availability:

HealthGPT-M3: A smaller version optimized for speed and reduced memory usage.
HealthGPT-L14: A larger version designed for higher Performance and more complex tasks.

Installation

1. Prepare Environment

First, clone our repository and create the Python environment for running HealthGPT using the following command:

# clone our project
git clone https://github.com/DCDmllm/HealthGPT.git
cd HealthGPT

# prepare python environment
conda create -n HealthGPT python=3.10
conda activate HealthGPT
pip install -r requirements.txt

2. Prepare Pre-trained Weights

For medical vision generation tasks, please follow the official VQGAN guide and download the VQGAN OpenImages (f=8), 8192 model weights from the "Overview of pretrained models" section. Below is the direct link to the corresponding VQGAN pre-trained weights: |Model Name|Download| |:-:|:-:| |VQGAN OpenImages (f=8), 8192, GumbelQuantization|Download|

After downloading, place the last.ckpt and model.yaml files in the taming_transformers/ckpt directory.

3. Prepare H-LoRA and Adapter Weights

HealthGPT enhances the base model's capabilities for medical visual comprehension and generation by training a small number of H-LoRA parameters and adapter layers for aligning vision and text. We have currently released some weights from the training process, supporting medical visual question answering and open-world visual reconstruction tasks. Here are the corresponding weights: Download.

We will soon be releasing the full weights for HealthGPT-L14, along with the H-LoRA weights for medical generation tasks. Stay tuned!!!

⚡ Inference

Medical Visual Question Answering

To perform inference using HealthGPT, please follow these steps:

Download Necessary Files:
- Ensure you have downloaded all the required model weights and resources.
Update Script Paths:
- Open the script located at llava/demo/com_infer.sh.
- Modify the following variables to point to the paths where you stored the downloaded files:
  - MODEL_NAME_OR_PATH: Path or identifier for base model.
  - VIT_PATH: Path to the Vision Transformer model weights.
  - HLORA_PATH: Path to the HLORA weights file for visual comprehension.
  - FUSION_LAYER_PATH: Path to your fusion layer weights file.
Run the Script:
- Execute the script in your terminal to begin inference:
```
cd llava/demo
bash com_infer.sh
```

You can directly run the Python command in your terminal by specifying the paths and parameters. This approach allows you to easily change the image or question as needed:

python3 com_infer.py \
    --model_name_or_path "microsoft/Phi-3-mini-4k-instruct" \
    --dtype "FP16" \
    --hlora_r "64" \
    --hlora_alpha "128" \
    --hlora_nums "4" \
    --vq_idx_nums "8192" \
    --instruct_template "phi3_instruct" \
    --vit_path "openai/clip-vit-large-patch14-336/" \
    --hlora_path "path/to/your/local/com_hlora_weights.bin" \
    --fusion_layer_path "path/to/your/local/fusion_layer_weights.bin" \
    --question "Your question" \
    --img_path "path/to/image.jpg"

Customize the Question and Image: You can modify the --question and --img_path parameters to ask different questions or analyze different images.

Correspondingly, the visual Question Answering task of HealthGPT-L14 can be executed with the following Python command:

python3 com_infer_phi4.py \
    --model_name_or_path "microsoft/Phi-4" \
    --dtype "FP16" \
    --hlora_r "32" \
    --hlora_alpha "64" \
    --hlora_nums "4" \
    --vq_idx_nums "8192" \
    --instruct_template "phi4_instruct" \
    --vit_path "openai/clip-vit-large-patch14-336/" \
    --hlora_path "path/to/your/local/com_hlora_weights_phi4.bin" \
    --question "Your question" \
    --img_path "path/to/image.jpg"

The weights of com_hlora_weights_phi4.bin can be downloaded here.

Image Reconstruction

Similarly, simply set the HLORA_PATH to point to the gen_hlora_weights.bin file and configure the other model paths. Then, you can perform the image reconstruction task using the following script:

cd llava/demo
bash gen_infer.sh

You can also directly execute the following python command:

python3 gen_infer.py \
    --model_name_or_path "microsoft/Phi-3-mini-4k-instruct" \
    --dtype "FP16" \
    --hlora_r "256" \
    --hlora_alpha "512" \
    --hlora_nums "4" \
    --vq_idx_nums "8192" \
    --instruct_template "phi3_instruct" \
    --vit_path "openai/clip-vit-large-patch14-336/" \
    --hlora_path "path/to/your/local/gen_hlora_weights.bin" \
    --fusion_layer_path "path/to/your/local/fusion_layer_weights.bin" \
    --question "Reconstruct the image." \
    --img_path "path/to/image.jpg" \
    --save_path "path/to/save.jpg"

Server

An interactive Chat UI based on Gradio, supporting text + image input, and returning text or images according to different modes.

📌 Project Introduction

This project is a Gradio front-end interface, supporting users:

Analyze image (comprehension task): input text + image, output text
Generate image (generation task): input text + image, output **i

HealthGPT

Install / Use

README