VoxTell
Free-Text Promptable Universal 3D Medical Image Segmentation
Install / Use
/learn @MIC-DKFZ/VoxTellREADME
[CVPR2026] VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
<div align="center"> </div> <img src="documentation/assets/VoxTellLogo.png" alt="VoxTell Logo"/>This repository contains the official implementation of our paper:
VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
VoxTell is a 3D vision–language segmentation model that directly maps free-form text prompts, from single words to full clinical sentences, to volumetric masks. By leveraging multi-stage vision–language fusion, VoxTell achieves state-of-the-art performance on anatomical and pathological structures across CT, PET, and MRI modalities, excelling on familiar concepts while generalizing to related unseen classes.
Authors: Maximilian Rokuss*, Moritz Langenberg*, Yannick Kirchhoff, Fabian Isensee, Benjamin Hamm, Constantin Ulrich, Sebastian Regnery, Lukas Bauer, Efthimios Katsigiannopulos, Tobias Norajitra, Klaus Maier-Hein
Paper:
📰 News
- 03/2026: 🥇 First place on the official ReXGroundingCT benchmark
- 02/2026: 📄 VoxTell was accepted at CVPR 2026!
- 02/2026: 🎉 The community built a VoxTell web interface - thank you! 👉 voxtell-web-plugin
- 01/2026: 🧩 Model checkpoint v1.1 released and now available with official napari plugin 👉 napari-voxtell
- 12/2025: 🚀
VoxTelllaunched with a Python backend and PyPI package (pip install voxtell)
Overview
VoxTell is trained on a large-scale, multi-modality 3D medical imaging dataset, aggregating 158 public sources with over 62,000 volumetric images. The data covers:
- Brain, head & neck, thorax, abdomen, pelvis
- Musculoskeletal system and extremities
- Vascular structures, major organs, substructures, and lesions
This rich semantic diversity enables language-conditioned 3D reasoning, allowing VoxTell to generate volumetric masks from flexible textual descriptions, from coarse anatomical labels to fine-grained pathological findings.
Architecture
VoxTell combines 3D image encoding with text-prompt embeddings and multi-stage vision–language fusion:
- Image Encoder: Processes 3D volumetric input into latent feature representations
- Prompt Encoder: We use the fozen Qwen3-Embedding-4B model to embed text prompts
- Prompt Decoder: Transforms text queries and image latents into multi-scale text features
- Image Decoder: Fuses visual and textual information at multiple resolutions using MaskFormer-style query-image fusion with deep supervision
🛠 Installation
1. Create a Virtual Environment
VoxTell supports Python 3.10+ and works with Conda, pip, or any other virtual environment manager. Here's an example using Conda:
conda create -n voxtell python=3.12
conda activate voxtell
2. Install PyTorch
[!WARNING] Temporary Compatibility Warning
There is a known issue with PyTorch 2.9.0 causing OOM errors during inference (related to 3D convolutions — see the PyTorch issue here).
Until this is resolved, please use PyTorch 2.8.0 or earlier.
Install PyTorch compatible with your CUDA version. For example, for Ubuntu with a modern NVIDIA GPU:
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu126
For other configurations (macOS, CPU, different CUDA versions), please refer to the PyTorch Get Started page.
Install via pip (you can also use uv):
pip install voxtell
or install directly from the repository:
git clone https://github.com/MIC-DKFZ/VoxTell
cd VoxTell
pip install -e .
🚀 Getting Started
👉 NEW: Try VoxTell interactively in the napari viewer
You can download VoxTell checkpoints using the Hugging Face huggingface_hub library:
from huggingface_hub import snapshot_download
MODEL_NAME = "voxtell_v1.1" # Updated models may be available in the future
DOWNLOAD_DIR = "/home/user/temp" # Optionally specify the download directory
download_path = snapshot_download(
repo_id="mrokuss/VoxTell",
allow_patterns=[f"{MODEL_NAME}/*", "*.json"],
local_dir=DOWNLOAD_DIR
)
# path to model directory, e.g., "/home/user/temp/voxtell_v1.1"
model_path = f"{download_path}/{MODEL_NAME}"
Command-Line Interface (CLI)
VoxTell provides a convenient command-line interface for running predictions:
voxtell-predict -i input.nii.gz -o output_folder -m /path/to/model -p "liver" "spleen" "kidney"
Single prompt:
voxtell-predict -i case001.nii.gz -o output_folder -m /path/to/model -p "liver"
# Output: output_folder/case001_liver.nii.gz
Multiple prompts (saves individual files by default):
voxtell-predict -i case001.nii.gz -o output_folder -m /path/to/model -p "liver" "spleen" "right kidney"
# Outputs:
# output_folder/case001_liver.nii.gz
# output_folder/case001_spleen.nii.gz
# output_folder/case001_right_kidney.nii.gz
Save combined multi-label file:
voxtell-predict -i case001.nii.gz -o output_folder -m /path/to/model -p "liver" "spleen" --save-combined
# Output: output_folder/case001.nii.gz (multi-label: 1=liver, 2=spleen)
# ⚠️ WARNING: Overlapping structures will be overwritten by later prompts
CLI Options
| Argument | Short | Required | Description |
|----------|-------|----------|-------------|
| --input | -i | Yes | Path to input NIfTI file |
| --output | -o | Yes | Path to output folder |
| --model | -m | Yes | Path to VoxTell model directory |
| --prompts | -p | Yes | Text prompt(s) for segmentation |
| --device | | No | Device to use: cuda (default) or cpu |
| --gpu | | No | GPU device ID (default: 0) |
| --save-combined | | No | Save multi-label file instead of individual files |
| --verbose | | No | Enable verbose output |
Python API
For more control or integration into Python workflows, use the Python API:
import torch
from voxtell.inference.predictor import VoxTellPredictor
from nnunetv2.imageio.nibabel_reader_writer import NibabelIOWithReorient
# Select device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Load image
image_path = "/path/to/your/image.nii.gz"
img, _ = NibabelIOWithReorient().read_images([image_path])
# Define text prompts
text_prompts = ["liver", "right kidney", "left kidney", "spleen"]
# Initialize predictor
predictor = VoxTellPredictor(
model_dir="/path/to/voxtell_model_directory",
device=device,
)
# Run prediction
# Output shape: (num_prompts, x, y, z)
voxtell_seg = predictor.predict_single_image(img, text_prompts)
Optional: Visualize Results
You can visualize the segmentation results using napari:
pip install napari[all]
💡 Tip
If you work in napari already, the napari-voxtell plugin offers the fastest way to explore VoxTell results interactively.
import napari
import numpy as np
# Create a napari viewer and add the original image
viewer = napari.Viewer()
viewer.add_image(img, name='Image')
# Add segmentation results as label layers for each prompt
for i, prompt in enumerate(text_prompts):
viewer.add_labels(voxtell_seg[i].astype(np.uint8), name=prompt)
# Run napari
napari.run()
Important: Image Orientation and Spacing
-
⚠️ Image Orientation (Critical): For correct anatomical localization (e.g., distinguishing left from right), images must be in RAS orientation. VoxTell was trained on data reoriented using this specific reader. Orientation mismatches can be a source of error. An easy way to test for this is if a simple prompt like "liver" fails and segments parts of the spleen instead. Make sure your image metadata is correct.
-
Image Spacing: The model does not resample images to a standardized spacing for faster inference. Performance may degrade on images with very uncommon voxel spacings (e.g., super high-resolution brain MRI). In such cases, consider resampling the image to a m
