DualTrack
DualTrack: Sensorless 3D Ultrasound needs Local and Global Context
Install / Use
/learn @ImFusionGmbH/DualTrackREADME
DualTrack: Sensorless 3D Ultrasound Needs Local and Global Context
Official Repo for "DualTrack: Sensorless 3D Ultrasound Needs Local and Global Context" (MICCAI ASMUS Workshop 2025, arxiv paper) and winner of the TUS-REC 2025 challenge.

Abstract
Motivation. 3D Ultrasound is cost-effective and has many clinical applications. AI models can analyze 2D ultrasound scans to infer the scan trajectory and build a 3D image, eliminating the need for expensive and/or cumbersome hardware used in conventional 3D ultrasound.
Method. Two types of information can be used to infer scan trajectory from a 2D ultrasound sequence:
- Local features: frame-to-frame motion cues and speckle patterns.
- Global features: scan-level context such as anatomical landmarks and the shape/continuity of anatomical structures.
To best exploit these dual, complementary sources of information, we designed a network called DualTrack. DualTrack features a dual-encoder architecture, with separate modules specializing in local and global features, respectively. These features are combined using a powerful fusion module to predict scan trajectory.

Results. On the TUS-REC 2024 benchmark—a large dataset of over 1000 forearm scans with complex trajectory shapes—DualTrack achieved an average error of < 5 mm (a statistically significant 18.3% improvement over prior state-of-the-art).
We’ve since adapted DualTrack to numerous other datasets with excellent results:
| Dataset | Avg. Error (mm) | |---------------------------------|-----------------| | Carotid artery scans | 3.4 | | Thyroid scans | 4.9 | | TUS-REC 2025 Challenge Dataset | 9.2 |
Efficiency. DualTrack is efficient and runs on a consumer GPU in < 0.5 s for a 30-second ultrasound scan.
Highlights
- 🧭 Dual-encoder design for local and global context
- 🔗 Robust feature fusion for trajectory prediction
- 📏 Accurate: < 5 mm error on TUS-REC 2024; strong cross-dataset results
- ⚡ Fast: sub-second inference on consumer GPUs
Publicly available models:
| Model | Dataset | Avg. GPE Error (mm) | Download link | Config | |----------------|---------|---------------------|---------------|-| | DualTrack | TUS-REC 2024 | 4.9 (validation set) | dualtrack_final.pt| configs/model/dualtrack.yaml| | DualTrack Finetuned (TUS-REC 2025 Challenge winner) | TUS-REC 2025 | 9.2 | dualtrack_ft_tus_rec_2025_v3_best.pt | configs/model/dualtrack_ft_tus_rec_2025.yaml |
Instantiate the model using the following code snippet:
from omegaconf import OmegaConf
from src.models import get_model
cfg_path = 'path/to/config.yaml'
cfg = OmegaConf.load(cfg_path)
cfg.checkpoint = 'path/to/checkpoint.pt'
model = get_model(**cfg)
Acknowledgements
We thank the TUS-REC challenge organizing team for putting together the datasets used for training and benchmarking our models! If you find this work interesting please also check out the TUS-REC 2024 paper and dataset.
Usage
Installation
Create a Python environment with python>=3.10 and install the requirements lists in requirements.txt.
Data Preparation
Data Format
To store a tracked ultrasound sweep, this codebase uses an h5 file with the following keys/data structures:
images: $N \times H \times W$uint8array containing the pixel values of each ultrasound image in the sweep. Here, $N$ is the number of timesteps in the sweep, and $H$ and $W$ are the height and witdth (axial and lateral dimensions) of the ultrasound image.tracking: $N \times 4 \times 4$floatarray containing the sequence $T_0, T_1, ..., T_N$ of tracking transforms. Each $T_i$ is a stored as a $4 \times 4$ homogeneous transform matrix, mapping from the image coordinate system to the world coordinate system. The image system is in $mm$ relative to the center of the image, with the following orientation for a vector $(x, y, z, 1)$:
dimensions: a single array storing the image dimensions as $(W, H, 1)$spacing: a single array storing the image spacing (millimeters per pixel) as (W_spacing,H_spacing, $1$)pixel_to_image: a single $ 4\times 4$floatarray containing the transform that maps from the pixel coordinate system to the image coordinate system. The pixel coordinate system has the same orientation as the image coordinate system, but its origin is at the top-left of the image, and its units are in pixels rather than millimeters. This is used for dense displacement field metrics which are based on the physical positions of image points.
If you have a collection of .h5 files in this format, it is easy to create and register a "dataset" with the code base. To prepare a dataset for training and evaluation, first create a .csv file containing at least 4 columns:
- an index column
sweep_id, a unique id for each sweepprocessed_sweep_path, the.h5filepath corresponding to the sweepsplit, one of[train, val]indicating whether the sweep should be used for
Finally, you should register your dataset by creating a file (or adding to a file) located at data/datasets.yaml with the following format:
tus-rec:
data_csv_path: /path/to/metadata.csv
my-dataset-2:
data_csv_path: "..."
Now, the dataset will be registered with the codebase. You can test this by running:
from src.datasets.sweeps_dataset_v2 import SweepsDataset
ds = SweepsDataset(name='tus-rec')
print(ds[0]['images'].shape) # print the loaded sweep shape (N_timesteps x H x W) array
TUS-REC To DualTrack Format Conversion
If you come from the TUS-REC Challenge, we have provided a convenient script to convert their data into our format. The script scripts/data/convert_tus_rec_format_to_dualtrack_format.py will do the job. You simply need to prepare a .csv file pointing to the TUS-REC challenge input files. To receive help from the command line about how to run the script, run:
python scripts/data/convert_tus_rec_format_to_dualtrack_format.py -h
Feel free to raise a github issue if there are any problems with using this script.
Run Model Training and Evaluation
DualTrack uses the train.py script for training and evaluate.py script for evaluation, for example:
python train.py -c path/to/config --log_dir="experiment/v0"
python evaluate.py -c path/to/config --log_dir="experiment/eval/v0"
Note: Training scripts will generate a log directory where checkpoints (best/last) will be saved. Certain experiments will use the checkpoints of a previous experiment to initialize components of the model.
Training configurations are found in the folder configs/dualtrack_train_tus_rec/, and evaluation configurations are found in the folder configs/dualtrack_evaluation. A typical config looks like the following:
model:
name: dualtrack_loc_enc_stg1
data: # dataset options
version: local_encoder
dataset: tus-rec # <- use the name you registered your dataset with
sequence_length_train: 16
augmentations: true
train: # training options
lr: 0.0001
epochs: 5000
warmup_epochs: 0
weight_decay: 0.001
batch_size: 16
val_every: 100
seed: 0
device: cuda
use_amp: true
logger: wandb # could be tensorboard, or console if not using wandb
logger_kw:
wandb_project: dualtrack # logger specific options
debug: false
Training DualTrack
Training DualTrack involves three main steps:
- Pretrain the local encoder
- Pretrain the global encoder
- Train the final model
1. Pretraining Local Encoder
Training the fusion model happens in 3 stages:
Pretraining step 1 - we pretrain the 3d CNN backbone on small subsequences of images for 5000 epochs (should take 4-5 days on NVIDIA A40 GPU). Use this config.
Pretrain step 2 - we add a vit stage for frame-wise spatial self-attention on top of the frozen CNN backbone of stage 1 using this config. You will need to edit the model.backbone_weights field to point to the best checkpoint from the step 1 experiment.
Pretrain step 3 - here we add temporal attention stage and pretrain it on top of the frozen CNN + vit model of stage 2 using this config. Similarly, edit model.backbone_weights.
2. Pretraining Global Encoder
The second step of DualTrack is to pretrain the global encoder using sparsely sampled subsequences of the ultrasound frames. The global encoder consists of an image backbone and then a transformer temporal self-attention stage. Here we have several options for the image backbone: CNN, iBOT, MedSAM, and USFM. The code can easily be adapted to using other backbones. Note that some backbones require pretrained weights or add dependencies. Choose one of the configs in configs/dualtrack_train_tus_rec/global_encoder (we recommend cnn.yaml as a good starting point with no extra dependencies).
