DualTrack: Sensorless 3D Ultrasound Needs Local and Global Context

Official Repo for "DualTrack: Sensorless 3D Ultrasound Needs Local and Global Context" (MICCAI ASMUS Workshop 2025, arxiv paper) and winner of the TUS-REC 2025 challenge.

Model Demo

Abstract

Motivation. 3D Ultrasound is cost-effective and has many clinical applications. AI models can analyze 2D ultrasound scans to infer the scan trajectory and build a 3D image, eliminating the need for expensive and/or cumbersome hardware used in conventional 3D ultrasound.

Method. Two types of information can be used to infer scan trajectory from a 2D ultrasound sequence:

Local features: frame-to-frame motion cues and speckle patterns.
Global features: scan-level context such as anatomical landmarks and the shape/continuity of anatomical structures.

To best exploit these dual, complementary sources of information, we designed a network called DualTrack. DualTrack features a dual-encoder architecture, with separate modules specializing in local and global features, respectively. These features are combined using a powerful fusion module to predict scan trajectory.

Method

Results. On the TUS-REC 2024 benchmark—a large dataset of over 1000 forearm scans with complex trajectory shapes—DualTrack achieved an average error of < 5 mm (a statistically significant 18.3% improvement over prior state-of-the-art).
We’ve since adapted DualTrack to numerous other datasets with excellent results:

| Dataset | Avg. Error (mm) | |---------------------------------|-----------------| | Carotid artery scans | 3.4 | | Thyroid scans | 4.9 | | TUS-REC 2025 Challenge Dataset | 9.2 |

Efficiency. DualTrack is efficient and runs on a consumer GPU in < 0.5 s for a 30-second ultrasound scan.

Highlights

🧭 Dual-encoder design for local and global context
🔗 Robust feature fusion for trajectory prediction
📏 Accurate: < 5 mm error on TUS-REC 2024; strong cross-dataset results
⚡ Fast: sub-second inference on consumer GPUs

Publicly available models:

| Model | Dataset | Avg. GPE Error (mm) | Download link | Config | |----------------|---------|---------------------|---------------|-| | DualTrack | TUS-REC 2024 | 4.9 (validation set) | dualtrack_final.pt| configs/model/dualtrack.yaml| | DualTrack Finetuned (TUS-REC 2025 Challenge winner) | TUS-REC 2025 | 9.2 | dualtrack_ft_tus_rec_2025_v3_best.pt | configs/model/dualtrack_ft_tus_rec_2025.yaml |

Instantiate the model using the following code snippet:

from omegaconf import OmegaConf 
from src.models import get_model 

cfg_path = 'path/to/config.yaml'
cfg = OmegaConf.load(cfg_path)
cfg.checkpoint = 'path/to/checkpoint.pt'

model = get_model(**cfg)

Acknowledgements

We thank the TUS-REC challenge organizing team for putting together the datasets used for training and benchmarking our models! If you find this work interesting please also check out the TUS-REC 2024 paper and dataset.

Usage

Installation

Create a Python environment with python>=3.10 and install the requirements lists in requirements.txt.

Data Preparation

Data Format

To store a tracked ultrasound sweep, this codebase uses an h5 file with the following keys/data structures:

images: $N \times H \times W$ uint8 array containing the pixel values of each ultrasound image in the sweep. Here, $N$ is the number of timesteps in the sweep, and $H$ and $W$ are the height and witdth (axial and lateral dimensions) of the ultrasound image.
tracking: $N \times 4 \times 4$ float array containing the sequence $T_0, T_1, ..., T_N$ of tracking transforms. Each $T_i$ is a stored as a $4 \times 4$ homogeneous transform matrix, mapping from the image coordinate system to the world coordinate system. The image system is in $mm$ relative to the center of the image, with the following orientation for a vector $(x, y, z, 1)$:
dimensions: a single array storing the image dimensions as $(W, H, 1)$
spacing: a single array storing the image spacing (millimeters per pixel) as (W_spacing, H_spacing, $1$)
pixel_to_image: a single $ 4\times 4$ float array containing the transform that maps from the pixel coordinate system to the image coordinate system. The pixel coordinate system has the same orientation as the image coordinate system, but its origin is at the top-left of the image, and its units are in pixels rather than millimeters. This is used for dense displacement field metrics which are based on the physical positions of image points.

If you have a collection of .h5 files in this format, it is easy to create and register a "dataset" with the code base. To prepare a dataset for training and evaluation, first create a .csv file containing at least 4 columns:

an index column
sweep_id, a unique id for each sweep
processed_sweep_path, the .h5 filepath corresponding to the sweep
split, one of [train, val] indicating whether the sweep should be used for

Finally, you should register your dataset by creating a file (or adding to a file) located at data/datasets.yaml with the following format:

tus-rec: 
    data_csv_path: /path/to/metadata.csv

my-dataset-2: 
    data_csv_path: "..."

Now, the dataset will be registered with the codebase. You can test this by running:

from src.datasets.sweeps_dataset_v2 import SweepsDataset
ds = SweepsDataset(name='tus-rec')
print(ds[0]['images'].shape) # print the loaded sweep shape (N_timesteps x H x W) array

TUS-REC To DualTrack Format Conversion

If you come from the TUS-REC Challenge, we have provided a convenient script to convert their data into our format. The script scripts/data/convert_tus_rec_format_to_dualtrack_format.py will do the job. You simply need to prepare a .csv file pointing to the TUS-REC challenge input files. To receive help from the command line about how to run the script, run:

python scripts/data/convert_tus_rec_format_to_dualtrack_format.py -h

Feel free to raise a github issue if there are any problems with using this script.

Run Model Training and Evaluation

DualTrack uses the train.py script for training and evaluate.py script for evaluation, for example:

python train.py -c path/to/config --log_dir="experiment/v0"
python evaluate.py -c path/to/config --log_dir="experiment/eval/v0"

Note: Training scripts will generate a log directory where checkpoints (best/last) will be saved. Certain experiments will use the checkpoints of a previous experiment to initialize components of the model.

Training configurations are found in the folder configs/dualtrack_train_tus_rec/, and evaluation configurations are found in the folder configs/dualtrack_evaluation. A typical config looks like the following:

model:
  name: dualtrack_loc_enc_stg1

data: # dataset options
  version: local_encoder 
  dataset: tus-rec # <- use the name you registered your dataset with
  sequence_length_train: 16
  augmentations: true

train: # training options
  lr: 0.0001
  epochs: 5000
  warmup_epochs: 0
  weight_decay: 0.001
  batch_size: 16
  val_every: 100

seed: 0
device: cuda
use_amp: true

logger: wandb # could be tensorboard, or console if not using wandb
logger_kw:
  wandb_project: dualtrack # logger specific options

debug: false

Training DualTrack

Training DualTrack involves three main steps:

Pretrain the local encoder
Pretrain the global encoder
Train the final model

1. Pretraining Local Encoder

Training the fusion model happens in 3 stages:

Pretraining step 1 - we pretrain the 3d CNN backbone on small subsequences of images for 5000 epochs (should take 4-5 days on NVIDIA A40 GPU). Use this config.

Pretrain step 2 - we add a vit stage for frame-wise spatial self-attention on top of the frozen CNN backbone of stage 1 using this config. You will need to edit the model.backbone_weights field to point to the best checkpoint from the step 1 experiment.

Pretrain step 3 - here we add temporal attention stage and pretrain it on top of the frozen CNN + vit model of stage 2 using this config. Similarly, edit model.backbone_weights.

2. Pretraining Global Encoder

The second step of DualTrack is to pretrain the global encoder using sparsely sampled subsequences of the ultrasound frames. The global encoder consists of an image backbone and then a transformer temporal self-attention stage. Here we have several options for the image backbone: CNN, iBOT, MedSAM, and USFM. The code can easily be adapted to using other backbones. Note that some backbones require pretrained weights or add dependencies. Choose one of the configs in configs/dualtrack_train_tus_rec/global_encoder (we recommend cnn.yaml as a good starting point with no extra dependencies).

DualTrack

Install / Use

README