TrackTention
No description available
Install / Use
/learn @zlai0/TrackTentionREADME
Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better
Zihang Lai, Andrea Vedaldi
Visual Geometry Group, University of Oxford

A plug-and-play transformer layer to turn image-based models into state-of-the-art video models using point tracking.
🧠 Summary
Tracktention is a novel architectural module that improves temporal consistency in video tasks like depth estimation and colorization. It leverages modern point trackers to explicitly align features across frames using attention — converting powerful image-based models into robust, temporally aware video models with minimal overhead.
🔧 Features
- Tracktention Layer: Enhances existing ViT/ConvNet with motion-aware temporal reasoning.
- Plug-and-Play: Easily integrates into existing models like
Depth Anything. - Lightweight: Only ~17M additional parameters with minimal runtime overhead.
- State-of-the-Art: Outperforms leading video models in depth prediction and video colorization benchmarks.
🧬 Method
Tracktention consists of:
- Attentional Sampling: Pool features from image tokens to track tokens using cross-attention.
- Track Transformer: Propagate features along tracks for temporal consistency.
- Attentional Splatting: Redistribute processed track tokens back to image tokens.

We use CoTracker3 to generate point tracks.
🧪 Usage
Note: Usage instructions will be provided once the codebase is officially released.
📄 Citation
If you use this code or Tracktention in your research, please cite:
@inproceedings{lai2025tracktention,
title={Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better},
author={Zihang Lai and Andrea Vedaldi},
booktitle={CVPR},
year={2025}
}
🌐 Project Page
Security Score
Audited on Dec 22, 2025
