KV-Tracker
Real-Time Pose Tracking with Transformers
arXiv 2025
- Marwan Taher
- Ignacio Alzugaray
- Kirill Mazur
- Xin Kong
- Andrew J. Davison Dyson Robotics Lab Imperial College London
Live Demo
Real-time pose tracking demonstration showing KV-Tracker in action.
Abstract
Multi-view 3D geometry networks offer a powerful prior but are prohibitively slow for real-time applications. We propose a novel way to adapt them for online use, enabling real-time 6-DoF pose tracking and online reconstruction of objects and scenes from monocular RGB videos.
Our method rapidly selects and manages a set of images as keyframes to map a scene or object via π3 [32] with full bidirectional attention. We then cache the global self-attention block’s key-value (KV) pairs and use them as the sole scene representation for online tracking. This allows for up to 15× speedup during inference without the fear of drift or catastrophic forgetting. Our caching strategy is model-agnostic and can be applied to other off-the-shelf multi-view networks without retraining.
We demonstrate KV-Tracker on both scene-level tracking and the more challenging task of on-the-fly object tracking and reconstruction without depth measurements or object priors. Experiments on the TUM RGB-D, 7-Scenes, Arctic and OnePose datasets show the strong performance of our system while maintaining high frame-rates up to ∼30 FPS.
Explainer
Our method caches key-value pairs from the global self-attention block for efficient real-time tracking.
Demos
Runtime Analysis
Frames per second (FPS) throughput comparison: processing N frames with full bidirectional attention vs. processing a single query frame with KV-cache from N frames.
Citation
@misc{taher2025kvtracker,
title={KV-Tracker: Real-Time Pose Tracking with Transformers},
author={Marwan Taher and Ignacio Alzugaray and Kirill Mazur and Xin Kong and Andrew J. Davison},
year={2025},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Acknowledgements
Research presented here has been supported by Dyson Technology Ltd.
The website template was borrowed from Mip-NeRF.