To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to be able to track any point corresponding to a solid surface in a video, potentially densely in space and time. Large-scale ground-truth training data for TAP is only available in simulation, which currently has limited variety of objects and motion. In this work, we demonstrate how large-scale, unlabeled, uncurated real-world data can improve a TAP model with minimal architectural changes, using a self-supervised student-teacher setup. We demonstrate state-of-the-art performance on the TAP-Vid benchmark surpassing previous results by a wide margin: for example, TAP-Vid-DAVIS performance improves from 61.3% to 67.4%, and TAP-Vid-Kinetics from 57.2% to 62.5%.
▲ BootsTAPIR | ■ TAPIR | + Ground-truth |
▲ BootsTAPIR | ■ TAPIR | + Ground-truth |
▲ BootsTAPIR | ■ TAPIR | + Ground-truth |
▲ BootsTAPIR | ■ TAPIR | + Ground-truth |
▲ BootsTAPIR | ■ TAPIR | + Ground-truth |
▲ BootsTAPIR | ■ TAPIR | + Ground-truth |
▲ BootsTAPIR | ■ TAPIR | + Ground-truth |
▲ BootsTAPIR | ■ TAPIR | + Ground-truth |
▲ BootsTAPIR | ■ TAPIR | + Ground-truth |
▲ BootsTAPIR | ■ TAPIR | + Ground-truth |
▲ BootsTAPIR | ■ TAPIR | + Ground-truth |
TAPIR | BootsTAPIR | BootsTAPIR-Libero |
TAPIR | BootsTAPIR | BootsTAPIR-Libero |
TAPIR | BootsTAPIR | BootsTAPIR-Libero |
TAPIR | BootsTAPIR | BootsTAPIR-Libero |
TAPIR | BootsTAPIR | BootsTAPIR-Libero |
TAPIR | BootsTAPIR | BootsTAPIR-Libero |
TAPIR | BootsTAPIR |
TAPIR | BootsTAPIR |
TAPIR | BootsTAPIR |
TAPIR | BootsTAPIR |
TAPIR | BootsTAPIR |
TAPIR | BootsTAPIR |
TAPIR | BootsTAPIR |
TAPIR is the foundational visual perception model for this work. It provides fast and accurate tracking of any point in a video, and has some cool video generation applications.
RoboTAP, ATM, and Track2Act demonstrate how TAP can transform few-shot learning in robotics.
VideoDoodles shows how TAP can enable intuitive animations on top of real videos.
OmniMotion, SpaTracker, and VGGSfM show how TAP can contribute to 3D scene understanding.
@article{doersch2024bootstap,
author = {Carl Doersch and Pauline Luc and Yi Yang and Dilara Gokay and Skanda Koppula and Ankush Gupta and Joseph Heyward and Ignacio Rocco and Ross Goroshin and João Carreira and Andrew Zisserman},
title = {{BootsTAP}: Bootstrapped Training for Tracking Any Point},
journal = {arXiv},
year = {2024},
}