BootsTAP: Bootstrapped Training
for Tracking-Any-Point

1Google DeepMind, 2University of Oxford

Abstract

To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to be able to track any point corresponding to a solid surface in a video, potentially densely in space and time. Large-scale ground-truth training data for TAP is only available in simulation, which currently has limited variety of objects and motion. In this work, we demonstrate how large-scale, unlabeled, uncurated real-world data can improve a TAP model with minimal architectural changes, using a self-supervised student-teacher setup. We demonstrate state-of-the-art performance on the TAP-Vid benchmark surpassing previous results by a wide margin: for example, TAP-Vid-DAVIS performance improves from 61.3% to 67.4%, and TAP-Vid-Kinetics from 57.2% to 62.5%.

Video Summary

TAP-Vid DAVIS: We plot TAPIR and BootsTAP points alongside their ground truth, sampling query points using the query-first approach. The main improvements come from more accurate occlusion prediction and more precise localization. In addition, we observe that BootsTAPIR is more robust to large scale changes than the baseline TAPIR model. In the displays, we don't plot points that the methods mark as occluded. We include a line segment connecting the ground truth to the prediction where the ground truth is marked visible, showing the scale of the errors. If a method predicts a point as visible when the ground truth is occluded, we plot the prediction with an x. Note that some points begin very close to object boundaries which means the methods occasionally track the wrong object. For this and other visualizations, the algorithm tracks on videos downsampled to 256-by-256; we plot them on the original resolution only for appearance.
RoboTAP: We show some examples from the RoboTAP dataset where BootsTAPIR improves performance. We use the same plotting scheme as for TAP-Vid-DAVIS, as there is ground truth available in this dataset.
Libero: In this experiment, we apply TAPIR (left), BootsTAPIR (center), and a BootsTAPIR further finetuned on Libero (BootsTAPIR-Libero) to Libero videos. This dataset has no ground truth, so we apply the algorithms to query points sampled in a semi-dense grid on the first frame. We see that BootsTAP on internet data improves performance, especially scale invariance, but training specifically on Libero improves results even further.

Video Comparisons

RoboCAT-NIST: In this study, we apply both TAPIR (left) and BootsTAPIR (right) to very challenging sequences of NIST gear insertion, as viewed from the gripper camera. Due to the lack of texture and the rotational symmetry, the correspondence is difficult to perceive, leading to many failures from the original TAPIR model. We manually select query points on the gears by automatically selecting a grid of red pixels on the first frame, therefore producing a set of points on task-relevant objects inspired by RoboTAP (note that this selection process is imperfect, and we expect real-world query point selection to be more sophisticated; this is just to prevent clutter in the display).

Some other cool projects using TAP

TAPIR is the foundational visual perception model for this work. It provides fast and accurate tracking of any point in a video, and has some cool video generation applications.

RoboTAP, ATM, and Track2Act demonstrate how TAP can transform few-shot learning in robotics.

VideoDoodles shows how TAP can enable intuitive animations on top of real videos.

OmniMotion, SpaTracker, and VGGSfM show how TAP can contribute to 3D scene understanding.

BibTeX

@article{doersch2024bootstap,
    author    = {Carl Doersch and Pauline Luc and Yi Yang and Dilara Gokay and Skanda Koppula and Ankush Gupta and Joseph Heyward and Ignacio Rocco and Ross Goroshin and João Carreira and Andrew Zisserman},
    title     = {{BootsTAP}: Bootstrapped Training for Tracking Any Point},
    journal   = {arXiv},
    year      = {2024},
  }