BootsTAP: Bootstrapped Training for Tracking-Any-Point

BootsTAP: Bootstrapped Training
for Tracking-Any-Point

Carl Doersch¹, Pauline Luc¹, Yi Yang¹, Dilara Gokay¹, Skanda Koppula¹, Ankush Gupta¹, Joseph Heyward¹, Ignacio Rocco¹, Ross Goroshin¹, João Carreira¹, Andrew Zisserman^1,2

¹Google DeepMind, ²University of Oxford

Paper arXiv BootsTAP checkpoints

Colab

Abstract

To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to be able to track any point corresponding to a solid surface in a video, potentially densely in space and time. Large-scale ground-truth training data for TAP is only available in simulation, which currently has limited variety of objects and motion. In this work, we demonstrate how large-scale, unlabeled, uncurated real-world data can improve a TAP model with minimal architectural changes, using a self-supervised student-teacher setup. We demonstrate state-of-the-art performance on the TAP-Vid benchmark surpassing previous results by a wide margin: for example, TAP-Vid-DAVIS performance improves from 61.3% to 67.4%, and TAP-Vid-Kinetics from 57.2% to 62.5%.

Video Summary

Video Comparisons

TAP-Vid DAVIS: We plot TAPIR and BootsTAP points alongside their ground truth, sampling query points using the query-first approach. The main improvements come from more accurate occlusion prediction and more precise localization. In addition, we observe that BootsTAPIR is more robust to large scale changes than the baseline TAPIR model. In the displays, we don't plot points that the methods mark as occluded. We include a line segment connecting the ground truth to the prediction where the ground truth is marked visible, showing the scale of the errors. If a method predicts a point as visible when the ground truth is occluded, we plot the prediction with an x. Note that some points begin very close to object boundaries which means the methods occasionally track the wrong object. For this and other visualizations, the algorithm tracks on videos downsampled to 256-by-256; we plot them on the original resolution only for appearance.

▲ BootsTAPIR

■ TAPIR

+ Ground-truth

A common failure mode of TAPIR is losing track of points across large changes in scale. Note that BootsTAPIR keeps track of points for longer, especially on the rider's knee.

▲ BootsTAPIR

■ TAPIR

+ Ground-truth

This extremely challenging video requires tracking points on the body even as the pose changes and the clothing deforms. Note the point on the belly, which TAPIR loses entirely, and the point on the forehead, which TAPIR predicts as drifting to the back of the head.

▲ BootsTAPIR

■ TAPIR

+ Ground-truth

BootsTAPIR can also increase the precision even for points where TAPIR is close to correct. Note in particular the point on the back of the spot on the camel's knee and the one on the back of its neck, which are closer to the ground truth for BootsTAPIR.

▲ BootsTAPIR

■ TAPIR

+ Ground-truth

Here BootsTAPIR tracks points for longer: note the left wrist of the foreground person.

▲ BootsTAPIR

■ TAPIR

+ Ground-truth

Again BootsTAPIR relocalizes points that TAPIR thinks are occluded throughout the entire video.

▲ BootsTAPIR

■ TAPIR

+ Ground-truth

The proposed BootsTAPIR method is more robust to large scale changes.

RoboTAP: We show some examples from the RoboTAP dataset where BootsTAPIR improves performance. We use the same plotting scheme as for TAP-Vid-DAVIS, as there is ground truth available in this dataset.

▲ BootsTAPIR

■ TAPIR

+ Ground-truth

There are many white patches of fur on the plushie with relatively weak distinguising features. Therefore, TAPIR loses track of one plushie almost entirely at the end of the video, while BootsTAPIR recovers the performance.

▲ BootsTAPIR

■ TAPIR

+ Ground-truth

TAPIR mostly loses track of the shoe once it's grasped. BootsTAPIR, however, finds many of the points on the shoe even after they move out of the frame.

▲ BootsTAPIR

■ TAPIR

+ Ground-truth

This challenging reflective surface causes problems for both methods, as it is difficult to rely on low level cues. However, there are many points that TAPIR loses entirely, which BootsTAPIR recovers, including on the rectangular structure in the bottom left.

▲ BootsTAPIR

■ TAPIR

+ Ground-truth

In this example, TAPIR tracks the wrong peg; it's possible that BootsTAPIR has learned better motion priors from real data in order to determine which is the correct peg.

▲ BootsTAPIR

■ TAPIR

+ Ground-truth

In another example of gears, we see that TAPIR loses track of many gear points, but BootsTAPIR maintains them.

Libero: In this experiment, we apply TAPIR (left), BootsTAPIR (center), and a BootsTAPIR further finetuned on Libero (BootsTAPIR-Libero) to Libero videos. This dataset has no ground truth, so we apply the algorithms to query points sampled in a semi-dense grid on the first frame. We see that BootsTAP on internet data improves performance, especially scale invariance, but training specifically on Libero improves results even further.

TAPIR

BootsTAPIR

BootsTAPIR-Libero

For both the grid of points on the countertop and the points on the bowl, TAPIR loses many points quite quickly. BootsTAPIR improves on this, keeping more points, but for BootsTAPIR, the grid structure remains visible on both and relatively few points are lost.

TAPIR

BootsTAPIR

BootsTAPIR-Libero

The appearance of the book changes substantially due to the way the pages are rendered. TAPIR loses most of these points; BootsTAPIR keeps points near the book cover, but BootsTAPIR-Libero keeps quite a few even on the pages.

TAPIR

BootsTAPIR

BootsTAPIR-Libero

Due to the large changes in scale on the stovetop, TAPIR loses many points. BootsTAPIR recovers some of them, but BootsTAPIR-Libero keeps enough to see the grid structure.

TAPIR

BootsTAPIR

BootsTAPIR-Libero

Note in particular the points on the inside of the cup. TAPIR loses all of the points after the change in pose; BootsTAPIR keeps more of them, but the positions end up wrong after a while, with the blue point sliding down relative to where it should be toward the end of the video. BootsTAPIR-Libero fixes this problem.

TAPIR

BootsTAPIR

BootsTAPIR-Libero

TAPIR not only loses many points on the chocolate pudding, but also on the background as the camera zooms in. BootsTAPIR improves on this, but still makes errors, including tracking the wrong edge of the drawer. BootsTAPIR-Libero fixes this error, and also tracks the points on the pudding for a larger number of frames.

TAPIR

BootsTAPIR

BootsTAPIR-Libero

Here BootsTAPIR improves tracking of the points on the cup, and BootsTAPIR-Libero improves them even further.

RoboCAT-NIST: In this study, we apply both TAPIR (left) and BootsTAPIR (right) to very challenging sequences of NIST gear insertion, as viewed from the gripper camera. Due to the lack of texture and the rotational symmetry, the correspondence is difficult to perceive, leading to many failures from the original TAPIR model. We manually select query points on the gears by automatically selecting a grid of red pixels on the first frame, therefore producing a set of points on task-relevant objects inspired by RoboTAP (note that this selection process is imperfect, and we expect real-world query point selection to be more sophisticated; this is just to prevent clutter in the display).

TAPIR

BootsTAPIR

A common failure mode of TAPIR is large changes in scale; the scale invariances learned on Kubric data may not generalize to real-world data. BootsTAPIR often fixes these issues.

TAPIR

BootsTAPIR

Large textureless regions can cause jittery predictions from TAPIR, but BootsTAPIR integrates context better.

TAPIR

BootsTAPIR

In this video, reflections on the gear cause failures from TAPIR, as Kubric videos have limited specularity. BootsTAPIR learns better invariance from real videos.

TAPIR

BootsTAPIR

Large occlusions for objects unlike those in Kubric can cause TAPIR to lose tracks, while BootsTAPIR has better invariance.

TAPIR

BootsTAPIR

Here, a sudden fast motion causes TAPIR to lose many points, but BootsTAPIR keeps better track.

TAPIR

BootsTAPIR

BootsTAPIR successfully tracks this small gear across a large motion, keeping the points in a rigid configuration, while TAPIR loses many points.

TAPIR

BootsTAPIR

This example combines many of the aforementioned challenges, and BootsTAPIR does a good job of tracking the gear nevertheless.

Related Work

TAPIR is the foundational visual perception model for this work. It provides fast and accurate tracking of any point in a video, and has some cool video generation applications.

RoboTAP, ATM, and Track2Act demonstrate how TAP can transform few-shot learning in robotics.

VideoDoodles shows how TAP can enable intuitive animations on top of real videos.

OmniMotion, SpaTracker, and VGGSfM show how TAP can contribute to 3D scene understanding.