† Work done during an internship at NVIDIA.
Increasing the annotation efficiency of trajectory annotations from videos has the potential to enable the next generation of data-hungry tracking algorithms to thrive on large scale datasets. Despite the importance of this task, there are currently very few works exploring how to efficiently label tracking datasets comprehensively. In this work, we introduce SPAM, a video label engine that provides high-quality labels with minimal human intervention. SPAM is built around two key insights: (i) most tracking scenarios can be easily resolved. To take advantage of this, we utilize a pre-trained model to generate high-quality pseudo-labels, reserving human involvement for a smaller subset of more difficult instances; (ii) handling the spatiotemporal dependencies of track annotations across time can be elegantly and efficiently formulated through graphs. Therefore, we use a unified graph formulation to address the annotation of both detections and identity association for tracks across time. Based on these insights, SPAM produces high-quality annotations with a fraction of ground truth labeling cost. We demonstrate that trackers trained on SPAM labels achieve comparable performance to those trained on human annotations while requiring only 3 − 20% of the human labeling effort. Hence, SPAM paves the way towards highly efficient labeling of large-scale tracking datasets.
Overview of the SPAM training and annotation pipeline. (a) Initial model training on synthetic data. (b) Application of SPAM to generate pseudo-labels without incurring manual annotation costs on a real dataset, followed by self-training on pseudolabels. (c) Real dataset labeling using pseudo-labels and an uncertainty-based active learning approach.
Evaluation of SPAM labels on MOT17, MOT20, and DanceTrack test sets with varying annotation budgets by training ByteTrack and GHOST using annotations generated by our method. We first confirm that our results on the validation set also generalize to the test set, and that we can reach GT performance with 3.3% of annotation effort on MOT17. MOT20 and DanceTrack are known to be more challenging datasets, which is confirmed by the fact that our labels require 10% and 20% of the original annotation effort to reach GT performance. This clearly shows the potential in properly leveraging synthetic data, pseudo-labels from a strong model, and active learning on graph hierarchies in order to label multi-object tracking data.
|