Learning to Evaluate Perception Models Using Planner-Centric Metrics

NVIDIA, Vector Institute, University of Toronto

CVPR 2020

Variants of accuracy and precision are the gold-standard by which the computer vision community measures progress of perception algorithms. One reason for the ubiquity of these metrics is that they are largely task-agnostic; we in general seek to detect zero false negatives or positives. The downside of these metrics is that, at worst, they penalize all incorrect detections equally without conditioning on the task or scene, and at best, heuristics need to be chosen to ensure that different mistakes count differently. In this paper, we propose a principled metric for 3D object detection specifically for the task of self-driving. The core idea behind our metric is to isolate the task of object detection and measure the impact the produced detections would induce on the downstream task of driving. Without hand-designing it to, we find that our metric penalizes many of the mistakes that other metrics penalize by design. In addition, our metric downweighs detections based on additional factors such as distance from a detection to the ego car and the speed of the detection in intuitive ways that other detection metrics do not. For human evaluation, we generate scenes in which standard metrics and our metric disagree and find that humans side with our metric 79% of the time.

News

[October 2020] PKL integrated into the nuScenes leaderboard
[October 2020] Analysis of PKL on nuScenes for the "IROS 2020 Workshop on Benchmarking Progress in Autonomous Driving" in collaboration with Motional released on arxiv
[July 2020] Code release repo. Includes pip installable python package! pip install planning-centric-metrics
[April 2020] Paper released on arXiv

Paper

Jonah Philion, Amlan Kar, Sanja Fidler

Learning to Evaluate Perception Models Using Planner-Centric Metrics

CVPR, 2020. (poster) (to appear)

[preprint] [bibtex] [code]

Follow-Up With Motional

Yiluan Guo, Holger Caesar, Oscar Beijbom, Jonah Philion, Sanja Fidler

The Efficacy of Neural Planning Metrics: A Meta-Analysis of PKL on nuScenes

IROS 2020 Workshop on Benchmarking Progress in Autonomous Driving. (poster)

[preprint] [bibtex]

Synopsis of Results

Human evaluation We generate pairs of detection mistakes such that, for every pair, NDS ranks one set of detection mistakes worse but PKL considers the other set of detection mistakes worse. We then ask humans to choose which mistake was more dangerous. We find that in 79% of the generated scenarios, the humans side with our metric, PKL.

NDS	PKL	Scenes	Responses
21%	79%	75	730

Comparison to hand-designed metrics On the left, we show that PKL largely agrees with NDS under canonical noise models such as translation error. On the right, we show that while NDS penalizes false negatives equally independent of context, PKL penalizes false negatives more strongly if the missed detection is close to the ego vehicle or moving at high speeds.

False Negative/False Positive Sensitivity We visualize the "importance" that each true object in the scene is detected correctly by removing each object from the scene and re-evaluating the PKL (left). The ego vehicle is shown in green. To visualize the sensitivity of PKL to false positives, we place false positives at each position on a grid local to the ego vehicle and evaluate the PKL (right). Qualitatively, the "worst" false negatives involve objects that are close to the ego vehicle and the "worst" false positives occur in regions where the ego vehicle is likely to travel in the future.

Supplementary

Human study results We submit a survey to the Amazon Mechanical Turk service in which humans are asked to vote on whether one set of detection mistakes are more dangerous than another set of detection mistakes. An example of the instructions given to the workers is shown below.

We recorded 730 responses and found that human's agreed with PKL (our metric) over NDS in 79% of scenarios. The 75 scenarios used in the survey are shown below. For each of the scenarios, the two detection sequences with different detection mistakes are shown on the left. The scene that PKL considers more dangerous is boxed in green. The comments that the workers wrote about each of the scenes are shown to the right. The comment is colored green if the commenter agreed with PKL and colored red if the commenter disagreed with PKL.

Mechanical Turk Responses

Distribution over trajectories We train a model that predicts the future position of the ego vehicle for T timesteps. In the videos below, we visualize the sum of the heatmaps for all future timesteps as a single color for each of the vehicles in the scene. Different objects are given one of ten different colors to facilitate matching cars to their heatmaps.