Modern deep learning systems require huge data sets to achieve impressive performance, but there is little guidance on how much or what kind of data to collect. Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay workflows. We propose a new paradigm for modeling the data collection workflow as a formal optimal data collection problem that allows designers to specify performance targets, collection costs, a time horizon, and penalties for failing to meet the targets. Additionally, this formulation generalizes to tasks requiring multiple data sources, such as labeled and unlabeled data used in semi-supervised learning. To solve our problem, we develop Learn-Optimize-Collect (LOC), which minimizes expected future collection costs. Finally, we numerically compare our framework to the conventional baseline of estimating data requirements by extrapolating from neural scaling laws. We significantly reduce the risks of failing to meet desired performance targets on several classification, segmentation, and detection tasks, while maintaining low total collection costs.
There is an industry folklore that 87% of AI systems do not make it to production. Data collection is a fundamental challenge. For example, a 2019 survey revealed 51% of enterprise teams struggled to collect enough data for their models.
Given a target performance V*, initial set of q0 points, cost-per-sample c, maximum number of collection rounds T, and a penalty P for failing to meet the target, we must determine how much data to collect qt in each round to minimize the total costs and penalties. In each round, we collect more data and re-evaluate our model until we reach the target or the final round.
In each round, we first estimate the probability distribution of how much data we need by bootstrap resampling different scaling laws and fitting a Density Estimation model. We then solve a differentiable optimization problem to minimize the likelihood of not collecting enough data (obtained from the DE model) plus the total collection cost.
The optimization framework naturally generalizes to more complex settings such as when we have multiple types of data arriving from different sources. Consider:
For each data set and task, we fix the number of rounds T and then sweep the performance target V* to see how well a policy makes decisions on how much data to collect. We compare LOC against the intuitive baseline of using a single neural scaling law to estimate how much data is needed. We evaluate on two metrics:
We also consider two new problems with multiple data types. First, we consider classification on CIFAR-100 where the first 50 classes cost more than the second 50 classes (e.g., as in long-tail learning). Second, we consider segmentation on BDD100K where we can pseudo-label data from an additional unlabeled source to augment training.
@InProceedings{Mahmood_2022_Optimizing,
author = {Mahmood, Rafid and Lucas, James and Alvarez, Jose M. and Fidler, Sanja
and Law, Marc T.},
title = {Optimizing Data Collection for Machine Learning},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
month = {November},
year = {2022}