NVIDIA Research Toronto AI Lab
Optimizing Data Collection for Machine Learning

Optimizing Data Collection For Machine Learning

2University of Toronto
3Vector Institute
NeurIPS 2022


Modern deep learning systems require huge data sets to achieve impressive performance, but there is little guidance on how much or what kind of data to collect. Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay workflows. We propose a new paradigm for modeling the data collection workflow as a formal optimal data collection problem that allows designers to specify performance targets, collection costs, a time horizon, and penalties for failing to meet the targets. Additionally, this formulation generalizes to tasks requiring multiple data sources, such as labeled and unlabeled data used in semi-supervised learning. To solve our problem, we develop Learn-Optimize-Collect (LOC), which minimizes expected future collection costs. Finally, we numerically compare our framework to the conventional baseline of estimating data requirements by extrapolating from neural scaling laws. We significantly reduce the risks of failing to meet desired performance targets on several classification, segmentation, and detection tasks, while maintaining low total collection costs.

Main Problem

There is an industry folklore that 87% of AI systems do not make it to production. Data collection is a fundamental challenge. For example, a 2019 survey revealed 51% of enterprise teams struggled to collect enough data for their models.

Given a target performance V*, initial set of q0 points, cost-per-sample c, maximum number of collection rounds T, and a penalty P for failing to meet the target, we must determine how much data to collect qt in each round to minimize the total costs and penalties. In each round, we collect more data and re-evaluate our model until we reach the target or the final round.


In each round, we first estimate the probability distribution of how much data we need by bootstrap resampling different scaling laws and fitting a Density Estimation model. We then solve a differentiable optimization problem to minimize the likelihood of not collecting enough data (obtained from the DE model) plus the total collection cost.

1) Learning the data requirement distribution

2) Optimizing how much data to collect

Extensions to Multiple Data Sources

The optimization framework naturally generalizes to more complex settings such as when we have multiple types of data arriving from different sources. Consider:

  • Semi-supervised learning: we can train with labeled and unlabeled data sets, where collecting labeled data incurs an additional cost over unlabeled.
  • Long-tail learning: some classes may be hard-to-collect and therefore, more expensive, than other classes.
  • Domain adaptation and synthetic-to-real: source (synthetic) training data can be collected more easily than target (real) data.
  • In each case, we have different categories of data that incur different costs-per-sample of collection. Although we need data of each category, we may achieve our performance targets even with collecting more of the cheaper data and less of the costly data. The learning and optimization problems can be adapted to this setting.


    For each data set and task, we fix the number of rounds T and then sweep the performance target V* to see how well a policy makes decisions on how much data to collect. We compare LOC against the intuitive baseline of using a single neural scaling law to estimate how much data is needed. We evaluate on two metrics:

  • Failure rate: For each data set, task, and T, how often does a policy fail to collect enough data to meet the target.
  • Cost ratio: For each data set, task, and T, what is the average ratio of the cost incurred c(qT - q0) over the minimum possible cost c(q* - q0) where q* is the absolute minimum amount of data needed to reach the performance target. In other words, this measures the relative sub-optimality of a policy with respect to the optimization problem.
  • We also consider two new problems with multiple data types. First, we consider classification on CIFAR-100 where the first 50 classes cost more than the second 50 classes (e.g., as in long-tail learning). Second, we consider segmentation on BDD100K where we can pseudo-label data from an additional unlabeled source to augment training.


                    author    = {Mahmood, Rafid and Lucas, James and Alvarez, Jose M. and Fidler, Sanja 
                                    and Law, Marc T.},
                    title     = {Optimizing Data Collection for Machine Learning},
                    booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
                    month     = {November},
                    year      = {2022}