Toronto AI Lab

BigDatasetGAN: Synthesizing ImageNet with Pixel-wise Annotations

Daiqing Li 1
Huan Ling1,2,3
Seung Wook Kim 1,2,3
Karsten Kreis1
Adela Barriuso
Sanja Fidler1,2,3
Antonio Torralba4

1NVIDIA
2University of Toronto
3Vector Institute
4MIT

description Paper description Arxiv description BibTex description Code description Dataset


BigDatasetGAN overview: (1) We sample a few images per class from BigGAN and manually annotate them with masks. (2) We train a feature interpreter branch on top of BigGAN's and VQGAN's features on this data, turning these GANs into generators of labeled data. (3) We sample large synthetic datasets from BigGAN & VQGAN. (4) We use these datasets for training segmentation models.

Our synthesized pixel-wise labeled ImageNet dataset. We sample both images and masks for each of the 1k ImageNet classes.

Annotating images with pixel-wise labels is a time-consuming and costly process. Recently, DatasetGAN showcased a promising alternative - to synthesize a large labeled dataset via a generative adversarial network (GAN) by exploiting a small set of manually labeled, GAN-generated images. Here, we scale DatasetGAN to ImageNet scale of class diversity. We take image samples from the class-conditional generative model BigGAN trained on ImageNet, and manually annotate 5 images per class, for all 1k classes. By training an effective feature segmentation architecture on top of BigGAN, we turn BigGAN into a labeled dataset generator. We further show that VQGAN can similarly serve as a dataset generator, leveraging the already annotated data. We create a new ImageNet benchmark by labeling an additional set of 8k real images and evaluate segmentation performance in a variety of settings. Through an extensive ablation study we show big gains in leveraging a large generated dataset to train different supervised and self-supervised backbone models on pixel-wise tasks. Furthermore, we demonstrate that using our synthesized datasets for pre-training leads to improvements over standard ImageNet pre-training on several downstream datasets, such as PASCAL-VOC, MS-COCO, Cityscapes and chest X-ray, as well as tasks (detection, segmentation). Our benchmark will be made public and maintain a leaderboard for this challenging task.




News



CVPR Presentation

 




Methods

Architecture of BigDatasetGAN based on BigGAN. We augment BigGAN with a segmentation branch using BigGAN's features. We exploit the rich semantic features of generative models in order to synthesize paired data, segmentation masks and images, turning generative models into dataset generators.



Dataset Analysis

We provide analyses of our synthesized datasets compared to the real annotated ImageNet samples. We compare image and label quality in terms of distribution metrics using the real annotated dataset as reference. We also compare various label statistics and perform shape analysis on labeled polygons in terms of shape complexity and diversity.

Dataset analysis. We report image & mask statistics across our datasets. We compute image and label quality using FID and KID and use Real-annotated dataset as reference. IN: instance count per image, MI: ratio of mask area over image area, BI: ratio of tight bounding box of the mask over image area, MB: ratio of mask area over area of its tight bounding box, PL: polygon length (polygon normalized to width and height of 1), SC: shape complexity measured by the number of points in a simplified polygon, SD: shape diversity measured by mean pair-wise Chamfer distance per class and averaged across classes.


Examples from our datasets: Real-annotated (real ImageNet subset labeled manually), Synthetic-annotated (BigGAN’s samples labeled manually), and synthetic BigGAN-sim, VQGAN-sim datasets. Notice the high quality of synthetic sampled labeled examples.


Mean shapes from our BigGAN-sim dataset. For our 100k BigGAN-sim dataset, each class has around 100 samples. We crop the mask from the segmentation label and run k-means with 5 clusters to extract the major modes of the selected ImageNet class shapes.



Imagenet Segmentation Benchmark

We introduce a benchmark with a suite of segmentation challenges using our Synthetic-annotated dataset (5k) as training set and evaluate on our Real-annotated held-out dataset (8k). Specifically, we evaluate performance for (1) two individual classes (dog and bird), (2) foreground/background (FG/BG) segmentation evaluated across all 1k classes, and (3) multi-class semantic segmentation for various subsets of classes.

ImageNet pixel-wise benchmark. Numbers are mIoU. We compare various methods on several tasks, with supervised and self-supervised pre-training. We use Resnet-50 for all methods. We ablate the use of synthetic datasets for three methods. FG/BG evaluates binary segmentation across all classes; MC-N columns evaluate multi-class segmentation performance in setups with N classes. Adding synthetic datasets improves performance by a large margin BigGAN-off and BigGAN-on compare offline & online sampling strategy.



Imagenet Segmentation Visualization


Qualitative results on MC-128. We visualize predictions (second column) of DeepLab trained on our BigGAN-sim dataset, compared to ground-truth annotations (third column). The final row shows typical failure cases, which include multiple parts, thin structures or complicated scenes.



Imagenet Segmentation vs Classification Analysis


Top-5 analysis of ImageNet benchmark. Text below images indicates: Class name, FG/BG segmentation measured in mIoU, classification accuracy of a Resnet-50 pre-trained on ImageNet. Top Row: We visualize Top-5 best predictions of DeepLabv3 trained on BigGAN-sim dataset for the FG/BG task, compared to ground-truth annotations (third column). Bottom Row: We visualize Top-5 worst predictions. Typical failure cases include small objects or thin structures. Interestingly, classes the are hard to segment, such as baskeball and bow, are not necessarily hard to classify.



Imagenet Segmentation Ablation Study



Ablating synthetic dataset size. We fix the model to the Resnet50 backbone and compare the performance when we increase the synthetic dataset size. The model trained using a 22k synthetic dataset outperforms the same model trained with 2k human-annotated dataset. Another 7 points is gained when further increasing the synthetic data size from 22k to 220k. Here, 2M is the total number of samples synthesized through our online sampling strategy. Ablating backbone size. We scale up the backbone from Resnet50 to Resnet101 and Resnet152. We supervise with 2k human-annotated labels (red), and with our BigGAN-sim dataset (green), which is 100x larger. BigGAN-sim dataset supervision leads to consistent improvements, especially for larger models.



Downstream Tasks Performance

We propose a simple architecture design to jointly train model backbones with contrastive learning and supervision from our synthetic datasets as pretraining step. Here we show transfer learning experiments results for dense prediction tasks on MS-COCO, PASCAL-VOC, Cityscapes, as well as chest X-ray segmentation in the medical domain.

MS-COCO object detection & instance segmentation. Using our synthetic data during pre-training improves object detection performance by 0.4 AP bb , and instance segmentation by 0.3 AP mk in 1x training schedule. When training longer in the 2x schedule, our synthetic data consistently helps improving the task performance by 0.3 AP bb and 0.2 AP mk.




PASCAL VOC detection & semantic segmentation. For detection, we train on the trainval'07+12 set and evaluate on test07. For semantic segmentation, we train on train aug2012 and evaluate on val2012. Results are average over 5 individual trials.





Semi-supervised chest X-ray segmentation with a frozen backbone. Performance numbers are mIoU. When using our synthetic dataset, we match the performance of the supervised and self-supervised pre-trained networks with only 1% and 5% of labels, respectively. We achieve a big gain using 100% of the data. Numbers are averaged over 3 independent trials. Cityscapes instance and semantic segmentation. training with our BigGAN-sim dataset improves AP mk by 0.3 points in the instance segmentation task over the baseline model. However, we do not see a significant performance boost for the semantic segmentation task.



Paper

BigDatasetGAN:
Synthesizing ImageNet with Pixel-wise Annotations


Daiqing Li, Huan Ling, Seung Wook Kim,
Karsten Kreis, Adela Barriuso, Sanja Fidler, Antonio Torralba

[Paper]      [Benchmark and dataset] (coming soon)

For feedback and questions please reach out to Daiqing Li and Huan Ling.





Citation

If you find this work useful for your research, please consider citing it as:

@inproceedings{bigDatasetGAN,
      title = {BigDatasetGAN: Synthesizing ImageNet with Pixel-wise Annotations}, 
      author = {Daiqing Li and Huan Ling and Seung Wook Kim and Karsten Kreis and 
                Adela Barriuso and Sanja Fidler and Antonio Torralba},
      eprint={2201.04684},
      archivePrefix={arXiv},
      year = {2022}
    }
    
See prior work on using GANs for downstream tasks, which BigDatasetGAN builds on:
DatasetGAN
@inproceedings{zhang21,
      title={DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort},
      author={Zhang, Yuxuan and Ling, Huan and Gao, Jun and Yin, Kangxue and Lafleche, 
      Jean-Francois and Barriuso, Adela and Torralba, Antonio and Fidler, Sanja},
      booktitle={CVPR},
      year={2021}
    }
    
SemanticGAN
@inproceedings{semanticGAN, 
    title={Semantic Segmentation with Generative Models: Semi-Supervised Learning and Strong Out-of-Domain Generalization}, 
    booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)}, 
    author={Li, Daiqing and Yang, Junlin and Kreis, Karsten and Torralba, Antonio and Fidler, Sanja}, 
    year={2021}, 
    }
    




Dataset Visualization

Here we show random samples from the human-annotated dataset Real-annotated (real ImageNet subset labeled manually), Synthetic-annotated (BigGAN’s samples labeled manually) as well as synthetic datasets BigGAN-sim, VQGAN-sim generated by BigGAN and VQGAN. We also show side-by-side comparison between BigGAN-sim and VQGAN-sim datasets.
Examples from the Real-annotated dataset. We visualize both the segmentation masks as well as the boundary polygons.




Examples from the Synthetic-annotated dataset. We visualize both the segmentation masks as well as the boundary polygons.




Examples from the BigGAN-sim random samples. We visualize both the segmentation masks as well as the boundary polygons.




Examples from the VQGAN-sim random samples. We visualize both the segmentation masks as well as the boundary polygons.




BigGAN-sim vs VQGAN-sim. We select the same classes at each row for both BigGAN-sim and VQGAN-sim for easy comparison. Comparing to BigGAN-sim, the VQGAN-sim dataset samples are more diverse in terms of object scale, pose as well as background. However, we see BigGAN-sim has better label quality than VQGAN-sim where in some cases the labels have holes and are noisy.


BigGAN-sim per-class samples
VQGAN-sim per-class samples