Toronto AI Lab

DreamTeacher: Pretraining Image Backbones with Deep Generative Models

Daiqing Li *1
Huan Ling *1,2,3
Amlan Kar 1,2,3
David Acuna 1,2,3
Seung Wook Kim 1,2,3
Karsten Kreis1
Antonio Torralba4
Sanja Fidler1,2,3

1NVIDIA
2University of Toronto
3Vector Institute
4MIT

*Equal contribution.

ICCV 2023

description Arxiv description BibTex


We propose DreamTeacher framework for distilling knowledge from a pre-trained generative network onto a target image backbone, as a generic way of pre-training without labels. We investigate feature distillation, and optionally label distillation (when task-specific labels are available). Our DreamTeacher outperforms existing self-supervised methods on a variety of benchmarks.
 
Qualitative visualization of label-efficient semantic segmentation benchmark. We visualize predictions from DreamTeacher with ADM generator using mix-distillation to a ConvNX-B backbone using as few as 20 labelled images. Our predictions are accurate, even on thin structures like cat's whiskers.




Abstract

In this work, we introduce a self-supervised feature representation learning framework DreamTeacher that utilizes generative networks for pre-training downstream image backbones. We propose to distill knowledge from a trained generative model into standard image backbones that have been well engineered for specific perception tasks. We investigate two types of knowledge distillation: 1) distilling learned generative features onto target image backbones as an alternative to pretraining these backbones on large labeled datasets such as ImageNet, and 2) distilling labels obtained from generative networks with task heads onto logits of target backbones. We perform extensive analysis on several generative models, dense prediction benchmarks, and several pre-training regimes. We empirically find that our DreamTeacher significantly outperforms existing self-supervised representation learning approaches across the board. Unsupervised ImageNet pre-training with DreamTeacher leads to significant improvements over ImageNet classification pre-training on downstream datasets, showcasing generative models, and diffusion generative models specifically, as a promising approach to representation learning on large, diverse datasets without requiring manual annotation.




Paper

DreamTeacher:
Pretraining Image Backbones with Deep Generative Models


Daiqing Li*, Huan Ling*, Amlan Kar, David Acuna,
Seung Wook Kim, Karsten Kreis, Antonio Torralba, Sanja Fidler

description ArXiv
insert_comment BibTeX


For feedback and questions please reach out to Huan Ling and Sanja Fidler.





News




Methods



DreamTeacher architecture: Feature regression module (FR) maps and fuses multi-scale features of a (CNN) image backbone. We supervise FR with features from the generator's decoding network. We optionally add a feature interpreter to the generator to train a task head with supervised labels-used to supervise the image backbone with label distillation loss.


Transfer Learning Results


Comparing DreamTeacher with SoTA self-supervised methods on ImageNet and instance segmentation on COCO. All the baselines including ADM are pre-trained on ImageNet-1k without labels. For a fair comparison, both our method and baselines follow iBOT fine-tuning setting. Our DT pre-training task is highlighted as generative(GEN) comparing to contrastive(CL) and masking(MIM) based objectives. *Our effective epochs includes 400 epochs generative model training and 200 epochs feature distillation training.


ResNet-50 results on ImageNet and COCO instance segmentation. For ImageNet classification, we follow SparK's fine-tuning setting with resolution 224. Top-1 accuracy (Acc) on ImageNet val set is reported. For COCO, MaskR-CNN ResNet50-FPN is equally fine-tuned for 12 or 24 epochs (1x or 2x), following the same setup as SparK. *Our effective epochs includes 400 epochs generative model training and 200 epochs feature distillation training.


Label-efficient Semantic Segmentation Benchmark


Label-efficient semantic segmentation benchmark. We compare our DreamTeacher (DT) with various representation learning baselines. Our DT-mix.distil. with ResNet 101 backbone (only 43M parameters) beats all baselines, some with 10x the number of parameters. We also show our method with ConvNX-B achieves the new SoTA without using any extra data, i.e. IN1k-1M or IN21k-14M.


In-the-wild Label-efficient Segmentation Visualization


Semantic segmentation trained with only 30 labeled images: LSUN-horse with 21 classes. Qualitative results of our ConvNX-B model pre-trained with DreamTeacher- feature distillation on LSUN-horse unlabelled images.

Semantic segmentation trained with only 30 labeled images: LSUN-cat with 15 classes. Qualitative results of our ConvNX-B model pre-trained with DreamTeacher- feature distillation on LSUN-cat unlabelled images.

Semantic segmentation trained with only 40 labeled images: LSUN-bedroom with 28 classes. Qualitative results of our ConvNX-B model pre-trained with DreamTeacher- feature distillation on LSUN-bedroom unlabelled images.

Semantic segmentation trained with only 16 labeled images: LSUN-car with 20 classes. Qualitative results of our ConvNX-B model pre-trained with DreamTeacher- feature distillation on LSUN-car unlabelled images.

Semantic segmentation trained with only 20 labeled images: FFHQ with 34 classes. Qualitative results of our ConvNX-B model pre-trained with DreamTeacher- feature distillation on FFHQ unlabelled images.



Citation

If you find this work useful for your research, please consider citing it as:

@misc{li2023dreamteacher,
      title={DreamTeacher: Pretraining Image Backbones with Deep Generative Models}, 
      author={Daiqing Li and Huan Ling and Amlan Kar and David Acuna and Seung Wook Kim 
                and Karsten Kreis and Antonio Torralba and Sanja Fidler},
      year={2023},
      eprint={2307.07487},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}