Toronto AI Lab

Multi-Student Distillation
Multi-student Diffusion Distillation for Better One-step Generators

Yanke Song^1,2

Jonathan Lorraine^1,3

Weili Nie¹

Karsten Kreis¹

James Lucas¹

¹NVIDIA

²Harvard

³Vector Institute

International Conference on Machine Learing (ICML) 2025
Efficient Systems for Foundation Models (ES-FoMo) Workshop

TL;DR: We improve diffusion models by distilling into multiple students, allowing (a) improved quality by specializing in data subsets and (b) improved latency by distilling into smaller models, allowing one-step generation, now additionally with smaller, lower-latency architectures.

Abstract: Diffusion models achieve high-quality sample generation at the cost of a lengthy multistep inference procedure. To overcome this, diffusion distillation techniques produce student generators capable of matching or surpassing the teacher in a single step. However, the student model’s inference speed is limited by the size of the teacher architecture, preventing real-time generation for computationally heavy applications. This work introduces Multi-Student Distillation (MSD), a framework to distill a conditional teacher diffusion model into multiple single-step generators. Each student generator is responsible for a subset of the conditioning data, thereby obtaining higher generation quality for the same capacity. MSD trains multiple distilled students allowing smaller sizes and, therefore, faster inference. Also, MSD offers a lightweight quality boost over single-student distillation with the same architecture. We demonstrate MSD is effective by training multiple same-sized or smaller students on single-step distillation using distribution matching and adversarial distillation techniques. With smaller students, MSD gets competitive results with faster inference for single-step generation. Using 4 same-sized students, MSD sets a new state-of-the-art for one-step image generation: FID 1.20 on ImageNet-64×64 and 8.20 on zero-shot COCO2014.

We achieve competitive generation quality by distilling into smaller students specialized on subsets of the data.

Paper

Yanke Song, Jonathan Lorraine, Weili Nie,
Karsten Kreis, James Lucas

Multi-student Diffusion Distillation
for Better One-step Generators

Efficient Systems for Foundation Models (ES-FoMo) Workshop

[Paper]

[Video]

[Slides]

[Poster]

[Bibtex]

Overview

Recent works have used Knowledge Distillation to accelerate the sampling of multi-step diffusion models. Specifically, Distribution Matching and Adversarial Distillation obtain one-step generators with comparable or better performance than the teacher. However, these methods must use an identical architecture, so they 1) have limited network capacity and quality for the more difficult one-step generation and 2) can not go faster than a single forward pass of the teacher. Can we make these models both better and faster?

We present Multi-Student Distillation (MSD), which distills multiple students from the teacher. At training time, MSD partitions the input condition set, filters the corresponding dataset and assigns them to different students; During inference time, MSD uses only one student. In this way, MSD effectively increases the model's total capacity and, therefore, performance without incurring additional inference latency.

The inference speed of one-step generators is still bounded by their model size. So, we want to distill into a smaller student, but this incurs additional challenges, such as initializing the student. We show that the issue can be resolved by prepending an additional teacher score matching (TSM) stage that trains multi-step students to emulate the teacher score estimation, which provides useful weight initializations for the following distillation stages.

Experimental Results

By training 4 students of the same architecture, we show that MSD improves upon single-student counterparts. This is measured by FID scores on: 1) class-conditional image generation on ImageNet-64×64 and 2) text-to-image generation on zero-shot COCO2014.

	ImageNet-64×64	COCO2014
DMD	2.62	11.49
MSD4-DM (ours)	2.37	8.80
DMD2	1.28	8.35
MSD4-AMD (ours)	1.20	8.20

FID scores on two tasks. MSD beats single-student counterparts. Specifically, MSD with two stages sets SOTA FID scores among single-step generators.

In addition, we achieve competitive generation quality by distilling smaller students from scratch, as shown in the figure on the top.

Citation

Song, Y., Lorraine, J., Nie, W., Kreis, K., & Lucas, J. (2024).
Multi-Student Diffusion Distillation for Better One-Step Generators. arXiv preprint arXiv:2410.23274.


                    @inproceedings{song2024multistudent,

                      title={Multi-Student Diffusion Distillation for Better One-Step Generators},

                      author={Yanke Song and Jonathan Lorraine and Weili Nie and Karsten Kreis and James Lucas},

                    
                      url={https://arxiv.org/abs/2410.23274},

                      year={2024},

                    
                    }