TL;DR: We adapt Score Distillation Sampling (SDS), originally developed for text-to-3D generation, to audio diffusion models, allowing us to reuse large pretrained models for new text-guided parametric audio tasks such as source separation, physically informed impact synthesis, and more. |
Abstract: We introduce Audio-SDS, a generalization of Score Distillation Sampling (SDS) to text-conditioned audio diffusion models. While SDS was originally designed for text-to-3D generation using image diffusion, its core idea, distilling a powerful generative prior into a separate parametric representation, extends naturally to the audio domain. Leveraging a single pretrained model, Audio-SDS enables a broad range of tasks without requiring specialized datasets. In particular, we demonstrate how Audio-SDS can guide physically informed impact sound simulations, calibrate FM-synthesis parameters for expressive timbre design, and perform prompt-specified source separation. Our findings illustrate the versatility of distillation-based methods across modalities and establish a robust foundation for future work integrating generative priors into audio tasks. |
Our proposed framework — Audio-SDS — uses a single, large, pretrained text-to-audio diffusion model to power diverse audio tasks, from physically guided impact synthesis to prompt-based source separation, needing no specialized training or fine-tuning. Below, we overview our method's update: |
Our SDS update — originally developed for text-to-3D generation — computes an update for rendered data in the diffusion models modality (e.g., image or audio), then propagates that update back through the differentiable simulation to update parameters. We begin with a baseline render, for example, from a synthesizer or physically based simulator, and repeatedly push it toward the user’s prompt. Each iteration adds noise to the rendered audio, denoises conditioned on the prompt, and backpropagates the difference to adjust the underlying parameters. Over multiple steps, the diffusion model’s rich generative prior is distilled into our audio parameters.
|
For our impact synthesizer parameterization, the object and reverb impulse parameters are updated using Audio-SDS. The final audio is the convolution of the object, reverb, and impact impulses. Our impact impulse is simply a delta function (so, not shown), while the object impulse is parametrized with a sum of damped sinusoids, and the reverb impulse is a sum of bandpassed white noise.
|
Iteration 0
Iteration 100
Iteration 1000
|
"kick drum, bass, reverb" and "striking metal pot with a wooden spoon"
👆Click the drum or pot to listen
|
For FM synthesis, the parameters updated with Audio-SDS are the FM matrix and ratio, attack, and decay for each operator. Below, we visualize these as dials for an FM synthesis user interface, as well as the envelope and audio.
|
Below, we qualitatively contrast results using Audio-SDS with the FM and impact synthesis models, along with quantitative measurements using CLAP, which gauges audio-prompt alignment. The top row is the initialization for each synthesizer. The middle row shows a prompt which both models can represent (both +0.1 CLAP vs. init.), while the final row shows an impact-oriented prompt which the impact synthesizer can more accurately match (final impact +0.3 CLAP vs. final FM). |
Source separation is a core problem in audio processing, where a mixture of sources is supplied as one raw, mixed audio buffer, and the goal is to separate that audio buffer into multiple sources. Here, we specify a prompt for each source, while the parameters we optimize are the audio of each source. Our update is a mixture of Audio-SDS for each source's prompt, allowing prompt-guided separation per source, and a reconstruction loss forcing the sum of the sources to be the mixed audio.
|
Below, we show additional examples of Audio-SDS for prompt-guided source separation. On the left is the provided mixed audio, followed by the separated sources and their prompts, as well as the ground truth sources for each mixture. We have occasional artifacts where audio bleeds into other channels, but we are re-using a single pretrained audio diffusion model with no source separation specific training.
|
Below, we show a fully automated pipeline on real YouTube audio. We first run the clip through an audio captioner, feed the result to an LLM for prompt suggestions, and then apply Audio-SDS to separate each prompt. Although the results aren’t perfect, this demo highlights how our approach can scale to in-the-wild recordings — no manual labeling of sources or specialized datasets is required.
|
Audio Model Bias: Our results depend on the pretrained Stable Audio Open model. Thus, rare or out-of-domain audio classes (e.g., non-Western instruments, speech, or audio without silence at the end) may yield suboptimal results. Incorporating more comprehensive or fine-tuned diffusion models could overcome this limitation.
|
Citation |
Richter-Powell, J., Torralba, A., Lorraine, J. (2025)
@article{richter2025audiosds,
|