Spatial Intelligence Lab

Audio-SDS
Score Distillation Sampling for Audio:
Source Separation, Synthesis, and Beyond

Jessie Richter-Powell¹

Antonio Torralba^{1 2}

Jonathan Lorraine¹

¹NVIDIA

²MIT CSAIL

Paper Overview Full Talk Slides Poster

International Conference on Machine Learing (ICML) 2025
AI Heard That! Audio Workshop

TL;DR: We adapt Score Distillation Sampling (SDS), originally developed for text-to-3D generation, to audio diffusion models, allowing us to reuse large pretrained models for new text-guided parametric audio tasks such as source separation, physically informed impact synthesis, and more.

Abstract: We introduce Audio-SDS, a generalization of Score Distillation Sampling (SDS) to text-conditioned audio diffusion models. While SDS was originally designed for text-to-3D generation using image diffusion, its core idea, distilling a powerful generative prior into a separate parametric representation, extends naturally to the audio domain. Leveraging a single pretrained model, Audio-SDS enables a broad range of tasks without requiring specialized datasets. In particular, we demonstrate how Audio-SDS can guide physically informed impact sound simulations, calibrate FM-synthesis parameters for expressive timbre design, and perform prompt-specified source separation. Our findings illustrate the versatility of distillation-based methods across modalities and establish a robust foundation for future work integrating generative priors into audio tasks.

Audio-SDS Overview

Our proposed framework — Audio-SDS — uses a single, large, pretrained text-to-audio diffusion model to power diverse audio tasks, from physically guided impact synthesis to prompt-based source separation, needing no specialized training or fine-tuning. Below, we overview our method's update:

Our SDS update — originally developed for text-to-3D generation — computes an update for rendered data in the diffusion models modality (e.g., image or audio), then propagates that update back through the differentiable simulation to update parameters. We begin with a baseline render, for example, from a synthesizer or physically based simulator, and repeatedly push it toward the user’s prompt. Each iteration adds noise to the rendered audio, denoises conditioned on the prompt, and backpropagates the difference to adjust the underlying parameters. Over multiple steps, the diffusion model’s rich generative prior is distilled into our audio parameters.

Below, we overview our different proposed tasks, including an intuitve explanation of how they are used, what their optimizable parameters are, what the audio rendering function is for the parameters:

Impact Synthesis Overview

For our impact synthesizer parameterization, the object and reverb impulse parameters are updated using Audio-SDS. The final audio is the convolution of the object, reverb, and impact impulses. Our impact impulse is simply a delta function (so, not shown), while the object impulse is parametrized with a sum of damped sinusoids, and the reverb impulse is a sum of bandpassed white noise.

Click the iteration tabs to see the components evolve while training. Here, the prompt is "kick drum, bass, reverb."

Iteration 0

Iteration 100

Iteration 1000

Object Impulse
⋆	=	Output
Reverb Impulse

Object Impulse
⋆	=	Output
Reverb Impulse

Object Impulse
⋆	=	Output
Reverb Impulse

Interactive Impact Synthesis: Click on the drum or pot to listen to our learned results with the prompts
"kick drum, bass, reverb" and "striking metal pot with a wooden spoon"

👆Click the drum or pot to listen

FM Synthesis Overview

For FM synthesis, the parameters updated with Audio-SDS are the FM matrix and ratio, attack, and decay for each operator. Below, we visualize these as dials for an FM synthesis user interface, as well as the envelope and audio.

Click the iteration tabs to see the components evolve while training. Here the prompt is "kick drum, bass, reverb."

Iteration 0

Iteration 50

Iteration 150

Below, we qualitatively contrast results using Audio-SDS with the FM and impact synthesis models, along with quantitative measurements using CLAP, which gauges audio-prompt alignment. The top row is the initialization for each synthesizer. The middle row shows a prompt which both models can represent (both +0.1 CLAP vs. init.), while the final row shows an impact-oriented prompt which the impact synthesizer can more accurately match (final impact +0.3 CLAP vs. final FM).

FM Synthesis

Impact Synthesis

Initialization

"kick drum, bass, reverb"

"striking metal pot with a wooden spoon"

Prompt-Guided Source Separation

Source separation is a core problem in audio processing, where a mixture of sources is supplied as one raw, mixed audio buffer, and the goal is to separate that audio buffer into multiple sources. Here, we specify a prompt for each source, while the parameters we optimize are the audio of each source. Our update is a mixture of Audio-SDS for each source's prompt, allowing prompt-guided separation per source, and a reconstruction loss forcing the sum of the sources to be the mixed audio.

Consider the example below, which combines road noise with a saxophone solo. As a demonstration, these are combined into the mixture in the center of the diagram, and we show the results for applying our Audio-SDS source separation method on the right.

Saxophone solo (ground-truth source, typically unavailable)

→

Separated source for prompt:
"sax playing melody, jazzy, modal, interchange, post bop"

↘ → ↗

Mixed audio (observed)

↗ → ↘

Traffic noise (ground-truth
source, typically unavailable)

→

Separated source for prompt:
"cars passing by on a busy street, traffic, road noise"

Extended Source Separation Results

Below, we show additional examples of Audio-SDS for prompt-guided source separation. On the left is the provided mixed audio, followed by the separated sources and their prompts, as well as the ground truth sources for each mixture. We have occasional artifacts where audio bleeds into other channels, but we are re-using a single pretrained audio diffusion model with no source separation specific training.

Given Mixture	Separated Source 1	Separated Source 2	Ground Truth 1	Ground Truth 2
spacer	"bongo drum playing a rhythmic beat"	"waves crashing on a rocky shore"	spacer	spacer
spacer	"rhythmic ticking of an old grandfather clock"	"bongo drum playing a rhythmic beat"	spacer	spacer
spacer	"clanging metal pipes dropping on concrete"	"glass breaking and shattering"	spacer	spacer
spacer	"clanging metal pipes dropping on concrete"	"gentle rustling of leaves in the wind"	spacer	spacer
spacer	"saxophone playing melody, jazzy, modal interchange, post bop"	"cars passing by on a busy street, traffic, road noise"	spacer	spacer

Fully Automatic Real-World Source Separation Example

Below, we show a fully automated pipeline on real YouTube audio. We first run the clip through an audio captioner, feed the result to an LLM for prompt suggestions, and then apply Audio-SDS to separate each prompt. Although the results aren’t perfect, this demo highlights how our approach can scale to in-the-wild recordings — no manual labeling of sources or specialized datasets is required.

↓

Give the target audio from video to an audio captioning model, such as AudioCaps.

↓

Provide an LLM-assistant, such as ChatGPT, with the audio caption (here, "Someone is clicking on a keyboard and talking.") and a task description (e.g.,"...suggest different prompts for sources given the audio caption...").

↓

LLM Output:

...

Channel 1 Prompt: "clicking on a keyboard"
Channel 2 Prompt: "music playing quietly with indiscernible talking"

...

↓

Run Audio-SDS source separation using the prompts for each channel.

↙ ↘

Separated Source 1:
"clicking on a keyboard"

Separated Source 2:
"music playing quietly with indiscernible talking"

Limitations

Audio Model Bias: Our results depend on the pretrained Stable Audio Open model. Thus, rare or out-of-domain audio classes (e.g., non-Western instruments, speech, or audio without silence at the end) may yield suboptimal results. Incorporating more comprehensive or fine-tuned diffusion models could overcome this limitation.

Clip-Length Budget: We use ≤10s waveforms; pushing to minute-scale scenes can introduce artifacts near the tail or memory blowups for simplistic implementations. Hierarchical or windowed SDS schedules are a straightforward path to longer audio.

Citation

Richter-Powell, J., Torralba, A., Lorraine, J. (2025)
Score Distillation Sampling for Audio: Source Separation, Synthesis, and Beyond. arXiv preprint arXiv:2505.04621


                    @article{richter2025audiosds,

                      title={Score Distillation Sampling for Audio: Source Separation, Synthesis, and Beyond},

                      author={Jessie Richter-Powell and Antonio Torralba and Jonathan Lorraine},

                      journal={arXiv preprint arXiv:2505.04621},

                      year={2025},

                    }