UALM: Unified Audio Language Model for Understanding, Generation, and Reasoning

Published: October 08, 2025

Paper Code

Author: Jinchuan Tian (equal), Sang-gil Lee (equal), Zhifeng Kong (equal), Sreyan Ghosh, Arushi Goel, Chao-Han Huck Yang, Wenliang Dai, Zihan Liu, Hanrong Ye, Shinji Watanabe, Mohammad Shoeybi, Bryan Catanzaro, Rafael Valle, Wei Ping (Project Lead)

Posted: Zhifeng Kong

Overview

Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. Very few studies attempt to unify these tasks – an essential step toward advanced multimodal rea- soning. This paper introduces Unified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal rea- soning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is compara- ble to state-of-the-art diffusion-based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state-of-the-art specialized models in audio under- standing, text-to-audio generation, and text reasoning. Furthermore, we present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross-modal gener- ative reasoning, with its effectiveness confirmed by subjective evaluations.

Model Architecture

UALM is based on a decoder-only LLM architecture. The method for audio inputs is similar to LLaVA and Audio Flamingo 3, where our pre-trained Whisper encoder is applied to compute audio features followed by an MLP to compute audio embeddings. The audio outputs are X-Codec tokens (8-layer RVQ tokens).

To efficiently sample audio tokens, we use the delay pattern following MusicGen. Let $A_{n,t}$ be the audio token at time frate $t$ and layer $n$. At step $s$, we predict all 8 tokens ${A_{1,s},A_{2,s-1},\cdots,A_{8,s-7}}$ in parallel.

As X-Codec only produces 16kHz mono audio and may have codec artifacts, We further train an Enhancement VAE to improve the quality to 48kHz stereo.

UALM-Gen

One of the major challenge is to support text-to-audio generation within the LLM framework, as the recent state-of-the-art models are mostly latent diffusion models. UALM-Gen tackles this problem with data scaling, supporting classifier-free guidance (CFG) in LLM, and applying DPO. It is a 1.5B LLM that predicts X-Codec audio tokens, and matches the state-of-the-art diffusion models such as our ETTA model. We find these to be critical for high-quality audio generation.

UALM

We then combine training data of all modalities and train our 7B unified model, UALM, on audio generation, audio understanding, and text-only tasks. We upweight the audio generation data due to the difficulty of this task, and apply an additional warmup stage before full finetuning.

UALM achieves impressive audio generation as shown above, and also good audio understanding and text problem solving abili tiescomparable to these domain experts. It worths noting that, prior unified understanding and generation models in the vision domain, such as Liquid and Chameleon, have degraded text abilities (MMLU). Our UALM keeps good text abilities as our base LLM, showing the success of unified training.

UALM-Reason

UALM-Reason unblocks more complex abilities which we call multimodal reasoning, the ability to reason beyond the text domain. UALM-Reason supports three multimodal reasoning abilities with a focus on audio generation:

Enrichment: the model enriches a short caption into a complex and detailed caption before generating audio;
Dialogue: the model chats with the user and progressively creates a complex caption per user’s request before generating audio;
Self-reflection: the model listens to its own output, and generates an improved version of it.

These abilities show a deep synergy between understanding and generation, marking a significant step towards higher-level intelligence in multimodal models.

Citation

@misc{tian2025ualm,
  title={UALM: Unified Audio Language Model for Understanding, Generation, and Reasoning},
  author={Tian, Jinchuan and Lee, Sang-gil and Kong, Zhifeng and Ghosh, Sreyan and Goel, Arushi and Yang, Chao-Han Huck and Dai, Wenliang and Liu, Zihan and Ye, Hanrong and Watanabe, Shinji and Shoeybi, Mohammad and Catanzaro, Bryan and Valle, Rafael and Ping, Wei},
  year={2025}
}

UALM Demonstration

UALM-Gen & UALM: Music Generation
UALM-Gen & UALM: Sound Effect Generation
UALM-Reason: Enrichment with Imaginary Prompt
UALM-Reason: Dialogue
UALM-Reason: Self-Reflection
UALM-Reason: Audio Generation Controllability
UALM-Reason: Audio Understanding Reasoning

UALM-Gen & UALM: Music Generation

Sample 1

Description: Electronic music that has a constant melody throughout with accompanying instruments used to supplement the melody which can be heard in possibly a casual setting

Groundtruth	UALM-Gen	UALM

ETTA	Stable Audio Open	MusicGen

MAGNeT	AudioLDM	TangoFLUX

Sample 2

Description: Delicate orchestral music with a magical Christmas feel

Groundtruth	UALM-Gen	UALM

ETTA	Stable Audio Open	MusicGen

MAGNeT	AudioLDM	TangoFLUX

Sample 3

Description: Relaxing jazz music with soothing melody that contains brass instruments and various keyboards

Groundtruth	UALM-Gen	UALM

ETTA	Stable Audio Open	MusicGen

MAGNeT	AudioLDM	TangoFLUX

Sample 4

Description: A slow paced arty electronic track that features a strange tuned guitar

Groundtruth	UALM-Gen	UALM

ETTA	Stable Audio Open	MusicGen

MAGNeT	AudioLDM	TangoFLUX

Sample 5

Description: Contemporary trendy optimistic indie pop, with dirty drums, happy guitar comping and synthesizer solo

Groundtruth	UALM-Gen	UALM

ETTA	Stable Audio Open	MusicGen

MAGNeT	AudioLDM	TangoFLUX