UALM: Unified Audio Language Model for Understanding, Generation, and Reasoning

Published:

Paper    Code

Author: Jinchuan Tian (equal), Sang-gil Lee (equal), Zhifeng Kong (equal), Sreyan Ghosh, Arushi Goel, Chao-Han Huck Yang, Wenliang Dai, Zihan Liu, Hanrong Ye, Shinji Watanabe, Mohammad Shoeybi, Bryan Catanzaro, Rafael Valle, Wei Ping (Project Lead)

Posted: Zhifeng Kong

Overview

Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. Very few studies attempt to unify these tasks – an essential step toward advanced multimodal rea- soning. This paper introduces Unified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal rea- soning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is compara- ble to state-of-the-art diffusion-based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state-of-the-art specialized models in audio under- standing, text-to-audio generation, and text reasoning. Furthermore, we present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross-modal gener- ative reasoning, with its effectiveness confirmed by subjective evaluations.

Model Architecture

UALM is based on a decoder-only LLM architecture. The method for audio inputs is similar to LLaVA and Audio Flamingo 3, where our pre-trained Whisper encoder is applied to compute audio features followed by an MLP to compute audio embeddings. The audio outputs are X-Codec tokens (8-layer RVQ tokens).


To efficiently sample audio tokens, we use the delay pattern following MusicGen. Let $A_{n,t}$ be the audio token at time frate $t$ and layer $n$. At step $s$, we predict all 8 tokens ${A_{1,s},A_{2,s-1},\cdots,A_{8,s-7}}$ in parallel.


As X-Codec only produces 16kHz mono audio and may have codec artifacts, We further train an Enhancement VAE to improve the quality to 48kHz stereo.

UALM-Gen

One of the major challenge is to support text-to-audio generation within the LLM framework, as the recent state-of-the-art models are mostly latent diffusion models. UALM-Gen tackles this problem with data scaling, supporting classifier-free guidance (CFG) in LLM, and applying DPO. It is a 1.5B LLM that predicts X-Codec audio tokens, and matches the state-of-the-art diffusion models such as our ETTA model. We find these to be critical for high-quality audio generation.



UALM

We then combine training data of all modalities and train our 7B unified model, UALM, on audio generation, audio understanding, and text-only tasks. We upweight the audio generation data due to the difficulty of this task, and apply an additional warmup stage before full finetuning.

UALM achieves impressive audio generation as shown above, and also good audio understanding and text problem solving abili tiescomparable to these domain experts. It worths noting that, prior unified understanding and generation models in the vision domain, such as Liquid and Chameleon, have degraded text abilities (MMLU). Our UALM keeps good text abilities as our base LLM, showing the success of unified training.



UALM-Reason

UALM-Reason unblocks more complex abilities which we call multimodal reasoning, the ability to reason beyond the text domain. UALM-Reason supports three multimodal reasoning abilities with a focus on audio generation:

  • Enrichment: the model enriches a short caption into a complex and detailed caption before generating audio;
  • Dialogue: the model chats with the user and progressively creates a complex caption per user’s request before generating audio;
  • Self-reflection: the model listens to its own output, and generates an improved version of it.

These abilities show a deep synergy between understanding and generation, marking a significant step towards higher-level intelligence in multimodal models.




Citation

@misc{tian2025ualm,
  title={UALM: Unified Audio Language Model for Understanding, Generation, and Reasoning},
  author={Tian, Jinchuan and Lee, Sang-gil and Kong, Zhifeng and Ghosh, Sreyan and Goel, Arushi and Yang, Chao-Han Huck and Dai, Wenliang and Liu, Zihan and Ye, Hanrong and Watanabe, Shinji and Shoeybi, Mohammad and Catanzaro, Bryan and Valle, Rafael and Ping, Wei},
  year={2025}
}

UALM Demonstration

Table of Contents


UALM-Gen & UALM: Music Generation

Sample 1

Description: Electronic music that has a constant melody throughout with accompanying instruments used to supplement the melody which can be heard in possibly a casual setting

Groundtruth UALM-Gen UALM
ETTA Stable Audio Open MusicGen
MAGNeT AudioLDM TangoFLUX

Sample 2

Description: Delicate orchestral music with a magical Christmas feel

Groundtruth UALM-Gen UALM
ETTA Stable Audio Open MusicGen
MAGNeT AudioLDM TangoFLUX

Sample 3

Description: Relaxing jazz music with soothing melody that contains brass instruments and various keyboards

Groundtruth UALM-Gen UALM
ETTA Stable Audio Open MusicGen
MAGNeT AudioLDM TangoFLUX

Sample 4

Description: A slow paced arty electronic track that features a strange tuned guitar

Groundtruth UALM-Gen UALM
ETTA Stable Audio Open MusicGen
MAGNeT AudioLDM TangoFLUX

Sample 5

Description: Contemporary trendy optimistic indie pop, with dirty drums, happy guitar comping and synthesizer solo

Groundtruth UALM-Gen UALM
ETTA Stable Audio Open MusicGen
MAGNeT AudioLDM TangoFLUX

UALM-Gen & UALM: Sound Effect Generation

Sample 1

Description: A series of rapid gunshots firing alongside footsteps running on concrete as a man groans while a muffled heart beats in the background

Groundtruth UALM-Gen UALM
ETTA Stable Audio Open AudioGen
MAGNeT AudioLDM TangoFLUX

Sample 2

Description: A man talking as metal clanks together followed by footsteps on grass as insects buzz in the background

Groundtruth UALM-Gen UALM
ETTA Stable Audio Open AudioGen
MAGNeT AudioLDM TangoFLUX

Sample 3

Description: Water lightly splashing as a bird chirps and wind blows into a microphone

Groundtruth UALM-Gen UALM
ETTA Stable Audio Open AudioGen
MAGNeT AudioLDM TangoFLUX

Sample 4

Description: An engine running followed by the engine revving and tires screeching

Groundtruth UALM-Gen UALM
ETTA Stable Audio Open AudioGen
MAGNeT AudioLDM TangoFLUX

Sample 5

Description: Vibrations from a sewing machine followed by clicks and a man speaking

Groundtruth UALM-Gen UALM
ETTA Stable Audio Open AudioGen
MAGNeT AudioLDM TangoFLUX

UALM-Reason: Enrichment with Imaginary Prompt

Example 1: Peaceful Countryside

**User:** I'm working on a scene that needs a very peaceful and relaxing outdoor setting. I want it to evoke a sense of calm and tranquility, like a perfect summer morning in the countryside.
**Assistant:** Okay, to create that relaxing and tranquil countryside soundscape, I will use these elements: **1. Ambient Nature:** I'll start with a continuous base layer of a gentle outdoor soundscape. This will create a sense of space and natural quietude to define the environment. **2. Distinct Bird Chirp:** Layered on top of the ambient sound, I will add a clear, short chirp of a small songbird, appearing briefly. **3. Insect Hum:** Finally, I will include a low, soft, continuous insect hum, similar to the sound of crickets or cicadas in the distance, to enrich the background ambiance.
**Rich caption:** - **Keywords:** Bird chirp, Insect hum, Ambient nature - **Layout:** The audio will start with the continuous, soft sounds of an ambient nature environment. A distinct bird chirp will occur within this continuous background, followed by a light insect hum. - **Description:** - **Bird chirp:** The clear, short, and distinct chirp of a small bird, likely a songbird, appearing briefly. - **Insect hum:** A low, soft, and continuous hum, indicative of distant or small insects like crickets or cicadas. - **Ambient nature:** A gentle and open outdoor soundscape, with a sense of space and natural quietude.

Generated Audio:


Example 2: Garden After Shower

**User:** I need an ambient soundscape that evokes a quiet, contemplative mood, perhaps suitable for a scene in a park or garden after a light shower. I want it to feel natural and peaceful, with a hint of mystery or anticipation.
**Assistant:** Okay, to create that quiet, contemplative, and slightly mysterious mood in a park or garden after a shower, I will structure the audio with the following elements: **• Rain:** The background will be established with the gentle, continuous sound of light rain falling on various surfaces, providing a soft and consistent ambiance. **• Footsteps:** Shortly after the rain begins, the sound of soft, rhythmic footsteps will start, suggesting someone is walking slowly on a slightly wet or uneven surface. **• Rustling:** Simultaneously with the rain and footsteps, I will include a brief, light rustling sound, similar to paper or leaves being gently moved or disturbed.
**Rich caption:** - **Keywords:** Rain, Footsteps, Rustling - **Layout:** The audio will start with the sound of light rain falling, which will continue throughout. Shortly after, the distinct sound of footsteps will begin, followed by a brief rustling sound, all occurring simultaneously with the rain. - **Description:** - **Rain:** The gentle, continuous sound of light rain falling on various surfaces, creating a soft, consistent background ambiance. - **Footsteps:** The soft, rhythmic sound of footsteps, suggesting someone walking slowly on a slightly wet or uneven surface. - **Rustling:** A brief, light rustling sound, like paper or leaves being gently moved or disturbed.

Generated Audio:


Example 3: Meditation Dreamscape

**User:** I need a calming soundscape for a meditation app. Something that feels peaceful and reflective, almost like a dream.
**Assistant:** Okay, to create that peaceful and reflective dreamscape, I will structure the audio using the following elements: **1. Foundation:** The audio will begin with a sustained synth melody. The melody will be smooth and ethereal, with a slightly melancholic tone and a moderate tempo, establishing a tranquil mood. **2. Layered Elements:** - **Bell Sounds:** Concurrent with the synth melody, clear, high-pitched bell or chime sounds will be introduced, adding a layer of sparkle and delicate texture to the overall sound. - **Light Percussion:** To provide a gentle, underlying pulse, I will add subtle, rhythmic percussion. This will likely be a gentle shaker or a light hi-hat, offering a soft, unobtrusive rhythm that supports the meditative atmosphere.
**Rich caption:** - **Keywords:** Synth melody, Bell sounds, Light percussion - **Layout:** The audio will start with a sustained synth melody. Bell sounds will join concurrently, and light percussion will be added to provide a steady rhythm. - **Description:** - **Synth melody:** A smooth, ethereal synth melody with a sustained, slightly melancholic tone and a moderate tempo. - **Bell sounds:** Clear, high-pitched bell or chime sounds that play alongside the synth, adding sparkle. - **Light percussion:** Subtle, rhythmic percussion, possibly a gentle shaker or light hi-hat, providing a soft pulse.

Generated Audio:


Example 4: Parade Music

**User:** I need a piece of music for a parade scene. It should be high-energy and celebratory, something that really captures the excitement of a street performance.
**Assistant:** Okay, to create that high-energy, celebratory parade music, I will structure the audio as follows: **Brass Band Music:** The audio will begin with the sound of a lively brass band playing an upbeat, rhythmic melody. The music will feature prominent trumpets and trombones, creating a full and energetic sound. **Percussion:** Shortly after the brass band music begins, I will introduce the sound of percussion. This will consist of a steady, driving beat, likely from a drum kit, providing a strong rhythmic foundation for the music. The percussion will continue throughout the piece, accompanying the brass band.
**Rich caption:** - **Keywords:** Brass band music, Percussion - **Layout:** The audio will start with the sound of a brass band playing a lively tune. Percussion will join in shortly after, accompanying the brass band throughout. - **Description:** - **Brass band music:** The sound of a lively brass band playing an upbeat, rhythmic melody. The music will feature prominent trumpets and trombones, creating a full and energetic sound. - **Percussion:** The sound of percussion, likely a drum kit, providing a steady, driving beat that supports the brass band music.

Generated Audio:


Example 5: Theatrical Climax

**User:** I need an audio piece for a theatrical scene, something that builds to a powerful emotional climax. Think grand and sweeping.
**Assistant:** Okay, to create this grand and emotionally climactic theatrical scene, I will construct the audio as follows: **1. Orchestral Music:** The piece will begin with building, grand orchestral music. This will be a dramatic soundscape with a strong rhythmic pulse and rising intensity, incorporating elements of brass, strings, and percussion. **2. Male Singing:** After the orchestral music has established itself, a male voice will begin singing over the continuous music. This will be a clear, powerful male tenor voice, singing in a dramatic, operatic style and with a distinct vibrato.
**Rich caption:** - **Keywords:** Orchestral music, Male singing - **Layout:** The audio will begin with a building orchestral music piece. A male voice will then start singing over the continuous music. - **Description:** - **Orchestral music:** The sound of grand, dramatic orchestral music with a strong rhythmic pulse and a rising intensity, including brass, strings, and percussion. - **Male singing:** A clear, powerful male tenor voice singing in a dramatic, operatic style, with a distinct vibrato.

Generated Audio:


Example 6: Playground Energy

**User:** I'm looking for an audio track that captures the energy of a school playground during recess or a busy public park on a sunny day. It should feel vibrant and full of life.
**Assistant:** Okay, I will create a vibrant and lively atmosphere by layering distinct audio elements. The overall structure will be as follows: **1. Children's Voices:** The primary sound element will be the distinct voices of children. These will be a mix of high-pitched, excited, and playful sounds from multiple children, including speech and other playful noises, with some instances of overlapping voices. **2. Background Crowd Noise:** Layered beneath the children's voices, I will include a continuous, low-level murmur of general crowd noise. This will consist of many people talking and moving in the background, helping to create a lively and populated environment.
**Rich caption:** - **Keywords:** Children's voices, Crowd noise - **Layout:** The audio will feature the distinct sound of children's voices layered over a general background of crowd noise. - **Description:** - **Children's voices:** A mix of high-pitched, excited, and playful voices of several children speaking and making sounds, sometimes overlapping. - **Crowd noise:** The continuous, low-level murmur of many people talking and moving in a public space, creating a lively atmosphere.

Generated Audio:


Example 7: Retro Gaming Nostalgia

**User:** I'm working on a podcast segment about the nostalgia of classic video games. I need audio that captures the feeling of friends hanging out and enjoying those old games together.
**Assistant:** Okay, to capture that feeling of friends enjoying classic video games, I'll combine the following elements: **• Retro game music:** A cheerful, repetitive, and simple melodic tune, reminiscent of music from old video games, will play continuously in the background, setting the scene. **• Male speech:** A male voice with a medium pitch and an excited, somewhat animated tone will speak intermittently over the music. **• Male laughter:** Following the male speech, there will be a short, high-pitched, and very expressive burst of laughter from a man to emphasize the fun and camaraderie.
**Rich caption:** - **Keywords:** Retro game music, Male speech, Male laughter - **Layout:** The audio will feature continuous retro game music. Over this music, a male voice will speak intermittently, followed by a short burst of male laughter. - **Description:** - **Retro game music:** A cheerful, repetitive, and simple melodic tune, reminiscent of music from old video games, playing continuously in the background. - **Male speech:** A man speaking with a medium pitch and an excited, somewhat animated tone. - **Male laughter:** A short, high-pitched, and very expressive burst of laughter from a man.

Generated Audio:


Example 8: Fitness App Advertisement

**User:** I'm putting together a high-energy advertisement for a new fitness app. I need something with a modern, edgy feel that will really pump people up and get them motivated. Think powerful visuals and a fast-paced narrative.
**Assistant:** Okay, I will create a soundscape to support your high-energy, edgy fitness app advertisement, focusing on a motivational feel. Here's the audio plan: **1. Electronic Music:** I will start with driving, uptempo electronic music. This track will feature a strong, repetitive drum beat, prominent synthesizers, and a slightly distorted texture to give it a high-energy feel. The music will be continuous and designed to build in intensity over time. **2. Male Speech:** A male voice will overlay the electronic music. This voice will have a slightly rough quality and will deliver lines in a rhythmic style, akin to chanting or rapping. The delivery style will be matched to the energy of the music to create a cohesive, motivational experience.
**Rich caption:** - **Keywords:** Electronic music, Male speech - **Layout:** The audio will start with electronic music featuring a driving beat. A male voice will then speak over the continuous music, which gradually increases in intensity. - **Description:** - **Electronic music:** Uptempo electronic music with a strong, repetitive drum beat, synthesizers, and a slightly distorted, high-energy feel. The music will build in intensity. - **Male speech:** A male voice, slightly rough in quality, will speak in a rhythmic, chanting, or rap-like manner, matching the energy of the music.

Generated Audio:


Example 9: Racing Game Commentary

**User:** I'm working on a racing game and need audio for an exciting moment during a race. I want the player to feel the adrenaline and hear the intensity of the competition, with commentary adding to the thrill.
**Assistant:** Okay, I will generate the audio to convey an exciting moment in a race with accompanying commentary: **1. Initial Sound Event:** The audio will begin with the prominent sound of a race car engine. This will be a continuous, very loud, high-pitched whine and roar, representing a powerful engine at high speed and performance. **2. Overlapping Sound Event:** While the race car engine sound continues in the background, a male voice will start speaking. The male speech will consist of a man speaking with an excited, slightly strained voice, likely due to the surrounding noise and his emotion. The pitch of his voice will be medium.
**Rich caption:** - **Keywords:** Race car engine, Male speech - **Layout:** The audio will start with the loud, high-pitched sound of a race car engine. A male voice will then begin speaking, while the race car engine sound continues in the background. - **Description:** - **Race car engine:** The continuous, very loud, high-pitched whine and roar of a powerful race car engine, indicating high speed and performance. - **Male speech:** A man speaking with an excited, slightly strained voice, likely due to the surrounding noise and his emotion. His pitch is medium.

Generated Audio:


UALM-Reason: Dialogue

Example 1: Rock Concert

**User:** I need audio of rock music with audience cheering.
**Assistant:** Okay, is there any singing involved, and how is it arranged in the audio?
**User:** Yes, there's group singing within the music. The audience cheering is present concurrently, especially at the end.
**Assistant:** Got it. Can you give me more details about the rock music and the singing?
**User:** The rock music is loud and fast-paced with electric guitars. The group singing features multiple male voices in an anthemic style. Now start the audio generation.
**Assistant:** I'll generate the audio based on your specifications:
**Rich caption:** - **Keywords:** Rock music, Group singing, Audience cheering - **Layout:** The audio will feature energetic rock music continuously. Group singing will occur within the music, and the sound of an audience cheering will be present concurrently, especially at the end. - **Description:** - **Rock music:** A loud, fast-paced rock song with a strong beat, prominent electric guitars, and a powerful, driving rhythm. - **Group singing:** Multiple male voices singing together in a melodic, anthemic style, with a slightly raw, passionate quality. - **Audience cheering:** The enthusiastic sound of a crowd cheering and shouting, indicating an excited live performance atmosphere.

Generated Audio:


Example 2: Music with Singing

**User:** I need audio of music with someone singing.
**Assistant:** What kind of music, and what kind of singing?
**User:** The music should be upbeat and rhythmic, with a clear melody played on guitar and drums. A male voice will be singing along.
**Assistant:** Got it. Can you describe the male voice in more detail?
**User:** The male voice sings lyrics in a clear, medium-pitched tone, harmonizing with the instrumental music. Now start the audio generation.
**Assistant:** I'll generate the audio based on your specifications:
**Rich caption:** - **Keywords:** Music, Male singing - **Layout:** The audio will feature continuous music throughout. A male voice will be heard singing along with the music. - **Description:** - **Music:** The sound of upbeat, rhythmic music with a clear melody played on instruments like a guitar and drums. It has a cheerful and consistent tempo. - **Male singing:** A male voice singing lyrics in a clear, medium-pitched tone, harmonizing with the instrumental music.

Generated Audio:


Example 3: Acoustic Guitar Track

**User:** I need an audio track with acoustic guitar, percussion, and male vocals.
**Assistant:** Okay, in what order should these elements appear?
**User:** The guitar starts first, then the percussion joins, and finally the male voice begins singing over both.
**Assistant:** Got it. Can you describe the guitar sound in more detail?
**User:** It's a clean, bright acoustic guitar playing a repeating riff. Now start the audio generation.
**Assistant:** I'll generate the audio based on your specifications:
**Rich caption:** - **Keywords:** Acoustic guitar, Percussion beat, Male singing - **Layout:** The audio will start with a clear acoustic guitar melody, quickly joined by a steady, rhythmic percussion beat. A male voice will then begin singing, layered over the continuous music. - **Description:** - **Acoustic guitar:** The sound of a clean, bright acoustic guitar playing a melodic, repeating riff. - **Percussion beat:** A regular, driving drum beat, typical of pop or rock music, providing a consistent rhythm. - **Male singing:** A male voice singing melodically, with a medium pitch and a clear, slightly energetic tone.

Generated Audio:


Example 4: Car Engine

**User:** I need audio of a car engine.
**Assistant:** What actions should the engine be performing?
**User:** It needs to crank, start, idle, and rev. Also, there should be a male voice.
**Assistant:** Okay. Can you describe the timing of these sounds?
**User:** The engine cranks first, then starts and idles. The male voice speaks while the engine is idling and revving, before it fades out. Now start the audio generation.
**Assistant:** I'll generate the audio based on your specifications:
**Rich caption:** - **Keywords:** Engine cranking, Engine starting, Engine idling, Engine revving, Male speech - **Layout:** The audio will begin with an engine cranking, followed by the engine starting and settling into an idle. A male voice will then speak briefly, while the engine sound continues, occasionally revving before fading out. - **Description:** - **Engine cranking:** The sound of a vehicle engine turning over several times, struggling to start. - **Engine starting:** The engine successfully ignites and begins to run. - **Engine idling:** The continuous, low rumble of a vehicle engine running steadily. - **Engine revving:** The engine's RPMs increasing significantly, then decreasing back to an idle. - **Male speech:** A single male voice speaking in a clear, medium-pitched tone, conveying information.

Generated Audio:


Example 5: Harmonica Music

**User:** I need a short audio clip of harmonica music.
**Assistant:** Sure! Is there a specific kind of melody you'd like the harmonica to play?
**User:** Yes, it should be a lively melody.
**Assistant:** Great! Can you tell me more about the harmonica's sound itself?
**User:** The harmonica should have a warm and slightly breathy quality, playing a clear, bright tune. The melody should be cheerful and upbeat, with quick, flowing notes, creating a joyful atmosphere. Now start the audio generation.
**Assistant:** I'll generate the audio based on your specifications:
**Rich caption:** - **Keywords:** Harmonica music, Lively melody - **Layout:** The audio will feature continuous harmonica music, playing a lively melody throughout the clip. - **Description:** - **Harmonica music:** The sound of a harmonica playing a clear, bright tune. The instrument will have a warm, slightly breathy quality. - **Lively melody:** I will generate a cheerful and upbeat melody, characterized by quick, flowing notes, creating a joyful atmosphere.

Generated Audio:


Example 6: Music and Vehicle

**User:** I want to generate audio of music playing and a vehicle passing by.
**Assistant:** Okay, can you describe the music you want?
**User:** I want electronic music with a strong, repetitive beat, a slightly dark, energetic feel, a clear melody, and a rhythm. Then a vehicle will pass by.
**Assistant:** Understood. How does the vehicle sound, and how does it interact with the music?
**User:** The vehicle, likely a car or truck, passes quickly from left to right with a distinct engine rumble and tire noise. It briefly pauses or overlaps with the music. Then the music continues. Now start the audio generation.
**Assistant:** I'll generate the audio based on your specifications:
**Rich caption:** - **Keywords:** Music, Vehicle passing - **Layout:** The audio will start with music playing. A vehicle will then pass by, causing a brief pause or overlay with the music, and then the music will continue. - **Description:** - **Music:** I will generate electronic music with a strong, repetitive beat and a slightly dark, energetic feel. The music will have a clear melody and rhythm. - **Vehicle passing:** The sound of a vehicle, likely a car or truck, passing quickly from left to right with a distinct engine rumble and tire noise.

Generated Audio:


Example 7: Walking on Gravel

**User:** I need the sound of someone walking.
**Assistant:** What surface are they walking on?
**User:** They are walking on gravel. Now start the audio generation.
**Assistant:** I'll generate the audio based on your specifications:
**Rich caption:** - **Keywords:** Footsteps, Gravel - **Layout:** The audio will continuously feature the sound of footsteps walking on a gravel surface. - **Description:** - **Footsteps:** I will generate the sound of steady, rhythmic footsteps, suggesting a person walking at a moderate pace. - **Gravel:** The distinct, crisp crunching sound of small, loose stones underfoot, indicating a gravel path or surface.

Generated Audio:


Example 8: Human Speech and Engines

**User:** I need an audio scene with human speech and engine sounds.
**Assistant:** What kind of human speech and engine sounds are you looking for?
**User:** I want multiple male voices, talking in a conversational tone, and a consistent engine hum. Also, a vehicle should pass by towards the end.
**Assistant:** Got it. Could you describe the passing vehicle sound in more detail?
**User:** It should have the distinct sound of an engine accelerating and quickly passing by, causing a brief, louder roar. Now start the audio generation.
**Assistant:** I'll generate the audio based on your specifications:
**Rich caption:** - **Keywords:** Human speech, Engine hum, Vehicle passing - **Layout:** The audio will start with the sound of human speech, which will be accompanied by a continuous engine hum. Towards the end, a vehicle will pass by with a distinct engine sound. - **Description:** - **Human speech:** Multiple male voices speaking, with a conversational, sometimes excited, tone. Their pitch is generally in the medium range. - **Engine hum:** A low, steady rumble of an engine, consistent throughout most of the clip, suggesting a vehicle is idling or moving slowly in the background. - **Vehicle passing:** The distinct sound of an engine accelerating and a vehicle quickly passing by, causing a brief, louder roar.

Generated Audio:


Example 9: Bird Calls and Traffic

**User:** I need an audio track with bird calls and some background noise.
**Assistant:** What kind of background noise would you like?
**User:** I want it to be the sound of distant traffic, very faint and steady. The bird calls should be continuous.
**Assistant:** Understood. Can you describe the bird calls in more detail?
**User:** They should sound like multiple small birds, with high-pitched chirps and melodic tweets, like various bird songs. Now start the audio generation.
**Assistant:** I'll generate the audio based on your specifications:
**Rich caption:** - **Keywords:** Bird calls, Distant traffic - **Layout:** The audio will feature the continuous sound of bird calls, with a faint, steady hum of distant traffic present throughout. - **Description:** - **Bird calls:** I will generate the sound of multiple small birds making high-pitched chirps and melodic tweets, resembling various bird songs. - **Distant traffic:** I will generate a very low, consistent rumble of vehicles from a far distance, blending into the background.

Generated Audio:


Example 10: Musical Piece

**User:** I want some audio of a musical piece.
**Assistant:** What instruments are involved and what are their roles?
**User:** It features a plucked string instrument playing a melody, a wind instrument joining in with its own melody, and some subtle percussion.
**Assistant:** Got it. Could you describe the sound of each instrument?
**User:** The plucked string is bright and gentle, like a guitar. The wind instrument sounds airy and smooth, like a pan flute. The percussion is light and rhythmic, almost like shakers. Now start the audio generation.
**Assistant:** I'll generate the audio based on your specifications:
**Rich caption:** - **Keywords:** Plucked string melody, Wind instrument melody, Percussion - **Layout:** The audio will start with a plucked string instrument playing a melody. A wind instrument will then join in, playing its own melody over the continuous string sound. Light percussion will accompany both instruments throughout the piece. - **Description:** - **Plucked string melody:** The clear, bright sound of a plucked string instrument, like an acoustic guitar or ukulele, playing a gentle, flowing melody. - **Wind instrument melody:** A smooth, airy melody played by a woodwind instrument, possibly a pan flute, with a distinct, breathy tone, harmonizing with the strings. - **Percussion:** The subtle, rhythmic sound of light percussion, providing a steady, gentle beat with a shakers-like quality.

Generated Audio:


UALM-Reason: Self-Reflection

Example 1: Electronic Drone

**User:** Electronic drone with rhythmic pulsing changes.
**Assistant (Initial Generation):** I'll generate an electronic drone with rhythmic pulsing changes.
**Initial Generation Details:** - **Keywords:** Electronic drone, Pulsing changes - **Layout:** The audio will feature a continuous electronic drone that undergoes rhythmic pulsing changes in its sound qualities. - **Description:** - **Electronic drone:** I will generate a sustained, low-to-mid frequency electronic hum, resembling a musical synth pad. It will have a slightly hollow, ethereal quality. - **Pulsing changes:** Over the continuous drone, I will introduce a rhythmic, almost 'wobbling' or 'pulsing' effect, where the drone's texture and possibly its volume or filter sweep gently and consistently change, creating a sense of movement without distinct breaks.

Initial Generated Audio:

**Self-Analysis of Generated Audio:** - **Keywords:** Electronic music, Rhythmic pulse - **Layout:** The audio will continuously feature electronic music with a prominent rhythmic pulse throughout its duration. - **Description:** - **Electronic music:** I will generate a sustained, atmospheric electronic music sound, characterized by a smooth, evolving tone. - **Rhythmic pulse:** A clear, steady, medium-speed rhythmic pulse will be present, providing a consistent beat to the electronic music.
**Critique:** The generated audio has a prominent rhythmic pulse separate from the electronic music. It should be corrected to have the electronic drone itself undergo rhythmic pulsing changes in its sound qualities, such as texture, volume, or filter sweep.

Final Refined Audio:


Example 2: Chaotic Bells

**User:** Chaotic and rapid ringing of large metallic bells.
**Assistant (Initial Generation):** I'll generate chaotic and rapid ringing of large metallic bells.
**Initial Generation Details:** - **Keywords:** Bells - **Layout:** The audio will consist solely of the continuous sound of bells ringing. - **Description:** - **Bells:** I will generate the sound of multiple large, metallic bells ringing rapidly and chaotically, creating a sustained, resonant, and somewhat clanging musical texture.

Initial Generated Audio:

**Self-Analysis of Generated Audio:** - **Keywords:** Bells ringing - **Layout:** The audio will feature a continuous sound of bells ringing throughout. - **Description:** - **Bells ringing:** I will generate the clear, melodic sound of multiple metal bells, possibly church bells, ringing in a distinct, harmonious pattern.
**Critique:** The generated audio has bells ringing in a distinct, harmonious pattern. It should be corrected to have bells ringing rapidly and chaotically, creating a sustained, resonant, and somewhat clanging musical texture.

Final Refined Audio:


Example 3: Insect Buzzing

**User:** Insect buzzing loudly, like a fly flying close by.
**Assistant (Initial Generation):** I'll generate the sound of an insect buzzing loudly, like a fly flying close by.
**Initial Generation Details:** - **Keywords:** Insect buzzing - **Layout:** The audio will feature the continuous sound of an insect buzzing throughout the clip. - **Description:** - **Insect buzzing:** I will generate the sound of a large insect, like a fly, buzzing loudly and flying close by, with distinct wing flaps and a sustained, vibrant hum.

Initial Generated Audio:

**Self-Analysis of Generated Audio:** - **Keywords:** Insect buzzing - **Layout:** I will generate a continuous sound of an insect buzzing from beginning to end. - **Description:** - **Insect buzzing:** The continuous, high-pitched hum of multiple flying insects, creating a steady, slightly fluctuating buzzing sound.
**Critique:** The generated audio has the sound of multiple flying insects buzzing. It should be corrected to have the sound of a single large insect, like a fly, buzzing loudly and flying close by, with distinct wing flaps and a sustained, vibrant hum.

Final Refined Audio:


Example 4: Female Speech with Typing

**User:** Female speech with keyboard typing in the background.
**Assistant (Initial Generation):** I'll generate female speech with keyboard typing in the background.
**Initial Generation Details:** - **Keywords:** Female speech, Keyboard typing - **Layout:** The audio will feature a female voice speaking continuously, with the sound of keyboard typing happening concurrently in the background. - **Description:** - **Female speech:** A young female voice speaking at a moderate pace in a conversational and slightly earnest tone. - **Keyboard typing:** The sound of individual keys being pressed on a computer keyboard, creating light, distinct clicks.

Initial Generated Audio:

**Self-Analysis of Generated Audio:** - **Keywords:** Keyboard typing, Female speech - **Layout:** The audio will feature the distinct sound of keyboard typing, which will be consistently present. A female voice will speak intermittently over the typing sounds. - **Description:** - **Keyboard typing:** The crisp, regular sound of mechanical keyboard keys being pressed at a moderate pace, suggesting active typing. - **Female speech:** A soft, calm female voice speaking in a low to medium pitch, with a clear and steady rhythm, as if reading or dictating.
**Critique:** The generated audio has intermittent female speech over consistent keyboard typing. It should be corrected to have continuous female speech with keyboard typing in the background.

Final Refined Audio:


Example 5: Wind and Car

**User:** Fast wind with a car passing quickly.
**Assistant (Initial Generation):** I'll generate the sound of fast wind with a car passing quickly.
**Initial Generation Details:** - **Keywords:** Fast wind, Car passing - **Layout:** The audio will begin with the strong sound of fast wind. As the wind continues, a car will quickly pass by, contributing to the overall sense of movement and speed. - **Description:** - **Fast wind:** I will generate the sound of strong, gusty wind with a distinct whooshing quality, suggesting high speed and pressure. - **Car passing:** The sound of a car quickly moving past the listener, characterized by a brief, low-pitched engine rumble and tire noise that rapidly approaches and then recedes.

Initial Generated Audio:

**Self-Analysis of Generated Audio:** - **Keywords:** Car pass-by, Tire screech - **Layout:** The audio will begin with a car accelerating rapidly. A sharp tire screech will occur shortly after, as the car passes by at high speed. - **Description:** - **Car pass-by:** I will generate the sound of a powerful car engine, starting from a moderate hum, then rapidly accelerating and passing by quickly with a rising and falling pitch. - **Tire screech:** I will generate a distinct, high-pitched screeching sound, characteristic of tires skidding or turning sharply on pavement.
**Critique:** The generated audio has a tire screech. It should be corrected to have the sound of fast wind at the beginning of the audio. The car passing should also be characterized by a brief, low-pitched engine rumble and tire noise that rapidly approaches and then recedes, instead of a tire screech.

Final Refined Audio:


UALM-Reason: Audio Generation Controllability

Group 1: Quantity and Spatial Control

Demonstrating control over the number of sound sources and their spatial distance.

Multiple Sources Single Source Distant Source
“Many dogs bark” “A dog barks” “A dog barks in the far distance”

Group 2: Temporal Ordering Control

Demonstrating control over the temporal sequence of different sound events.

Applause First Water First, Speech Later
“Applause and a male singing voice. Applause is heard earlier” “Water sound and female speech. Speech is heard later”

Group 3: Recording Quality Control

Demonstrating control over the recording quality and fidelity of the generated audio.

High-Quality Recording Low-Quality Recording
“High-quality female singing voice” “Female singing voice from a bad recording device”

UALM-Reason: Audio Understanding Reasoning

Example 1: Transport Scenario

Audio Input:

**User:** In what scenario does this sound occur? Choose the correct option from the following options: (A) On a plane (B) On a ship (C) In a car (D) On a train.
**Assistant:** The presence of mechanical sounds, the mention of landing, and the statement about heading back to the gate suggest that the audio is from a vehicle that experiences mechanical issues. The context points to a plane, as it is the most common mode of transport where such issues would lead to a return to the gate. Therefore, the answer is **(A) On a plane**.

Example 2: Sport Identification

Audio Input:

**User:** Determine what sport is being described in the audio? Choose the correct option from the following options: (A) Golf (B) Tennis (C) Baseball (D) Bowling.
**Assistant:** The phrases 'bogey', 'right shoulder really high', 'sling that ball in from right to left' and 'get that ball in' are indicative of a sport where a player aims to get a ball into a hole. Therefore, the answer is **(A) Golf**.

Example 3: Cat Behavior

Audio Input:

**User:** What are the two cats doing in the video? Choose the correct option from the following options: (A) Grooming each other (B) Fighting (C) Playing together (D) Sitting together resting.
**Assistant:** The audio features sounds of meowing and hissing, which are typically associated with cats interacting in a confrontational or playful manner. The presence of a metal object being moved suggests an active environment, further supporting the idea of interaction rather than passive rest. Therefore, the most likely scenario is: **(B) Fighting**.

Example 4: Music Rhythm Change

Audio Input:

**User:** At which second does the rhythm of the music suddenly change? Choose the correct option from the following options: (A) 5 (B) 6 (C) 3 (D) 4.
**Assistant:** The music starts with a steady beat, but around the 3-second mark, there's a noticeable shift in the rhythm, suggesting a change in the musical structure or arrangement. Therefore, the answer is: **(C) 3**.

Example 5: Speaker Age Identification

Audio Input:

**User:** Is the second speaker in the audio a child or an adult? Choose the correct option from the following options: (A) Adult (B) Child.
**Assistant:** The audio features a female voice asking a question about wishes, followed by a response about toys. The speech characteristics indicate the speaker is a child, specifically between the ages of 4-6 years old. Therefore, the second speaker is a child. Therefore, the answer is **(B) Child**.