Audio Flamingo 3
Published:

Audio Flamingo 3
Advancing Audio Intelligence with Fully Open Large Audio Language Models
[Paper] [Code] [Website Demo]
Audio Flamingo 3 - 7B: [Gradio] [Checkpoints]
Audio Flamingo 3-Chat - 7B: [Gradio] [Checkpoints]
Audio Flamingo 3 - Datasets: [AudioSkills-XL] [LongAudio-XL] [AF-Think] [AF-Chat]
Authors: Arushi Goel★, Sreyan Ghosh★, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, Bryan Catanzaro
Posted: Zhifeng Kong
★ Equal contribution, Alphabetical order
Overview


In this paper, we introduce Audio Flamingo 3, a fully open-source Large Audio Language Model (LALM) with state-of-the-art performance in audio understanding and reasoning across 20+ benchmarks. In addition, AF3 brings several novel capabilities, including multi-turn, multi-audio chat, on-demand thinking, voice-to-voice interaction, and long-context audio reasoning (up to 10 minutes).
-
AF3 introduces key capabilities including: (i) long-context audio QA (extending beyond sounds and including speech), and (ii) flexible, on-demand thinking, enabling the model to generate concise, CoT-style reasoning steps when prompted.
-
We also present AF3-Chat, a fine-tuned variant of AF3 designed for multi-turn, multi-audio chat and voice-to-voice interaction.
-
We propose novelties in data curation, audio encoder representation learning, and training strategies. Being fully open, we release our code, training recipes, and 4 new datasets to promote research in this space.
Expert Reasoning and Long Audio Understanding
We propose AudioSkills-XL, a large-scale audio question-answering (AQA) dataset designed to develop (large) audio-language models on expert-level reasoning and problem-solving tasks over short audio clips (≤30 seconds). It expands upon the original AudioSkills collection (proposed in Audio Flamingo 2) by adding approximately 4.5 million new QA pairs, resulting in a total of ~10 million diverse examples. AudioSkills-XL focuses on seven primary skills for sounds and music:
- Temporal Reasoning: Understanding temporal relationships in audio (order, attribute changes, referring, grounding).
- Attribute Identification: Recognizing specific event properties (e.g., loudness, speaker gender).
- Counting: Quantifying occurrences of target sounds at varying difficulty levels.
- Contextual Sound Event Reasoning: Inferring the purpose or cause of a sound in its acoustic context.
- Contextual Speech Event Reasoning: Explaining spoken utterances in relation to surrounding sounds or dialogue.
- Information Extraction: Pulling out detailed facts, entities, or responses from audio content.
- General Reasoning: Addressing complex questions that combine multiple reasoning skills.
and 6 primary skills for speech:
- Sarcasm Identification: Inferring sarcasm from speech by analyzing content, tone, and emotional cues.
- Emotional State Reasoning: Identifying a speaker’s emotion, reasoning about its cause, and explaining any emotion flips.
- Topic Relationship Reasoning: Determining how two ideas or topics relate within the conversation.
- Information Extraction (IE): Needle QA, Causal QA, Response QA, and Topic QA for extracting specific facts, causes, responses, or main topics.
- Summarization: Producing a concise summary of the speech content.
- Order: Temporal Order, Temporal Attribute, Temporal Referring, and Temporal Grounding to locate and sequence topics over time.
We also propose LongAudio-XL a large-scale long audio question-answering (AQA) dataset designed to develop (large) audio-language models on long audio reasoning and problem-solving tasks over long audio clips (30 seconds - 10 mins). It expands upon the original LongAudio collection (proposed in Audio Flamingo 2) by adding approximately 1 million new QA pairs for long speech, resulting in a total of ~1.25 million diverse examples. LongAudio-XL focuses on six primary skills for sounds and music:
- Captioning: Generate comprehensive descriptions of long audio, capturing key events and the overall context.
- Plot QA: Answer questions about the audio’s narrative or storyline, reasoning over temporal and causal relationships.
- Temporal QA: Identify when events occur and how they relate in time, including sequencing, overlap, and attribute changes.
- Needle QA: Locate and reason about a specific “needle” segment within a longer audio “haystack,” ensuring answers reference that segment.
- Subscene QA: Answer questions about a distinct subscene in the audio, requiring focus on localized events and details.
- General QA: Address broad, open-ended questions spanning multiple events or themes, demonstrating overall comprehension.
and 6 primary skills for speech similar to AudioSkills-XL.
On-demand Thinking
We propose AF-Think, a novel dataset designed to develop on-demand thinking capabilities in audio-language models. It consists of 500K QA pairs, where each question is designed to require a chain-of-thought (CoT) reasoning process. The questions are carefully crafted by augmenting the original questions with a short thought prefix that precedes the answer and a special suffix to trigger thinking only when requested. AF-Think examples are sampled from both AudioSkills-XL and LongAudio-XL to cover diverse audio lengths and reasoning skills.
Multi-turn, Multi-audio Chat
We also propose AF-Chat which is a high-quality fine-tuning dataset of ~75K multi-turn, multi-audio conversations (avg. 4.6 clips & 6.2 turns; range 2–8 clips & 2–10 turns) spanning speech, environmental sounds, and music.
Samples
Emergent Audio Understanding
Audio Flamingo 3 understands audio with unusual sounds never seen during training, showing emergent capabilities.
Audio | Instruction | Ground-truth | Audio Flamingo 3 |
---|---|---|---|
🔊 | What is surprising about the relationship between the barking and the music? | Dogs barking in sync with the music | The barking and music are synchronized, creating a rhythmic and harmonious interaction. |
🔊 | What makes this saxophone performance unlike typical saxophone sounds? | Aside from “Interstellar Spaces”, Saxophones do not bark nor meow | The saxophone is played in a way that resembles a dog howling, creating a unique and unusual sound. |
🔊 | What makes this combination of acoustic and ambient sounds unlikely to appear together? | Unlikely existence of banjo and rain sounds at the same time. | The contrast between the lively music and the sound of raindrops creates an unusual auditory experience. |
Long audio and speech understanding
Below we show how Audio Flamingo 3 outperforms baselines on long audio and speech understanding, for audios up to 10 minutes in length.
Audio | Ground-truth | Prev. SOTA | Audio Flamingo 3 |
---|---|---|---|
🔊 Instruction: During the sequence with a man walking down a hallway, what indicates a rising tension in the audio? |
The orchestra playing a waltz with a crescendo movement by timpani at the end builds tension. | The sound of a gun being cocked and fired indicates a rising tension in the audio. | The orchestra plays a dramatic tune with a crescendo, featuring a string section and timpani, creating a suspenseful atmosphere. |
🔊 Instruction: How does the tone of female speech change throughout the audio? (A) Sad to happy (B) Happy to sad (C) Neutral to happy (D) Happy to neutral. |
(A) Sad to happy | (B) Happy to sad | (A) Sad to happy |
🔊 Instruction: In the conversation provided, is the comment ‘Wow! Spinning that sounds like fun.’ made by the speaker sarcastic? |
Yes, the comment is sarcastic. The tone of ‘Wow!’ followed by ‘Spinning that sounds like fun.’ in response to someone being spun off the bed suggests sarcasm, as being spun off the bed is typically not considered fun. | No, the comment is not sarcastic; it is an expression of genuine surprise and amusement. | Yes, the comment is sarcastic. The speaker uses a tone of disgust to express that the situation was not enjoyable. |
🔊 Instruction: Why does one speaker express disbelief about the possibility of scoring 222 points in a football game? |
The speaker is surprised because they primarily watch pro football, where such high scores are unheard of. | The speaker considers it highly unlikely for a college team to score that many points in a game, especially against a lower-ranked team. | The speaker finds it hard to believe because they are more accustomed to watching professional football, where such high scores are less common. |
🔊 Instruction: What is the specific reason given for why the huddle was originally invented in American football? |
The huddle was invented because of a deaf quarterback named Paul D. Hubbard. | The huddle was originally invented because of a deaf quarterback. | The huddle was originally invented because of a deaf quarterback named Paul D. Hubbard. |
Chain-of-Thought Reasoning
Audio Flamingo 3 supports flexible, on-demand reasoning, steering the model to correct predictions with accurate reasoning steps.
Audio | Instruction | Ground-truth | AF3 Think |
---|---|---|---|
🔊 | How many unique sounds are present in this audio? (A) 2 (B) 3 (C) 4 (D) 5 |
(C) 4 | The audio contains distinct sound events: bird chirping, cat meowing, ice cracking, and horse neighing. These are four separate sound types. Therefore, the answer is: (C) 4 |
🔊 | What sport are the people in the audio doing? (A) Running (B) Playing badminton (C) Playing football (D) Playing tennis |
(D) Playing tennis | The audio contains sounds of a ball being hit and bounced, along with the sound of shoes squeaking on a hard surface. These sounds are characteristic of a sport played on a hard court with a racket or shuttlecock. Tennis and badminton are both such sports. The presence of a ball being hit and bounced suggests a sport where a racket is used to hit a ball, which is more indicative of tennis than badminton. Therefore, the most likely sport is tennis. |
🔊 | How many students are in this class? (A) 3 (B) 2 (C) 5 (D) 1 |
(D) 1 | The speaker mentions “just you and me,” indicating a small class size. The presence of a teacher and the sound of a door suggest a classroom setting. Therefore, the most plausible answer is: (D) 1 |
🔊 | Why is the philosopher’s name mentioned in the lyrics? (A) To express a sense of nostalgia (B) To indicate that language cannot express clearly, satirizing the inversion of black and white in the world (C) To add depth and complexity to the lyrics (D) To showcase the wisdom and influence of the philosopher |
(B) To indicate that language cannot express clearly, satirizing the inversion of black and white in the world | The lyrics mention the philosopher’s name in a context where language is described as being “muddled” and “inverted,” suggesting a critique of language’s ability to accurately convey meaning. This implies that the philosopher’s name is used to highlight the limitations of language in expressing complex ideas. Therefore, the most plausible reason is: (B). |
Short Audio Understanding
Audio Flamingo 3 not only achieves the state-of-the-art on audio understanding benchmarks, but also understands audio in depth.
Audio | Instruction | Ground-truth | Prev. SOTA | Audio Flamingo 3 |
---|---|---|---|---|
🔊 | How many times did the cat meowing sound appear? Choose the correct option from the following options: (A) 1 (B) 2 (C) 3 (D) 4 | (A) 1 | (B) 2 | (A) 1 |
🔊 | The following audio contains someone whistling a melody. Which nursery rhyme is being whistled? Choose the correct option from the following options: (A) Baa Baa Black Sheep (B) Mary Had a Little Lamb (C) Twinkle Twinkle Little Star (D) London Bridge Is Falling Down | (B) Mary Had a Little Lamb | Without the actual audio, it’s impossible to determine the specific nursery rhyme being whistled. Each nursery rhyme has a unique melody, but without hearing the melody, we cannot identify it. | (B) Mary Had a Little Lamb |
🔊 | Where is the person most likely around? (A) country road (B) parking lot (C) shopping mall (D) highway | (A) country road | (B) parking lot | (A) country road |
Multi-audio Multi-turn Chat
Users can chat with Audio Flamingo 3, ask questions about multiple audios and refer to previous audios in context.
Sound Dialogue
🎧 Audio Clip 1
User: The narration here feels calm and almost poetic. What aspects of the audio help create such a peaceful and uplifting vibe overall?
Ground-truth: The peaceful vibe is shaped by several elements working together. The narrator speaks in a steady, neutral tone, without any sharp changes in pitch or emotion, which immediately sets a calm mood. The background melody is uplifting but gentle, and the addition of bird chirps reinforces the sense of warmth and tranquility. All these sounds blend to create an atmosphere reminiscent of a serene transition between seasons.
Audio Flamingo 3: The peaceful and uplifting vibe comes from several elements working together. The narrator’s voice is steady and neutral, with a gentle pacing that avoids any urgency or tension. The background music is soft and melodic, which adds warmth and serenity. Additionally, the natural sounds of birds chirping reinforce a sense of tranquility and connection to nature, making the whole scene feel harmonious and optimistic.
User: Does the background music and nature sounds make the information more memorable or emotionally impactful?
Ground-truth: Absolutely. The background music and nature sounds don’t just fill space—they support the narrator’s message by evoking feelings of renewal and positivity. These auditory cues can help the listener connect emotionally, making the information more memorable and giving it a sense of place and time.
Audio Flamingo 3: Yes, the background music and nature sounds play a big role in making the information more memorable and emotionally impactful. The gentle music sets a positive mood, while the birdsong creates an immersive, serene atmosphere. These elements help the listener associate the content with feelings of peace and contentment, making the message about the changing seasons more vivid and emotionally resonant.
🎧 Audio Clip 2
User: Switching gears, the man in this next audio says “I did,” but his tone is very different—almost negative. How does the way he speaks change the meaning?
Ground-truth: The tone here is key. Even though the words are simple, the man’s delivery—marked by a sense of disgust or reluctance—adds emotional weight. The negative tone suggests regret, annoyance, or a sense of being bothered. Unlike the calm narration earlier, this short response feels tense and emotionally charged, changing the listener’s perception of the situation.
Audio Flamingo 3: The man’s tone is key here. His voice carries a sense of disgust or reluctance, which completely shifts the meaning of the phrase “I did.” Instead of sounding proud or neutral, his delivery makes it feel like he’s admitting something unpleasant or unwanted. The lack of background music and the presence of a car engine hum also add a sense of isolation or discomfort, reinforcing the negative emotion.
User: Would his disgust be as clear if there were background music or sound effects, like in the narration clip?
Ground-truth: If there were uplifting music or peaceful sounds in the background, it might soften or even mask his disgust, making it less apparent. In contrast, the lack of background sound here lets the tone of his voice stand out, making the emotional content unmistakable.
Audio Flamingo 3: If there were background music or sound effects, it could potentially soften or even contradict the sense of disgust in his voice. For example, gentle music might make his words seem less harsh, while tense or discordant sounds could amplify the negativity. In the narration clip, the absence of distracting sounds lets the listener focus entirely on the speaker’s tone, making the emotion more direct and unmistakable.
🎧 Audio Clip 3
User: In this audio, a man sighs with clear disgust. How does a sigh communicate so much without words?
Ground-truth: A sigh can convey a lot through its sound alone—especially the way it’s delivered. The length, intensity, and breathiness of the sigh, paired with subtle vocalizations or facial tension (even if unheard), can express feelings like frustration, disappointment, or disgust. Here, the disgusted sigh communicates emotional state even more clearly than words might.
Audio Flamingo 3: A sigh is a powerful nonverbal cue. The way he exhales, the length, and the intensity of the sound can all convey frustration, disappointment, or disgust. Even without words, the listener can pick up on the emotional weight through the rhythm and force of the sigh. It’s a universal signal that often communicates more than a spoken phrase, especially when paired with a tone of voice that matches the emotion.
Music Dialogue
🎧 Music Clip 1
User: This track feels really peaceful and introspective. What elements make it feel so calming and meditative?
Ground-truth: The calming and meditative quality here comes from several factors: the gentle, fingerpicked acoustic guitar creates a soft foundation, while the slow tempo allows plenty of space between phrases. The lo-fi production adds warmth, and the vocal delivery is very restrained, almost whispered. The harmonies are simple and minor, which adds to the introspective vibe. Altogether, these choices keep the mood serene and unhurried.
Audio Flamingo 3: The calming effect comes from several factors: the gentle acoustic instrumentation, the slow and steady tempo, and the soft, expressive vocal delivery. The arrangement leaves plenty of space, with minimal percussion and subtle melodic lines, which helps create a tranquil, reflective mood. The use of major key harmonies also adds to the sense of warmth and serenity.
🎧 Music Clip 2
User: Switching gears, this one is super energetic and synthetic. If I wanted to remix the calming folk piece into something closer to this, what would you suggest?
Ground-truth: To reimagine the peaceful folk song in the style of this energetic electronic track, start by increasing the tempo and layering in electronic drums with a steady, driving beat. Replace or double the acoustic guitar with bright synth arpeggios and pulsing bass. You could chop the original vocals and process them with effects like reverb or vocoder to match the electronic vibe. The contrast between the folk melody and the electro-pop rhythms will give it a fresh, danceable energy.
Audio Flamingo 3: To transform the folk track into a more energetic, synth-driven style, start by increasing the tempo and introducing electronic drums with a punchy, danceable groove. Replace or layer the acoustic instruments with bright, rhythmic synths and arpeggiators. You could also process the vocals with effects like vocoder or light autotune for a more electronic feel. The key is to maintain the melodic core but reframe it with bold, synthetic textures and a driving beat.
User: Would I need to change the chord progression, or can I keep the original harmonies?
Ground-truth: You can generally keep the original chord progression, especially if you want to preserve the song’s emotional core. However, adding some brighter chords or incorporating major-key elements, as in the electronic piece, can help match the upbeat feel. You might also want to experiment with reharmonizing certain sections to better fit the new tempo and groove.
Audio Flamingo 3: You can often keep the original chord progression, especially if it’s strong and memorable. However, you might want to simplify or reharmonize certain sections to fit the new style. Sometimes, adding extended or suspended chords with synths can enhance the energy while preserving the song’s emotional core.
🎧 Music Clip 3
User: This funk instrumental is super tight and rhythmic. Could I blend elements from it into either of the other two tracks?
Ground-truth: Absolutely! The syncopated grooves and punchy horn lines from the funk track could add rhythmic interest and energy to either the folk or electronic song. For the folk piece, try layering subtle funk guitar or percussion beneath the acoustic elements. For the electronic track, you could sample horn stabs or use funky basslines to deepen the groove. Just make sure the rhythmic complexity complements the existing structure—this will help create a seamless fusion.
Audio Flamingo 3: Absolutely! For the folk track, you could introduce syncopated bass lines and percussive guitar strumming inspired by funk, giving it a groovier, more upbeat feel. For the electronic track, layering funky guitar riffs or slap bass under the synths would add organic movement and complexity. In both cases, the rhythmic interplay and tight arrangements from funk can inject new life and danceability into the other genres.
Voice-to-voice conversation
Audio Flamingo 3 can respond to voice queries given by the user directly by speaking, supporting voice-to-voice abilities.
Audio Instruction | Ground-truth | AF3 Voice |
---|---|---|
🎧 Audio Instruction 1 | ||
What are the names of some famous actors that started their careers on Broadway? | Some famous actors that started their careers on Broadway include: Hugh Jackman, Meryl Streep,Denzel Washington, Julia Roberts, Christopher Walken, Anthony Rapp, Audra McDonald, Nathan Lane, Sarah Jessica Parker and Lin-Manuel Miranda. | 🎧 AF3 Voice 1 |
🎧 Audio Instruction 2 | ||
Hi, my sister and her girlfriends want me to play kickball with them. Can you explain how the game is played, so they don’t take advantage of me? | Kickball is a game similar to baseball, but with a large rubber ball instead of a bat and a ball. The game is usually played with two teams of six players each. Each team has three bases and a home plate. The players on the kicking team line up at home plate and take turns kicking the ball. The object of the game is to score runs by running around all three bases and back to home plate without being tagged out by the defense. The team with the most runs at the end of the game is the winner. | 🎧 AF3 Voice 2 |
🎧 Audio Instruction 3 | ||
What is some cool music from the 1920s? | Some cool music from the 1920s includes jazz classics like Duke Ellington’s “Mood Indigo,” Louis Armstrong’s “West End Blues,” and Bessie Smith’s “Down Hearted Blues.” Other popular tunes from the era include Fats Waller’s “Ain’t Misbehavin,” George Gershwin’s “Rhapsody in Blue,” and Irving Berlin’s “Puttin’ On the Ritz.” | 🎧 AF3 Voice 3 |
🎧 Audio Instruction 4 | ||
How did US states get their names? | US states get their names from a variety of sources, including Native American tribes, Spanish explorers, British colonists, and even presidents. For example, the state of Alabama was named after the Native American tribe that lived in the area, while the state of Florida gets its name from the Spanish explorer, Ponce de Leon, who explored the area in the 1500s. Other states are named after English kings (like Virginia, named after England’s “Virgin Queen,” Queen Elizabeth I) or presidents (like Washington, named after George Washington) | 🎧 AF3 Voice 4 |
AF-Whisper
Prior work in audio representation learning typically treats speech, sounds, and music as separate modalities, and LALMs often rely on distinct encoders for each. Using separate encoders for LALMs increases model complexity, introduces frame-rate mismatches, and can lead to training instability. To address this, we propose AF-Whisper, a unified audio encoder trained with a simple yet effective representation learning strategy to model all three audio types.
Specifically, we use the pre-trained Whisper large-v3 encoder, attach it to a standard Transformer decoder, and train using the audio captioning task with the next-token-prediction objective. To achieve this, we generate a natural language caption for each audio, describing its speech, sound, and music content. We choose Whisper as the backbone due to its existing speech understanding capabilities and its dense, high-resolution audio features, which are more informative than those from models like CLAP.

Curriculum Training
Audio Flamingo 3 is trained with a 5-stage curriculum strategy as shown in the figure below.

Benchmark Results
Audio Flamingo 3 outperforms prior SOTA models including GAMA, Audio Flamingo, Audio Flamingo 2, Qwen-Audio, Qwen2-Audio, Qwen2.5-Omni, LTU, LTU-AS, SALMONN, AudioGPT, Gemini Flash v2, Gemini Pro v1.5, Gemini Pro v2.5, and GPT-4o-audio on a number of understanding and reasoning benchmarks.
- Audio Flamingo 3 has SOTA audio understanding and reasoning abilities, including long audio understanding.

- Audio Flamingo 3 achieves competitive/SOTA performance on the ASR benchmarks.

- Audio Flamingo 3 achieves SOTA performance on the multi-turn, multi-audio chat and voice-to-text benchmarks.

Citation
- Audio Flamingo
@inproceedings{kong2024audio, title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities}, author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro, Bryan}, booktitle={International Conference on Machine Learning}, pages={25125--25148}, year={2024}, organization={PMLR} }
- Audio Flamingo 2
@inproceedings{ ghosh2025audio, title={Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities}, author={Ghosh, Sreyan and Kong, Zhifeng and Kumar, Sonal and Sakshi, S and Kim, Jaehyeon and Ping, Wei and Valle, Rafael and Manocha, Dinesh and Catanzaro, Bryan}, booktitle={Forty-second International Conference on Machine Learning}, year={2025}, url={https://openreview.net/forum?id=xWu5qpDK6U} }
- Audio Flamingo 3
@article{goel2025audio, title={Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models}, author={Goel, Arushi and Ghosh, Sreyan and Kim, Jaehyeon and Kumar, Sonal and Kong, Zhifeng and Lee, Sang-gil and Yang, Chao-Han Huck and Duraiswami, Ramani and Manocha, Dinesh and Valle, Rafael and Catanzaro, Bryan}, journal={arXiv preprint arXiv}, year={2025} }