Audio Flamingo 2

Published: February 13, 2025

Audio Flamingo 2
An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

[Paper] [Code]

Audio Flamingo 2 - 0.5B: [Gradio] [Checkpoints]

Audio Flamingo 2 - 1.5B: [Gradio] [Checkpoints]

Audio Flamingo 2 - 3B: [Gradio] [Checkpoints]

Author: Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, Bryan Catanzaro

Posted: Zhifeng Kong

Overview

Understanding non-speech sounds, non-verbal speech, and music is essential for real-world applications such as detecting anomalies in industrial environments, recognizing emotional cues, and improving assistive technologies for the impaired. While Large Language Models (LLM) have demonstrated remarkable reasoning capabilities through language, extending these systems to comprehend audio is key to building intelligent systems capable of reasoning with contextual auditory cues. Verbal speech, inherently tied to language, has benefited significantly from LLM advancements. However, the potential to enhance perception and reasoning over non-verbal audio remains largely under-explored.

In this paper, we introduce Audio Flamingo 2, an Audio-Language Model with advanced long-audio understanding and reasoning capabilities. Audio Flamingo 2 achieves the state-of-the-art performance across over 20 benchmarks, with only a 3B parameter small language model.

We introduce two datasets, AudioSkills for expert audio reasoning, and LongAudio for long audio understanding, to advance this field.
Audio Flamingo 2 has advanced audio understanding and reasoning capabilities. Especially, Audio Flamingo 2 has expert audio reasoning abilities, and can understand long audio up to 5 minutes.
Audio Flamingo 2 outperforms larger and proprietary LALMs across 20+ benchmarks, despite being smaller (3B) and trained exclusively on public datasets.

Expert Reasoning and Long Audio Understanding

We propose AudioSkills, a high-quality, skill-specific synthetic dataset designed to prioritize the development of reasoning and problem-solving abilities. This dataset is carefully curated to ensure diversity and relevance, grounded in the hypothesis that expert reasoning emerges from the mastery of various relevant skills and world knowledge. The skills include: Temporal Reasoning, Attribute Identification, Counting, Contextual Sound Event Reasoning, Contextual Speech Event Reasoning, Information Extraction, and General Reasoning.

We propose LongAudio and LongAudioBench for training and evaluating long audio understanding. LongAudio is consists of over 80K unique audios and approximately 263K AQA pairs, where audios are from open long-video datasets. LongAudio supports six different tasks including: Captioning, Plot QA, Temporal QA, Needle QA, Subscene QA, and General QA. LongAudioBench has 2429 expert human-annotated instances across these tasks.

Samples

Emergent Ability

Question: What is unusual about the audio?

Ground Truth: The audio contains combination of music and hammer beating which is unusual.

LTU: There is nothing unusual.

Qwen2-Audio-Instruct: The input audio contains a dark electronic piece with synthesizers and drums that has an energetic and experimental mood, instrumental only, and includes bass and percussion.

GAMA: The audio clip does not contain any unusual sounds. It is a typical example of a musical composition with drums and percussion instruments playing.

Gemini Pro v1.5: The unusual thing about the audio is that it sounds like someone is **shredding paper by hand**, but at a very **fast and consistent pace**. It's so fast and rhythmic that it almost sounds mechanical, but the slight variations and imperfections suggest a human hand. It's the speed and consistency that makes it stand out.

Audio Flamingo (chat): The audio seems like a remix of a popular song, which might make it interesting. However, the combination of the rhythmic scratching and rhythmic tapping sounds and the vibrating techno beat suggests a somewhat unusual arrangement.

Audio Flamingo 2 (ours): The unusual aspect of the audio is the combination of the music and the sound of the mechanisms.

Sound Captioning (AudioCaps)

Question: how would you describe the scene from the audio?

Ground Truth: Water pouring from a faucet and draining into a pipe while a young girl talks followed by a brush scrubbing then a person spitting.

LTU: An woman is taking whike a water flowing in the background.

Qwen2-Audio-Instruct: The background sounds include domestic home sounds, water tap or faucet noise, sounds of various mechanisms, generic impact sounds, and respiratory sounds such as breathing.

GAMA: Water flowing while a woman talking can be heard.

Gemini Pro v1.5: The audio depicts a person seemingly frustrated or annoyed. They say, "The last time you did the first thing, same thing," suggesting someone repeated a mistake or undesirable action. A sigh or exasperated exhale follows, along with a muttered, unintelligible word (possibly a curse word, given the tone and context). The overall impression is one of exasperation and perhaps a bit of resignation.

Audio Flamingo (chat): The audio seems to depict a scene involving a woman speaking, some mechanical equipment running in the background, and the sound of water trickling or dripping intermittently. There are also some generic impact sounds and breathing sounds.

Audio Flamingo 2 (ours): Water running and a woman speaking.

Question: what is the soundscape in this audio?

Ground Truth: Traffic hums and beeps with revving engines and a man speaking nearby.

LTU: An idle car engine is heard.

Qwen2-Audio-Instruct: The audio contains the sound of an engine accelerating and revving, noticeable as medium engine sounds of mid-frequency active throughout the duration of the audio, coupled with brief instances of a clock ticking at specific intervals. Additionally, there is a speech segment where a male voice in Mandarin can be heard saying '坐等他开走啊'.

GAMA: Sound of men talking and car engine

Gemini Pro v1.5: The soundscape primarily features the sounds of a vehicle, likely a car or truck. This is evident through what sounds like engine noise, perhaps struggling or accelerating. There's also a distinct vocalization, likely a person expressing surprise or concern, possibly related to the vehicle's performance. The tone suggests a potential problem. There might be other subtle sounds present, but these are the dominant elements.

Audio Flamingo (chat): The audio mainly consists of the sounds of a man speaking, the hum of a motor vehicle, breathing, ticking, and a generic impact sound.

Audio Flamingo 2 (ours): A man speaks and a vehicle engine runs.

Music Captioning (MusicCaps)

Question: summarize the music content in a sentence.

Ground Truth: A female vocalist sings this spirited latin song.the tempo is medium fast with an animated piano accompaniment, rhythmic congas and bongos, funky bass line and bright trumpet harmony. The song is lively, spirited, upbeat, catchy, vivacious, vibrant and peppy with a salsa dance groove. This song is Latin Pop.

LTU: A female voice is heard singing, possibly a dance song.

Qwen2-Audio-Instruct: The music features a solo guitar playing blues with slides between chord changes and a syncopated bassline, creating a relaxed atmosphere.

GAMA: A salsa song with a vibrant and upbeat rhythm and a male voice singing in the background with a trumpet playing a melody in the foreground. The song is lively, energetic, and joyful.

Gemini Pro v1.5: The lyrics express the singer's strong liking for the music being played.

Audio Flamingo (chat): The song has a salsa groove with a male voice on lead. It has a powerful female voice backing him.

Audio Flamingo 2 (ours): This is a salsa music piece. There is a brass section playing the main melody. The rhythm is played by the acoustic drums and the bass guitar. The atmosphere is upbeat and joyful. This piece could be playing in the background at a latin american social media company.

Question: provide a caption for the music.

Ground Truth: The low quality recording features an arpeggiated steel acoustic guitar playing alongside acoustic rhythm guitar chord progression. The recording is noisy and in mono and it sounds like a cartoon song.

LTU: Music featuring guitar playing can be heard.

Qwen2-Audio-Instruct: The music is in C major, with a time signature of 4/4 and a tempo of 103.45 bpm.

GAMA: A guitar is playing.

Gemini Pro v1.5: Several options depending on the mood you want to convey: **Formal:** "Enjoying a lively piece of traditional Klezmer music." **Informal/Fun:** "This Klezmer tune is making me want to dance!" or "Getting my groove on with some Klezmer!" **Descriptive:** "The frenetic energy of this Klezmer piece is infectious!" or "Listening to the beautiful clarinet melodies of this Klezmer song." **If you know the song/artist:** "[Song Title] by [Artist Name] - a Klezmer classic!" **If you're at a performance:** "Live Klezmer music is the best!" or "Soaking up the atmosphere at this amazing Klezmer concert."

Audio Flamingo (chat): The song has a bluesy melody on the acoustic guitar.

Audio Flamingo 2 (ours): This is a live performance of a blues rock piece. There is an electric guitar playing a solo using the tapping technique. The piece has a blues feel to it. The atmosphere is groovy. This piece could be playing in the background at a rock bar.

Long Audio Understanding & Reasoning (LongAudioBench)

Question: What sound is heard towards the end of the audio? Choose the correct option among the options below: (a) a man dancing (b) a man speaking in English (c) a man kicking a door (d) continuous background noise

Ground Truth: (b) a man speaking in English

LTU: A man kicking the door can be heard towards the end of the audio.

Qwen2-Audio-Instruct: (C) A man kicking a door

GAMA: (a) a man dancing

Gemini Pro v1.5: The correct answer is (b). A man's voice clearly says, "It's over here... Don't lose me." There's no other sound effect that matches the other choices.

Audio Flamingo (chat): A man speaking in English. The sound of a door being kicked can be heard towards the end of the audio.

Audio Flamingo 2 (ours): (b) a man speaking in English

Question: When is the song with a male vocalist heard? Choose the correct option among the options below: (A) Towards the beginning (B) In the middle (C) Towards the end

Ground Truth: (B) In the middle

LTU: Male vocalist can be heard throughout the audio.

Qwen2-Audio-Instruct: The male vocalist can be heard at the beginning of the track. Therefore, the answer is (A).

GAMA: (A) Beginning

Gemini Pro v1.5: The piano melody begins right at the start of the clip. So the answer is (A).

Audio Flamingo (chat): End

Audio Flamingo 2 (ours): (B) middle

Question: What happens between the background noise and the clicking sounds?

Ground Truth: A man and a woman are heard talking after the background noise and before the clicking sounds.

LTU: A man is talking.

Qwen2-Audio-Instruct: Various background noises and occasional speech as someone moves around a cluttered room handling objects.

GAMA: The audio clip does not provide enough information to determine what happens between the background noise and the clicking sounds.

Gemini Pro v1.5: Between the background noise and the clicking sounds, the person in the video is searching for a scrapbook. They are looking through drawers, bags, and other containers, rummaging around and making the rustling and clicking sounds in the process. They are having a conversation with someone off-camera about the scrapbook and where the pictures might be. They express uncertainty about the scrapbook’s location and wonder if it’s been moved.

Audio Flamingo (chat): Between the background noise and the clicking sounds, there's a brief segment where someone can be heard speaking. This might indicate that someone is performing some activity that involves the use of a device, possibly a computer or a calculator.

Audio Flamingo 2 (ours): A man speaks in English, followed by a woman speaking in English.

Question: Give a description of the atmosphere in the audio.

Ground Truth: Intense video game battle with dynamic sound effects, background music, and intermittent male commentary.

LTU: The atmosphere of the audio is filled with excitement and impact sounds.

Qwen2-Audio-Instruct: The atmosphere is intense and action-packed, with an emphasis on conflict and determination.

GAMA: The atmosphere in the audio is one of excitement and anticipation, as the player is about to win a jackpot and receive a prize.

Gemini Pro v1.5: An intense video game battle unfolds, filled with vivid sound effects that capture the chaos and energy of combat. The rapid, driving music heightens the tension, while periodic male voice snippets provide guidance or comments, anchoring the player in a thrilling, fast-paced experience.

Audio Flamingo (chat): Intense dynamic sound, fast-paced music, and male voice snippets.

Audio Flamingo 2 (ours): Intense video game battle with dynamic sound effects, fast-paced music, and occasional male speech.

Question: What musical elements accompany the announcement of the SAG Awards nominations?

Ground Truth: An instrumental theme with electric guitar, keyboard chords, bass guitar pattern, and drum beat.

LTU: There is no information about the musical elements and SAG Awards nominations in the audio.

Qwen2-Audio-Instruct: An instrumental theme with electric guitar, keyboard chords, bass guitar pattern, and drum beat.

GAMA: The audio clip does not provide information about any musical elements accompanying the announcement.

Gemini Pro v1.5: The audio clip includes some upbeat, triumphant-sounding music accompanying the announcement of the *Bombshell* cast's nomination. It sounds orchestral/instrumental, possibly with horns.

Audio Flamingo (chat): The announcement of the SAG Awards nominations is accompanied by music, narration from an announcer, and sound effects, which all contribute to a specific mood of excitement and anticipation.

Audio Flamingo 2 (ours): A male voice announces the nominations over a slow-tempo instrumental piece with a groovy bass line and a simple drum beat.

Question: What indicates a shift in the speaker's mood during the audio?

Ground Truth: The speaker's voice turns from happy to disgusted when discussing unusual methods.

LTU: The background sound turns suspenseful, which indicates a shift in the mood of the speaker.

Qwen2-Audio-Instruct: The transition from no music to music indicates a shift in the speaker's mood.

GAMA: The shift in the speaker's mood is not specified in the given information. It is possible that the speaker's tone or inflection changes, but without further context, it is difficult to determine the exact shift in mood.

Gemini Pro v1.5: The shift in the speaker's mood is indicated by the interruption and side conversations happening while he's giving his sermon. We hear whispers, someone saying "Stop it," and requests to have the disruptive element taken outside. These indicate the speaker is likely becoming increasingly frustrated and perhaps embarrassed by the disruption. The final exasperated sigh further emphasizes the shift from confident preaching to annoyance and exasperation.

Audio Flamingo (chat): After the speaker finishes speaking, there is a sharp, brief sound that could be a gasp or a sigh, which could indicate surprise or relief. However, without specific details, it's hard to confirm if that's the case.

Audio Flamingo 2 (ours): The male voice transitions from a neutral tone to a disgusted mood while discussing the man's actions.

MMAU test-mini

Question: What specific sounds can be distinguished from the audio clip? (A) Car honking and raindrops (B) Whistling and chatter (C) Helicopter and impact sounds (D) Birds chirping and water flowing

Ground Truth: (C) Helicopter and impact sounds.

LTU: The source of the sound is likely whistling and chatter.

Qwen2-Audio-Instruct: The specific sounds that can be distinguished are birds chirping and water flowing.

GAMA: b) Whistling and chatter

Gemini Pro v1.5: The audio clip contains the distinct sounds of a helicopter and what sounds like an impact or crash. So the answer is (C).

Audio Flamingo (chat): The audio clip primarily features the sounds of a vehicle horn and a heavy engine, accompanied by other human sounds such as bird whistling and chatter.

Audio Flamingo 2 (ours): (C) Helicopter and impact sounds

Question: Which chord is played right before the last chord in the audio? (A) C#:maj7/1, (B) F#:maj7/1, (C) G#:7/1, (D) A#:min7/1.

Ground Truth: (B) F#:maj7/1.

LTU: The chord playing in the audio is not mentioned.

Qwen2-Audio-Instruct: The chord played right before the last chord in the audio is A# major (A#maj7/1).

GAMA: The chord playing in the audio is a) C#:maj7/1

Gemini Pro v1.5: The chord played right before the last chord is a G#7 chord. So the correct answer is (C).

Audio Flamingo (chat): An intro riff featuring a single chord from the acoustic-guitar is strummed before the last chord in the audio.

Audio Flamingo 2 (ours): B) F#:maj7/1

Expert Reasoning (Muchomusic)

Question: Which instrument performs the long melancholic solo section? Options: (A) Vocals (B) Guitar (C) Drum groove (D) Emotional male vocal delivery.

Ground Truth: (B) Guitar

LTU: Emotional male vocal delivery pperforms the long melancholic solo section.

Qwen2-Audio-Instruct: The answer is D: Emotional male vocal delivery.

GAMA: (D) Emotional male vocal delivery

Gemini Pro v1.5: The long melancholic solo is played on a guitar. So the answer is (B).

Audio Flamingo (chat): The vocals performs the long melancholic solo section.

Audio Flamingo 2 (ours): (B) Guitar

Question: Which two genres are primarily combined in this music piece? Options: (A) Techno and Trance (B) EDM and House (C) Folk and Rock (D) Electronic Disco.

Ground Truth: (B) EDM and House

LTU: (A) Techno and Trance

Qwen2-Audio-Instruct: The answer is D: Emotional male vocal delivery.

GAMA: The primary combination of genres in this music piece is (D) Electronic Disco, which encompasses elements of both techno and trance.

Gemini Pro v1.5: The primary genres combined in this piece are (A) Techno and Trance.

Audio Flamingo (chat): The two genres that are primarily combined in this music piece are techno and trance.

Audio Flamingo 2 (ours): (b) edm and house

Expert Reasoning (CompA-R-test)

Question: Given the continuous presence of mechanisms sound and the children's laughter, what kind of children's play could be occurring?

Ground Truth: Based on the sound of mechanisms and children's laughter and speech, they might be playing on a swing or slide.

LTU: The children might be playing with toys.

Qwen2-Audio-Instruct: The children might be engaged in a game that involves running, jumping, or hide-and-seek, which often results in laughter and physical activity.

GAMA: It is possible that the children are playing with toys or engaging in physical activities that involve the use of mechanisms, such as riding a bicycle or playing with a ball that has a mechanism inside it.

Gemini Pro v1.5: The children may be playing a game that includes running and jumping, frequently sparking laughter.

Audio Flamingo (chat): The children might be playing games which involve physical activity and laughter like hide-and-seek.

Audio Flamingo 2 (ours): The children could be playing with playground equipment like swings or sea-saw, as indicated by the mechanism sounds and their laughter.

Question: Based on the audio, determine the likely activity the child might be engaged in during the singing.

Ground Truth: The child might be partaking in a playful or creative activity that involves singing, possibly a game or a sing-along scenario.

LTU: It is probable that the child is simultaneously engaged in both singing and playing.

Qwen2-Audio-Instruct: The child is likely singing and playing at the same time.

GAMA: The child might be singing in a classroom or at home, as the audio does not provide enough information to determine the specific activity. However, the presence of a child singing suggests that the child might be engaged in a musical activity, such as singing a song or practicing a vocal exercise.

Gemini Pro v1.5: The child could be in a classroom environment, actively engaged in a group-based activity that involves singing and playful interactions. This might include chanting songs, clapping rhythms, or using simple instruments, fostering creativity, cooperation, and enjoyment while promoting social and musical development.

Audio Flamingo (chat): The child’s voice and accompanying music suggest a playful or creative activity.

Audio Flamingo 2 (ours): The child might be engaged in a playful or creative activity, as suggested by the presence of a child's voice and the accompanying music.

AF-CLAP

Representations in current CLAP models struggle with compositional reasoning and linguistic variations in captions. We introduce an improved version of CLAP called AF-CLAP, where we (1) construct a large-scale, high-quality training dataset, and (2) improve the training objective to for better representational quality and robustness.

For each audio-caption pair, we construct linguistically varied captions with identical semantics and composition, and regard these as additional positives. We then generate caption variations with modified temporal or attribute compositions, and regard these as additional negatives. Our improved contrastive loss considers these additional positives and negatives, which leads to more human-aligned representation and better results (in both representation learning and audio understanding).

Curriculum Training

Audio Flamingo 2 is trained with a 3-stage curriculum.

Pre-training: we train transformation and cross attention layers on 30 seconds on the pre-training dataset.
Fine-tuning: we train all but LLM layers on 1.5 minutes on the fine-tuning dataset.
Long fine-tuning: we train transformation and cross attention layers on 5 minutes on the LongAudio dataset.

Additional ablations on the curriculum training can be found below.

Benchmark Results

Audio Flamingo 2 outperforms prior SOTA models including GAMA, Audio Flamingo, Qwen-Audio, Qwen2-Audio, LTU, LTU-AS, SALMONN, AudioGPT, Gemini Flash v2, Gemini Pro v1.5, and GPT-4o-audio on a number of understanding and reasoning benchmarks.

Audio Flamingo 2 has SOTA foundational audio understanding abilities.

Audio Flamingo 2 has SOTA audio reasoning and long audio undersanding abilities.

Citation

Audio Flamingo

@inproceedings{kong2024audio,
title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities},
author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro, Bryan},
booktitle={International Conference on Machine Learning},
pages={25125--25148},
year={2024},
organization={PMLR}
}