Audio Flamingo 2
Published:

Audio Flamingo 2
An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
Paper Code & Model Checkpoints
Author: Sreyan Ghosh, Zhifeng Kong, Jaehyeon Kim, Wei Ping, Rafael Valle, Bryan Catanzaro
Posted: Zhifeng Kong
Overview


Understanding non-speech sounds, non-verbal speech, and music is essential for real-world applications such as detecting anomalies in industrial environments, recognizing emotional cues, and improving assistive technologies for the impaired. While Large Language Models (LLM) have demonstrated remarkable reasoning capabilities through language, extending these systems to comprehend audio is key to building intelligent systems capable of reasoning with contextual auditory cues. Verbal speech, inherently tied to language, has benefited significantly from LLM advancements. However, the potential to enhance perception and reasoning over non-verbal audio remains largely under-explored.
In this paper, we introduce Audio Flamingo 2, an Audio-Language Model with advanced long-audio understanding and reasoning capabilities. Audio Flamingo 2 achieves the state-of-the-art performance across over 20 benchmarks, with only a 3B parameter small language model.
-
We introduce two datasets, AudioSkills for expert audio reasoning, and LongAudio for long audio understanding, to advance this field.
-
Audio Flamingo 2 has advanced audio understanding and reasoning capabilities. Especially, Audio Flamingo 2 has expert audio reasoning abilities, and can understand long audio up to 5 minuts.
-
Audio Flamingo 2 outperforms larger and proprietary LALMs across 20+ benchmarks, despite being smaller (3B) and trained exclusively on public datasets.
Expert Reasoning and Long Audio Understanding
We propose AudioSkills, a high-quality, skill-specific synthetic dataset designed to prioritize the development of reasoning and problem-solving abilities. This dataset is carefully curated to ensure diversity and relevance, grounded in the hypothesis that expert reasoning emerges from the mastery of various relevant skills and world knowledge. The skills include: Temporal Reasoning, Attribute Identification, Counting, Contextual Sound Event Reasoning, Contextual Speech Event Reasoning, Information Extraction, and General Reasoning.
We propose LongAudio and LongAudioBench for training and evaluating long audio understanding. LongAudio is consists of over 80K unique audios and approximately 263K AQA pairs, where audios are from open long-video datasets. LongAudio supports six different tasks including: Captioning, Plot QA, Temporal QA, Needle QA, Subscene QA, and General QA. LongAudioBench has 2429 expert human-annotated instances across these tasks.
Samples
Emergent Ability
Ground Truth: The audio contains combination of music and hammer beating which is unusual. |
LTU: There is nothing unusual. |
Qwen2-Audio-Instruct: The input audio contains a dark electronic piece with synthesizers and drums that has an energetic and experimental mood, instrumental only, and includes bass and percussion. |
GAMA: The audio clip does not contain any unusual sounds. It is a typical example of a musical composition with drums and percussion instruments playing. |
Gemini: The unusual thing about the audio is that it sounds like someone is **shredding paper by hand**, but at a very **fast and consistent pace**. It's so fast and rhythmic that it almost sounds mechanical, but the slight variations and imperfections suggest a human hand. It's the speed and consistency that makes it stand out. |
Audio Flamingo (chat): The audio seems like a remix of a popular song, which might make it interesting. However, the combination of the rhythmic scratching and rhythmic tapping sounds and the vibrating techno beat suggests a somewhat unusual arrangement. |
Audio Flamingo 2 (ours): The unusual aspect of the audio is the combination of the music and the sound of the mechanisms. |
Sound Captioning (AudioCaps)
Ground Truth: Water pouring from a faucet and draining into a pipe while a young girl talks followed by a brush scrubbing then a person spitting. |
LTU: An woman is taking whike a water flowing in the background. |
Qwen2-Audio-Instruct: The background sounds include domestic home sounds, water tap or faucet noise, sounds of various mechanisms, generic impact sounds, and respiratory sounds such as breathing. |
GAMA: Water flowing while a woman talking can be heard. |
Gemini: The audio depicts a person seemingly frustrated or annoyed. They say, "The last time you did the first thing, same thing," suggesting someone repeated a mistake or undesirable action. A sigh or exasperated exhale follows, along with a muttered, unintelligible word (possibly a curse word, given the tone and context). The overall impression is one of exasperation and perhaps a bit of resignation. |
Audio Flamingo (chat): The audio seems to depict a scene involving a woman speaking, some mechanical equipment running in the background, and the sound of water trickling or dripping intermittently. There are also some generic impact sounds and breathing sounds. |
Audio Flamingo 2 (ours): Water running and a woman speaking. |
Ground Truth: Traffic hums and beeps with revving engines and a man speaking nearby. |
LTU: An idle car engine is heard. |
Qwen2-Audio-Instruct: The audio contains the sound of an engine accelerating and revving, noticeable as medium engine sounds of mid-frequency active throughout the duration of the audio, coupled with brief instances of a clock ticking at specific intervals. Additionally, there is a speech segment where a male voice in Mandarin can be heard saying '坐等他开走啊'. |
GAMA: Sound of men talking and car engine |
Gemini: The soundscape primarily features the sounds of a vehicle, likely a car or truck. This is evident through what sounds like engine noise, perhaps struggling or accelerating. There's also a distinct vocalization, likely a person expressing surprise or concern, possibly related to the vehicle's performance. The tone suggests a potential problem. There might be other subtle sounds present, but these are the dominant elements. |
Audio Flamingo (chat): The audio mainly consists of the sounds of a man speaking, the hum of a motor vehicle, breathing, ticking, and a generic impact sound. |
Audio Flamingo 2 (ours): A man speaks and a vehicle engine runs. |
Music Captioning (MusicCaps)
Ground Truth: A female vocalist sings this spirited latin song.the tempo is medium fast with an animated piano accompaniment, rhythmic congas and bongos, funky bass line and bright trumpet harmony. The song is lively, spirited, upbeat, catchy, vivacious, vibrant and peppy with a salsa dance groove. This song is Latin Pop. |
LTU: A female voice is heard singing, possibly a dance song. |
Qwen2-Audio-Instruct: The music features a solo guitar playing blues with slides between chord changes and a syncopated bassline, creating a relaxed atmosphere. |
GAMA: A salsa song with a vibrant and upbeat rhythm and a male voice singing in the background with a trumpet playing a melody in the foreground. The song is lively, energetic, and joyful. |
Gemini: The lyrics express the singer's strong liking for the music being played. |
Audio Flamingo (chat): The song has a salsa groove with a male voice on lead. It has a powerful female voice backing him. |
Audio Flamingo 2 (ours): This is a salsa music piece. There is a brass section playing the main melody. The rhythm is played by the acoustic drums and the bass guitar. The atmosphere is upbeat and joyful. This piece could be playing in the background at a latin american social media company. |
Ground Truth: The low quality recording features an arpeggiated steel acoustic guitar playing alongside acoustic rhythm guitar chord progression. The recording is noisy and in mono and it sounds like a cartoon song. |
LTU: Music featuring guitar playing can be heard. |
Qwen2-Audio-Instruct: The music is in C major, with a time signature of 4/4 and a tempo of 103.45 bpm. |
GAMA: A guitar is playing. |
Gemini: Several options depending on the mood you want to convey: **Formal:** "Enjoying a lively piece of traditional Klezmer music." **Informal/Fun:** "This Klezmer tune is making me want to dance!" or "Getting my groove on with some Klezmer!" **Descriptive:** "The frenetic energy of this Klezmer piece is infectious!" or "Listening to the beautiful clarinet melodies of this Klezmer song." **If you know the song/artist:** "[Song Title] by [Artist Name] - a Klezmer classic!" **If you're at a performance:** "Live Klezmer music is the best!" or "Soaking up the atmosphere at this amazing Klezmer concert." |
Audio Flamingo (chat): The song has a bluesy melody on the acoustic guitar. |
Audio Flamingo 2 (ours): This is a live performance of a blues rock piece. There is an electric guitar playing a solo using the tapping technique. The piece has a blues feel to it. The atmosphere is groovy. This piece could be playing in the background at a rock bar. |
LongAudioBench
Ground Truth: (b) a man speaking in English |
LTU: A man kicking the door can be heard towards the end of the audio. |
Qwen2-Audio-Instruct: (C) A man kicking a door |
GAMA: (a) a man dancing |
Gemini: The correct answer is (b). A man's voice clearly says, "It's over here... Don't lose me." There's no other sound effect that matches the other choices. |
Audio Flamingo (chat): A man speaking in English. The sound of a door being kicked can be heard towards the end of the audio. |
Audio Flamingo 2 (ours): (b) a man speaking in English |
Ground Truth: A man and a woman are heard talking after the background noise and before the clicking sounds. |
LTU: A man is talking. |
Qwen2-Audio-Instruct: Various background noises and occasional speech as someone moves around a cluttered room handling objects. |
GAMA: The audio clip does not provide enough information to determine what happens between the background noise and the clicking sounds. |
Gemini: Between the background noise and the clicking sounds, the person in the video is searching for a scrapbook. They are looking through drawers, bags, and other containers, rummaging around and making the rustling and clicking sounds in the process. They are having a conversation with someone off-camera about the scrapbook and where the pictures might be. They express uncertainty about the scrapbook’s location and wonder if it’s been moved. |
Audio Flamingo (chat): Between the background noise and the clicking sounds, there's a brief segment where someone can be heard speaking. This might indicate that someone is performing some activity that involves the use of a device, possibly a computer or a calculator. |
Audio Flamingo 2 (ours): A man speaks in English, followed by a woman speaking in English. |
Ground Truth: A man's voice turns fearful amidst the continuous background noise and animal sounds. |
LTU: The sad music conveys a fearful moment in the audio. |
Qwen2-Audio-Instruct: The change from birds chirping to a loud, high-pitched tone suggests a moment of fear. |
GAMA: The suddenness of the human voice and the high pitch of the tone. |
Gemini: The sharp gasp or intake of breath at [00:00:48] indicates a sudden moment of fear. |
Audio Flamingo (chat): The ominous and threatening nature of the deep resonant bell sound, along with the low-pitched tones of the wind, all contribute to a sudden moment of fear in the audio. |
Audio Flamingo 2 (ours): A male voice speaks in a fearful tone amidst the background noise and bird chirping. |
Ground Truth: An instrumental theme with electric guitar, keyboard chords, bass guitar pattern, and drum beat. |
LTU: There is no information about the musical elements and SAG Awards nominations in the audio. |
Qwen2-Audio-Instruct: An instrumental theme with electric guitar, keyboard chords, bass guitar pattern, and drum beat. |
GAMA: The audio clip does not provide information about any musical elements accompanying the announcement. |
Gemini: The audio clip includes some upbeat, triumphant-sounding music accompanying the announcement of the *Bombshell* cast's nomination. It sounds orchestral/instrumental, possibly with horns. |
Audio Flamingo (chat): The announcement of the SAG Awards nominations is accompanied by music, narration from an announcer, and sound effects, which all contribute to a specific mood of excitement and anticipation. |
Audio Flamingo 2 (ours): A male voice announces the nominations over a slow-tempo instrumental piece with a groovy bass line and a simple drum beat. |
Ground Truth: The speaker's voice turns from happy to disgusted when discussing unusual methods. |
LTU: The background sound turns suspenseful, which indicates a shift in the mood of the speaker. |
Qwen2-Audio-Instruct: The transition from no music to music indicates a shift in the speaker's mood. |
GAMA: The shift in the speaker's mood is not specified in the given information. It is possible that the speaker's tone or inflection changes, but without further context, it is difficult to determine the exact shift in mood. |
Gemini: The shift in the speaker's mood is indicated by the interruption and side conversations happening while he's giving his sermon. We hear whispers, someone saying "Stop it," and requests to have the disruptive element taken outside. These indicate the speaker is likely becoming increasingly frustrated and perhaps embarrassed by the disruption. The final exasperated sigh further emphasizes the shift from confident preaching to annoyance and exasperation. |
Audio Flamingo (chat): After the speaker finishes speaking, there is a sharp, brief sound that could be a gasp or a sigh, which could indicate surprise or relief. However, without specific details, it's hard to confirm if that's the case. |
Audio Flamingo 2 (ours): The male voice transitions from a neutral tone to a disgusted mood while discussing the man's actions. |
MMAU test-mini
Ground Truth: (C) Helicopter and impact sounds. |
LTU: The source of the sound is likely whistling and chatter. |
Qwen2-Audio-Instruct: The specific sounds that can be distinguished are birds chirping and water flowing. |
GAMA: b) Whistling and chatter |
Gemini: The audio clip contains the distinct sounds of a helicopter and what sounds like an impact or crash. So the answer is (C). |
Audio Flamingo (chat): The audio clip primarily features the sounds of a vehicle horn and a heavy engine, accompanied by other human sounds such as bird whistling and chatter. |
Audio Flamingo 2 (ours): (C) Helicopter and impact sounds |
Ground Truth: (B) F#:maj7/1. |
LTU: The chord playing in the audio is not mentioned. |
Qwen2-Audio-Instruct: The chord played right before the last chord in the audio is A# major (A#maj7/1). |
GAMA: The chord playing in the audio is a) C#:maj7/1 |
Gemini: The chord played right before the last chord is a G#7 chord. So the correct answer is (C). |
Audio Flamingo (chat): An intro riff featuring a single chord from the acoustic-guitar is strummed before the last chord in the audio. |
Audio Flamingo 2 (ours): B) F#:maj7/1 |
Expert Reasoning (Muchomusic)
Ground Truth: (B) Guitar |
LTU: Emotional male vocal delivery pperforms the long melancholic solo section. |
Qwen2-Audio-Instruct: The answer is D: Emotional male vocal delivery. |
GAMA: (D) Emotional male vocal delivery |
Gemini: The long melancholic solo is played on a guitar. So the answer is (B). |
Audio Flamingo (chat): The vocals performs the long melancholic solo section. |
Audio Flamingo 2 (ours): (B) Guitar |
Ground Truth: (B) EDM and House |
LTU: (A) Techno and Trance |
Qwen2-Audio-Instruct: The answer is D: Emotional male vocal delivery. |
GAMA: The primary combination of genres in this music piece is (D) Electronic Disco, which encompasses elements of both techno and trance. |
Gemini: The primary genres combined in this piece are (A) Techno and Trance. |
Audio Flamingo (chat): The two genres that are primarily combined in this music piece are techno and trance. |
Audio Flamingo 2 (ours): (b) edm and house |
AF-CLAP
Representations in current CLAP models struggle with compositional reasoning and linguistic variations in captions. We introduce an improved version of CLAP called AF-CLAP, where we (1) construct a large-scale, high-quality training dataset, and (2) improve the training objective to for better representational quality and robustness.
For each audio-caption pair, we construct linguistically varied captions with identical semantics and composition, and regard these as additional positives. We then generate caption variations with modified temporal or attribute compositions, and regard these as additional negatives. Our improved contrastive loss considers these additional positives and negatives, which leads to more human-aligned representation and better results (in both representation learning and audio understanding).


Curriculum Training
Audio Flamingo 2 is trained with a 3-stage curriculum.
-
Pre-training: we train transformation and cross attention layers on 30 seconds on the pre-training dataset.
-
Fine-tuning: we train all but LLM layers on 1.5 minutes on the fine-tuning dataset.
-
Long fine-tuning: we train transformation and cross attention layers on 5 minutes on the LongAudio dataset.
Additional ablations on the curriculum training can be found below.

Benchmark Results
Audio Flamingo 2 has SOTA foundational audio understanding abilities.

Audio Flamingo 2 has SOTA audio reasoning and long audio undersanding abilities.
