Projects

Audio Flamingo 3

Published: July 09, 2025

We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.

Nemotron-CrossThink: Scaling Self-Learning beyond Math Reasoning

Published: April 29, 2025

We propose Nemotron-CrossThink, a framework for training AI models to reason across diverse real-world domains by curating multi-domain data, applying structured reward templates, and blending data, leading to significant accuracy and efficiency gains in both math and non-math reasoning.

Training Math Reasoning Model with Reinforcement Learning

Published: April 23, 2025

We present AceMath-RL-Nemotron-8B, a math reasoning model trained using reinforcement learning.

Nemotron-MIND: Math Informed syNthetic Dialogues for Pretraining LLMs

Published: April 23, 2025

We propose a novel large-scale and diverse Math Informed syNthetic Dialogue (Nemotron-MIND) generation method that improves the mathematical reasoning ability of LLMs. Compared to pretraining just on raw data, a model pretrained on data generated by Nemotron-MIND shows significant boost in mathematical reasoning (GSM8K: +13.42%, MATH: +2.30%), including superior performance in specialized knowledge (MMLU: +4.55%) and general purpose reasoning tasks (General Reasoning: +2.51%).

From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models

Published: April 14, 2025

We introduce UltraLong-8B, a suite of models based on Llama3.1-8B-Instruct, designed to support 1M, 2M, and 4M context windows.

Nemotron-CORTEXA: Enhancing LLM Agents for Software Engineering Tasks via Improved Localization and Solution Diversity

Published: April 09, 2025

Nemotron-CORTEXA is a software engineering agent that can resolve real-world Github issues. Nemotron-CORTEXA resolves 68.20% of the issues from the SWE-bench Verified set, while only consuming $3.28 in LLM inference calls per problem, achieving a new state-of-the-art in resolution rate.

Nemotron-H: A Family of Accurate, Efficient Hybrid Mamba-Transformer Models

Published: March 21, 2025

Nemotron-H is a series of hybrid Mamba-Transformer models which offer either better or on-par accuracy and improved inference speed (up to 3x) compared to other similarly-sized state-of-the-art open-sourced pure Transformer models (e.g., Qwen-2.5-7B/72B and Llama-3.1-8B/70B).

DLSS 4: Transforming Real-Time Graphics with AI

Published: March 13, 2025

DLSS 4 is an advanced AI-driven rendering suite designed to elevate real-time gaming across three key pillars: image quality, frame rate, and input latency. It introduces Multi Frame Generation (MFG) to quadruple performance, transformer-based architectures for Ray Reconstruction and Super Resolution to boost fidelity, and FrameWarp to reduce latency via real-time reprojection. Critically, DLSS 4 focuses on seamless adoption, enabling rapid integration into existing games so GeForce users can benefit immediately. This practical emphasis narrows the research domain but ensures that DLSS 4 delivers substantial, tangible value in modern production environments, setting a new standard for visual fidelity and responsiveness in gaming.

Audio Flamingo 2

Published: February 13, 2025

Understanding and reasoning over non-speech sounds and music are crucial for both humans and AI agents to interact effectively with their environments. In this paper, we introduce Audio Flamingo 2, an Audio-Language Model (ALM) with advanced long-audio understanding and reasoning capabilities. Audio Flamingo 2 achieves the state-of-the-art performance across over 20 benchmarks, with only a 3B parameter small language model. In addition, we propose training and test sets for long audio understanding capabilities – namely LongAudio and LongAudioBench – to advance this field, and Audio Flamingo 2 is the first ALM that can understand long audio up to 5 minutes. We confirm the efficacy of our method through extensive evaluations and ablation studies.

A2SB: Audio-to-Audio Schrodinger Bridges

Published: January 17, 2025

Audio in the real world may be perturbed due to numerous factors, causing the audio quality to be degraded. The following work presents an audio restoration model tailored for high-res music at 44.1kHz. Our model, Audio-to-Audio Schrodinger Bridges (A2SB), is capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Critically, A2SB is end-to-end without need of a vocoder to predict waveform outputs, able to restore hour-long audio inputs, and trained on permissively licensed music data. A2SB is capable of achieving state-of-the-art bandwidth extension and inpainting quality on several out-of-distribution music test sets.

AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

Published: December 19, 2024

We introduce AceMath, a family of frontier math reasoning models that set new state-of-the-art performance on math reasoning benchmarks. AceMath outperforms both leading open-access models (e.g., Qwen2.5-Math-72B-Instruct) and proprietary models (e.g., GPT-4o (2024-08-06) and Claude 3.5 Sonnet (2024-10-22)).

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

Published: December 03, 2024

We are excited to release Nemotron-CC, our high quality Common Crawl based 6.3 trillion tokens dataset (4.4T globally deduplicated original tokens and 1.9T synthetically generated tokens).

Elucidating the Design Space of Text-to-Audio Models

Published: October 25, 2024

ETTA is a text-to-audio model trained on publicly available audio datasets with synthetic captions. ETTA significantly outperforms open-sourced baseline models and is comparable to models trained on proprietary data. Furthermore, ETTA has an improved ability to generate creative audio using imaginative text prompts.

NVLM: Open Frontier-Class Multimodal LLMs

Published: September 17, 2024

We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training.

P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting

Published: October 24, 2023

P-Flow is a fast and data-efficient zero-shot text-to-speech model that uses speech prompting mechanism for speaker adaptation.

Progressive Learning of 3D Reconstruction Network from 2D GAN Data

Published: March 28, 2023

Recommended citation: Aysegul Dundar, Jun Gao, Andrew Tao, Bryan Catanzaro, Progressive Learning of 3D Reconstruction Network from 2D GAN Data.

RADMMM: Multilingual Multiaccented Multispeaker TTS with RADTTS

Published: January 20, 2023

We present an multi-lingual multi-accented multi-speaker (MMM) speech synthesis system extending on our previous work with RADTTS, RADTTS++ and Alignment Learning Framework. Our method doesn’t rely on data with speaker(s) speaking multiple languages and allows generating speech in a desired language seen by the model with the proper accent while retaining the characteristics of an individual voice.

BigVGAN: A Universal Neural Vocoder with Large-Scale Training

Published: June 10, 2022

we present BigVGAN, a universal neural vocoder. It’s trained only on speech data but shows extraordinary zero-shot generalization ability for non-speech vocalizations (laughter, applaud), singing voices, music, instrumental audio that are even recorded in varied noisy environment!

Fine Detailed Texture Learning for 3D Meshes with Generative Models

Published: March 17, 2022

Recommended citation: Aysegul Dundar, Jun Gao, Andrew Tao, Bryan Catanzaro, Fine Detailed Texture Learning for 3D Meshes with Generative Models, arXiv:2203.09362, 2022. https://arxiv.org/abs/2203.09362

Speech Denoising in the Waveform Domain with Self-Attention

Published: February 01, 2022

We present CleanUNet, a speech denoising model on the raw waveform. It is based on an encoder-decoder architecture combined with several self-attention blocks to refine its bottleneck representations, which is crucial to obtain good results. It outperforms the state-of-the-art models in terms of denoised speech quality from various objective and subjective evaluation metrics.

One TTS Alignment to Rule Them All

Published: August 20, 2021

We present an unsupervised alignment learning framework that learns speech-text alignments online in text to speech models. We showcase this alignment learning framework can be applied to any TTS model removing the dependency of TTS systems on external aligners. It also enhances the speech quality as evaluated by human evaluators.

RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis

Published: August 16, 2021

RAD-TTS is a parallel flow-based generative network for text-to-speech synthesis which does not rely on external aligners to learn speech-text alignments and supports diversity in generated speech by modeling speech rhythm as a separate generative distribution.

Long-Short Transformer: Efficient Transformers for Language and Vision

Published: July 29, 2021

Long-Short Transformer is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks

View Generalization for Single Image Textured 3D Models

Published: June 13, 2021

Recommended citation: Anand Bhattad, Aysegul Dundar, Guilin Liu, Andrew Tao, Bryan Catanzaro, View Generalization for Single Image Textured 3D Models, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR) 2021.

MegatronLM’s Supercharged V1.0

Published: May 15, 2020

We release version 1.0 of Megatron which makes the training of large NLP models even faster and sustains 62.4 teraFLOPs in the end-to-end training that is 48% of the theoretical peak FLOPS for a single GPU in a DGX2-H server.

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

Published: May 12, 2020

Flowtron is an autoregressive flow-based generative network for text-to-speech synthesis with direct control over speech variation and style transfer

Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens

Published: October 23, 2019

Mellotron is a multispeaker voice synthesis model that can make a voice emote and sing without emotive or singing training data

Unsupervised Video Interpolation Using Cycle Consistency

Published: September 26, 2019

We propose unsupervised techniques to synthesize high frame rate videos directly from low frame rate videos using cycle consistency. We also introduce a pseudo-supervised loss term that enforces the interpolated frames to be consistent with predictions of a pre-trained interpolation model. The pseudo-supervised loss term, used together with cycle consistency, can effectively adapt a pre-trained model to a new target domain. We show results that significantly reduce the domain gap problem in video frame interpolation.

Recommended citation: Fitsum A. Reda, Deqing Sun, Aysegul Dundar, Mohammad Shoeybi, Guilin Liu, Kevin J. Shih, Andrew Tao, Jan Kautz, Bryan Catanzaro, "Unsupervised Video Interpolation Using Cycle Consistency". In ICCV 2019. https://arxiv.org/abs/1906.05928

MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism

Published: August 13, 2019

We train an 8.3 billion parameter transformer language model with 8-way model parallelism and 64-way data parallelism on 512 GPUs, making it the largest transformer based language model ever trained at 24x the size of BERT and 5.6x the size of GPT-2

Partial Convolution based Padding

Published: December 10, 2018

Recommended citation: Guilin Liu, Kevin J. Shih, Ting-Chun Wang, Fitsum A. Reda, Karan Sapra, Zhiding Yu, Andrew Tao, Bryan Catanzaro, Partial Convolution based Padding, arXiv:1811.11718, 2018. https://arxiv.org/abs/1811.11718

Image Inpainting for Irregular Holes Using Partial Convolutions

Published: December 09, 2018

Recommended citation: Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, Bryan Catanzaro, Image Inpainting for Irregular Holes Using Partial Convolutions, Proceedings of the European Conference on Computer Vision (ECCV) 2018. https://arxiv.org/abs/1804.07723

Improving Semantic Segmentation via Video Propagation and Label Relaxation

Published: December 05, 2018

This paper shows how to scale up training sets for semantic segmentation by using video prediction-based data synthesis method. Our proposed joint propagation strategy and boundary relaxation technique can alleviate the label noise in the synthesized samples and lead to state-of-the-art performance on three benchmark datasets Cityscapes, CamVid and KITTI.

Recommended citation: Yi Zhu, Karan Sapra, Fitsum A. Reda, Kevin J. Shih, Shawn Newsam, Andrew Tao and Bryan Catanzaro, Improving Semantic Segmentation via Video Propagation and Label Relaxation, arXiv:1812.01593, 2018. https://arxiv.org/abs/1812.01593

WaveGlow: a Flow-based Generative Network for Speech Synthesis

Published: October 29, 2018

WaveGlow is an invertible neural network that can generate high quality speech efficiently from mel-spectrograms.

SDCNet: Video Prediction Using Spatially Displaced Convolution

Published: September 08, 2018

SDCNet is a 3D convolutional neural network proposed for frame prediction. The model takes as input a sequence of past frames and their inter-frame optical flows and generates a per-pixel kernel and motion vector. A future frame is then synthesised by sampling past frames guided by the motion vectors and weighted by the learned kernels.

Recommended citation: Fitsum A. Reda, Guilin Liu, Kevin J. Shih, Robert Kirby, Jon Barker, David Tarjan, Andrew Tao, Bryan Catanzaro, SDCNet: Video Prediction Using Spatially Displaced Convolution. ECCV 2018. https://arxiv.org/abs/1811.00684

Large Scale Language Modeling: Converging on 40GB of Text in Four Hours

Published: August 03, 2018

This paper shows how to do large scale distributed, large batch, mixed precision training of language models with investigations into the successes and limitations of large batch training on publicly available language datasets.

Recommended citation: Raul Puri, Robert Kirby, Nikolai Yakovenko, Bryan Catanzaro, Large Scale Language Modeling: Converging on 40GB of Text in Four Hours. arXiv. 2018. https://arxiv.org/abs/1808.01371

Malware Detection by Eating a Whole EXE

Published: February 02, 2018

This paper shows how to do whole binary classification for malware detection with a convolutional neural network. Done in collaboration with researchers at the University of Maryland.

Recommended citation: Edward Raff, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro, Charles Nicholas, Malware Detection by Eating a Whole EXE. arXiv. 2017. http://arxiv.org/abs/1710.09435

NVIDIA ADLR

Projects