| Research

Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition

Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition.

Read more about Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition

Towards Neural Scaling Laws for Time Series Foundation Models

Scaling laws offer valuable insights into the design of time series foundation models (TSFMs). However, previous research has largely focused on the scaling laws
of TSFMs for in-distribution (ID) data, leaving their out-of-distribution (OOD)
scaling behavior and the influence of model architectures less explored. In this
work, we examine two common TSFM architectures—encoder-only and decoderonly Transformers—and investigate their scaling behavior on both ID and OOD
data. These models are trained and evaluated across varying parameter counts,

Read more about Towards Neural Scaling Laws for Time Series Foundation Models

Audio Large Language Models Can Be Descriptive Speech Quality Evaluators

An ideal multimodal agent should be aware of the quality of its input modalities.
Recent advances have enabled large language models (LLMs) to incorporate auditory systems for handling various speech-related tasks. However, most audio
LLMs remain unaware of the quality of the speech they process. This limitation arises because speech quality evaluation is typically excluded from multi-task
training due to the lack of suitable datasets. To address this, we introduce the first

Read more about Audio Large Language Models Can Be Descriptive Speech Quality Evaluators

UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation

Pre-training and representation learning have been playing an increasingly important role in modern speech processing. Nevertheless, different applications have been relying on different foundation models, since predominant pre-training techniques are either designed for discriminative tasks or generative tasks. In this work, we make the first attempt at building a unified pre-training framework for both types of tasks in speech.

Read more about UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation

GL0AM: GPU Accelerated Gate Level Logic Simulator

Read more about GL0AM: GPU Accelerated Gate Level Logic Simulator

BoolGebra: Attributed Graph-learning for Boolean Algebraic Manipulation

Read more about BoolGebra: Attributed Graph-learning for Boolean Algebraic Manipulation

Mitigating Covariate Shift in Imitation Learning for Autonomous Vehicles Using Latent Space Generative World Models

We propose the use of latent space generative world models to address the covariate shift problem in autonomous driving. A world model is a neural network capable of predicting an agent's next state given past states and actions. By leveraging a world model during training, the driving policy effectively mitigates covariate shift without requiring an excessive amount of training data.

Read more about Mitigating Covariate Shift in Imitation Learning for Autonomous Vehicles Using Latent Space Generative World Models

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

We propose a novel hybrid Mamba-Transformer backbone, MambaVision, specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. Through a comprehensive ablation study, we demonstrate the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results show that equipping the Mamba architecture with self-attention blocks in the final layers greatly improves its capacity to capture long-range spatial dependencies.

Read more about MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Marco: Configurable Graph-Based Task Solving and Multi-AI Agents Framework for Hardware Design

Hardware design presents numerous challenges stemming from its complexity and advancing technologies. These challenges result in longer turn-around-time (TAT) for optimizing performance, power, area, and cost (PPAC) during synthesis, verification, physical design, and reliability loops. Large Language Models (LLMs) have shown remarkable capacity to comprehend and generate natural language at a massive scale, leading to many potential applications and benefits across various domains.

Read more about Marco: Configurable Graph-Based Task Solving and Multi-AI Agents Framework for Hardware Design

Alán Aspuru-Guzik

Senior Director of Quantum Chemistry.

aspuru@nvidia.com Toronto, Canada

My research at NVIDIA focuses at the intersection of Quantum Computing, Artificial Intelligence and Chemical applications.

-- Near-term quantum algorithm development

-- Algorithms for fast quantum chemistry simulation for quantum computers and classical computers

-- Self-driving laboratories and robotics for chemical automation

-- Chemical generative models for materials design and drug discovery

Read more about Alán Aspuru-Guzik

Subscribe to