Investigating End-to-End ASR Architectures for Long Form Audio Transcription

This paper presents an overview and evaluation of some of the end-to-end ASR models on long-form audios. We study three categories of Automatic Speech Recognition(ASR) models based on their core architecture: (1) convolutional, (2) convolutional with squeeze-and-excitation and (3) convolutional models with attention. We selected one ASR model from each category and evaluated Word Error Rate, maximum audio length and real-time factor for each model on a variety of long audio benchmarks: Earnings-21 and 22, CORAAL, and TED-LIUM3.

SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF

Model alignment with human preferences is an essential step in making Large Language Models (LLMs) helpful and consistent with human values. It typically consists of supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) stages. However, RLHF faces inherent limitations stemming from a complex training setup and its tendency to align the model with implicit values that end users cannot control at run-time.

NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems. Guardrails (or rails for short) are a specific way of controlling the output of an LLM, such as not talking about topics considered harmful, following a predefined dialogue path, using a particular language style, and more. There are several mechanisms that allow LLM providers and developers to add guardrails that are embedded into a specific model at training, e.g. using model alignment.

Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition

In this paper, we propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture. We adapted the FastConformer architecture for streaming applications through: (1) constraining both the look-ahead and past contexts in the encoder, and (2) introducing an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference. The proposed model is thoughtfully designed in a way to eliminate the accuracy disparity between the train and inference time which is common for many streaming models.

A Chat about Boring Problems: Studying GPT-Based Text Normalization

Text normalization - the conversion of text from written to spoken form - is traditionally assumed to be an ill-formed task for language modeling. In this work, we argue otherwise. We empirically show the capacity of Large-Language Models (LLM) for text normalization in few-shot scenarios. Combining self-consistency reasoning with linguistic-informed prompt engineering, we find LLM-based text normalization to achieve error rates approximately 40% lower than production-level normalization systems.

Chenhui Deng

Chenhui Deng is currently a Research Scientist at NVIDIA, where he focuses on leveraging graph-based machine learning techniques for circuit problems. Chenhui earned his PhD degree in Electrical and Computer Engineering from Cornell University in 2024. His research area is in the interdisciplinary field of Machine Learning, Spectral Graph Theory, Electronic Design Automation, and VLSI.

Novel Transformer Model Based Clustering Method for Standard Cell Design Automation

Standard cells are essential components of modern digital circuit designs. With process technologies advancing beyond 5nm, more routability issues have arisen due to the decreasing number of routing tracks (RTs), increasing number and complexity of design rules, and strict patterning rules. The standard cell design automation framework is able to automatically design standard cell layouts, but it is struggling to resolve the severe routability issues in advanced nodes.