Research Notes¶

February 8, 2024
in Announcements, NVIDIA Technical Blog
5 min read

New Standard for Speech Recognition and Translation from the NVIDIA NeMo Canary Model

Our team is thrilled to announce Canary, a multilingual model that sets a new standard in speech-to-text recognition and translation. Read more about it in our team's post on the NVIDIA Techblog.

January 31, 2024
in Announcements, NVIDIA Technical Blog
5 min read

Turbocharge ASR Accuracy and Speed with NVIDIA NeMo Parakeet-TDT

Our team is thrilled to announce the latest addition to the Parakeet family — Parakeet TDT. Learn more in our team's post on the NVIDIA Techblog.

January 3, 2024
in Announcements, NVIDIA Technical Blog
5 min read

Announcing NVIDIA NeMo Parakeet ASR Models for Pushing the Boundaries of Speech Recognition

We announce the release of Parakeet, a family of state-of-the-art automatic speech recognition (ASR) models. Learn more in our team's post on the NVIDIA Techblog.

September 28, 2023
in Technical deep-dive
5 min read

Training NeMo RNN-T Models Efficiently with Numba FP16 Support

In the field of Automatic Speech Recognition research, RNN Transducer (RNN-T) is a type of sequence-to-sequence model that is well-known for being able to achieve state-of-the-art transcription accuracy in offline and real-time (A.K.A. "streaming") speech recognition applications. They are also notorious for having high memory requirements. In this blog post we will explain why they have this reputation, and how NeMo allows you to side-step many of the memory requirements issues, including how to make use of Numba’s recent addition of FP16 support.

August 14, 2023
in Technical deep-dive
15 min read

How does forced alignment work?

In this blog post we will explain how you can use an Automatic Speech Recognition (ASR) model¹ to match up the text spoken in an audio file with the time when it is spoken. Once you have this information, you can do downstream tasks such as:

creating subtitles such as in the video below² or in the Hugging Face space
obtaining durations of tokens or words to use in Text To Speech or speaker diarization models
splitting long audio files (and their transcripts) into shorter ones. This is especially useful when making datasets for training new ASR models, since audio files that are too long will not be able to fit onto a single GPU during training. ³

August 14, 2023
in Announcements
2 min read

Introducing NeMo Forced Aligner

Today we introduce NeMo Forced Aligner: a NeMo-based tool for forced alignment.

NFA allows you to obtain token-level, word-level and segment-level timestamps for words spoken in an audio file. NFA produces timestamp information in a variety of output file formats, including subtitle files, which you can use to create videos such as the one below¹:

June 7, 2023
in Papers
10 min read

Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

The Conformer architecture, introduced by Gulati et al. has been a standard architecture used for not only Automatic Speech Recognition, but has also been extended to other tasks such as Spoken Language Understanding, Speech Translation, and used as a backbone for Self Supervised Learning for various downstream tasks. While they are highly accurate models on each of these tasks, and can be extended for use in other tasks, they are also very computationally expensive. This is due to the quadratic complexity of the attention mechanism, which makes it difficult to train and infer on long sequences, which are used as input to these models due to the granular stride of audio pre-processors (commonly Mel Spectrograms or even raw audio signal in certain models with 10 milliseconds stride). Furthermore, the memory requirement of quadratic attention also significantly limits the audio duration during inference.

January 1, 2023
in NVIDIA Technical Blog
2 min read

NeMo on the NVIDIA Technical blog in 2023

The following blog posts have been published by the NeMo team on the NVIDIA Technical blog in 2023.

January 2023

Based on work accepted to SLT 2022:

December 31, 2022
in NVIDIA Technical Blog
5 min read

NeMo on the NVIDIA Technical blog in 2022

The following blog posts were published by the NeMo team on the NVIDIA Technical blog in 2022.

August 2022

Solving Automatic Speech Recognition Deployment Challenges

September 2022

Simplifying Model Development and Building Models at Scale with PyTorch Lightning and NGC

Based on work accepted to Interspeech 2022:

August 7, 2022
2 min read

NeMo Blog Posts and Announcements

NVIDIA NeMo is a conversational AI toolkit that supports multiple domains such as Automatic Speech Recognition (ASR), Text to Speech generation (TTS), Speaker Recognition (SR), Diarization (SDR), Natural Language Processing (NLP), Neural Machine translation (NMT) and much more. NVIDIA RIVA has long been the toolkit that enables efficient deployment of NeMo models. In recent months, NeMo Megatron supports training and inference on large language models (upto 1 trillion parameters !).

As NeMo becomes capable of more advanced tasks, such as p-tuning / prompt tuning of NeMo Megatron models, domain adaptation of ASR models using Adapter modules, customizable generative TTS models and much more, we introduce this website as a collection of blog posts and announcements for:

Technical deep dives of NeMo's capabilities
Presenting state-of-the-art research results
Announcing new capabilities and domains of research that our team will work on.

Visit NVIDIA NeMo to get started