Technical deep-dive¶

September 28, 2023
in Technical deep-dive
5 min read

Training NeMo RNN-T Models Efficiently with Numba FP16 Support

In the field of Automatic Speech Recognition research, RNN Transducer (RNN-T) is a type of sequence-to-sequence model that is well-known for being able to achieve state-of-the-art transcription accuracy in offline and real-time (A.K.A. "streaming") speech recognition applications. They are also notorious for having high memory requirements. In this blog post we will explain why they have this reputation, and how NeMo allows you to side-step many of the memory requirements issues, including how to make use of Numba’s recent addition of FP16 support.

August 14, 2023
in Technical deep-dive
15 min read

How does forced alignment work?

In this blog post we will explain how you can use an Automatic Speech Recognition (ASR) model¹ to match up the text spoken in an audio file with the time when it is spoken. Once you have this information, you can do downstream tasks such as:

creating subtitles such as in the video below² or in the Hugging Face space
obtaining durations of tokens or words to use in Text To Speech or speaker diarization models
splitting long audio files (and their transcripts) into shorter ones. This is especially useful when making datasets for training new ASR models, since audio files that are too long will not be able to fit onto a single GPU during training. ³