Skip to content

Technical deep-dive

Training NeMo RNN-T Models Efficiently with Numba FP16 Support

In the field of Automatic Speech Recognition research, RNN Transducer (RNN-T) is a type of sequence-to-sequence model that is well-known for being able to achieve state-of-the-art transcription accuracy in offline and real-time (A.K.A. "streaming") speech recognition applications. They are also notorious for having high memory requirements. In this blog post we will explain why they have this reputation, and how NeMo allows you to side-step many of the memory requirements issues, including how to make use of Numba’s recent addition of FP16 support.


How does forced alignment work?

In this blog post we will explain how you can use an Automatic Speech Recognition (ASR) model1 to match up the text spoken in an audio file with the time when it is spoken. Once you have this information, you can do downstream tasks such as:

  • creating subtitles such as in the video below2 or in the Hugging Face space

  • obtaining durations of tokens or words to use in Text To Speech or speaker diarization models

  • splitting long audio files (and their transcripts) into shorter ones. This is especially useful when making datasets for training new ASR models, since audio files that are too long will not be able to fit onto a single GPU during training. 3