2023-08-14 · 2 minute read

Introducing NeMo Forced Aligner¶

Today we introduce NeMo Forced Aligner: a NeMo-based tool for forced alignment.

NFA allows you to obtain token-level, word-level and segment-level timestamps for words spoken in an audio file. NFA produces timestamp information in a variety of output file formats, including subtitle files, which you can use to create videos such as the one below¹:

Video with words highlighted according to word alignment timestamps obtained with NFA

Ways to get started:

Try out our HuggingFace Space demo to quickly test NFA in various languages.
Follow along with our step-by-step NFA "how-to" tutorial.
Learn more about how forced alignment works in this explainer tutorial.

You can also download NFA from the NeMo repository.

You can use NFA timestamps to:

Split audio files into shorter segments
Generate token- or word-level subtitles, like in our HuggingFace Space
Train token/word duration components of text-to-speech or speaker diarization models

NFA alignment timestamps can be based on reference text that you provide, or reference text obtained from speech-to-text transcription from a NeMo model. NFA works on audio in 14+ languages: it will work any of the 14 (and counting) languages for which there is an open-sourced NeMo speech-to-text model checkpoint, or you can train your own ASR model for a new language.

NFA pipeline — The NFA forced alignment pipeline

This video is of an excerpt from 'The Jingle Book' by Carolyn Wells. The audio is a reading of a poem called "The Butter Betty Bought". The audio is taken from a LibriVox recording of the book. We used NeMo Forced Aligner to generate the subtitle files for the video. The text was adapted from Project Gutenberg. Both the original audio and the text are in the public domain. ↩