2023-09-28 · 5 minute read

Training NeMo RNN-T Models Efficiently with Numba FP16 Support¶

In the field of Automatic Speech Recognition research, RNN Transducer (RNN-T) is a type of sequence-to-sequence model that is well-known for being able to achieve state-of-the-art transcription accuracy in offline and real-time (A.K.A. "streaming") speech recognition applications. They are also notorious for having high memory requirements. In this blog post we will explain why they have this reputation, and how NeMo allows you to side-step many of the memory requirements issues, including how to make use of Numba’s recent addition of FP16 support.

What’s so great about Transducer models?¶

As we mentioned, RNN-T models (often called just “Transducer” models, since they don’t need to use an RNN) have been shown to achieve state-of-the-art results for accurate, streaming speech recognition. RNN-Ts are also able to handle longer sequences than they were trained on, as well as out-of-vocabulary words, which is a common problem in speech recognition.

If you want to learn more about Transducer models, we recommend the excellent blog post Sequence-to-sequence learning with Transducers.

**Figure 1.** The RNN-Transducer architecture. The audio sequence 'x' is passed through the encoder network, and the text sequence 'y' is passed to the prediction network. The outputs of both networks are combined in the joint network.

Why do Transducer models consume a lot of memory?¶

A significant drawback of the Transducer architecture is the vast GPU memory required during training. As discussed in Sequence-to-sequence learning with Transducers, the output of the joint network (which is the final step before the softmax, see Figure 1) in the transducer is a 4-dimensional tensor which occupies significant amounts of memory. The size of this tensor (both its activations and its gradients) can be calculated as follows:

\[\textnormal{Joint tensor size} = \: B \times T \times U \times V \times 2 \times 4 \,\, \textnormal{bytes}\]

Here, \(B\) is the batch size, \(T\) is the audio sequence length, \(U\) is the text sequence length and \(V\) is the vocabulary size. We multiply by 2 so we get the size of both the activations and the gradients. We then multiply by 4 because we assume an FP32 datatype (and a single FP32 value occupies 4 bytes).

The audio waveform signal is commonly converted to 100 Hz spectrogram frames, which means each second of audio corresponds to 100 audio frames. Thus, for a single 20-second audio clip with about 100 subwords in its transcript, and a vocabulary of 1024 subword tokens, the size of the tensor would be ~1.6 Gigabytes:

\[\textnormal{Joint tensor size} = \: B \times \phantom{....}T\phantom{....}\times \phantom{.}U\phantom{.} \times \phantom{..}V\phantom{.} \times 2 \times 4 \,\, \textnormal{bytes}\phantom{=1.6 \textnormal{ Gigabytes}}\] \[\phantom{\textnormal{Joint tensor size}} = \: 1 \times (20 \times 100) \times 100 \times 1024 \times 2 \times 4 \,\, \textnormal{bytes}=1.6 \textnormal{ Gigabytes}\]

This number is for a single audio sample. If we use a larger batch size, e.g. 10, for training, we will quickly run out of memory even on 16 GB GPUs. Also, remember, this is just the size of the joint network tensor: there is additional memory required to keep the model in memory, and to calculate the activation and gradients of the rest of the network!

Enter Numba support for FP16 datatype¶

As of Numba 0.57 release, FP16 datatype format is now supported natively. Using this, we can effectively halve the memory requirement of the above joint network tensor and support larger batch sizes with almost no changes to our NeMo workflow!

NeMo utilizes Numba's Just-in-time compile CUDA kernels written in Python in order to efficiently compute the RNN-T loss (which requires manipulation of the joint network tensor). This allows a user to simply have Numba installed on their system, and without explicit compilation of C++ / CUDA code, they can train their RNN-T models easily. Furthermore, since the kernels are written in Python, it allows for simple modifications by researchers to develop advanced features such as FastEmit, and even other extensions to the Transducer loss, such as Token-and-Duration Transducers.

Prerequisites¶

Pytorch 1.13.1+
Nvidia NeMo 1.20.0+
Numba 0.57+ (conda install numba=0.57.1 -c conda-forge)
CUDA Python
CUDA 11.8 (installed as part of cudatoolkit)
It is preferable to install these libraries in a Conda environment (Python 3.10) for correct dependency resolution.

The following snippet can be used to install the requirements:

conda create -n nemo -c pytorch -c nvidia -c conda-forge python=3.10 numba=0.57.1 cudatoolkit=11.8 cuda-python=11.8 pytorch torchvision torchaudio pytorch-cuda=11.8 cython
conda activate nemo
pip install nemo-toolkit[all]>=1.20.0

Enabling Numba FP16 Support in NeMo¶

Set the Numba environment variable: export NUMBA_CUDA_USE_NVIDIA_BINDING=1
Set the NeMo environment variable: export STRICT_NUMBA_COMPAT_CHECK=0
Check if installation is successful by using the following snippet:

from nemo.core.utils import numba_utils

# Should be True
print(numba_utils.numba_cuda_is_supported(numba_utils.__NUMBA_MINIMUM_VERSION_FP16_SUPPORTED__))

# Should also be True
print(numba_utils.is_numba_cuda_fp16_supported())

Train a Transducer ASR model with FP16¶

With the above environment flags set, and the latest Numba version installed, NeMo supports training with FP16 loss out of the box. For a tutorial on how to setup and train a Transducer ASR model, please refer to the NeMo ASR with Transducers tutorial.

The only change necessary to use the FP16 loss is to specify trainer.precision=16 in the NeMo model config.

Measuring Memory and Compute Improvements¶

We devised a simple benchmarking script that measures the memory usage when computing the RNN-T loss (with gradients enabled) for various combinations of inputs which are common during training on the Librispeech speech recognition dataset. The script used can be found in this Gist.

We assume that we are training a Conformer or Fast Conformer Transducer model, which performs 4x or 8x audio signal reduction respectively. For Librispeech, the longest audio file is approximately 17 seconds, which becomes approximately 200 timesteps after 8x reduction. We check memory consumption for both Character tokenization (\(V\)=28) and Subword Tokenization (\(V\)=1024). Due to the tokenization, the transcript text may be between 80 to 250 tokens but we take a conservative limit of 100 to 200 tokens. As well as the output tensor, the benchmarking script takes into account the memory consumption of the activations and gradients of the intermediate layer of the joint network, which has a shape of \(B \times T \times U \times H\). We set \(H\)=640 (this is a common value for this parameter in the literature).

Results¶

We show the results of the benchmarking in the graph below. You can see that using FP16 effectively halves the memory cost of an RNN-T model (compared with FP32), allowing to keep the memory usage relatively low as the values of parameters \(B\), \(T\), \(U\) and \(V\) increase.

RNN-Transducer memory under fp16 vs fp32 — **Figure 1.** Plot of GPU Memory usage for a given combination of Batch size (B), Timesteps (T), Text length (U), Vocabulary size (V) and the hidden dimension of the RNN-Transducer Joint.

It is to be noted that NVIDIA NeMo has several other mechanisms to significantly reduce peak memory consumption, such as Batch Splitting. When combined with FP16 support in Numba, this allows us to train even larger ASR models with a Transducer loss.

Conclusion¶

Numba FP16 support alleviates one of the crucial issues of RNN-Transducer training: memory usage. This unlocks efficient streaming speech recognition model training for a wider audience of researchers and developers. With a simple installation step, users are empowered to train and fine-tune their own speech recognition solutions on commonly available GPUs.

Users can learn more about Numba and how to leverage it for high-performance computing using Python in their 5-minute guide. Furthermore, NeMo users can read up on how to perform speech recognition with many models and losses in the NeMo ASR documentation.