Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition
The Conformer architecture, introduced by Gulati et al. has been a standard architecture used for not only Automatic Speech Recognition, but has also been extended to other tasks such as Spoken Language Understanding, Speech Translation, and used as a backbone for Self Supervised Learning for various downstream tasks. While they are highly accurate models on each of these tasks, and can be extended for use in other tasks, they are also very computationally expensive. This is due to the quadratic complexity of the attention mechanism, which makes it difficult to train and infer on long sequences, which are used as input to these models due to the granular stride of audio pre-processors (commonly Mel Spectrograms or even raw audio signal in certain models with 10 milliseconds stride). Furthermore, the memory requirement of quadratic attention also significantly limits the audio duration during inference.