Conformer without Convolutions

We analyze the weights of a trained speech-to-text neural network and discover a surprising amount of structure in the temporal convolutions. Based on our observations we propose to completely remove learnable temporal convolutions, and replace them with fixed averaging and shift operations which have no learnable parameters and open the way for significantly faster implementations. In the state-of-the-art models Conformer, Squeezeformer and FastConformer, this improves WER by 0.12%, 0.62%, and 0.20% respectively, while reducing the computational cost.

Authors

Matthijs Van keirsbilck

Alex Keller

Publication Date

Sunday, September 1, 2024

Published in

Interspeech 2024

Research Area

Algorithms and Numerical Methods

Artificial Intelligence and Machine Learning

Uploaded Files

vankeirsbilck24_interspeech.pdf531.1 KB