Conformer without Convolutions
We analyze the weights of a trained speech-to-text neural network and discover a surprising amount of structure in the temporal convolutions. Based on our observations we propose to completely remove learnable temporal convolutions, and replace them with fixed averaging and shift operations which have no learnable parameters and open the way for significantly faster implementations. In the state-of-the-art models Conformer, Squeezeformer and FastConformer, this improves WER by 0.12%, 0.62%, and 0.20% respectively, while reducing the computational cost.
Publication Date
Published in
Uploaded Files
vankeirsbilck24_interspeech.pdf531.1 KB