1. [Publications](/publications)
2. Investigating End-to-End ASR Architectures for Long Form Audio Transcription
 
 # Investigating End-to-End ASR Architectures for Long Form Audio Transcription

  ![Publication image](/sites/default/files/styles/wide/public/default_images/default.jpeg?itok=qUFsuJCP "Publication image")

 This paper presents an overview and evaluation of some of the end-to-end ASR models on long-form audios. We study three categories of Automatic Speech Recognition(ASR) models based on their core architecture: (1) convolutional, (2) convolutional with squeeze-and-excitation and (3) convolutional models with attention. We selected one ASR model from each category and evaluated Word Error Rate, maximum audio length and real-time factor for each model on a variety of long audio benchmarks: Earnings-21 and 22, CORAAL, and TED-LIUM3. The model from the category of self-attention with local attention and global token has the best accuracy comparing to other architectures. We also compared models with CTC and RNNT decoders and showed that CTC-based models are more robust and efficient than RNNT on long form audio.



 ## Authors



Nithin Rao Koluguri (NVIDIA)

Samuel Kriman (NVIDIA)

Georgy Zelenfroind (NVIDIA)

Somshubra Majumdar (NVIDIA)

Dima Rekesh (NVIDIA)

Vahid Noroozi (NVIDIA)

Jagadeesh Balam (NVIDIA)

Boris Ginsburg (NVIDIA)

 

 

 ## Publication Date



Monday, September 18, 2023

 

 ## Research Area



[Speech Processing](/research-area/speech-processing)

 

 

 ## External Links



[Paper](https://arxiv.org/abs/2309.09950)