  ## Speech Processing

 ### Associated Publications

 

### 2026 

[TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models](/publication/2026-04_timeomni-1-incentivizing-complex-reasoning-time-series-large-language-models)

Tong Guan, [Huck Yang](/person/huck-yang), Sabato Marco Siniscalchi, Qingsong Wen, Ming Jin, Shirui Pan



[ICLR](https://openreview.net/forum?id=kOIclg7muL)









### 2025 

[VoiceNoNG: Robust High-Quality Speech Editing Model without Hallucinations](/index.php/publication/2025-08_voicenong-robust-high-quality-speech-editing-model-without-hallucinations)

[Sung-Feng Huang](/index.php/person/sung-feng-huang), Heng-Cheng Kuo, Zhehuai Chen, Xuesong Yang, Pin-Jui Ku, Ante Jukić, [Huck Yang](/index.php/person/huck-yang), Yu Tsao, [Frank Wang](/index.php/person/frank-wang), Hung-yi Lee, [Szu-Wei Fu](/index.php/person/szu-wei-fu)



[Interspeech 2025](https://www.interspeech2025.org/home)









[UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation](/index.php/publication/2025-04_uniwav-towards-unified-pre-training-speech-representation-learning-and)

Alexander H. Liu, Sang-gil Lee, [Huck Yang](/index.php/person/huck-yang), Yuan Gong, [Frank Wang](/index.php/person/frank-wang), James R. Glas, Rafael Valle



[ICLR 2025](https://openreview.net/forum?id=yj9lLwMjnE)









[Audio Large Language Models Can Be Descriptive Speech Quality Evaluators](/publication/2025-04_audio-large-language-models-can-be-descriptive-speech-quality-evaluators)

Chen Chen, Yuchen Hu, Siyin Wang, Helin Wang, Zhehuai Chen, Chao Zhang, [Huck Yang](/person/huck-yang), EngSiong Chng



[ICLR 2025](https://openreview.net/forum?id=U42TkrEDzb)









### 2024 

[Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition](/publication/2024-12_large-language-model-based-generative-error-correction-challenge-and-baselines)

[Huck Yang](/person/huck-yang), Taejin Park, Yuan Gong, Yuanchao Li, Zhehuai Chen, yen-ting Lin, Chen Chen, Yuchen Hu, Kunal Dhawan, Piotr Zelasko, Chao Zhang, Yun-Nung Chen, Yu Tsao, Jagadeesh Balam, Boris Ginsburg, Shinji Watanabe, Andreas Stolcke



[SLT 2024](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10832176)









[Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models](/index.php/publication/2024-12_self-taught-recognizer-toward-unsupervised-adaptation-speech-foundation-models)

Yuchen Hu, Chen Chen, [Huck Yang](/index.php/person/huck-yang), Chengwei Qin, Pin-Yu Chen, Eng Siong Chng, Chao Zhang



[NeurIPS](https://arxiv.org/pdf/2405.14161)









[Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits](/index.php/publication/2024-12_detecting-undetectable-assessing-efficacy-current-spoof-detection-methods)

[Sung-Feng Huang](/index.php/person/sung-feng-huang), Heng-Cheng Kuo, Zhehuai Chen, Xuesong Yang, [Huck Yang](/index.php/person/huck-yang), Yu Tsao, [Frank Wang](/index.php/person/frank-wang), Hung-yi Lee, [Szu-Wei Fu](/index.php/person/szu-wei-fu)



[IEEE SLT 2024](https://2024.ieeeslt.org/)









[FastAdaSP: Multitask-Adapted Efficient Inference for Large Speech Language Model](/publication/2024-11_fastadasp-multitask-adapted-efficient-inference-large-speech-language-model)

Yichen Lu, Jiaqi Song, [Huck Yang](/person/huck-yang), Shinji Watanabe



[EMNLP](https://aclanthology.org/2024.emnlp-industry.33.pdf)









[Bayesian Example Selection Improves In-Context Learning for Speech, Text, and Visual Modalities](/publication/2024-11_bayesian-example-selection-improves-context-learning-speech-text-and-visual)

Siyin Wang, [Huck Yang](/person/huck-yang), Ji Wu, Chao Zhang



[EMNLP](https://arxiv.org/pdf/2404.14716)









[GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators](/index.php/publication/2024-08_gentranslate-large-language-models-are-generative-multilingual-speech-and)

Yuchen Hu, Chen Chen, [Huck Yang](/index.php/person/huck-yang), Ruizhe Li, Zhehuai Chen, Eng Siong Chng



[ACL 2024](https://arxiv.org/pdf/2402.06894)









[Large Language Models are Efficient Learners of Noise-Robust Speech Recognition](/index.php/publication/2024-05_large-language-models-are-efficient-learners-noise-robust-speech-recognition)

YuChen Hu, Chen Chen, [Huck Yang](/index.php/person/huck-yang), Ruizhe Li, Chao Zhang, Pin-Yu Chen, EnSiong Chng



[ICLR 2024](https://iclr.cc/Conferences/2024)









[It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition](/index.php/publication/2024-05_it-s-never-too-late-fusing-acoustic-information-large-language-models-automatic)

Chen Chen, Ruizhe Li, Yuchen Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Ensiong Chng, [Huck Yang](/index.php/person/huck-yang)



[ICLR 2024](https://iclr.cc/Conferences/2024)









[A Chat about Boring Problems: Studying GPT-Based Text Normalization](/publication/2024-03_chat-about-boring-problems-studying-gpt-based-text-normalization)

Yang Zhang, Travis M. Bartley, Mariana Graterol-Fuenmayor, Vitaly Lavrukhin, Evelina Bakhturina, Boris Ginsburg



[ICASSP](https://ieeexplore.ieee.org/xpl/conhome/10445798/proceeding)









[Fast Entropy-Based Methods of Word-Level Confidence Estimation for End-to-End Automatic Speech Recognition](/publication/2024-01_fast-entropy-based-methods-word-level-confidence-estimation-end-end-automatic)

Aleksandr Laptev, Boris Ginsburg



[IEEE](https://ieeexplore.ieee.org/abstract/document/10022960)









### 2023 

[Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition](/publication/2023-12_stateful-conformer-cache-based-inference-streaming-automatic-speech-recognition)

Vahid Noroozi, Somshubra Majumdar, Ankur Kumar, Jagadeesh Balam, Boris Ginsburg













[HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models](/index.php/publication/2023-12_hyporadise-open-baseline-generative-speech-recognition-large-language-models)

Chen Chen, YuChen Hu, [Huck Yang](/index.php/person/huck-yang), Sabato Marco Siniscalchi, Pin-Yu Chen, Ensiong Chng



[NeurIPS 2023](https://openreview.net/forum?id=cAjZ3tMye6)









[Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition](/index.php/publication/2023-12_whispering-llama-cross-modal-generative-error-correction-framework-speech)

Srijith Radhakrishnan, [Huck Yang](/index.php/person/huck-yang), Sumeer Khan, Rohit Kumar, Narsis Kiani, David Gomez-Cabrero, Jesper Tegnér



[EMNLP](https://aclanthology.org/2023.emnlp-main.618/)









[Investigating End-to-End ASR Architectures for Long Form Audio Transcription](/index.php/publication/2023-09_investigating-end-end-asr-architectures-long-form-audio-transcription)

Nithin Rao Koluguri, Samuel Kriman, Georgy Zelenfroind, Somshubra Majumdar, Dima Rekesh, Vahid Noroozi, Jagadeesh Balam, Boris Ginsburg













[NeMo Forced Aligner and its application to word alignment for subtitle generation](/publication/2023-08_nemo-forced-aligner-and-its-application-word-alignment-subtitle-generation)

Elena Rastorgueva, Vitaly Lavrukhin, Boris Ginsburg



[Interspeech](https://www.isca-archive.org/interspeech_2023/rastorgueva23_interspeech.html)









[Confidence-based Ensembles of End-to-End Speech Recognition Models](/publication/2023-06_confidence-based-ensembles-end-end-speech-recognition-models)

Igor Gitman, Vitaly Lavrukhin, Aleksandr Laptev, Boris Ginsburg



[Interspeech](https://interspeech2023.org/)









[Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](/publication/2023-05_fast-conformer-linearly-scalable-attention-efficient-speech-recognition)

Dima Rekesh, Nithin Rao Koluguri, Samuel Kriman, Somshubra Majumdar, Vahid Noroozi, He Huang, Oleskii Hrinchuk, Krishna Puvvada, Ankur Kumar, Jagadeesh Balam, Boris Ginsburg













[Efficient Sequence Transduction by Jointly Predicting Tokens and Durations](/publication/2023-04_efficient-sequence-transduction-jointly-predicting-tokens-and-durations)

Hainan Xu, Fei Jia, Somshubra Majumdar, He Huang, Shinji Watanabe, Boris Ginsburg













### 2022 

[Accidental Learners: Spoken Language Identification in Multilingual Self-Supervised Models](/publication/2022-11_accidental-learners-spoken-language-identification-multilingual-self-supervised)

Travis M. Bartley, Fei Jia, Krishna C. Puvvada, Samuel Kriman, Boris Ginsburg













[Multi-blank Transducers for Speech Recognition](/publication/2022-11_multi-blank-transducers-speech-recognition)

Hainan Xu, Fei Jia, Somshubra Majumdar, Shinji Watanabe, Boris Ginsburg













[Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers](/index.php/publication/2022-11_adapter-based-extension-multi-speaker-text-speech-model-new-speakers)

Cheng-Ping Hsieh, Subhankar Ghosh, Boris Ginsburg













[A Compact End-to-End Model with Local and Global Context for Spoken Language Identification](/publication/2022-10_compact-end-end-model-local-and-global-context-spoken-language-identification)

Fei Jia, Nithin Rao Koluguri, Jagadeesh Balam, Boris Ginsburg



[Interspeech](https://interspeech2023.org/)









[Damage Control During Domain Adaptation for Transducer Based Automatic Speech Recognition](/publication/2022-10_damage-control-during-domain-adaptation-transducer-based-automatic-speech)

Somshubra Majumdar, Shantanu Acharya, Vitaly Lavrukhin, Boris Ginsburg



[IEEE](https://ieeexplore.ieee.org/abstract/document/10023219)









[Thutmose Tagger: Single-pass neural model for Inverse Text Normalization](/index.php/publication/2022-07_thutmose-tagger-single-pass-neural-model-inverse-text-normalization)

Alexandra Antonova, Evelina Bakhturina, Boris Ginsburg













[TitaNet: Neural Model for Speaker Representation with 1D Depth-Wise Separable Convolutions and Global Context](/index.php/publication/2022-05_titanet-neural-model-speaker-representation-1d-depth-wise-separable)

Nithin Rao Koluguri, Taejin Park, Boris Ginsburg



[IEEE](https://ieeexplore.ieee.org/abstract/document/9746806)









[Shallow Fusion of Weighted Finite-State Transducer and Language Model for Text Normalization](/publication/2022-03_shallow-fusion-weighted-finite-state-transducer-and-language-model-text)

Evelina Bakhturina, Yang Zhang, Boris Ginsburg













### 2021 

[Mixer-TTS: non-autoregressive, fast and compact text-to-speech model conditioned on language model embeddings](/publication/2021-10_mixer-tts-non-autoregressive-fast-and-compact-text-speech-model-conditioned)

Oktai Tatanov, Stanislav Beliaev, Boris Ginsburg













[Mixer-TTS: non-autoregressive, fast and compact text-to-speech model conditioned on language model embeddings](/publication/2021-10_mixer-tts-non-autoregressive-fast-and-compact-text-speech-model-conditioned-0)

Oktai Tatanov, Stanislav Beliaev, Boris Ginsburg













[A Unified Transformer-based Framework for Duplex Text Normalization](/publication/2021-08_unified-transformer-based-framework-duplex-text-normalization)

Tuan Manh Lai, Yang Zhang, Evelina Bakhturina , Boris Ginsburg, Heng Ji













[CarneliNet: Neural Mixture Model for Automatic Speech Recognition](/publication/2021-07_carnelinet-neural-mixture-model-automatic-speech-recognition)

Aleksei Kalinov, Somshubra Majumdar, Jagadeesh Balam, Boris Ginsburg













[TalkNet: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis](/publication/2021-04_talknet-non-autoregressive-depth-wise-separable-convolutional-model-speech)

Stanislav Beliaev, Boris Ginsburg













[TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction](/publication/2021-04_talknet-2-non-autoregressive-depth-wise-separable-convolutional-model-speech)

Stanislav Beliaev, Boris Ginsburg













[NeMo Inverse Text Normalization: From Development To Production](/publication/2021-04_nemo-inverse-text-normalization-development-production)

Yang Zhang, Evelina Bakhturina, Kyle Gorman, Boris Ginsburg













[A Toolbox for Construction and Analysis of Speech Datasets](/publication/2021-04_toolbox-construction-and-analysis-speech-datasets)

Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg













[Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition](/publication/2021-04_citrinet-closing-gap-between-non-autoregressive-and-autoregressive-end-end)

Somshubra Majumdar, Jagadeesh Balam, Oleksii Hrinchuk, Vitaly Lavrukhin, Vahid Noroozi, Boris Ginsburg













[SPGISpeech: 5,000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speech Recognition](/publication/2021-04_spgispeech-5000-hours-transcribed-financial-audio-fully-formatted-end-end)

Patrick K. O’Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, Georg Kucsko



[Interspeech](https://www.isca-archive.org/interspeech_2021/oneill21_interspeech.html)









[Hi-Fi Multi-Speaker English TTS Dataset](/publication/2021-04_hi-fi-multi-speaker-english-tts-dataset)

Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, Yang Zhang













### 2020 

[MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection](/publication/2020-10_marblenet-deep-1d-time-channel-separable-convolutional-neural-network-voice)

Fei Jia, Somshubra Majumdar, Boris Ginsburg



[IEEE](https://ieeexplore.ieee.org/abstract/document/9414470)









[Improving Noise Robustness of an End-to-End Neural Model for Automatic Speech Recognition](/index.php/publication/2020-10_improving-noise-robustness-end-end-neural-model-automatic-speech-recognition)

Jagadeesh Balam, Jocelyn Huang, Vitaly Lavrukhin, Slyne Deng, Somshubra Majumdar, Boris Ginsburg













[SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification](/publication/2020-10_speakernet-1d-depth-wise-separable-convolutional-network-text-independent)

Nithin Rao Koluguri, Jason Li, Vitaly Lavrukhin, Boris Ginsburg













[Cross-Language Transfer Learning and Domain Adaptation for End-to-End Automatic Speech Recognition](/publication/2020-05_cross-language-transfer-learning-and-domain-adaptation-end-end-automatic-speech)

Jocelyn Huang, Oleksii Kuchaiev, Patrick O’Neill, Vitaly Lavrukhin, Jason Li, Adriana Flores, Georg Kucsko, Boris Ginsburg













[MatchboxNet - 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition](/publication/2020-04_matchboxnet-1d-time-channel-separable-convolutional-neural-network-architecture)

Somshubra Majumdar, Boris Ginsburg



[Interspeech](http://www.interspeech2020.org/index.php?m=content&c=index&a=show&catid=337&id=993)









### 2019 

[Correction of Automatic Speech Recognition with Transformer Sequence-To-Sequence Model](/index.php/publication/2019-10_correction-automatic-speech-recognition-transformer-sequence-sequence-model)

Oleksii Hrinchuk, Mariya Popova, Boris Ginsburg



[IEEE](https://ieeexplore.ieee.org/abstract/document/9053051)









[QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions](/publication/2019-10_quartznet-deep-automatic-speech-recognition-1d-time-channel-separable)

Samuel Kriman, Stanislav Beliaev, Boris Ginsburg, Jocelyn Huang, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Yang Zhang













[Jasper: An End-to-End Convolutional Neural Acoustic Model](/index.php/publication/2019-04_jasper-end-end-convolutional-neural-acoustic-model)

Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M. Cohen, Huyen Nguyen, Ravi Teja Gadde













 

 



 ### Researchers

 

[Huck Yang](/index.php/person/huck-yang)



[Matthijs Van keirsbilck](/person/matthijs-van-keirsbilck)



[Sung-Feng Huang](/index.php/person/sung-feng-huang)