VoiceNoNG: Robust High-Quality Speech Editing Model without Hallucinations

Publication image

Voicebox and VoiceCraft are the current most representative models for non-autoregressive and autoregressive speech editing, respectively. Although both of them can generate high-quality speech edits, we identify their limitations: Voicebox is not good at editing speech with background audio, while VoiceCraft suffers from the hallucination-like problem. To maintain speech quality for varying audio scenarios and address the hallucination issue, we introduce VoiceNoNG, which combines the strengths of both model frameworks. VoiceNoNG utilizes a latent flow-matching framework to model the pre-quantization features of a neural codec. The vector quantizer in the neural codec provides additional robustness against minor prediction errors from the editing model, which enables VoiceNoNG to achieve state-of-the-art performance in both objective and subjective evaluations under diverse audio conditions.

Authors

Heng-Cheng Kuo (National Taiwan University)
Zhehuai Chen (NVIDIA)
Xuesong Yang (NVIDIA)
Pin-Jui Ku (NVIDIA)
Ante Jukić (NVIDIA)
Yu Tsao (Academia Sinica)
Hung-yi Lee (National Taiwan University)

Publication Date