VoiceNoNG: Robust High-Quality Speech Editing Model without Hallucinations

Voicebox and VoiceCraft are the current most representative models for non-autoregressive and autoregressive speech editing, respectively. Although both of them can generate high-quality speech edits, we identify their limitations: Voicebox is not good at editing speech with background audio, while VoiceCraft suffers from the hallucination-like problem. To maintain speech quality for varying audio scenarios and address the hallucination issue, we introduce VoiceNoNG, which combines the strengths of both model frameworks. VoiceNoNG utilizes a latent flow-matching framework to model the pre-quantization features of a neural codec. The vector quantizer in the neural codec provides additional robustness against minor prediction errors from the editing model, which enables VoiceNoNG to achieve state-of-the-art performance in both objective and subjective evaluations under diverse audio conditions.