Speech Processing | NVIDIA Research Taiwan

VoiceNoNG: High-Quality Speech Editing Model without Hallucinations

Voicebox and VoiceCraft are the current most representative models for non-autoregressive and autoregressive speech editing, respectively. Although both of them can generate high-quality speech edits, we identify their limitations: Voicebox is not …

Audio Large Language Models Can Be Descriptive Speech Quality Evaluators

An ideal multimodal agent should be aware of the quality of its input modalities. Recent advances have enabled large language models (LLMs) to incorporate auditory systems for handling various speech-related tasks. However, most audio LLMs remain …

Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

Multimodal foundation models, such as Gemini and GPT-4, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language …

UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation

Pre-training and representation learning have been playing an increasingly important role in modern speech processing. Nevertheless, different applications have been relying on different foundation models, since predominant pre-training techniques …

An Investigation of Incorporating Mamba for Speech Enhancement

This work aims to study a scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task. We exploit a Mamba-based regression model to characterize speech signals and build an SE system upon Mamba, termed SEMamba. We explore the …

Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits

Neural speech editing advancements have raised concerns about their misuse in spoofing attacks. Traditional partially edited speech corpora primarily focus on cut-and-paste edits, which, while maintaining speaker consistency, often introduce …

The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction

We present the third edition of the VoiceMOS Challenge, a scientific initiative designed to advance research into automatic prediction of human speech ratings. There were three tracks. The first track was on predicting the quality of ``zoomed-in'' …

Desta: Enhancing speech language models through descriptive speech-text alignment

Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs). In this paper, we propose a Descriptive Speech-Text Alignment approach that leverages speech captioning …