1. [Publications](/publications)
2. UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation
 
 # UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation

  ![](/sites/default/files/styles/wide/public/publications/Screenshot%202025-03-06%20at%201.19.03%20AM.png?itok=LtGop__6)

 Pre-training and representation learning have been playing an increasingly important role in modern speech processing. Nevertheless, different applications have been relying on different foundation models, since predominant pre-training techniques are either designed for discriminative tasks or generative tasks. In this work, we make the first attempt at building a unified pre-training framework for both types of tasks in speech. We show that with the appropriate design choices for pre-training, one can jointly learn a representation encoder and generative audio decoder that can be applied to both types of tasks. We propose UniWav, an encoder-decoder framework designed to unify pre-training representation learning and generative tasks. On speech recognition, text-to-speech, and speech tokenization, UniWav achieves comparable performance to different existing foundation models, each trained on a specific task. Our findings suggest that a single general-purpose foundation model for speech can be built to replace different foundation models, reducing the overhead and cost of pre-training.


 ## Authors


Alexander H. Liu (NVIDIA)

Sang-gil Lee (NVIDIA)

[Huck Yang](/person/huck-yang)

Yuan Gong (XAI)

[Frank Wang](/person/frank-wang)

 James R. Glas (MIT)

Rafael Valle (NVIDIA)

 
 ## Publication Date


Tuesday, April 15, 2025

 
 ## Published in


[ICLR 2025](https://openreview.net/forum?id=yj9lLwMjnE)

 
 ## Research Area


[Generative AI](/research-area/generative-ai)

[Speech Processing](/research-area/speech-processing)