SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification

We propose SpeakerNet - a new neural architecture for speaker recognition and speaker verification tasks. It is composed of residual blocks with 1D depth-wise separable convolutions, batch-normalization, and ReLU layers. This architecture uses x-vector based statistics pooling layer to map variable-length utterances to a fixed-length embedding (q-vector). SpeakerNet-M is a simple lightweight model with just 5M parameters. It doesn't use voice activity detection (VAD) and achieves close to state-of-the-art performance scoring an Equal Error Rate (EER) of 2.10% on the VoxCeleb1 cleaned and 2.29% on the VoxCeleb1 trial files.

Authors

Nithin Rao Koluguri (NVIDIA)

Jason Li (NVIDIA)

Vitaly Lavrukhin (NVIDIA)

Boris Ginsburg (NVIDIA)

Publication Date

Friday, October 23, 2020

Research Area

Speech Processing

External Links

Paper