SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

In recent years, large-scale pre-trained speech language models (SLMs) have demonstrated remarkable advancements in various generative speech modeling applications, such as text-to-speech synthesis, voice conversion, and speech enhancement. These applications typically involve mapping text or speech inputs to pre-trained SLM representations, from which target speech is decoded. This paper introduces a new approach, SLMGAN, to leverage SLM representations for discriminative tasks within the generative adversarial network (GAN) framework, specifically for voice conversion. Building upon StarGANv2-VC, we add our novel SLM-based WavLM discriminators on top of the mel-based discriminators along with our newly designed SLM feature matching loss function, resulting in an unsupervised zero-shot voice conversion system that does not require text labels during training. Subjective evaluation results show that SLMGAN outperforms existing state-of-the-art zero-shot voice conversion models in terms of naturalness and achieves comparable similarity, highlighting the potential of SLM-based discriminators for related applications.

Related collections

Author and article information

Journal

Publication date Created: 18 July 2023

Article

ArXiV ID: 2307.09435

SO-VID: 94cb15b3-87b6-48c6-b1c4-6179e2899f4c

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Comments WASPAA 2023

Categories eess.AS cs.AI cs.SD

ScienceOpen disciplines: Artificial intelligence,Graphics & Multimedia design,Electrical engineering

Data availability:

ScienceOpen disciplines: Artificial intelligence, Graphics & Multimedia design, Electrical engineering

SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs

Read this article at

Abstract

Related collections

Radiology and Natural Language Processing

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 160