A Convolutional Deep Markov Model for Unsupervised Speech Representation
  Learning

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Probabilistic Latent Variable Models (LVMs) provide an alternative to self-supervised learning approaches for linguistic representation learning from speech. LVMs admit an intuitive probabilistic interpretation where the latent structure shapes the information extracted from the signal. Even though LVMs have recently seen a renewed interest due to the introduction of Variational Autoencoders (VAEs), their use for speech representation learning remains largely unexplored. In this work, we propose Convolutional Deep Markov Model (ConvDMM), a Gaussian state-space model with non-linear emission and transition functions modelled by deep neural networks. This unsupervised model is trained using black box variational inference. A deep convolutional neural network is used as an inference network for structured variational approximation. When trained on a large scale speech dataset (LibriSpeech), ConvDMM produces features that significantly outperform multiple self-supervised feature extracting methods on linear phone classification and recognition on the Wall Street Journal dataset. Furthermore, we found that ConvDMM complements self-supervised methods like Wav2Vec and PASE, improving on the results achieved with any of the methods alone. Lastly, we find that ConvDMM features enable learning better phone recognizers than any other features in an extreme low-resource regime with few labeled training examples.

Related collections

Author and article information

Journal

Publication date Created: 03 June 2020

Article

ArXiV ID: 2006.02547

SO-VID: 03364260-7129-4294-9059-80dbddde12c3

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Comments Submitted to Interspeech 2020

Categories eess.AS cs.CL cs.LG cs.SD

ScienceOpen disciplines: Theoretical computer science,Artificial intelligence,Graphics & Multimedia design,Electrical engineering

Data availability:

ScienceOpen disciplines: Theoretical computer science, Artificial intelligence, Graphics & Multimedia design, Electrical engineering

A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning

Read this article at

Abstract

Related collections

Computer Vision, Deep Learning, Deep Reinforcement Learning, IoT

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 254

Cited by 1