ScienceOpen: research and publishing network

For Researchers

Search
Advanced search

9

views

    

0

recommends

0

shares

Record: found
Abstract: found
Article: found

Is Open Access

From Semi-supervised to Almost-unsupervised Speech Recognition with Very-low Resource by Jointly Learning Phonetic Structures from Audio and Text Embeddings

Preprint

Author(s): Yi-Chen Chen , Sung-Feng Huang , Hung-yi Lee , Lin-shan Lee

Publication date Created: 10 April 2019

Read this article at

ScienceOpen ArXiv

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Producing a large amount of annotated speech data for training ASR systems remains difficult for more than 95% of languages all over the world which are low-resourced. However, we note human babies start to learn the language by the sounds (or phonetic structures) of a small number of exemplar words, and "generalize" such knowledge to other words without hearing a large amount of data. We initiate some preliminary work in this direction. Audio Word2Vec is used to learn the phonetic structures from spoken words (signal segments), while another autoencoder is used to learn the phonetic structures from text words. The relationships among the above two can be learned jointly, or separately after the above two are well trained. This relationship can be used in speech recognition with very low resource. In the initial experiments on the TIMIT dataset, only 2.1 hours of speech data (in which 2500 spoken words were annotated and the rest unlabeled) gave a word error rate of 44.6%, and this number can be reduced to 34.2% if 4.1 hr of speech data (in which 20000 spoken words were annotated) were given. These results are not satisfactory, but a good starting point.

Related collections

Most cited references 10

Record: found
Abstract: not found
Conference Proceedings: not found

End-to-end attention-based large vocabulary speech recognition

Philemon Brakel, Jan Chorowski, Dzmitry Bahdanau … (2016)

0 comments Cited 128 times – based on 0 reviews

Record: found
Abstract: not found
Conference Proceedings: not found

Very deep convolutional networks for end-to-end speech recognition

Yu Zhang, William K Chan, Navdeep Jaitly (2017)

0 comments Cited 47 times – based on 0 reviews

Record: found
Abstract: not found
Conference Proceedings: not found

Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-to-Sequence Autoencoder

Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen … (2016)

0 comments Cited 20 times – based on 0 reviews

Author and article information

Journal

Publication date Created: 10 April 2019

Article

ArXiV ID: 1904.05078

SO-VID: 1bbbb423-b1f1-4df2-ba19-bf66bb9fbcfc

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Categories cs.CL cs.SD eess.AS

ScienceOpen disciplines: Theoretical computer science,Electrical engineering,Graphics & Multimedia design

Data availability:

ScienceOpen disciplines: Theoretical computer science, Electrical engineering, Graphics & Multimedia design

Comments

Comment on this article