4
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

      Preprint

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: a set of utterance segments from public YouTube videos called YTDEV18 and the publicly available LRS3-TED set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YTDEV18 set artificially corrupted with background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the LRS3-TED set.

          Related collections

          Most cited references7

          • Record: found
          • Abstract: not found
          • Conference Proceedings: not found

          FaceNet: A unified embedding for face recognition and clustering

            Bookmark
            • Record: found
            • Abstract: not found
            • Conference Proceedings: not found

            Lip Reading Sentences in the Wild

              Bookmark
              • Record: found
              • Abstract: not found
              • Conference Proceedings: not found

              Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription

                Bookmark

                Author and article information

                Journal
                08 November 2019
                Article
                1911.04890
                5be3478f-30fc-4ecb-846e-59983a8d2ba3

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                Will be presented in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019)
                eess.AS cs.CL cs.CV cs.LG cs.SD

                Computer vision & Pattern recognition,Theoretical computer science,Artificial intelligence,Graphics & Multimedia design,Electrical engineering

                Comments

                Comment on this article