10
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Semantic keyword spotting by learning from images and speech

      Preprint
      , ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          We consider the problem of representing semantic concepts in speech by learning from untranscribed speech paired with images of scenes. This setting is relevant in low-resource speech processing, robotics, and human language acquisition research. We use an external image tagger to generate soft labels, which serve as targets for training a neural model that maps speech to keyword labels. We introduce a newly collected data set of human semantic relevance judgements and an associated task, semantic keyword spotting, where the goal is to search for spoken utterances that are semantically relevant to a given text query. Without seeing any text, the model trained on parallel speech and images achieves a precision of almost 60% on its top ten semantic retrievals. Compared to a model trained on transcriptions, our model matches human judgements better by some measures, especially in retrieving non-verbatim semantic matches.

          Related collections

          Most cited references15

          • Record: found
          • Abstract: not found
          • Article: not found

          Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Rapid word learning under uncertainty via cross-situational statistics.

            There are an infinite number of possible word-to-word pairings in naturalistic learning environments. Previous proposals to solve this mapping problem have focused on linguistic, social, representational, and attentional constraints at a single moment. This article discusses a cross-situational learning strategy based on computing distributional statistics across words, across referents, and, most important, across the co-occurrences of words and referents at multiple moments. We briefly exposed adults to a set of trials that each contained multiple spoken words and multiple pictures of individual objects; no information about word-picture correspondences was given within a trial. Nonetheless, over trials, subjects learned the word-picture mappings through cross-trial statistical relations. Different learning conditions varied the degree of within-trial reference uncertainty, the number of trials, and the length of trials. Overall, the remarkable performance of learners in various learning conditions suggests that they calculate cross-trial statistics with sufficient fidelity and by doing so rapidly learn word-referent pairs even in highly ambiguous learning contexts.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Babytalk: understanding and generating simple image descriptions.

              We present a system to automatically generate natural language descriptions from images. This system consists of two parts. The first part, content planning, smooths the output of computer vision-based detection and recognition algorithms with statistics mined from large pools of visually descriptive text to determine the best content words to use to describe an image. The second step, surface realization, chooses words to construct natural language sentences based on the predicted content and general statistics from natural language. We present multiple approaches for the surface realization step and evaluate each using automatic measures of similarity to human generated reference descriptions. We also collect forced choice human evaluations between descriptions from the proposed generation system and descriptions from competing approaches. The proposed system is very effective at producing relevant sentences for images. It also generates descriptions that are notably more true to the specific image content than previous work.
                Bookmark

                Author and article information

                Journal
                05 October 2017
                Article
                1710.01949
                5c5fdb80-c100-4a58-bde5-95a4a2a114a8

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                13 pages, 3 figures, 4 tables
                cs.CL cs.CV

                Comments

                Comment on this article