Semantic keyword spotting by learning from images and speech

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

We consider the problem of representing semantic concepts in speech by learning from untranscribed speech paired with images of scenes. This setting is relevant in low-resource speech processing, robotics, and human language acquisition research. We use an external image tagger to generate soft labels, which serve as targets for training a neural model that maps speech to keyword labels. We introduce a newly collected data set of human semantic relevance judgements and an associated task, semantic keyword spotting, where the goal is to search for spoken utterances that are semantically relevant to a given text query. Without seeing any text, the model trained on parallel speech and images achieves a precision of almost 60% on its top ten semantic retrievals. Compared to a model trained on transcriptions, our model matches human judgements better by some measures, especially in retrieving non-verbatim semantic matches.

Related collections

Most cited references 15

Record: found
Abstract: not found
Article: not found

Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

S Davis, P G Mermelstein (1980)

0 comments Cited 398 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Rapid word learning under uncertainty via cross-situational statistics.

Chen Yu, Linda Smith (2007)

There are an infinite number of possible word-to-word pairings in naturalistic learning environments. Previous proposals to solve this mapping problem have focused on linguistic, social, representational, and attentional constraints at a single moment. This article discusses a cross-situational learning strategy based on computing distributional statistics across words, across referents, and, most important, across the co-occurrences of words and referents at multiple moments. We briefly exposed adults to a set of trials that each contained multiple spoken words and multiple pictures of individual objects; no information about word-picture correspondences was given within a trial. Nonetheless, over trials, subjects learned the word-picture mappings through cross-trial statistical relations. Different learning conditions varied the degree of within-trial reference uncertainty, the number of trials, and the length of trials. Overall, the remarkable performance of learners in various learning conditions suggests that they calculate cross-trial statistics with sufficient fidelity and by doing so rapidly learn word-referent pairs even in highly ambiguous learning contexts.

0 comments Cited 109 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Babytalk: understanding and generating simple image descriptions.

Girish S Kulkarni, Visruth Premraj, Vicente Ordonez … (2013)

We present a system to automatically generate natural language descriptions from images. This system consists of two parts. The first part, content planning, smooths the output of computer vision-based detection and recognition algorithms with statistics mined from large pools of visually descriptive text to determine the best content words to use to describe an image. The second step, surface realization, chooses words to construct natural language sentences based on the predicted content and general statistics from natural language. We present multiple approaches for the surface realization step and evaluate each using automatic measures of similarity to human generated reference descriptions. We also collect forced choice human evaluations between descriptions from the proposed generation system and descriptions from competing approaches. The proposed system is very effective at producing relevant sentences for images. It also generates descriptions that are notably more true to the specific image content than previous work.

0 comments Cited 96 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Publication date Created: 05 October 2017

Article

ArXiV ID: 1710.01949

SO-VID: 5c5fdb80-c100-4a58-bde5-95a4a2a114a8

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Comments 13 pages, 3 figures, 4 tables

Categories cs.CL cs.CV

Data availability:

Semantic keyword spotting by learning from images and speech

Read this article at

Abstract

Related collections

Semantic Knowledge Base

Most cited references 15

Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

Rapid word learning under uncertainty via cross-situational statistics.

Babytalk: understanding and generating simple image descriptions.

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 468

Most referenced authors 122