DORi: Discovering Object Relationship for Moment Localization of a
  Natural-Language Query in Video

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

This paper studies the task of temporal moment localization in a long untrimmed video using natural language query. Given a query sentence, the goal is to determine the start and end of the relevant segment within the video. Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm suitable for temporal moment localization which captures the relationships between humans, objects and activities in the video. These relationships are obtained by a spatial sub-graph that contextualizes the scene representation using detected objects and human features conditioned in the language query. Moreover, a temporal sub-graph captures the activities within the video through time. Our method is evaluated on three standard benchmark datasets, and we also introduce YouCookII as a new benchmark for this task. Experiments show our method outperforms state-of-the-art methods on these datasets, confirming the effectiveness of our approach.

Related collections

Author and article information

Journal

Publication date Created: 13 October 2020

Article

ArXiV ID: 2010.06260

SO-VID: f84249fc-2ef7-4dbc-9716-9069ae95d89f

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Categories cs.CV

ScienceOpen disciplines: Computer vision & Pattern recognition

Data availability:

ScienceOpen disciplines: Computer vision & Pattern recognition

DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video

Read this article at

Abstract

Related collections

Recursive Rule based Visual Categorization

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 26