17
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning

      Preprint
      , , , ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Automatic generation of video captions is a fundamental challenge in computer vision. Recent techniques typically employ a combination of Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) for video captioning. These methods mainly focus on tailoring sequence learning through RNNs for better caption generation, whereas off-the-shelf visual features are borrowed from CNNs. We argue that careful designing of visual features for this task is equally important, and present a visual feature encoding technique to generate semantically rich captions using Gated Recurrent Units (GRUs). Our method embeds rich temporal dynamics in visual features by hierarchically applying Short Fourier Transform to CNN features of the whole video. It additionally derives high level semantics from an object detector to enrich the representation with spatial dynamics of the detected objects. The final representation is projected to a compact space and fed to a language model. By learning a relatively simple language model comprising two GRU layers, we establish new state-of-the-art on MSVD and MSR-VTT datasets for METEOR and ROUGE_L metrics.

          Related collections

          Most cited references17

          • Record: found
          • Abstract: not found
          • Conference Proceedings: not found

          Sequence to Sequence -- Video to Text

            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Video Captioning With Attention-Based LSTM and Semantic Consistency

              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              A Survey on Learning to Hash.

              Nearest neighbor search is a problem of finding the data points from the database such that the distances from them to the query point are the smallest. Learning to hash is one of the major solutions to this problem and has been widely studied recently. In this paper, we present a comprehensive survey of the learning to hash algorithms, categorize them according to the manners of preserving the similarities into: pairwise similarity preserving, multiwise similarity preserving, implicit similarity preserving, as well as quantization, and discuss their relations. We separate quantization from pairwise similarity preserving as the objective function is very different though quantization, as we show, can be derived from preserving the pairwise similarities. In addition, we present the evaluation protocols, and the general performance analysis, and point out that the quantization algorithms perform superiorly in terms of search accuracy, search time cost, and space cost. Finally, we introduce a few emerging topics.
                Bookmark

                Author and article information

                Journal
                26 February 2019
                Article
                1902.10322
                a1c95cc5-660c-4a16-aa43-780681cfd485

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                Accepted in CVPR-2019
                cs.CV

                Computer vision & Pattern recognition
                Computer vision & Pattern recognition

                Comments

                Comment on this article