0
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Watch It Twice: Video Captioning with a Refocused Video Encoder

      Preprint
      , , ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          With the rapid growth of video data and the increasing demands of various applications such as intelligent video search and assistance toward visually-impaired people, video captioning task has received a lot of attention recently in computer vision and natural language processing fields. The state-of-the-art video captioning methods focus more on encoding the temporal information, while lack of effective ways to remove irrelevant temporal information and also neglecting the spatial details. However, the current RNN encoding module in single time order can be influenced by the irrelevant temporal information, especially the irrelevant temporal information is at the beginning of the encoding. In addition, neglecting spatial information will lead to the relationship confusion of the words and detailed loss. Therefore, in this paper, we propose a novel recurrent video encoding method and a novel visual spatial feature for the video captioning task. The recurrent encoding module encodes the video twice with the predicted key frame to avoid the irrelevant temporal information often occurring at the beginning and the end of a video. The novel spatial features represent the spatial information in different regions of a video and enrich the details of a caption. Experiments on two benchmark datasets show superior performance of the proposed method.

          Related collections

          Most cited references14

          • Record: found
          • Abstract: not found
          • Conference Proceedings: not found

          CIDEr: Consensus-based image description evaluation

            Bookmark
            • Record: found
            • Abstract: not found
            • Conference Proceedings: not found

            Self-Critical Sequence Training for Image Captioning

              Bookmark
              • Record: found
              • Abstract: not found
              • Conference Proceedings: not found

              Sequence to Sequence -- Video to Text

                Bookmark

                Author and article information

                Journal
                21 July 2019
                Article
                1907.12905
                8ba3fc00-93b1-4762-881f-ee4412113553

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                cs.CV

                Computer vision & Pattern recognition
                Computer vision & Pattern recognition

                Comments

                Comment on this article