21
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning

      Preprint
      , ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Multi-modal learning, particularly among imaging and linguistic modalities, has made amazing strides in many high-level fundamental visual understanding problems, ranging from language grounding to dense event captioning. However, much of the research has been limited to approaches that either do not take audio corresponding to video into account at all, or those that model the audio-visual correlations in service of sound or sound source localization. In this paper, we present the evidence, that audio signals can carry surprising amount of information when it comes to high-level visual-lingual tasks. Specifically, we focus on the problem of weakly-supervised dense event captioning in videos and show that audio on its own can nearly rival performance of a state-of-the-art visual model and, combined with video, can improve on the state-of-the-art performance. Extensive experiments on the ActivityNet Captions dataset show that our proposed multi-modal approach outperforms state-of-the-art unimodal methods, as well as validate specific feature representation and architecture design choices.

          Related collections

          Most cited references24

          • Record: found
          • Abstract: not found
          • Article: not found

          Some mathematical notes on three-mode factor analysis.

          L Tucker (1966)
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

              Bookmark
              • Record: found
              • Abstract: not found
              • Conference Proceedings: not found

              CIDEr: Consensus-based image description evaluation

                Bookmark

                Author and article information

                Journal
                22 September 2019
                Article
                1909.09944
                9d7074f2-8968-4b30-af87-3450a54ae587

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                ICCV2019
                cs.CV

                Computer vision & Pattern recognition
                Computer vision & Pattern recognition

                Comments

                Comment on this article