15
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      End-to-End Learning of Motion Representation for Video Understanding

      Preprint

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Despite the recent success of end-to-end learned representations, hand-crafted optical flow features are still widely used in video analysis tasks. To fill this gap, we propose TVNet, a novel end-to-end trainable neural network, to learn optical-flow-like features from data. TVNet subsumes a specific optical flow solver, the TV-L1 method, and is initialized by unfolding its optimization iterations as neural layers. TVNet can therefore be used directly without any extra learning. Moreover, it can be naturally concatenated with other task-specific networks to formulate an end-to-end architecture, thus making our method more efficient than current multi-stage approaches by avoiding the need to pre-compute and store features on disk. Finally, the parameters of the TVNet can be further fine-tuned by end-to-end training. This enables TVNet to learn richer and task-specific patterns beyond exact optical flow. Extensive experiments on two action recognition benchmarks verify the effectiveness of the proposed approach. Our TVNet achieves better accuracies than all compared methods, while being competitive with the fastest counterpart in terms of features extraction time.

          Related collections

          Most cited references14

          • Record: found
          • Abstract: not found
          • Conference Proceedings: not found

          HMDB: A large video database for human motion recognition

            Bookmark
            • Record: found
            • Abstract: not found
            • Conference Proceedings: not found

            DeepFlow: Large Displacement Optical Flow with Deep Matching

              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

              Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 ( \( 69.4\% \)) and UCF101 (\( 94.2\% \)). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices.
                Bookmark

                Author and article information

                Journal
                02 April 2018
                Article
                1804.00413
                7e3ecd41-123d-4906-889f-32da441c92a0

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                CVPR 2018 spotlight. The first two authors contributed equally to this paper
                cs.CV

                Comments

                Comment on this article