9
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Research on Video Captioning Based on Multifeature Fusion

      research-article
      , , ,
      Computational Intelligence and Neuroscience
      Hindawi

      Read this article at

      ScienceOpenPublisherPMC
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Aiming at the problems that the existing video captioning models pay attention to incomplete information and the generation of expression text is not accurate enough, a video captioning model that integrates image, audio, and motion optical flow is proposed. A variety of large-scale dataset pretraining models are used to extract video frame features, motion information, audio features, and video sequence features. An embedded layer structure based on self-attention mechanism is designed to embed single-mode features and learn single-mode feature parameters. Then, two schemes of joint representation and cooperative representation are used to fuse the multimodal features of the feature vectors output by the embedded layer, so that the model can pay attention to different targets in the video and their interactive relationships, which effectively improves the performance of the video captioning model. The experiment is carried out on large datasets MSR-VTT and LSMDC. Under the metrics BLEU4, METEOR, ROUGEL, and CIDEr, the MSR-VTT benchmark dataset obtained scores of 0.443, 0.327, 0.619, and 0.521, respectively. The result shows that the proposed method can effectively improve the performance of the video captioning model, and the evaluation indexes are improved compared with comparison models.

          Related collections

          Most cited references40

          • Record: found
          • Abstract: not found
          • Conference Proceedings: not found

          Deep Residual Learning for Image Recognition

            • Record: found
            • Abstract: found
            • Article: not found

            Long Short-Term Memory

            Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
              • Record: found
              • Abstract: not found
              • Conference Proceedings: not found

              Going deeper with convolutions

                Author and article information

                Contributors
                Journal
                Comput Intell Neurosci
                Comput Intell Neurosci
                cin
                Computational Intelligence and Neuroscience
                Hindawi
                1687-5265
                1687-5273
                2022
                28 April 2022
                : 2022
                : 1204909
                Affiliations
                School of Computer and Communication, Lanzhou University of Technology, Lanzhou, Gansu, China
                Author notes

                Academic Editor: Le Sun

                Author information
                https://orcid.org/0000-0002-6069-0678
                Article
                10.1155/2022/1204909
                9071958
                7a6e8a98-d918-40a9-b315-3755971cdda3
                Copyright © 2022 Hong Zhao et al.

                This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 22 March 2022
                : 9 April 2022
                : 15 April 2022
                Funding
                Funded by: National Natural Science Foundation of China
                Award ID: 62166025
                Award ID: 51668043
                Funded by: Science and Technology Project of Gansu Province
                Award ID: 21YF5GA073
                Funded by: Gansu Educational Science and Technology Innovation (Project
                Award ID: 2021CXZX-511
                Award ID: 2021CXZX-512
                Categories
                Research Article

                Neurosciences
                Neurosciences

                Comments

                Comment on this article

                Related Documents Log