6
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Quantifying and Alleviating the Language Prior Problem in Visual Question Answering

      Preprint

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Benefiting from the advancement of computer vision, natural language processing and information retrieval techniques, visual question answering (VQA), which aims to answer questions about an image or a video, has received lots of attentions over the past few years. Although some progress has been achieved so far, several studies have pointed out that current VQA models are heavily affected by the \emph{language prior problem}, which means they tend to answer questions based on the co-occurrence patterns of question keywords (e.g., how many) and answers (e.g., 2) instead of understanding images and questions. Existing methods attempt to solve this problem by either balancing the biased datasets or forcing models to better understand images. However, only marginal effects and even performance deterioration are observed for the first and second solution, respectively. In addition, another important issue is the lack of measurement to quantitatively measure the extent of the language prior effect, which severely hinders the advancement of related techniques. In this paper, we make contributions to solve the above problems from two perspectives. Firstly, we design a metric to quantitatively measure the language prior effect of VQA models. The proposed metric has been demonstrated to be effective in our empirical studies. Secondly, we propose a regularization method (i.e., score regularization module) to enhance current VQA models by alleviating the language prior problem as well as boosting the backbone model performance. The proposed score regularization module adopts a pair-wise learning strategy, which makes the VQA models answer the question based on the reasoning of the image (upon this question) instead of basing on question-answer patterns observed in the biased training set. The score regularization module is flexible to be integrated into various VQA models.

          Related collections

          Most cited references14

          • Record: found
          • Abstract: not found
          • Book Chapter: not found

          Microsoft COCO: Common Objects in Context

            Bookmark
            • Record: found
            • Abstract: not found
            • Book Chapter: not found

            Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              Deep Multimodal Learning: A Survey on Recent Advances and Trends

                Bookmark

                Author and article information

                Journal
                13 May 2019
                Article
                1905.04877
                b48df8b3-dc50-4ae4-b8fe-cb17f8e5cf5b

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                cs.CV cs.CL cs.IR

                Computer vision & Pattern recognition,Theoretical computer science,Information & Library science

                Comments

                Comment on this article