18
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Overcoming Language Priors in Visual Question Answering with Adversarial Regularization

      Preprint
      , ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Modern Visual Question Answering (VQA) models have been shown to rely heavily on superficial correlations between question and answer words learned during training such as overwhelmingly reporting the type of room as kitchen or the sport being played as tennis, irrespective of the image. Most alarmingly, this shortcoming is often not well reflected during evaluation because the same strong priors exist in test distributions; however, a VQA system that fails to ground questions in image content would likely perform poorly in real-world settings. In this work, we present a novel regularization scheme for VQA that reduces this effect. We introduce a question-only model that takes as input the question encoding from the VQA model and must leverage language biases in order to succeed. We then pose training as an adversarial game between the VQA model and this question-only adversary -- discouraging the VQA model from capturing language biases in its question encoding. Further,we leverage this question-only model to estimate the increase in model confidence after considering the image, which we maximize explicitly to encourage visual grounding. Our approach is a model agnostic training procedure and simple to implement. We show empirically that it can improve performance significantly on a bias-sensitive split of the VQA dataset for multiple base models -- achieving state-of-the-art on this task. Further, on standard VQA tasks, our approach shows significantly less drop in accuracy compared to existing bias-reducing VQA models.

          Related collections

          Most cited references5

          • Record: found
          • Abstract: not found
          • Conference Proceedings: not found

          CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

            Bookmark
            • Record: found
            • Abstract: not found
            • Conference Proceedings: not found

            Visual Dialog

              Bookmark
              • Record: found
              • Abstract: not found
              • Conference Proceedings: not found

              Yin and Yang: Balancing and Answering Binary Visual Questions

                Bookmark

                Author and article information

                Journal
                08 October 2018
                Article
                1810.03649
                e668bbe6-172d-4937-98cd-b4463ead96a0

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                NIPS 2018. 11 pages ( with references ), 4 figures, 2 tables
                cs.CV

                Computer vision & Pattern recognition
                Computer vision & Pattern recognition

                Comments

                Comment on this article