Supervised Multimodal Bitransformers for Classifying Images and Text

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Self-supervised bidirectional transformer models such as BERT have led to dramatic improvements in a wide variety of textual classification tasks. The modern digital world is increasingly multimodal, however, and textual information is often accompanied by other modalities such as images. We introduce a supervised multimodal bitransformer model that fuses information from text and image encoders, and obtain state-of-the-art performance on various multimodal classification benchmark tasks, outperforming strong baselines, including on hard test sets specifically designed to measure multimodal performance.

Related collections

Most cited references 7

Record: found
Abstract: not found
Conference Proceedings: not found

CNN Features Off-the-Shelf: An Astounding Baseline for Recognition

Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan … (2014)

0 comments Cited 498 times – based on 0 reviews

Bookmark

Record: found
Abstract: not found
Conference Proceedings: not found

Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks

Maxime Oquab, Léon Bottou, Ivan Laptev … (2014)

0 comments Cited 337 times – based on 0 reviews

Bookmark

Record: found
Abstract: found
Article: not found

Multimodal Machine Learning: A Survey and Taxonomy

Tadas Baltrušaitis, Chaitanya Ahuja, Louis-Philippe Morency (2018)

Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. Multimodal machine learning aims to build models that can process and relate information from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential. Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy. We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning. This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research.

0 comments Cited 305 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Publication date Created: 06 September 2019

Article

ArXiV ID: 1909.02950

SO-VID: fe2d7ec0-2148-4333-acca-1f07112f2767

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Comments Rejected from EMNLP 2019

Categories cs.CL cs.CV cs.LG stat.ML

ScienceOpen disciplines: Computer vision & Pattern recognition,Theoretical computer science,Machine learning,Artificial intelligence

Data availability:

ScienceOpen disciplines: Computer vision & Pattern recognition, Theoretical computer science, Machine learning, Artificial intelligence

Supervised Multimodal Bitransformers for Classifying Images and Text

Read this article at

Abstract

Related collections

Multimodal Learning Material

Most cited references 7

CNN Features Off-the-Shelf: An Astounding Baseline for Recognition

Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks

Multimodal Machine Learning: A Survey and Taxonomy

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 327

Most referenced authors 151