2
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

      Preprint
      ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different language families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables us to learn a classifier on top of the resulting sentence embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification. Our approach sets a new state-of-the-art on zero-shot cross-lingual natural language inference for all the 14 languages in the XNLI dataset but one. We also achieve very competitive results in cross-lingual document classification (MLDoc dataset). Our sentence embeddings are also strong at parallel corpus mining, establishing a new state-of-the-art in the BUCC shared task for 3 of its 4 language pairs. Finally, we introduce a new test set of aligned sentences in 122 languages based on the Tatoeba corpus, and show that our sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages. Our PyTorch implementation, pre-trained encoder and the multilingual test set will be freely available.

          Related collections

          Most cited references3

          • Record: found
          • Abstract: not found
          • Conference Proceedings: not found

          Improving Neural Machine Translation Models with Monolingual Data

            Bookmark
            • Record: found
            • Abstract: not found
            • Conference Proceedings: not found

            Bilingual Word Representations with Monolingual Quality in Mind

              Bookmark
              • Record: found
              • Abstract: not found
              • Conference Proceedings: not found

              Learning Joint Multilingual Sentence Representations with Neural Machine Translation

                Bookmark

                Author and article information

                Journal
                26 December 2018
                Article
                1812.10464
                6568408d-0c4b-4819-bec7-7ac5885c16aa

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                cs.CL cs.AI cs.LG

                Theoretical computer science,Artificial intelligence
                Theoretical computer science, Artificial intelligence

                Comments

                Comment on this article