Search for authorsSearch for similar articles
9
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

      Preprint

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing resources for languages in Indonesia. Despite being the second most linguistically diverse country, most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia. Our resource includes datasets, a multi-task benchmark, and lexicons, as well as a parallel Indonesian-English dataset. We provide extensive analyses and describe the challenges when creating such resources. We hope that our work can spark NLP research on Indonesian and other underrepresented languages.

          Related collections

          Author and article information

          Journal
          31 May 2022
          Article
          2205.15960
          5addaedf-c480-46aa-9ca7-32ba56925f11

          http://creativecommons.org/licenses/by-sa/4.0/

          History
          Custom metadata
          Preprint
          cs.CL

          Theoretical computer science
          Theoretical computer science

          Comments

          Comment on this article