23
views
0
recommends
+1 Recommend
1 collections
    0
    shares

      To submit to the journal, click here

      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Word-level machine translation for bag-of-words text analysis: Cheap, fast, and surprisingly good

      research-article

      Read this article at

      ScienceOpenPublisher
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The quality of automated machine translation is rapidly approaching that of professional human translation. However, the best methods remain costly in terms of money, computational resources, and/or time, particularly when applied to large volumes of text. In contrast, word-level translation is both free and fast, simply mapping each word in a source language deterministically to a target language. This paper demonstrates that high-quality word-level translation dictionaries can be generated cheaply and easily, and that they produce translations that can serve reliably as inputs into some of the most common automated text analysis methods. It advances the field on two fronts: it assesses different techniques for creating word-level translation dictionaries, and it systematically compares the similarity of word-level translations against those produced by either state-of-the-art neural machine translation or professional human translation. Comparisons are performed for three common text analysis tasks — sentiment analysis, dictionary-based content analysis, and topic modeling — across a total of eleven different source languages and two target languages (English and French). Across all languages and tasks, word-level dictionaries perform sufficiently well to make them an attractive alternative when resource constraints make neural machine translation inaccessible. The translation dictionaries as well as the code used to generate and validate them are available on Github.

          Related collections

          Author and article information

          Contributors
          Journal
          CCR
          Computational Communication Research
          Amsterdam University Press (Amsterdam )
          2665-9085
          2665-9085
          2023
          : 5
          : 2
          : 1
          Affiliations
          William & Mary
          Article
          10.5117/CCR2023.2.8.VAND
          10.5117/CCR2023.2.8.VAND
          08389b10-0011-4d74-a39b-c60f8a37d933
          © The author(s)

          This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

          History
          Categories
          Article

          machine translation,computational social science,text-as-data,word embeddings

          Comments

          Comment on this article

          Related Documents Log