8
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

      Preprint
      ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences. We also compare the performance of subword training and segmentation with various configurations. SentencePiece is available under the Apache 2 license at https://github.com/google/sentencepiece.

          Related collections

          Author and article information

          Journal
          19 August 2018
          Article
          1808.06226
          50b01db4-d4af-466b-be8c-0ccd48958b2b

          http://arxiv.org/licenses/nonexclusive-distrib/1.0/

          History
          Custom metadata
          Accepted as a demo paper at EMNLP2018
          cs.CL

          Theoretical computer science
          Theoretical computer science

          Comments

          Comment on this article