8
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      MorphPiece : Moving away from Statistical Language Representation

      Preprint

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Tokenization is a critical part of modern NLP pipelines. However, contemporary tokenizers for Large Language Models are based on statistical analysis of text corpora, without much consideration to the linguistic features. We propose a linguistically motivated tokenization scheme, MorphPiece, which is based partly on morphological segmentation of the underlying text. A GPT-style causal language model trained on this tokenizer (called MorphGPT) shows superior convergence compared to the same architecture trained on a standard BPE tokenizer. Specifically we get Language Modeling performance comparable to a 6 times larger model. Additionally, we evaluate MorphGPT on a variety of NLP tasks in supervised and unsupervised settings and find superior performance across the board, compared to GPT-2 model.

          Related collections

          Author and article information

          Journal
          14 July 2023
          Article
          2307.07262
          b3e06975-f87e-4b6b-b887-98caa526bc4c

          http://creativecommons.org/licenses/by-nc-sa/4.0/

          History
          Custom metadata
          9 pages excluding references and appendices. 5 figures
          cs.CL

          Theoretical computer science
          Theoretical computer science

          Comments

          Comment on this article