MorphPiece : Moving away from Statistical Language Representation

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Tokenization is a critical part of modern NLP pipelines. However, contemporary tokenizers for Large Language Models are based on statistical analysis of text corpora, without much consideration to the linguistic features. We propose a linguistically motivated tokenization scheme, MorphPiece, which is based partly on morphological segmentation of the underlying text. A GPT-style causal language model trained on this tokenizer (called MorphGPT) shows superior convergence compared to the same architecture trained on a standard BPE tokenizer. Specifically we get Language Modeling performance comparable to a 6 times larger model. Additionally, we evaluate MorphGPT on a variety of NLP tasks in supervised and unsupervised settings and find superior performance across the board, compared to GPT-2 model.

Related collections

Author and article information

Journal

Publication date Created: 14 July 2023

Article

ArXiV ID: 2307.07262

SO-VID: b3e06975-f87e-4b6b-b887-98caa526bc4c

License:

http://creativecommons.org/licenses/by-nc-sa/4.0/

History

Custom metadata

Comments 9 pages excluding references and appendices. 5 figures

Categories cs.CL

ScienceOpen disciplines: Theoretical computer science

Data availability:

ScienceOpen disciplines: Theoretical computer science

MorphPiece : Moving away from Statistical Language Representation

Read this article at

Abstract

Related collections

Blockchain in Healthcare Today

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 73