Investigating an approach for low resource language dataset creation,
  curation and classification: Setswana and Sepedi

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The recent advances in Natural Language Processing have been a boon for well-represented languages in terms of available curated data and research resources. One of the challenges for low-resourced languages is clear guidelines on the collection, curation and preparation of datasets for different use-cases. In this work, we take on the task of creation of two datasets that are focused on news headlines (i.e short text) for Setswana and Sepedi and creation of a news topic classification task. We document our work and also present baselines for classification. We investigate an approach on data augmentation, better suited to low resource languages, to improve the performance of the classifiers

Related collections

Author and article information

Journal

Publication date Created: 18 February 2020

Article

ArXiV ID: 2003.04986

SO-VID: 434a0c81-04d0-4095-976e-1bdb15c46063

License:

http://creativecommons.org/licenses/by-sa/4.0/

History

Custom metadata

Comments Submitted to Resources for African Indigenous Languages (RAIL) at LREC 2020

Categories cs.CL cs.LG stat.ML

ScienceOpen disciplines: Theoretical computer science,Machine learning,Artificial intelligence

Data availability:

ScienceOpen disciplines: Theoretical computer science, Machine learning, Artificial intelligence

Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi

Read this article at

Abstract

Related collections

Resource Identification

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 62