ScienceOpen: research and publishing network

For Researchers

Search
Advanced search

0

views

    

0

recommends

0

shares

Record: found
Abstract: found
Article: found

Is Open Access

Kurdish News Dataset Headlines (KNDH) through multiclass classification

data-paper

Author(s): Soran Badawi ^a , Ari M. Saeed ^b ^, ^* , Sara A. Ahmed ^c , Peshraw Ahmed Abdalla ^b , Diyari A. Hassan ^d

Publication date (Electronic): 13 April 2023

Journal: Data in Brief

Publisher: Elsevier

Keywords: Kurdish text classification, News headline dataset, Natural language processing, Text pre-processing

Read this article at

ScienceOpen Publisher PMC

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The rapid growth of technology has massively increased the amount of text data. The data can be mined and utilized for numerous natural language processing (NLP) tasks, particularly text classification. The core part of text classification is collecting the data for predicting a good model. This paper collects Kurdish News Dataset Headlines (KNDH) for text classification. The dataset consists of 50000 news headlines which are equally distributed among five classes, with 10000 headlines for each class (Social, Sport, Health, Economic, and Technology). The percentage ratio of getting the channels of headlines is distinct, while the numbers of samples are equal for each category. There are 34 distinct channels that are used to collect the different headlines for each class, such as 8 channels for economics, 14 channels for health, 18 channels for science, 15 channels for social, and 5 channels for sport. The dataset is preprocessed using the Kurdish Language Processing Toolkit (KLPT) for tokenizing, spell-checking, stemming, and preprocessing.

Related collections

Most cited references 12

Record: found
Abstract: found
Article: found

Is Open Access

Medical dataset classification for Kurdish short text over social media

Ari M. Saeed, Shnya Hussein, Chro Ali … (2022)

The Facebook application is used as a resource for collecting the comments of this dataset, The dataset consists of 6756 comments to create a Medical Kurdish Dataset (MKD). The samples are comments of users, which are gathered from different posts of pages (Medical, News, Economy, Education, and Sport). Six steps as a preprocessing technique are performed on the raw dataset to clean and remove noise in the comments by replacing characters. The comments (short text) are labeled for positive class (medical comment) and negative class (non-medical comment) as text classification. The percentage ratio of the negative class is 55% while the positive class is 45%.

0 comments Cited 2 times – based on 0 reviews      Review now

Record: found
Abstract: not found
Article: not found

BLARK for multi-dialect languages: towards the Kurdish BLARK

Hossein Hassani (2018)

0 comments Cited 2 times – based on 0 reviews      Review now

Record: found
Abstract: not found
Book: not found

Proceedings of the Seventh Global Wordnet Conference

P Aliabadi, M.S. Ahmadi, S Salavati … (2014)

0 comments Cited 1 times – based on 0 reviews

Author and article information

Contributors

Soran Badawi

Journal

Journal ID (nlm-ta): Data Brief

Journal ID (iso-abbrev): Data Brief

Title: Data in Brief

Publisher: Elsevier

ISSN (Electronic): 2352-3409

Publication date PMC-release: 13 April 2023

Publication date Collection: June 2023

Publication date (Electronic): 13 April 2023

Volume: 48

Electronic Location Identifier: 109120

Affiliations

[a ]Language Center, Charmo University, KRG, Chamchamal, Kurdistan, Iraq

[b ]Computer Science Department, University of Halabja, KRG, Halabja, Kurdistan, Iraq

[c ]Department of Computer Science, Komar University of Science and Technology, Sulaymaniyah, Kurdistan Region, Iraq

[d ]Faculty of Engineering & Computer Science, Qaiwan International University, Sulaymaniyah, Kurdistan Region-Iraq

Author notes

[* ]Corresponding author.

Article

Publisher Item ID: S2352-3409(23)00239-1 Publisher ID: 109120

DOI: 10.1016/j.dib.2023.109120

PMC ID: 10147969

SO-VID: f019e8d1-4ced-47df-8df5-51f85477a3de

Copyright © © 2023 The Author(s)

License:

This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

History

Date received : 26 January 2023

Date revision received : 5 March 2023

Date accepted : 29 March 2023

Categories

Subject: Data Article

Keywords: kurdish text classification,news headline dataset,natural language processing,text pre-processing

Data availability:

Keywords: kurdish text classification, news headline dataset, natural language processing, text pre-processing

Comments

Comment on this article