0
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Kurdish News Dataset Headlines (KNDH) through multiclass classification

      data-paper

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The rapid growth of technology has massively increased the amount of text data. The data can be mined and utilized for numerous natural language processing (NLP) tasks, particularly text classification. The core part of text classification is collecting the data for predicting a good model. This paper collects Kurdish News Dataset Headlines (KNDH) for text classification. The dataset consists of 50000 news headlines which are equally distributed among five classes, with 10000 headlines for each class (Social, Sport, Health, Economic, and Technology). The percentage ratio of getting the channels of headlines is distinct, while the numbers of samples are equal for each category. There are 34 distinct channels that are used to collect the different headlines for each class, such as 8 channels for economics, 14 channels for health, 18 channels for science, 15 channels for social, and 5 channels for sport. The dataset is preprocessed using the Kurdish Language Processing Toolkit (KLPT) for tokenizing, spell-checking, stemming, and preprocessing.

          Related collections

          Most cited references12

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Medical dataset classification for Kurdish short text over social media

          The Facebook application is used as a resource for collecting the comments of this dataset, The dataset consists of 6756 comments to create a Medical Kurdish Dataset (MKD). The samples are comments of users, which are gathered from different posts of pages (Medical, News, Economy, Education, and Sport). Six steps as a preprocessing technique are performed on the raw dataset to clean and remove noise in the comments by replacing characters. The comments (short text) are labeled for positive class (medical comment) and negative class (non-medical comment) as text classification. The percentage ratio of the negative class is 55% while the positive class is 45%.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            BLARK for multi-dialect languages: towards the Kurdish BLARK

              Bookmark
              • Record: found
              • Abstract: not found
              • Book: not found

              Proceedings of the Seventh Global Wordnet Conference

                Bookmark

                Author and article information

                Contributors
                Journal
                Data Brief
                Data Brief
                Data in Brief
                Elsevier
                2352-3409
                13 April 2023
                June 2023
                13 April 2023
                : 48
                : 109120
                Affiliations
                [a ]Language Center, Charmo University, KRG, Chamchamal, Kurdistan, Iraq
                [b ]Computer Science Department, University of Halabja, KRG, Halabja, Kurdistan, Iraq
                [c ]Department of Computer Science, Komar University of Science and Technology, Sulaymaniyah, Kurdistan Region, Iraq
                [d ]Faculty of Engineering & Computer Science, Qaiwan International University, Sulaymaniyah, Kurdistan Region-Iraq
                Author notes
                [* ]Corresponding author.
                Article
                S2352-3409(23)00239-1 109120
                10.1016/j.dib.2023.109120
                10147969
                f019e8d1-4ced-47df-8df5-51f85477a3de
                © 2023 The Author(s)

                This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

                History
                : 26 January 2023
                : 5 March 2023
                : 29 March 2023
                Categories
                Data Article

                kurdish text classification,news headline dataset,natural language processing,text pre-processing

                Comments

                Comment on this article