4
views
0
recommends
+1 Recommend
2 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Improved semi-supervised learning technique for automatic detection of South African abusive language on Twitter

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Semi-supervised learning is a potential solution for improving training data in low-resourced abusive language detection contexts such as South African abusive language detection on Twitter. However, the existing semi-supervised learning methods have been skewed towards small amounts of labelled data, with small feature space. This paper, therefore, presents a semi-supervised learning technique that improves the distribution of training data by assigning labels to unlabelled data based on the majority voting over different feature sets of labelled and unlabelled data clusters. The technique is applied to South African English corpora consisting of labelled and unlabelled abusive tweets. The proposed technique is compared with state-of-the-art self-learning and active learning techniques based on syntactic and semantic features. The performance of these techniques with Logistic Regression, Support Vector Machine and Neural Networks are evaluated. The proposed technique, with accuracy and F1-score of 0.97 and 0.95, respectively, outperforms existing semi-supervised learning techniques. The learning curves show that the training data was used more efficiently by the proposed technique compared to existing techniques. Overall, n-gram syntactic features with a Logistic Regression classifier records the highest performance. The paper concludes that the proposed semi-supervised learning technique effectively detected implicit and explicit South African abusive language on Twitter.CATEGORIES: • Computing computing ~ Natural language processing and sentiment analysis • Computing methodologies ~ Text classification and information extraction

          Related collections

          Most cited references13

          • Record: found
          • Abstract: found
          • Article: not found

          SMOTE: Synthetic Minority Over-sampling Technique

          An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of ``normal'' examples with only a small percentage of ``abnormal'' or ``interesting'' examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Scikit-Learn: Machine learning in python

              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              Induction of decision trees

                Bookmark

                Author and article information

                Journal
                sacj
                South African Computer Journal
                SACJ
                South African Institute of Computer Scientists and Information Technologists (SAICSIT) (Grahamstown, Eastern Cape, South Africa )
                1015-7999
                2313-7835
                December 2020
                : 32
                : 2
                : 56-79
                Affiliations
                [01] orgnameUniversity of the Free State orgdiv1Department of Computer Science and Informatics South Africa OriolaO@ 123456ufs.ac.za
                Article
                S2313-78352020000200005 S2313-7835(20)03200200005
                10.18489/sacj.v32i2.847
                7ea600e1-5347-42ad-bc3a-6c1cfafef517

                This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

                History
                : 23 October 2020
                : 30 May 2020
                Page count
                Figures: 0, Tables: 0, Equations: 0, References: 13, Pages: 24
                Product

                SciELO South Africa

                Categories
                Research Papers (General)

                machine learning,semi-supervised learning,abusive language,South Africa,Twitter

                Comments

                Comment on this article