16
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      The Power of Communities: A Text Classification Model with Automated Labeling Process Using Network Community Detection

      Preprint
      ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The text classification is one of the most critical areas in machine learning and artificial intelligence research. It has been actively adopted in many business applications such as conversational intelligence systems, news articles categorizations, sentiment analysis, emotion detection systems, and many other recommendation systems in our daily life. One of the problems in supervised text classification models is that the models performance depend heavily on the quality of data labeling that are typically done by humans. In this study, we propose a new network community detection-based approach to automatically label and classify text data into multiclass value spaces. Specifically, we build a network with sentences as the network nodes and pairwise cosine similarities between TFIDF vector representations of the sentences as the network link weights. We use the Louvain method to detect the communities in the sentence network. We train and test Support vector machine and Random forest models on both the human labeled data and network community detection labeled data. Results showed that models with the data labeled by network community detection outperformed the models with the human-labeled data by 2.68-3.75% of classification accuracy. Our method may help development of a more accurate conversational intelligence system and other text classification systems.

          Related collections

          Most cited references9

          • Record: found
          • Abstract: found
          • Article: not found

          Support vector machines for spam categorization.

          We study the use of support vector machines (SVM's) in classifying e-mail as spam or nonspam by comparing it to three other classification algorithms: Ripper, Rocchio, and boosting decision trees. These four algorithms were tested on two different data sets: one data set where the number of features were constrained to the 1000 best features and another data set where the dimensionality was over 7000. SVM's performed best when using binary features. For both data sets, boosting trees and SVM's had acceptable test performance in terms of accuracy and speed. However, SVM's had significantly less training time.
            Bookmark
            • Record: found
            • Abstract: not found
            • Conference Proceedings: not found

            Sentiment analysis of blogs by combining lexical knowledge with text classification

              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found
              Is Open Access

              KNN with TF-IDF based Framework for Text Categorization

                Bookmark

                Author and article information

                Journal
                25 September 2019
                Article
                1909.11706
                f5475059-77e0-4d3e-965f-7671048cf580

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                14 pages, 6 figures, 1 table. Submitted for NetSci-X 2020 Tokyo
                cs.CL cs.IR cs.SI

                Social & Information networks,Theoretical computer science,Information & Library science

                Comments

                Comment on this article