23
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric

      research-article
      1 , * , 2 , 3
      PLoS ONE
      Public Library of Science

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Data imbalance is frequently encountered in biomedical applications. Resampling techniques can be used in binary classification to tackle this issue. However such solutions are not desired when the number of samples in the small class is limited. Moreover the use of inadequate performance metrics, such as accuracy, lead to poor generalization results because the classifiers tend to predict the largest size class. One of the good approaches to deal with this issue is to optimize performance metrics that are designed to handle data imbalance. Matthews Correlation Coefficient (MCC) is widely used in Bioinformatics as a performance metric. We are interested in developing a new classifier based on the MCC metric to handle imbalanced data. We derive an optimal Bayes classifier for the MCC metric using an approach based on Frechet derivative. We show that the proposed algorithm has the nice theoretical property of consistency. Using simulated data, we verify the correctness of our optimality result by searching in the space of all possible binary classifiers. The proposed classifier is evaluated on 64 datasets from a wide range data imbalance. We compare both classification performance and CPU efficiency for three classifiers: 1) the proposed algorithm (MCC-classifier), the Bayes classifier with a default threshold (MCC-base) and imbalanced SVM (SVM-imba). The experimental evaluation shows that MCC-classifier has a close performance to SVM-imba while being simpler and more efficient.

          Related collections

          Most cited references13

          • Record: found
          • Abstract: found
          • Article: not found

          Comparison of the predicted and observed secondary structure of T4 phage lysozyme.

          Predictions of the secondary structure of T4 phage lysozyme, made by a number of investigators on the basis of the amino acid sequence, are compared with the structure of the protein determined experimentally by X-ray crystallography. Within the amino terminal half of the molecule the locations of helices predicted by a number of methods agree moderately well with the observed structure, however within the carboxyl half of the molecule the overall agreement is poor. For eleven different helix predictions, the coefficients giving the correlation between prediction and observation range from 0.14 to 0.42. The accuracy of the predictions for both beta-sheet regions and for turns are generally lower than for the helices, and in a number of instances the agreement between prediction and observation is no better than would be expected for a random selection of residues. The structural predictions for T4 phage lysozyme are much less successful than was the case for adenylate kinase (Schulz et al. (1974) Nature 250, 140-142). No one method of prediction is clearly superior to all others, and although empirical predictions based on larger numbers of known protein structure tend to be more accurate than those based on a limited sample, the improvement in accuracy is not dramatic, suggesting that the accuracy of current empirical predictive methods will not be substantially increased simply by the inclusion of more data from additional protein structure determinations.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            nDNA-prot: identification of DNA-binding proteins based on unbalanced classification

            Background DNA-binding proteins are vital for the study of cellular processes. In recent genome engineering studies, the identification of proteins with certain functions has become increasingly important and needs to be performed rapidly and efficiently. In previous years, several approaches have been developed to improve the identification of DNA-binding proteins. However, the currently available resources are insufficient to accurately identify these proteins. Because of this, the previous research has been limited by the relatively unbalanced accuracy rate and the low identification success of the current methods. Results In this paper, we explored the practicality of modelling DNA binding identification and simultaneously employed an ensemble classifier, and a new predictor (nDNA-Prot) was designed. The presented framework is comprised of two stages: a 188-dimension feature extraction method to obtain the protein structure and an ensemble classifier designated as imDC. Experiments using different datasets showed that our method is more successful than the traditional methods in identifying DNA-binding proteins. The identification was conducted using a feature that selected the minimum Redundancy and Maximum Relevance (mRMR). An accuracy rate of 95.80% and an Area Under the Curve (AUC) value of 0.986 were obtained in a cross validation. A test dataset was tested in our method and resulted in an 86% accuracy, versus a 76% using iDNA-Prot and a 68% accuracy using DNA-Prot. Conclusions Our method can help to accurately identify DNA-binding proteins, and the web server is accessible at http://datamining.xmu.edu.cn/~songli/nDNA. In addition, we also predicted possible DNA-binding protein sequences in all of the sequences from the UniProtKB/Swiss-Prot database. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-298) contains supplementary material, which is available to authorized users.
              Bookmark
              • Record: found
              • Abstract: not found
              • Book: not found

              Imbalanced Learning

                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS One
                PLoS ONE
                plos
                plosone
                PLoS ONE
                Public Library of Science (San Francisco, CA USA )
                1932-6203
                2017
                2 June 2017
                : 12
                : 6
                : e0177678
                Affiliations
                [1 ]Systems Biology Department, Sidra Medical and Research Centre, Doha, Qatar
                [2 ]Laboratoire Cedric, CNAM, Paris, France
                [3 ]Clinical Research Center, Sidra Medical and Research Center, Doha, Qatar
                Tianjin University, CHINA
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                • Conceptualization: SB FJ ME.

                • Data curation: SB.

                • Formal analysis: SB FJ.

                • Investigation: SB FJ.

                • Methodology: SB FJ ME.

                • Software: SB.

                • Validation: SB.

                • Visualization: SB.

                • Writing – original draft: SB FJ ME.

                • Writing – review & editing: SB FJ.

                Article
                PONE-D-17-00175
                10.1371/journal.pone.0177678
                5456046
                28574989
                84c3feeb-932b-4c8d-8f36-18b8588de233
                © 2017 Boughorbel et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 3 January 2017
                : 30 April 2017
                Page count
                Figures: 5, Tables: 8, Pages: 17
                Product
                Funding
                This work was supported by Qatar Foundation.
                Categories
                Research Article
                Computer and Information Sciences
                Artificial Intelligence
                Machine Learning
                Support Vector Machines
                Physical Sciences
                Mathematics
                Applied Mathematics
                Algorithms
                Research and Analysis Methods
                Simulation and Modeling
                Algorithms
                Physical Sciences
                Mathematics
                Applied Mathematics
                Algorithms
                Machine Learning Algorithms
                Research and Analysis Methods
                Simulation and Modeling
                Algorithms
                Machine Learning Algorithms
                Computer and Information Sciences
                Artificial Intelligence
                Machine Learning
                Machine Learning Algorithms
                Physical Sciences
                Mathematics
                Operator Theory
                Kernel Functions
                Physical Sciences
                Mathematics
                Probability Theory
                Statistical Distributions
                Distribution Curves
                Research and Analysis Methods
                Database and Informatics Methods
                Bioinformatics
                Research and Analysis Methods
                Simulation and Modeling
                Biology and Life Sciences
                Neuroscience
                Cognitive Science
                Cognitive Psychology
                Learning
                Learning Curves
                Biology and Life Sciences
                Psychology
                Cognitive Psychology
                Learning
                Learning Curves
                Social Sciences
                Psychology
                Cognitive Psychology
                Learning
                Learning Curves
                Biology and Life Sciences
                Neuroscience
                Learning and Memory
                Learning
                Learning Curves
                Custom metadata
                The data used in this work are publicly available and are gathered in the following repository: https://github.com/bsabri/mcc_classifier/.

                Uncategorized
                Uncategorized

                Comments

                Comment on this article