+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm


      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.


          The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-world applications, such as biomedical data sets. Numerous researchers investigated either imbalanced class or high dimensional data sets and came up with various methods. Nonetheless, few approaches reported in the literature have addressed the intersection of the high dimensional and imbalanced class problem due to their complicated interactions. Lately, feature selection has become a well-known technique that has been used to overcome this problem by selecting discriminative features that represent minority and majority class. This paper proposes a new method called Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA); rCBR-BGOA has employed an ensemble of multi-filters coupled with the Correlation-Based Redundancy method to select optimal feature subsets. A binary Grasshopper optimisation algorithm (BGOA) is used to construct the feature selection process as an optimisation problem to select the best (near-optimal) combination of features from the majority and minority class. The obtained results, supported by the proper statistical analysis, indicate that rCBR-BGOA can improve the classification performance for high dimensional and imbalanced datasets in terms of G-mean and the Area Under the Curve (AUC) performance metrics.

          Related collections

          Most cited references 76

          • Record: found
          • Abstract: found
          • Article: not found

          A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis.

          Cancer diagnosis is one of the most important emerging clinical applications of gene expression microarray technology. We are seeking to develop a computer system for powerful and reliable cancer diagnostic model creation based on microarray data. To keep a realistic perspective on clinical applications we focus on multicategory diagnosis. To equip the system with the optimum combination of classifier, gene selection and cross-validation methods, we performed a systematic and comprehensive evaluation of several major algorithms for multicategory classification, several gene selection methods, multiple ensemble classifier methods and two cross-validation designs using 11 datasets spanning 74 diagnostic categories and 41 cancer types and 12 normal tissue types. Multicategory support vector machines (MC-SVMs) are the most effective classifiers in performing accurate cancer diagnosis from gene expression data. The MC-SVM techniques by Crammer and Singer, Weston and Watkins and one-versus-rest were found to be the best methods in this domain. MC-SVMs outperform other popular machine learning algorithms, such as k-nearest neighbors, backpropagation and probabilistic neural networks, often to a remarkable degree. Gene selection techniques can significantly improve the classification performance of both MC-SVMs and other non-SVM learning algorithms. Ensemble classifiers do not generally improve performance of the best non-ensemble models. These results guided the construction of a software system GEMS (Gene Expression Model Selector) that automates high-quality model construction and enforces sound optimization and performance estimation procedures. This is the first such system to be informed by a rigorous comparative analysis of the available algorithms and datasets. The software system GEMS is available for download from http://www.gems-system.org for non-commercial use. alexander.statnikov@vanderbilt.edu.
            • Record: found
            • Abstract: not found
            • Article: not found

            Grasshopper Optimisation Algorithm: Theory and application

              • Record: found
              • Abstract: found
              • Article: not found

              Solving Multiclass Learning Problems via Error-Correcting Output Codes

              Multiclass learning problems involve finding a definitionfor an unknown function f(x) whose range is a discrete setcontaining k > 2 values (i.e., k ``classes''). Thedefinition is acquired by studying collections of training examples ofthe form [x_i, f (x_i)]. Existing approaches tomulticlass learning problems include direct application of multiclassalgorithms such as the decision-tree algorithms C4.5 and CART,application of binary concept learning algorithms to learn individualbinary functions for each of the k classes, and application ofbinary concept learning algorithms with distributed outputrepresentations. This paper compares these three approaches to a newtechnique in which error-correcting codes are employed as adistributed output representation. We show that these outputrepresentations improve the generalization performance of both C4.5and backpropagation on a wide range of multiclass learning tasks. Wealso demonstrate that this approach is robust with respect to changesin the size of the training sample, the assignment of distributedrepresentations to particular classes, and the application ofoverfitting avoidance techniques such as decision-tree pruning.Finally, we show that---like the other methods---the error-correctingcode technique can provide reliable class probability estimates.Taken together, these results demonstrate that error-correcting outputcodes provide a general-purpose method for improving the performanceof inductive learning programs on multiclass problems.

                Author and article information

                Genes (Basel)
                Genes (Basel)
                27 June 2020
                July 2020
                : 11
                : 7
                [1 ]Department of Computer Sciences, Yusuf Maitama Sule University, 700222 Kofar Nassarawa, Kano, Nigeria
                [2 ]School of Computer Sciences, Universiti Sains Malaysia, 11800 Gelugor, Malaysia; zuri@ 123456usm.my
                Author notes
                [* ]Correspondence: abdulrauf770@ 123456student.usm.my ; Tel.: +60-111-317-0481 or +60-194-004-327
                © 2020 by the authors.

                Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license ( http://creativecommons.org/licenses/by/4.0/).



                Comment on this article