Feature selection algorithm for high dimensional biomedical data classification based on redundant removal

High dimensional biomedical data contain thousands of features, and accurate identification of the main features in these data can be used to classification related data. However, it is usually a large number of irrelevant or redundant features seriously influence classification accuracy. To solve this problem, a new feature selection algorithm based on redundant removal is proposed in this study. Firstly, two redundant criteria are determined by vertical relevance and horizontal relevance. Secondly, an approximate redundancy feature framework based on mutual information (MI) is defined to remove redundant and irrelevant features. Finally, to evaluate the effectiveness of our proposed method, contrast experiments based on the classic feature selection algorithm are conducted using (K-nearest neighbour) KNN classifiers, and the results show that our algorithm can effectively improve the classification accuracy.


INTRODUCTION
High dimensional data analysis (Tamaresis et al., 2014) is a very hot research area, especially in cancer data (Lee et al., 2013 ), or mental illness data (Jiang et al., 2017;Li et al., 2017).Usually high dimension data contain many weak relevance or irrelevance features.Hypothesis all the features are treated equally, computational complexity and accuracy of the prediction can be seriously affected.Therefore, feature selection is considered to be an essential procedure in high dimension data processing.
Feature selection (Saeys et al., 2007) refers to selecting relevant features while to remove irrelevant and redundant features.As one of the important part of knowledge discovery technology, feature selection can effectively improve the computing speed of subsequent prediction algorithm, enhance the compactness of the prediction model, increase the generalization ability of the corresponding model.Additionally, the major purpose of high dimensional data feature selection is to overcome the curse of dimensionality (Li et al., 2016;Zhang et al., 2018).
In general, the process of feature selection mainly consists (Mafarja et al., 2018): search strategy and evaluation criterion.Evaluation criterion can be categorized into the wrapper method and the filter method.The wrapper method (Chrysostomou et al., 2017) to evaluate superiority and inferiority of the optimal feature subset under the premise of keeping the classification algorithm unchanged.And the corresponding classification accuracy is adopted as an index to select optimal feature subset.It is necessary to execute the feature selection process again when the classification algorithm is changed.Hence, the complexity is too high, especially for high dimension data.The filter method (Hancer et al., 2018;Lei et al., 2018), the search of feature space depends on the intrinsic correlation of the data itself rather than the classification algorithm.The filter method is increasingly attractive because of its simplicity and fast speed.So filter method is more general applied than the wrapper method.
According to the above discussion, in this paper, a filter feature selection method is proposed.First of all, four kinds of boundary extremes are analyzed, and then two redundant criterions are proposed.Meanwhile, in order to quantify the redundancy criterion, the core module based on mutual information (MI) (Estevez et al., 2009) is proposed: the definition of approximate redundancy feature.Finally, the experiment is given.
The remainder of this article is organized as follows: Section 2 provides basic concepts related to this research.A feature selection algorithm based on redundancy removal is proposed in Section 3. In Section 4, we describe our experimental design, experimental results.Finally, in section 5 concludes the work of this study.

BASIC CONCEPTS
In order to facilitate follow-up research, some basic concepts (John et al., 1994) used in this study are listed as follows.

(i)
Strong relevance: Fi is strongly relevant feature iff there exists ( , ) Weak relevance: Fi is weakly relevant feature iff it is not strongly relevant (i.e. ( | , ) ( | ) Irrelevance: Fi is irrelevant feature iff it are not strongly relevant and weakly relevant, there all where P is a probability measure.
Strong relevance shows that the feature is very important for classification accuracy, so it can't be arbitrarily removed.Weak relevance indicates that this feature can sometimes contribute to improve prediction accuracy.Irrelevance indicates that this feature is useless on the improvement of classification accuracy, so it can be directly deleted.

Redundancy criterion
A redundancy criterion based on the correlation is proposed to lay the foundation for further feature selection.Based on three basic concepts in the section 2, the redundancy of feature F i is analyzed under four extreme values of R i,c (the relevance between any feature F i and class attribute C) and R i,j (the relevance between any pair of feature F i and F j , i≠j).And four extreme values are shown in table 1.From table 4, it is easy to draw the following conclusions: .≠1, it is difficult to determine the feature F i whether or not is redundant.
Conclusion 2：R i,j is small, which means that the correlation between F i and F j is weak.Hence F j can't replace F i .In other words, no matter the size of the R i,c , the feature F i is not redundant.

Conclusion 3：R
i,c is small, which means that F i contains less information about C. R i,j is large, which means that the correlation between F i and F j is strong.In this case, the feature F i is redundant with higher probability.With the increase of R i,j , this probability is also increasing.

Conclusion 4：R
i,j is small, which means that the correlation between F i and F j is weak.This conclusion is consistent with the conclusions 2, no matter the size of the R i,c , the feature F i is not redundant.
Based on the above four conclusions, two redundant criteria can be obtained: Criteria 1: when R i,j is large, whether F i is redundant is uncertain.
Criteria 2: when R i,j is small, no matter the size of the R i,c , the feature F i is not redundant.

Approximate redundancy feature
Assuming that the R i,c of the feature F i is very close to R max (the maximum value of R i,c ), it indicates that F i contains a lot of information about class attribute C. In this condition, only if the value of R i,j is large enough, F i can be considered as an approximate redundancy feature.Otherwise, it can't be considered as redundancy feature.The reason is that F i plays an important role in improving the accuracy of classification, and can't be easily removed as redundancy.By contrast, Assuming that the R i,c of feature F i is not very close to R max , it indicates that F i contains relatively less information about class attribute C. In this condition, As long as the value of R i,j is relatively large, F i is considered as an approximate redundancy feature.The reason is that F i is not plays a main role in improving the accuracy of classification.In addition to the above conditions, F i is removed as an approximate irrelevance feature when the difference between R i,c and R max is quite large.Based on the above analysis and discussion, the approximate redundant feature is formally described in definition 1.

Definition 1 (approximate redundancy feature):
There is any pair of correlation feature F i and F j , and R j,c ≥R i,c .
(i) F i is an approximate redundancy feature iff there exists max , , 0 0.
F i is an approximate redundancy feature iff there exists Where R is a mean value of R i,c , that is In addition, definition 1 shows that F j can be approximated as an alternative for F i .

Correlation calculation
A nonlinear method based on MI is applied in this paper, and the reason is that the high dimensional data usually exist in the form of nonlinear in the real world.The correlation between any pair of variables (X, Y) can be calculated in the following formulas ( 6) or ( 7).
where H(X), and H(X,Y) can be calculated on the basis of formulas ( 7) and (8).
To prevent the scale of data is not unified and to reduce the effect of extreme value, each IG(X;Y) is normalized to the range [0, 1] using formula (9).

Performance evaluation
In this paper, classification accuracy and the number of selected features are two indicators used to design the performance evaluation function (Hu et al., 2016;Chuang et al., 2008), which is shown in formula (10).w 1 and w 2 are predefined weight coefficients, which are used to adjust the importance of two indicators in the performance evaluation function.In this study, the values of w 1 and w 2 are set to 0.999 and 0.001 respectively.Acc is classification accuracy as defined in formula (11).n is the number of selected features and N is the total number of features.
C num and I num are the number of correct and incorrect classification features respectively.

Data description
Five well-known biomedical datasets (Table 2) were used to evaluate the performance of our proposed algorithm.These dataset includes three aspects of disease (cancer) diagnosis, such as gene expression, sera mass-spectrometric etc.The data dimension range was from 2000 to 10000.The first two datasets were taken from the Kent Ridge Biomedical dataset (Li & Liu, 2004), and the last one datasets were taken from the UCI dataset (Asuncion and Newman, 2007).

Experimental procedure
We designed and conducted the following experiments: three kinds of high dimensional biomedical data were compared and analyzed by our proposed algorithm, Relief (a filter method based on the nearest neighbor distance) (Kononenko, 2004), maximum relevance and minimum redundancy (mRmR, a filter method based on MI) (Peng et al., 2005), under the same conditions, respectively.In this experiment, the same conditions refers to: Random forest (RF, numTrees=10) (Zhang et al., 2018) were adopted as classifier to evaluate classification accuracy respectively.
10 fold cross validation was adopted to evaluate the classification accuracy.Each data set was stratified into 10 folds, of which 9 folds were used as a training sample and the remaining 1 fold constitute a testing sample.The above experiments were implemented in Matlab 2017a.

Results
For the three datasets in Section 4.1, we have conducted the experiments described in Section 4.2.Three main statistical indicators were compared and analyzed in Table 3 which are: (1) Mean (%): the mean of performance, (2) Std: the standard deviation, and (3) MeanFN: the mean number of selected feature.Boldface shows the best experimental result.
From Table 3, we can observe the following aspects: (1) our algorithm obtained the best Mean among all the three feature selection algorithms.
The best Mean(s) are 92.01%,82.99%, and 85.67%, respectively.In addition, we notice that the maximum Mean improvement of our proposed algorithm was 13.80% compared with the full set.
(2) The Std among all the three feature selection algorithms for two out of three experimental results obtained by our proposed algorithm is smaller than other two algorithms.
(3) The three feature selection algorithms can effectively reduce the feature dimension, and the dimensionality reduction of our proposed algorithm was the most obvious.In addition, Figure 1 was obtained by statistical analysis of table 3. Figure 1 shows that one average attribute values (avg(Mean)).From comparison results, we can observe that our proposed algorithm were superior to the other two algorithms.

CONCLUSIONS
In this study, the relationship between two kinds of correlation (the correlation between features and classes, and the correlation between features and features) is established to eliminate redundant features.Due to the determination of completely redundant features is difficult to realize, therefore we analyze four kinds of boundary conditions between R i,c and R i,j , and then a redundancy feature criteria is proposed.On this basis, the approximate redundancy features are defined in this study.

Figure 1 :
Figure 1: The average attributes value (avg(Mean)) Finally, we have proposed a new feature selection algorithm based on redundancy removal for high dimensional data classification ACKNOWLEDGMENT This work was supported by the National Basic Research Program of China (International S&T Cooperation of M OST [2013DFA11140]; the Yong Scholar Fund of Lanzhou Jiaotong University [2016004] and the Teaching and Reform Project of Lanzhou Jiaotong University [JGY201841].

Table 1 :
Four cases of extreme value

Table 2 :
High dimension datasets

Table 3 :
Comparison of experimental results based on different data sets