Emotion Recognition from Speech Signals by Using Evolutionary Algorithm and Empirical Mode Decomposition

Emotion status impacts massively the human’s health and the job performance of people. A system that can continuously and automatically monitor people’s emotion is worthy of development. Besides, the speech signals always contain some emotion features and are the most commonly used for communication between humans. The exploration of emotion recognition from speeches then becomes more important. In this paper, we propose a strategy for emotion recognition from speech by combining evolutionary algorithm (EA) with Empirical Mode Decomposition (EMD) to improve the emotion recognition rate. First, some emotional speeches were decomposed into several Intrinsic Mode Functions (IMFs) by using EMD process. The emotional part of a speech is then extracted by using these IMFs. In this paper, some weighted IMFs obtained from EMD are combined for the following recognition process. Hence, it is one of the goals of this paper to find the optimal weights corresponding to each IMF and to combine these weighted IMFs to make the recognition results as accurate as possible. The weights for each IMF are trained by evolutionary algorithm to find an optimal combination of IMFs. The reason why evolutional algorithm is used here is that evolutional algorithm always obtains some outstanding performances in many research concerning optimal design. The Mel-Frequency Cepstral Coefficients (MFCCs) are then computed and are used as the features for emotion recognition. An open database, eNTERFACE 2005 emotion database, is adopted in this paper as training and testing data for the experiments.


INTRODUCTION
It's important to detect the people's emotion in their daily lives. Emotion status impacts massively the human's health as well as the quality of life. Emotion of people also affects the job performance. Hence, a system that can continuously and automatically monitor people's emotion is worthy of development. Besides, the speech signals always contain some emotion features and are the mostly used for communication between humans. The exploration of emotion recognition from speeches then becomes more important. Although this topic is a new research area, there are already some literatures about this topic, see for example the papers (Schuller, Rigoll & Lang 2003, Fragopanagos & Taylor 2005, Cen et al. 2010, Wu, Falk & Chan 2011, Petrusihin 2000, Tato et al. 2002, Yacoub et al. 2003, Schuller, Rigoll & Lang 2004, New, Foo & Silva 2003 and the references therein. However, in the existing literatures, the emotion recognition rate from speech is low and is far away from practical applications. Hence, how to improve the recognition rate is an open problem. The studies on the topic of emotional speech recognition are consequently innovative. On the other hand, EMD was first proposed by Prof. Huang in combination with the Hilbert Transform (HT) to analyse nonlinear and nonstationary time series. The combination of EMD and HT is therefore referred to as the Hilbert Huang Transform (HHT) (Huang 1996). The advantage of EMD over other frequency-domain transformations is that the components decomposed from a mixed signal are related to specific physical sources. This allows us to examine the physical phenomena of a signal through the components obtained by EMD. Initially, EMD was applied to signal analysis in the field of geoscience, strength analysis of material structures, trend analysis of the stock market, etc. More recently, EMD has been applied in the measurement and enhancement of speech signals (Gales & Young 1995, Tsao & Lee 2009), biomedical signal analysis (Windmann & Haeb-Umbach 2009), earthquake signal analysis (Huang 1996) and speech feature capture (Li et al. 2008), etc. For speech feature capture, the paper (Wang, Li & Zhang 2006) uses first EMD to decompose the speech signal into several IMFs, and then uses Renyi Entropy to calculate each IMF's energy. The features are then obtained from theses energies. Due to the impressed performance of EMD on features extraction, this paper will apply EMD on the derivation of features for emotional components from some speeches.
Besides, the strategy of emotion features extraction in this paper is to combine some weighted IMFs obtained from EMD process and then compute the emotion features for the resulting signals by using MFCC. So, the design of the weights for each IMF becomes the key for feature extraction of emotion. It is one of the goals of this paper to find the optimal weights corresponding to each IMF. Recently, since the computation ability of computers has been enormously enhanced, evolutionary algorithm computation is widely applied in many areas of engineering; for a recent review of these applications, please see e.g. (Uesaka & Kawamata 2003). The most popular evolutionary algorithm is genetic algorithm (GA). GA is based on the Darwinian "Survival of the Fittest" strategy. Each individual in the population represents a potential solution to the problem at hand. They compete and mate with each other in order to produce stronger individuals. Much recent research has pointed out that GA is efficient in solving optimization problems (Haupt & Haupt 1998, Karamalis, Kanatas & Constantinou 2009, Doong, Lai & Wu 2007, Pan & Lai 2008, Lai & Chang 2009, Tsai, Huang & Chan 2011. Hence, in this paper, weights for each IMF are trained by GA to find an optimal weighted combination of IMFs. This will increase the recognition precision. The proposed method is then shown in Figure 1.

EMPIRICAL MODE DECOMPOSITION (EMD)
This study applies EMD to decompose an emotional speech signal into emotional parts and non-emotional parts. In this section, the procedure for performing EMD is introduced. The purpose of EMD is to shift the original non-stationary data series until the final data series is stationary. The main step to perform EMD operation is to divide a speech signal into several intrinsic mode functions (IMFs). The condition for IMFs on a data series can be described as follows (Huang 1996). 1. In the whole data series, the number of local extremes and the number of zero crossings must be equal or differ at most by one. 2. At any point, the mean value of the envelope defined by the local maxima and the envelope defined by the local minima is zero. The procedure of EMD for a data series or a signal is then introduced as follows. The Cubic Spline (Mathews & Fink 2004) is used to generate the upper envelop and lower envelop of the signals during the process of finding IMFs. Suppose that the original signal is ) (t X and Step 1  Step 3: Step 1 and Step 2 to find ) ( 2 t imf .
Step 4: Repeat Step 3 to find the subsequent IMFs as follows.
This step comes to an end when the signal ) (t r n is constant or is a monotone function. After that the EMD procedure Step 1~ Step 4 is finished, the following decomposition of However, it is hard to satisfy the second condition of IMF in practical application, since zero mean value of the envelopes for all time t is almost impossible. Hence, a looser condition is lunched to replace the second condition of IMF. To construct this looser condition, an index for the mean value of the envelopes is computed and a threshold for this index is assigned. In general, the threshold is assigned in the range 0.2 to 0.3. Moreover, the index can be calculated through the following equation (Huang 1996).
In practical applications, the second condition of IMF is then replaced by the looser condition: the ik SD should be smaller than the assigned threshold. That is, if , for example, and the second condition of the IMF is satisfied, then the iteration for ith IMF ends and we get a new IMF.

DISCRETE HIDDEN MARKOV MODEL
The HMM model is used for emotion recognition from speeches in this paper. According to the type of the probability distributions used in HMMs, HMMs can be categorised into Continuous Hidden Markov Model (CHMM) and Discrete Hidden Markov Model (DHMM). The DHMM provides more stable recognition results and faster training with almost the same level of recognition accuracy as CHMM. Therefore, the DHMM is adopted for emotion recognition in this paper. Some useful features of the emotion recognition are selected to train the DHMM. The built HMM model can become a reliable computer-assisted tool for monitoring peoples emotion in the future. Figure 2 illustrates the training process of DHMM model. First, the matrices A, B and π which describe the DHMM model are randomized at the initial step. Then, the features obtained from the EMD on emotional speeches are quantized through a trained codebook. The quantized features are used for the observation data of the DHMM model. The corresponding probability for every observation can be found from the elements in the matrices A, B and π. By using these probabilities, we can run the Viterbi Algorithm (Blunsom 2004) to update the matrix A, B and π until the elements in these matrix converge. In a DHMM, the hidden states are unobservable while the outputs (i.e., the observation) of each state are observable. Each hidden state has a probability distribution over the possible output tokens. Therefore, the sequence of output tokens generated by the DHMM gives some information about the sequence of states. For the purpose of clarification, a figure for the relations between the features, the observations and the hidden states of DHMM is depicted in Figure 3. In the following paragraph, the definition and the detail training method of the parameters in DHMM are introduced (Blunsom 2004 where is the number of training data. By using these updated , we run the Viterbi Algorithm again. The above steps are repeated until the matrices converge. The training process for a DHMM is then completed. During the recognition process, the probability of the observations is calculated based on the model by using the following equation (

GENETIC ALGORITHM
In order to explain clearly the strategy proposed in this paper, this section introduces briefly the genetic algorithm (GA). It is well known that GA solves optimization problems through the evolution processes for chromosomes in each generation. The evolution processes are selection, crossover, and mutation. It is hence the key point to define the chromosomes in GA for the problems in hand. A chromosome is organised by multiple genes. The number of genes in a chromosome is determined by the number of the parameters, which are going to be designed in a problem. In general, a GA with longer chromosomes needs fewer generations for evolution to obtain an optimal solution. However, it will take more computing time for each evolution. While a GA with shorter chromosomes will take less computing time for each evolution. But, the disadvantage is that a local optimum may be obtained. Consequently, the number of genes in a chromosome should be carefully determined. Moreover, there are three ways for the encoding of genes: binary encoding, real number encoding, and symbol encoding. The way that the genes are encoded is decided according to the parameters type in a problem. Figure 4 shows the three different ways for encoding genes in a chromosome.
In this paper, a chromosome is made up of twelve features which are calculated from MFCC for emotional speeches after EMD process. And real number encoding is used here, since all the features calculated by MFCC are real.
A GA is applied in this paper to find the optimal weights for the IMF obtained from the EMD process on emotional speeches. Each chromosome is encoded here as a vector of weights wi = [wi1, wi2, …, win], in which the index i means the ith generation and n means the number of genes in a chromosome. The process of GA starts from randomizing the chromosomes in the first generation. Then, the fitness function is calculated for each chromosome. In this paper, the fitness function is defined simply as the emotion recognition rate. According to "Survival of the Fittest", some chromosomes with better fitness values are selected based on Roulette Wheel Selection for evolving the next generation. Then, a linear crossover is used for crossover of the selected chromosomes. The mutation process is performed after the crossover process. In the mutation process, an offset for mutation probability is used. A random variable is applied to generate a random value for determining whether the mutation will be performed or not. If the random value for a gene is larger than the offset, the gene mutates. The flowchart in Figure 5 reveals the procedure of GA in this paper.

ALGORITHM FOR PROPOSED METHOD
In order to make the illustration of the proposed strategy clearer, an algorithm is introduced as follows.
1. Find the IMFs for each training emotion speech by using EMD method. 2. Find the optimal weight for each IMF by using GA. 3. Add every weighted IMF to obtain a new speech signals. 4. Perform MFCC on the speech signals obtained in Step 3 to get the emotional features (vector) of speech. 5. Use the feature vectors got from Step 4 to train HMM. 6. Calculate the recognition rates. If the recognition rate satisfies the target, then we stop the training stage, or go back to step 2. 7. Go to the testing stage.

EXPERIMENTAL RESULTS
In the experiments, the speeches in the emotional speech database, eNTERFACE 2005, are used as the training and testing data for the proposed system. The speeches in the database are collected from 35 males and 9 females and are all 145 in English. The database involves speeches with 6 different types of emotions: Anger, Disgust, Fear, Happiness, Sadness, and Surprise. And there are totally 1295 recordings in AVI format in the database. The recordings are transformed into WAV format before they are processed. Then, the recordings in WAV format are sampled with sampling rate of 256 samples, length of 16 bits, and monotone. In this experiment, a 10-fold crossvalidation is adopted. This experiment performs ten different runs. There are ten different combinations of the recordings for the ten runs. Each run of experiment is performed for six emotions with 40 recordings for training and 20 recordings for testing.
For the feature extraction from the emotional speeches, the EMD process is first applied to the speeches. Then, the weighted IMFs obtained from EMD process are added to form a new signal. Thereafter, MFCC features are calculated from the new signal. In this experiment, the number of IMF in EMD process is set to be 5 and the dimension of the MFCC feature vectors is set to be 12. Moreover, the parameters in GA used in this experiment for finding the optimal weights of IMF are set in Table 1. Concerning the training of HMM, a codebook is first trained by the k-mean method. The matrices A, B and π in HMM are then calculated from the codebook. Each HMM is trained for an emotion in the six considered emotions. That is, an HMM is trained by the all training data for a specific emotion.
In Figure 6, the evolution of GA in the proposed method for one of the ten experiments is revealed. The emotion recognition rate in the evolution converges about at the 75th generation. The average recognition rates for the ten experiments are shown in the Table 2. In order to illustrate the advantage of the proposed method on the feature extraction, a table (Table 3) in which the average recognition rates for various combination of IMFs is displayed. The case that the MFCC is calculated without the pre-process of EMD is also included in the table. It is obvious that the proposed method in which the weighted IMFs with the weights trained by GA are used for feature extraction reaches better recognition rate compared to the other combinations in Table 3. This confirms the performance of the proposed method. However, from Table 4 in which the average recognition rates of the ten experiments for each emotion are revealed, the average recognition rates for the three emotions: disgust, fear and surprise are lower than those of the other emotions. This is because the features of these three emotions are similar to each other. This causes wrong classification between these emotions.
Besides, in order to verify the performance of the proposed method, some comparisons of recognition rates for the ten experiments among the proposed method, the method by using MFCC without EMD and the methods in the papers (Lenzi, Mendes & Da Silva 2000, He et al. 2011) are revealed in Figure 7. It can be seen that the proposed method outperforms the other three methods in the figure.

Number of chromosomes in a generation 32
Number of genes in each chromosome 5

Method of crossover
Linear crossover Mutation rate 0.03 Survival rate 0.5 Figure 6: Evolutions of GA for emotion recognition in the proposed method.

CONCLUSIONS
This paper proposes a strategy by using EMD, GA and HMM to recognise six types of emotions in the database eNTERFACE 2005. The experiments in this paper demonstrate the good performance of the proposed method compared to the other methods in some existing papers. However, the recognition rates in the experiments are still low, three emotions: disgust, fear and surprise, especially. This is very likely due to the face that the features of these three emotions are similar to each other. Hence, some methods, for example, SVM, Fuzzy Logic, Neural Networks, may be integrated into the proposed method to distinct the features more precisely.