End-to-End Emotion Recognition using Peripheral Physiological Signals

The study of emotion recognition from physiological signals has seen a huge growth in recent decades. Studies initially used traditional machine learning classiﬁcation to estimate either discrete emotions or combinations of arousal and valence. However, different feature engineering techniques such as Ensemble Empirical Mode Decomposition (EEMD) Analysis for electroencephalography (EEG) signals and statistical calculations for peripheral signals are used prior to machine learning. Also, several sensors needed to be worn by participants to predict the emotions. This study aims to investigate whether arousal and valence can be predicted from a single peripheral signal using deep learning. The Galvanic Skin Response(GSR), Respiration (RSP), Blood Volume Pulse (BVP) and Temperature (Temp) signals from the DEAP dataset are used. The signals are downsampled to approximately three hertz (Hz) and input to a convolutional network (CNN) to predict arousal and valence. GSR, RSP and BVP had similar F1 and accuracy results. BVP had an F1 result of 0.673 and 0.632 and accuracies of 63.5% and 61.1% respectively for arousal and valence. RSP’s F1 results were 0.677 and 0.669 and accuracies were 61.3% and 64.2% for arousal and valence respectively. GSR had F1 results of 0.699 and 0.663 and accuracies of 62.5% and 60.2% respectively for arousal and valence. Using raw signals and examining the peripheral signals individually, we were able to identify which sensors showed the best potential for further research to bring emotion classiﬁcation into a real-world scenario using non-invasive sensors.


INTRODUCTION
With the advancement of artificial intelligence (AI) machine learning and deep learning techniques, the study of emotion recognition from physiological signals has seen a huge growth in recent decades. Emotion recognition research involves the selection of participants and the triggering of different emotions using stimuli such as video clips, music, or images whereby the participant experiences the stimulus. The participants have sensors attached to them that collect data that captures the participant's response to the emotion's stimulus. Basic inputs such as ElectroCardioGraphy (ECG), Elec-troEncephaloGraphy (EEG), Blood Volume Pulse (BVP), Galvanic Skin Response (GSR), and Body Temperature (BT), can be measured and powerful processors can be used to calculate parameters such as Heart Rate (HR), Heart Rate Variability (HRV) and interbeat intervals (IBI) from these basic inputs. Studies initially collected EEG signal data in a lab environment while more recently wearable sensors are being used. In emotion recognition research, AI models are applied to the sensor outputs. What to measure as an emotion varies. Affect is the general sense of feeling that a person experiences. It has two features: valence (V) which is how pleasant/positive or unpleasant/negative a person feels and arousal(A) which is how active or passive a person feels. Discrete emotions such as the six widely used basic emotions (happiness, fear, surprise, anger, sadness, and disgust) can be mapped to a V-A quadrant (LVLA, LVHA, HVLA, HVHA) as shown in Fig. 1 where low is negative valence and high is positive valence. In research, emotion measurements have been presented as combinations of A and V, the four quadrants of emotion (HVHA, LVHA, LVLA, HVLA) as well as an attempt to identify discrete emotions. To classify the emotions, features are engineered from the sensor outputs before being input to the AI models. However, while feature engineering is necessary when using traditional classifiers such as k-nearest neighbour (kNN), linear discriminant analysis (LDA), logistic regression (LR), and support vector machines (SVM), deep learning networks such as convolutional neural network (CNN) and long short-term memory (LSTM) should derive features thus removing the need for manual feature engineering. Table 1 shows feature engineering divided into three terms: extracted which are those features engineered from EEG signals; statistical which are those features engineered from non-EEG peripheral signals, and raw which are signals which are not feature engineered. The table also shows the signal and model used and what was measured as the emotion. This study aims to further research by removing the feature engineering step and classifying A and V from individual peripheral signals only. The main contributions of this work are as follows: • Automatic feature extraction from down sampled raw peripheral physiological signal data for the prediction of emotion; • Exploration of the possibility of using a single peripheral physiological signal for the prediction of arousal or valence. Shu et al. (2018) identified several challenges in emotion recognition based on physiological signals. Before discussing traditional machine learning and deep learning approaches, they identified three stages of data preparation that influence emotion classification: emotion elicitation, data acquisition, and data preprocessing.

Emotion Elicitation
In a lab setting, emotion elicitation is the triggering of emotions. Participants experience something usually video and/or different types of music/loudness, to trigger these emotions as done by Koelstra et al. (2011), Katsigiannis and Ramzan (2017), Miranda-Correa et al. (2018) and Subramanian et al. (2016) to create the corresponding emotion recognition databases DEAP, DREAMER, AMIGOS and ASCERTAIN.  in their systematic review of emotion recognition using wearables (2424 papers with 2 percent used in the final analysis), identify six stages for emotion recognition in the real-life environment, the first being the emotional model used. They note that as the different studies have different emotion models, it is impossible to compare results. Barrett (2017) proposes that emotion, at a particular moment, is predicted on experiences learned over a lifetime. Siegel et al. (2018) finds that there is no 1-1 mapping between an emotion category and a specific autonomic nervous system (ANS) response pattern, ANS being a component of the peripheral nervous system that regulates involuntary physiologic processes including heart rate, blood pressure and respiration. Focussing on measuring brain activity Saarimäki et al. (2016) and Clark-Polner et al. (2017) disagree on the fundamental understanding of the neural basis of emotion and whether brain activity patterns can be classified into e.g. six basic emotions (disgust, fear, happiness, sadness, anger, and surprise). Other studies of predicting affect from physiological signals score high and low values for arousal and valence (HA, HV, LA, LV) as done by Koelstra et al. (2011) or a combination thereof to give four quadrants of emotion HVHA, LVHA, LVLA, HVLA and as done by Katsigiannis and Ramzan (2017) and Miranda-Correa et al. (2018), or both as done by Alazrai et al. (2018). However, it is widely accepted that the ANS responds to different emotion stimuli. Goshvarpour et al. (2017) found in their investigation of emotion recognition from ECG and GSR that valence was more detectable that arousal. Vijayakumar et al. (2020) showed that arousal out-performed valence for all models for all physiological signals including GSR, RSP, BVP and skin temperature. In this study, both arousal and valence will be evaluated for each of these four physiological signals.

Data Acquisition and Datasets
Data acquisition involves connecting different types of sensors to human participants to measure physiological activities. Invasive sensors, such as EEG electrodes, 12 or more, to measure brain activity, conventional 12-lead ECG to measure heart rate and a finger clip to measure pulse, were used in the creation of the DEAP dataset by Koelstra et al. (2011). These sensors are being replaced with low-cost off-the-shelf sensors. The Emotive EPOC wireless EEG Headset and the wireless SHIMMER ECG sensor were used to compile the DREAMER dataset by Katsigiannis and Ramzan (2017) and the AMIGOS EEG, ECG and GSR based datasets by Miranda-Correa et al. (2018). Other off-theshelf sensors were used to measure EEG, ECG, GSR in the ASCERTAIN dataset by Subramanian et al. (2016). The Empatica wristband was used to detect optimal user experience from BVP and Electrodermal Activity (EDA), an umbrella term for GSR, by Maier et al. (2019). To further enhance the non-invasiveness of sensors, although not used in studies to detect emotion, sensors embedded in textiles have been used to measure respiration by Piuzzi et al. (2020) and to measure HR, skin temperature and respiration as highlighted in a review by Angelucci et al. (2021). Datasets from several of these studies are available for further research. The datasets can be briefly described as follows: • DEAP: Koelstra et al. (2011) collected data from 40 sensors on 32 participants who scored arousal, valence, liking and dominance on a likert-like scale of 1 to 9 while watching 40 60second videos; • DREAMER: Katsigiannis and Ramzan (2017) collected data from EEG and ECG sensors on 23 participants who scored valence, arousal, and dominance; • ASCERTAIN: Subramanian et al. (2016) collected data from EEG, ECG, GSR and Facial expression using off-the-shelf sensors on 58 participants who scored emotional ratings and personality scales; • K-Emocon: Kim and André (2008) had 32 subjects who participated in paired live debates while wearing off-the-shelf sensor devices: the E4 Empatica for pulse rate, 3 axis, body temp, EDA and derived (HR, IBI); the Polar H7 Bluetooth heart rate sensor and the Neurosky mindwave with 2 dry sensors for EEG signals. There were 16 live 10 minute debates. Arousal and valence labels were assigned by observers at 5 second intervals.
As this study is more interested in peripheral psychological signals, the DEAP dataset was chosen for two main reasons: it has four peripheral physiological responses compared to less than four in other studies and both labels arousal and valence are used.

Data Preprocessing
The output from the sensors are analog signals which vary in both amplitude and frequency for the different physiological measurements. and wavelet dictionary to extract MP coefficients on seven subjects and then statistically analysed for significance concluding that on average GSR is a better predictor of valence than of arousal and that the classification of both valence and arousal using ECG was better than GSR. In this study, four of the peripheral signals from the DEAP dataset will be used. The GSR signal data was collected by measuring galvanic skin response on the left middle and ring finger. A respiration belt was used to measure RSP. A plethysmograph on the left thumb was used to measure the BVP and a temperature sensor on the left pinky was used to measure temperature. The signals were then preprocessed by downsampling the data to 128Hz, segmenting it into 60 second trials and removing a 3 second pre-trial baseline, resulting in the data preprocessed python zip file provided by DEAP. These preprocessed signals form the raw signals that will be further down sampled and input to a CNN, which will itself, extract the features prior to classification of both arousal and valence.

METHODOLOGY
Design of the CNN involved outlining a process with the steps data preparation, data modelling, classification and validation. Manual hyperparameter optimisation was performed. First, initial parameters were set for the model that were similar to those used by Maier et al. (2019) as follows: four convolutional layers (32 filters, kernel size 3) with ReLU activations and alternate max pooling layers, a final dense layer set to one for one class and a softmax activation. Then, tuning of the hyperparameters for the model involved running several tests until a model was found that did not overfit. Fig.2 shows the accuracy and loss for each of the four sampled raw signals GSR, respiration, BVP, temperature for the final chosen hyperparameters which are shown in Fig.3.
Cross-validation was used to assess the generalisation ability of the predictive model. The k-fold cross validation method with k set to 10 was used. Analysis of the dataset found that the arousal and valence samples per binary class in the dataset were not balanced as shown in Fig. 4. Therefore, a stratified K-Folds cross-validator was used. This stratified validator performs the sampling in such a way that maintains the class proportions in the fold subsets that reflects the proportions in the training set.

Data Preparation
The preprocessed python version of DEAP dataset has been used. It has 32 participant .dat files which each have a data array of shape 40x40x8064 for video/trialxchannelxdata and a labels array of shape 40x4 for video/trialxlabel. In the DEAP dataset, the data channels have been downsampled from the original 512 Hz signals to 128 Hz. The channels comprise 32 Electroencephalogram (EEG) channels, horizonal and vertical Electrooculography (EOG) channels, horizonal and vertical Electromyography (EMG) channels and a channel each for GSR, RSP, BVP and Temperature. The labels comprise valence, arousal, dominance and liking each with floating point numbers in the range of 1 to 9. Data preparation was executed in the Google's colaboratory hosted Jupyter notebook service. To prepare the data channels for input to the CNN conv1d model, the 32 dat files were read into the notebook. The EEG, EoG and EMG channels were removed from the data array resulting in a data array shape of 1028 x 4 x 8064 (samples x channels x timesteps) derived from 32 participants in 40 video trials (1028), 4 channels and 63 seconds signals sampled at 128 Hz (8064) readings. Each of the four remaining channels (GSR, Respiration, BVP and Temperature) were rescaled to the range [0,1]. The 8064 timesteps represent a 63 minute sensed signal which has been sampled at the rate of 128 Hz (samples per second). By further sampling the 8064 timestep readings, taking every 40th reading, the signal was reduced to 201 timesteps. This is in effect equivalent to a sampling frequency of 3 Hz of the original signal. As shown in Fig. 5 for one sample on each of the four channels, the integrity is maintained after down sampling the signals. Finally, for each of the four channels, the array was reshaped to (none,201,1) for the conv1d format (batch size, timesteps, feature) whereby with batch size set to none, it can be varied in the model. To prepare the label channels, the labels dominance and liking were removed and the values for the remaining channels, arousal and valence which ranged from 1 to 9, were processed into binary values 0 and 1 using a threshold of 4.5 for binary classification.

Proposed Model
The CNN is a deep learning class of Artificial Neural Network (ANN) which has been used successfully for visual imagery where the data has two dimensions for images or three dimensions for video. The physiological signal data is one dimensional which can be used with CNN conv1d. The input shape, as prepared in the previous section, for CNN conv1d time series problem, is a 3-dimensional array comprising samples, timesteps and features. There are 1028 samples comprised of the 40 trials done by the 32 participants. There are 210 timesteps for each signal. There is one feature per signal. The CNN model is sequential. There are three convolutional layers all with a kernel size of three and each respectively having 8, 16 and 32 filters. These are alternated with maxpooling by 3. After the convolutions and maxpooling, the features are flattened to one fully connected layer which is followed by a dense layer with one class as arousal and valence were classified separately. The Adam optimisation algorithm was used with the default 5 End-to-End Emotion Recognition using Peripheral Physiological Signals • • • To estimate the generalisation performance, we conducted a 10-fold cross validation using a stratified k-fold validator to preserve the class proportions in the dataset.

RESULTS AND DISCUSSION
We set out to predict arousal and valence from peripheral physiological signals. The GSR, Respiration, BVP and Temperature channels from the DEAP dataset were downsampled and input to a CNN which itself extracted features. As seen in  (2017), using low-cost off-the-shelf sensors to sense EEG and ECG signals and classification using traditional SVM with RBF achieved accuracies of 62.57% using only the ECG device and 62.49% for arousal using only EEG signals. In our methodology, four of the peripheral signals were used from the DEAP dataset. These raw signals were downsampled and the CNN itself extracted the features for classification of both arousal and valence. As shown in table 3, the best accuracy was achieved for RSP at 64.2% for valence followed by an accuracy of 63.5% for arousal using the BVP signal. GSR, at 62.5%, was second best for arousal followed by RSP at 61.3%. For valence, BVP was second best at 61.1% following closely by RSP at 60.2%. Finally, the temperature signal had the lowest accuracy for both arousal (56.5%) and valence (53.2%).

CONCLUSION
In this paper, the use of raw peripheral physiological signals for emotion recognition has been investigated. Using the DEAP dataset, each of the GSR, RSP, BVP and temperature signals were down sampled to approximately 3 Hz. A CNN then automatically extracted features from each signal separately, each of which were then used to classify arousal and valence. The accuracies and F1 scores, when compared to other studies, indicate that a single raw peripheral physiological signal is not useful for predicting arousal or valence using our current CNN model. Moreover, we found that the GSR, RSP and 7 End-to-End Emotion Recognition using Peripheral Physiological Signals • • • Figure 5: Physiological signals before and after sampling  Our future work will further fine tune the hyperparameters and include regularisation methods to improve the network's performance. It will investigate classification of each of the four peripheral signals into the four quadrants of emotion HVHA, LVHA, LVLA and HVLA. Further, the GSR, RSP and BVP peripheral signals will be combined to classify arousal and valance separately as well as classifying the four emotion quadrants.

ACKNOWLEDGEMENT
This work is funded by the Technological University of the Shannon: Midlands President's Doctoral Scholarship 2020.