A novel cough audio segmentation framework for COVID-19 detection

: Despite its potential, Machine Learning has played little role in the present pandemic, due to the lack of data (i.e., there were not many COVID-19 samples in the early stage). Thus, this paper proposes a novel cough audio segmentation framework that may be applied on top of existing COVID-19 cough datasets to increase the number of samples, as well as filtering out noises and uninformative data. We demonstrate the efficiency of our framework on two popular open datasets.


INTRODUCTION
One of the biggest roadblocks to scientific progress is the restriction of access to data.In the context of Machine Learning (ML) and Deep Learning (DL) methodologies, research teams and scientists alike may not develop training models without data.
When the COVID-19 pandemic took the world by storm in 2020, researchers worldwide raced to develop Artificial Intelligence (AI) solutions to accompany existing detection methods including antigen, molecular, and serological testing (such as PCR, Lateral Flow Tests), and medical imaging (such as X-Ray chest scan, lung ultrasound).Studies have shown that coughs contain unique characteristics, making it possible for diagnosis of COVID-19 via analysis of coughs (Han et al., 2021;Pahar et al., 2021).However, existing respiratory datasets were not adequate enough to meet the requirements needed for the feasibility of AI solutions in a clinical setting to be examined for three reasons: • Existing audio datasets containing cough sounds, such as the Google Audio Set (Gemmeke et al., 2017), do not specify the diagnoses or pathologies of the coughs; the lack of clinical metadata therefore making them inappropriate.
• For datasets that do, often they have a sample size that is too small, especially for data hungry ML architectures.For example, NoCoCoDa is a database of reflex cough sounds selected from media interviews of COVID-19 patients, but only contains 10 unique subjects (Cohen-McFarlane et al., 2020).
As a consequence, two and a half years in the pandemic, there still has yet been any cough-based COVID-19 detection tool available for the wider public, at the time of writing.

Paper's contributions
This paper makes the following contributions: • We assess the current challenges of open cough COVID-19 datasets.
• We propose a novel cough audio segmentation framework that may be applied on top of existing cough datasets to increase the number of samples.
• We demonstrate the efficiency of our proposed framework.
The remainder of the paper is organised as follows.Section 2 discusses the inadequacy of existing cough sound datasets, the collection of crowdsourced cough sound datasets, and open sourcing some of these datasets following open data initiatives and standardisation to promote open 1 research, respectively.Sections 3 and 4 discuss our contribution of a proof-of-concept (POC) novel audio pre-processing and segmentation tool, and experimentation and evaluation of said tool on two open-source COVID-19 cough sound datasets.Lastly, Section 5 and 6 discuss related work in the field of cough segmentation, and outline conclusions and future works.

DATASETS OF COUGH SOUNDS
After 2 years since the pandemic, a number of crowdsourced audio datasets containing cough, breath, and speech sounds were collected by various research teams via smartphone and web applications (see Table 1).
University of Cambridge amassed 53,449 audio samples, of which 2,106 are COVID-19 positive, from 36,116 participants via their smartphone app 1 in their dataset named COVID-19 Sounds (Xia et al., 2021).This dataset is only available to academic institutions due to its sensitive nature.
Another curated dataset by the Wadhwani Institute of Artificial Intelligence, Mumbai, consists of 3,117 coughs from 1,039 individuals (Bagad et al., 2020).This dataset was intended to be publicly accessible, but unfortunately was not due to legal constraints.
The Government of Buenos Aires, Argentina, collected a clinically validated dataset called IATos v1 of 5,884 coughs from 2,821 individuals via WhatsApp used in a clinical setting (Pizzo et al., 2021).This dataset was one of the few that are in the dataset, producing a probability of likeliness that a cough is present (1.0 being likely, 0.0 being unlikely) (Orlandic et al., 2021).This can be used to filter data during pre-processing.
The Virufy AI Research Group are currently collecting cough sounds from multiple avenues, including via a smartphone app4 and clinical collection in hospitals (Chaudhari et al., 2020;?).This dataset was unreleased (save for a small sample of clinically validated coughs; 121 coughs from 16 participants, in which 48 coughs are COVID-19 positive (Chaudhari et al., 2020)), and the current number of samples kept private, though the team has stated that they intend to open source their datasets eventually.Additionally, the Virufy Common Data Format (CDF) was developed to standardise open-source datasets, i.e., to have the same column names (see Table 2) with an emphasis on clinical data, such as actual RT-PCR test results (or inferred where data is not/partially clinically validated) (Chaudhari et al., 2020).

AUDIO SEGMENTATION
Suppose we have an algorithm that isolates each cough in an audio sample into its own audio file.An original sample with n=10 coughs is split into 10 cough samples, increasing the sample count 10-fold.Necessary Digital Signal Processing (DSP) techniques, such as normalisation and silence removal, may be performed to remove uninformative data.This is the basis of our contribution.We believe that increasing the number of samples in a dataset may theoretically improve baseline

Cough event detection
To segment a time-series in a non-overlapping manner, the starting point of the event to isolate needs to be identified.This is known as onset detection, where extreme changes in an audio signal are located.The event here is a cough, the most common form physiologically characterised by three phases.During the first inspiratory phase, air is drawn into the lungs.This is followed by the compressive phase characterised by forced expiratory effort against the closed glottis.Finally, during the expiratory or expulsive phase, there is an opening of the glottis and rapid outflow of air.This sudden release of turbulent expiratory airflow is the signature sound that the human ear would recognise as a cough.
However, the difficulty in cough segmentation comes with the subtle differences between various cough types.A type of cough known as a peal cough, for example, is a cough that often occurs in epochs, or bouts, defined as a cluster of two or more cough sounds, separated from the next by an interval of no more than 2 seconds.Therefore, an epoch is where the initial inspiratory phase is followed by a series of further compressive phases associated with glottal closure, sometimes with additional inspirations.Additionally, another type of cough does not have a third phase, meaning that every energy peak is an individual cough (Cohen-McFarlane et al., 2020).Therefore, the difficulty in cough detection is developing an algorithm than can identify and accurately segment all of these cough types in their entirety, rather than the individual phases (Cohen-McFarlane et al., 2020).An analysis and summary of how our algorithm handles this variation are given in Sections 4 and 6 respectively.

Proposed segmentation framework
For cough onset detection, we select Root Mean Square Energy (RMSE) as the underlying feature as it may be interpreted as the signal's amplitude or loudness (Meister et al., 2021).RMSE is computed from the Short-Term Energy (STE), which is the energy of a signal corresponding to the total magnitude of the signal.Loudness is an intuitive indicator for a cough event with relatively low background noise levels.The audio feature in Equation 1 describes a signal's energy as a function of its amplitude x in frame n.
Given RMSE, the Python audio-processing package librosa identifies any extreme changes in the signal's energy levels.We set a threshold whereby the peak amplitude of a cough event may be detected, but not the peak amplitudes of background noise events, such as doors closing, vehicular traffic, and conversation.However, one challenge with this approach is the possibility that the beginning of some cough events may be excluded.One attempt to mitigate this is to use backtracking to move the onset index back by a minuscule number of seconds i.e., 10 milliseconds.For a trade-off between code complexity and performance, we treat the second cough's onset as the first cough's offset.We also avoid detecting onsets in the last few frames of a sample.The intuition is that some audio files have a percussive peak at the end of the recording (i.e., button press of the recording device).The algorithm has a number of parameters that can be configured manually, divided between constants and variables.These parameters are described in Table 3.
The cough samples have differences in amplitude, thus, we normalise each sample between the range (−1, 1) to compensate.A cough has an average duration of 350ms with an initial peak frequency of 400Hz, a secondary peak of highest continuous frequency 4kHz, and frequency components of sound spread up to 20kHz (Murata et al., 1998).
To properly capture characteristic information of the coughs, many studies include re-sampling the audio samples to 8kHz, as it has been observed that most of the frequency information of a cough sound is between the range of 350Hz and 4kHz (Shin et al., 2008), but this must be doubled according to the Nyquist sampling theorem to avoid anti-aliasing.However, whilst 8kHz may be adequate for feature extraction and the generation of spectrograms for classification, we wish to create a new dataset of audible isolated coughs.Therefore, we select a higher sampling rate of 22.05kHz.

EXPERIMENTAL RESULTS
This section demonstrates the performance of our proposed cough segmentation framework on two open cough COVID-19 datasets.

Evaluation metrics
According to Murata et al., (1998), cough events are between 0.2 and 1 seconds.Therefore, we will analyse how many isolated coughs fall within this range; for those outside the range, we assume that these have either not been correctly segmented, or that background noise or silence was not adequately removed.Additionally, we will use the Signal-to-Noise Ratio (SNR) to access the reduction of background noise which we consider uninformative data.SNR in Equation 2 measures the average power of the background noise and the average power of the foreground sound, in this case a cough event.It is measured on a logarithmic decibel scale, where the higher the SNR score, the higher the quality of the sample.
SN R dB = 20 • log 10 amplitude signal amplitude noise

Experiments using COVID-19 open data
Prior to running the segmentation procedure, we perform some diagnostic procedures to analyse the performance of the onset detection algorithm.Firstly, we demonstrate the difference between using Max Normalisation, where the sample is normalised between the range (0, 1), and Min Max Normalisation, where the sample is normalised between the range (−1, 1), with the Root Mean Square Energy (RMSE) and Short-Term Energy (STE) of the signal graphed over the oscillogram of the signal.Max Normalisation is a common choice in the literature, but we proposed using Min Max Normalisation as it performs better when the sample contains any of the three cough types; 2phase, 3-phase, or peal.It worked particularly well with peal coughs as demonstrated in Figure 2 with coughs from the COUGHVID dataset, and Figure 3 with coughs from the Coswara (heavy recordings) dataset, respectively.When Max Normalisation was used, the third phase of a 2-phase cough, or the epochs of a peal cough, was incorrectly identified as individual coughs.This does not occur when Min Max Normalisation was used.Next, we set the constants and tune the parameters of the onset detection algorithm (see Table 3. Table 4 outlines the constants used for all datasets.For the COUGHVID and Coswara datasets, Table 5 lists the optimal parameters for onset detection.With parameter tuning done the onset detection algorithm is run on 100 samples of the COUGHVID and Coswara (heavy recordings) datasets and the resulting oscillograms of 10 coughs from COUGHVID and 10 from Coswara are shown in Figures 4 and 5. Detected onsets are displayed on the oscillograms as red vertical lines.We observed that the algorithm worked well with both 2-phase and 3-phase coughs from both datasets, but could be hypersensitive5 to peal coughs, though this occurs more in Coswara than in COUGHVID.Whilst this is not a huge issue as the cough epoch still contains informative data, some data is lost from not having the full peal cough isolated in its entirety.
We found that our algorithm handles moderate noisy conditions well.Most background noise events were not classed as a cough, though with some exceptionally loud noises, mis-classification did occur.We go into more detail about how the effects of this can be mitigated in Section 6.The next step is to test the full segmentation pipeline comprising of onset detection, segmentation, preprocessing, and exportation procedures on the COUGHVID and Coswara datasets.We used a subset of 100 samples from each dataset in our segmentation experiment.The isolated audio samples were exported to WAV format with a sampling rate of 22.05kHz with a bit-depth of 16 bits.Our segmentation algorithm performed well, with almost 90% of isolated samples being of a highquality, and under 25% of isolated samples being discarded, for the COUGHVID dataset; and with over 80% of isolated samples being of a high-quality, and under 30% of isolated samples being discarded, for the Coswara dataset.

Summary of results
To summarise, we evaluate the datasets produced from COUGHVID and Coswara using the metrics outlined above graphed as a comparison between the original and isolated datasets.
Firstly, looking at the COUGHVID dataset, from our subset of 100 samples, the mean average duration is 8.4 seconds, decreasing to 0.9 seconds for the isolated cough dataset.For Coswara, from our subset of 100 samples, the mean average duration is 6 seconds, decreasing to 0.8 seconds for the isolated cough dataset.Therefore, we hypothesise that the majority of isolated cough samples were correctly segmented with both datasets, as the average cough duration's are both within the range of 0.2 to 1 seconds assumption.This is further supported by Figures 6 and 8.As for the cough SNRs, the mean average SNR from our subset of 100 COUGHVID samples is -76dB, compared with -19dB from the isolated cough dataset.The mean average SNR from our subset of 100 Coswara samples is -74dB, compared with -19dB from the isolated cough dataset.For both datasets the latter is significantly higher, with Figure 7 and Figure 9 showing some isolated cough samples even reaching an approximate SNR of 10dB (occured more-so in Coswara).Therefore, we hypothesise that the signal quality of the isolated cough samples is better than that of the original samples which include uninformative data i.e., background noise.
Lastly, we analyse the the class distributions of both the COUGHVID and Coswara subsets of 100 samples, as well as the isolated cough datasets produced from them.Here, we may analyse the class distribution to see if the segmentation procedure    has had an effect on the class imbalance (as both datasets are imbalanced).From Figures 10 and 11, we see that, the class imbalance has improved by 2.35% and 6.33% respectively.Our understanding is that the segmentation procedure may positively impact the class imbalance (by increasing the frequency of samples in the positive class) by segmenting an original positive sample into multiple positive samples, and decreasing the proportion of the negative class via quality filtering (negative class samples are predominant and more likely to be discarded for low-quality).

RELATED WORK
Cough segmentation to support COVID-19 detection is still in its infancy.Nevertheless, there are relevant work in other respiratory research, as follows.Amrulloh et al., (2015) propose a complex automated cough extraction methodology.The process begins with noise reduction using a High Pass Filter (HPF) and Power Subtraction Filter (PSF).Next, the signal is processed into n subblocks and extract the features Mel-Frequency Cepstral Coefficients (MFCCs), Formant frequency, Zero Crossing Rate (ZCR), Non-Gaussianity Score (NGS), and Shannon entropy.The feature matrix is then input into a trained Artificial Neural Network classifier to differentiate sub-blocks into cough and non-cough classes.Smoothing using a moving average filter is followed by thresholding and cough event identification.Infante et al., (2017) propose a method where the data is first smoothed by applying local regression using a 2 nd degree polynomial model.A peak detection algorithm is applied to the smoothed signal, and then each peak is analysed individually.The zero-crossing of the first derivative is used to determine the cough onset, and the slope of the trailing edge of the cough for the offset.

CONCLUSION AND FUTURE WORK
In this paper, we have proposed a novel audio segmentation framework for cough-based COVID-19 detection.In optimal conditions, where cough events are evenly spaced with low background noise, our framework worked well on both coughs and 2-phase coughs, correctly isolating for example a sample containing n=3 coughs into three distinct audio files containing a single cough, with normalised amplitudes and no uninformative data.However, our algorithm still has limitations in this POC stage.It is hypersensitive to some peal coughs characterised as epochs, or clusters of two or more cough sounds, identifying each as an individual cough between 0.1 and 0.5 seconds.This was because some frames within the epoch have a high enough RMSE value to exceed the threshold, whereby being classified as a standalone cough.Increasing the threshold addressed this problem, at the cost of some incorrectly identified onsets.Adapting the algorithm to identify cough clusters as opposed to individual cough sounds could mitigate the limitation of hypersensitive onset detection, but is outside the scope of this paper, and is considered as a potential future work.

Figure 1 :
Figure 1: An oscillogram, spectrogram, and Melspectrogram of an audio sample containing three 2-phase coughs from the Coswara dataset.

Figure 2 :
Figure 2: RMSE and STE graphed over an oscillogram of a 3-phase and peal cough from the COUGHVID dataset.

Figure 6 :
Figure 6: Comparison of cough duration's from COUGHVID and the isolated cough dataset.

Figure 7 :
Figure 7: Comparison of cough SNR's from COUGHVID and the isolated cough dataset.

Figure 8 :
Figure 8: Comparison of cough duration's from Coswara heavy and the isolated cough dataset.

Figure 9 :
Figure 9: Comparison of cough SNR's from Coswara heavy and the isolated cough dataset.

Figure 10 :
Figure 10: Class distributions of COUGHVID and isolated cough datasets.

Table 2 :
The Virufy Common Data Format (CDF) defines a column structure for dataset metadata.Benefits of standardisation include faster and more efficient pre-processing of data, data filtering via cough_detected, and more emphasis on clinical data such as pcr_test_result for clinically validated and pcr_test_result_inferred for partially clinically validated data.
cough_detected Probability that the audio file contains an actual cough submission.audio_path File path to the audio file containing patient's cough submission.audio_type Either cough or speech.age Age of patient.biological_sex Sex at birth of patient.The can be male, female, or NaN.reported_gender Reported gender of patient.submission_date Date cough was submitted by patient.pcr_test_date Date PCR test for presence of COVID-19 was taken.pcr_result_date Date test result from the PCR test for presence of COVID-19 was received.respiratory_condition Boolean indicator of whether patient suffers from a respiratory condition.fever_or_muscle_pain Boolean indicator of whether patient was suffering from a fever or muscle pain.pcr_test_result Result of patient's PCR test for presence of COVID-19.This can be positive, negative, recovered, or untested.pcr_test_result_inferred Best guess of a patient's COVID-19 diagnosis based on information specific to the dataset source.Can be positive, negative, recovered, or untested.covid_symptoms Boolean indicator of whether patient was experiencing COVID-19 symptoms.accuracies (following the Law of Large Numbers) for data hungry DL models such as neural networks, but also general ML models if the data is of good quality.age.Both datasets have a large variation in sample duration (0.2-10 seconds), number of coughs per sample (1-15 individual coughs), and background noise levels.Thus, the proposed algorithm intends to isolate individual coughs from each sample, regardless of the sample duration, in turn removing any background noise or silence present in the sample.

Table 3 :
A list of configurable constants and parameters used by the cough segmentation algorithm.Note that the constants can only be changed in the source code, whereas the parameters can be set at runtime with each instance of the cough segmentation tool created by the user.The threshold of Root Mean Square Energy from which onsets are detected.Minimum Distance intThe minimum distance between detected high RMSE frames.Backtrack floatThe number of seconds to subtract from onset times.Trim Frames bool A toggle to turn on functionality for trimming the last onset frames.End Frames intThe number of onset frames to trim from the end of the audio file.

Table 4 :
The selected values for the constants of the onset detection algorithm.

Table 5 :
The optimal parameters for the onset detection algorithm with the 2 datasets.