Speech Intelligibility in Adverse Conditions in Recorded Virtual Auditory Environments

The purpose of this investigation was to examine the effects of presentation mode on speech intelligibility in adverse listening conditions as signal-to-noise ratio was systematically varied in anechoic and reverberant environments. Speech intelligibility scores were obtained from 21 normally hearing listeners using a nonsense syllable test. The syllables were recorded in three environments (mono anechoic, spatial anechoic and spatial reverberant) at three SNR (0, 5, and 9dB) using two simultaneous interfering sound sources. The findings indicate (a) percent correct performance was about 40% lower with the traditional diotic presentation compared to a virtual presentation; (b) performance in the virtual reverberant was about 5% lower than in the virtual anechoic environment.


Introduction
Spatial hearing is undoubtedly one of the most complex of all biological abilities.Understanding the processes underlying spatial hearing presents a major challenge to psychoacoustic research.Interest is becoming widespread, because this ability is critical in the applied and basic research of various disciplines such as communications, entertainment, architectural design, rehabilitation, and safety.This study will focus on one important aspect of spatial hearing as it relates to safety and communication.Specifically, the experiment is concerned with speech recognition in clinical and virtual adverse conditions.
Understanding speech in adverse conditions is an extremely important and challenging task for the human auditory system.During daily conversations, most people possess the ability to "tune out" interfering noises that emanate from various directions, focusing instead on signals of interest.When adverse conditions disrupt speech perception, miscommunication is usually a temporary, albeit annoying, inconvenience because most conversations offer ample opportunity to repeat words or phrases not initially understood.However, the opportunity to clarify disappears in some occupations such as the military, surveillance, telecommunication, entertainment, and security where communication is conducted in unique and demanding circumstances.
In these occupational settings, the environment imposes severe challenges to the accurate understanding of speech.Civilian industrial or military personnel must routinely accomplish demanding ICAD' 98 2 missions relying heavily on concise clear communication in adverse conditions [1][2][3].For example, military personnel must simultaneously monitor several radio channels or decipher speech in the presence of competing talkers and/or background noise.Soldiers often conduct these listening tasks in the reverberant surroundings of tanks, tactical operation environments, communication centers, intelligence support vehicles, and so forth.Unfortunately, research that examines communication in real-world complex environments is extremely limited.Furthermore, qualifying auditory examinations for personnel are usually conducted in artificial environments and may not evaluate real-world communication abilities important for safe and effective job performance.Traditionally, many clinical and psychoacoustic experiments are conducted under headphones in sound-treated or anechoic chambers often using stimuli such as pure tones, bands of noise, or synthetic speech, which do not occur in nature.While these environments and stimuli are more easily controlled and provide a baseline for optimum performance, they can only approximate real-world performance.Auditory images produced by traditional recordings and headphone presentation do not contain the natural ear-specific auditory cues produced with a free-field presentation.Consequently, the former stimuli are perceived as originating from inside the head halfway between the two ears.Yet, headphone speech test results are often taken as an accurate portrayal of real-world performance.
The real-world or free-field acoustic characteristics of a sound measured at the eardrum depend on the source itself and the pathway the sound must travel to the ear.For example, the distance of the path will affect the phase of the signal in relation to the source.Additionally, the head position in the pathway will result in head-shadow and change the intensity of the signal relative to the intensity of the source.Furthermore, if a listener is wearing safety equipment or protective headgear the signal acoustics are further modified.Moreover, the external ear's filtering effects also influence the intensity and phase of the signal relative to the sound source in a very individualized manner.The auditory input, processing, and subsequent perceptual performance are further complicated by adverse environments such as background noise, multiple competing talkers, or reverberation.The outer ear and ear canal behave as a filter and thus alter the frequently composition of the signal.These complex spectral auditory cues are thought to be responsible for externalization of sound images.
Signal processing technology now allows reproduction of the complex spectral effects of the path from a sound source to the eardrum.In fact, digital signal processing can employ filters to reproduce the same spectral response that an individual would receive at his or her two eardrums.These techniques can achieve a three-dimensional (3-D) quality of sound under headphones and simulate realistic virtual spatial locations and listening environments [4][5][6][7][8].When filtered appropriately, sounds are heard externalized in headphone presentation.In other words, listening through traditional headphones gives the listener the illusion that stimuli emanate from external locations as in a natural listening environment.Auditory images presented in this manner are commonly referred to 3-D audio or virtual auditory displays.
The purpose of the present research was to examine the effects of traditional and virtual adverse conditions (noise and reverberation) on speech intelligibility.This study examined the effect of presentation method on speech intelligibility in conditions of noise and reverberation as signal-to-noise ratios were systematically varied.This investigation also evaluated the influence of talker gender on speech intelligibility in anechoic and reverberant environments as a function of signal-to-noise ratio and presentation method.By employing carefully controlled stimuli and environmental factors, the experimenters attempted to bridge the gap between previous laboratory experiments and the practical challenges faced by personnel who must communicate in real-world settings.Additionally, the results of this study question whether existing clinical methods that evaluate speech communication in noise should be replaced by techniques that employ rather than eliminate natural binaural acoustic cues.

2
Review of the Literature

Binaural Factors Underlying Speech Perception
Even though most people hear signals very well with headphones, free-field binaural hearing offers several advantages including localization of signals in space, signal separation, and signal enhancement in noise and reverberation.Binaural hearing allows the ability to selectively listen from one direction and tune out sounds coming from other directions that could interfere (cocktail party effect).Spatial hearing may also offer solutions to communication difficulties experienced by radio operators, tankers, and surveillance personnel.The simultaneous monitoring of several communication channels is usually accomplished using diotic or monotic headphones inside a small mobile communication center.Current military communication equipment designs do not employ stereophonic, dichotic, or virtual listening technologies, although future combat military operational plans include the visual "virtual battlefield."Another possible contributor to increased speech intelligibility in free-field listening as opposed to monaural headphone listening is the central auditory system's ability to suppress noise internally in some binaural listening conditions based on interaural differences.This ability, called binaural release from masking or masking level difference (MLD), is predominately a laboratory phenomenon, but it indicates that the auditory system can internally improve the SNR if the interaural differences for the signal and masker are different [73,74].It is reasonable to consider that even though the MLD is a threshold measurement and listening to conversations at a party is a supra-threshold task, the same underlying binaural mechanisms may operate in both situations.

Speech Perception in Virtual Environments
Listeners generally perceive real-world sounds as externalized.On the other hand, traditional binaural headphone presentation of sounds generally produces images that are perceived inside the head.Externalization of sound with free-field presentation (and their generally accurate localization) occurs because of the complex filtering of the acoustic signal produced by a listener's head, torso and pinnae.Because such filtering effects are not present in the traditional headphone presentation of signals, the latter mode of presentation does not result in externalization of sound images.Several studies in recent years have attempted to simulated under headphones the actual acoustic signals that would arrive at the eardrums of the listener if the sound source were in a specific location relative to the listener [4][5][6][7][8].This is done by filtering the inputs to the two ears by the so-called "head-related transfer functions," or HRTFs, that are the source-to-eardrum transfer functions (one for each ear) specific for the particular source position being simulated.The result, where this technique is applied correctly, is that the listener wearing headphones perceives the sound as externalized to the appropriate location in space.In other words, listening through headphones gives the listener the illusion that stimuli emanate from external locations as in a natural listening environment.The three-dimensional spatial effect thus achieved under headphones is called "virtual" auditory space, and the presentation mode is often referred to as "virtual listening" or "virtual or 3-D auditory displays."Investigators are now employing 3-D or virtual technology to study speech communication and intelligibility in a variety of adverse conditions [2,4,20,36,94 ].
The task of measuring numerous individual HRTFs (i.e., one for each ear and for each spatial location that must be simulated) is expensive and time-consuming, as well as technologically challenging.An alternative method to achieve externalization involves recording experimental signals and noise through microphones placed in the ear canals of KEMAR (Knowles Electronics Manikin for Acoustic Research), thus reproducing the signals arriving at KEMAR's eardrums.The two-channel recordings are subsequently presented to subjects through insert earphones to the two ears (Fig. 1).Recordings of this nature incorporate the effects of the head and pinnae on the incoming signals and reproduce the interaural differences as they exist in free-field listening conditions [32].However, any method that employs a generic manikin will not always produce realistically externalized sounds.Thus, a listener presented with KEMAR-recorded signals may not externalize them in the same way as if the signals were filtered by his own HRTFs (i.e., due to differences between KEMAR's pinnae shapes and the listener's individual pinnae shapes).
In summary, these studies have demonstrated that the ability to understand speech in the presence of noise and reverberant environments is complex and depends upon numerous variables.The cues provided by monaural listening appear to contribute to speech intelligibility to a greater degree than do binaural cues.Chief among these cues is the relative energy of the signal and noise (SNR).However, the advantages of spatial hearing provided by binaural listening emerge in adverse conditions such as low SNRs, reverberation, or a combination of these conditions.The literature indicates that angular separation in real or virtual listening conditions improves speech intelligibility, but no study has compared the effect of adverse conditions of noise and reverberation on talker gender between traditional clinical evaluation methods and performance using recordings which employ generic binaural cues.Additionally, many clinical instruments examining speech intelligibility and communication headsets present speech in noise in a diotic presentation, ignoring the advantages of spatial hearing.The present study attempts to compare performance with traditional diotic presentation to that which incorporates the binaural advantages afforded from listening with two ears in more of a real-world environment.

Subjects
Twenty-one adult native American English speakers with bilateral normal hearing were paid volunteers in this study.Subjects' age ranged from 21-38 years.Nineteen subjects were female and two were male.Subjects initially passed a pure-tone audiometric and immittance screening test confirming normal middle ear function and auditory thresholds.In an effort to avoid Idiopathic Discriminatory Dysfunction (IDD), no subject who reported (during initial interview) disturbed speech intelligibility in background noise was included [98].Frequent breaks were allowed within a session to avoid fatigue.

Testing Environment and Stimuli
Subjects were tested individually in a lighted sound-insulated listening station.Subjects were grouped and tested simultaneously at three different listening stations (Fig. 1).During all sessions, each subject was seated in the same individual listening station wearing the same ER4 insert earphones.

Figure 1: Recording and Presentation Environment
Nonsense syllables were chosen as the speech tokens to be identified.This choice was made in an attempt to control the monaural factors underlying speech perception in an auditory-only presentation.For example, it is well documented that clearly spoken speech is easier to understand than conversational speech for even normal hearing adults particularly in noise and reverberation [44].The nonsense syllable ensured that articulation and speaking style did not introduce unwanted sources of variance.Additionally, this measure was selected because of the high correlation between nonsense syllables recognition and audiometric configuration and because of its sensitivity to the adverse influence of reverberation [72].Additionally, the brief nature of the nonsense syllables closely duplicates brief communication found in a secure military environment.
The experimental speech stimulus tokens employed were the Nonsense Syllable Test (NST) [99].The syllables used were taken from recordings of the UCLA Nonsense Syllable Test (NST) [100].They consisted of the combination of 23 consonants, three vowels (/a/, /u/, /i/), with consonant position either initial (CV) or final (VC).Some consonants do not naturally occur in the English language in the initial and final position; thus, these combinations were not included in the syllable combination.The syllables were digitally recorded, calibrated and standardized for both a male and female talker.For a given run the 129 syllables were randomly presented without replacement.
The competing noise was digitally recorded multi-talker babble obtained from Auditect of St. Louis.To ensure an uncorrelated constant SNR with minimal envelope fluctuation, two digital 20-talker babble separate recordings were used.The multi-talker babble was the most effective masker for speech tokens, as it provided a worst-case listening scenario (i.e., a masker with a long-term speech spectrum and with little or no temporal variation in the envelope).This spectrum mirrors the long-term rms 1/3 octave-band idealized speech spectrum derived from ANSI S3. 5-1969.Speech recordings were made in an anechoic chamber (6 m x 6 m x 6 m).The syllable lists consisted of computer-generated CV and VC nonsense syllables randomly combined from all initial/final consonant tokens in all 3-vowel environments for a male and female talker.The CV and VC stimulus were presented from a single loudspeaker in the center of the loudspeaker array, and two independent babble sources were presented continuously through two different loudspeakers positioned at ~ 45 azimuth, 1.8 m from KEMAR.A 1000 Hz calibration tone was also recorded and subsequently used to set stimulus levels appropriately.For the virtual conditions, all stimuli (nonsense syllables plus babble) were recorded though the KEMAR manikin's two ears while it was positioned in the center of the anechoic chamber facing 0 azimuth.Etymotic ER-11 microphones were placed at the position of KEMAR's eardrums.Equalization filters were used to compensate for the frequency response of the ER4 insert earphones that subjects used.For the diotic control condition, KEMAR was removed from the chamber, and a single-channel recording was made with a single microphone at the position of KEMAR's head.
Each of the 129 syllables, spoken by the two talkers, was digitally stored in 9 different experimental conditions: 6 virtual presentation environments (2 virtual environments [virtual anechoic vs. virtual reverberant] x 3 SNRs [0, +5, +9]), plus three diotic conditions (3 SNRs) for both talkers.Additionally, 2 practice recordings (diotic and virtual) were made in an anechoic quiet environment for each talker, yielding a total of 22 recordings, each of which consisted of a different random ordering of only the 129 syllables.This resulted in 2838 recorded syllables that were later presented to subjects according to a random schedule described below.When the stimuli were played back to the subjects, the level of the target syllables was set such that the average level was 65 dB SPL.Babble presentation level was set (via the recording calibration tone) to 56, 60, or 65 dB SPL, to achieve SNRs of +9, +5 or 0 dB, respectively.
Subjects were presented the stimuli in one of two listening conditions: (a) in the 3-D or virtual condition, KEMAR's left and right ear recording was presented to the subjects' left and right ears, respectively, thus reproducing the interaural information and spectral cues provided by KEMAR's torso, head, and pinnae and, (b) in the diotic or mono condition, the single-channel (microphone) recording was presented identically to the subject's left and right ears.Diotic presentation was employed only for the anechoic condition to replicate typical recordings used in clinical evaluations.Reverberant room conditions were created by positioning acoustically reflective panels within the anechoic chamber around KEMAR.

Procedure
The subjects were seated comfortably in the auditory lab wearing ER 4 insert earphones.All testing presented the speech targets at 65 dB SPL.This level was well above threshold, while at the same time the multi-talker babble at all SNR ratios did not exceed uncomfortably loud or noise-hazardous loudness levels.The subjects were given an open-response version of the test with answer sheets containing a numbered set of blank lines on which to write down each syllable they heard.This open-response task was used to gather the broadest possible information about speech perception in the absence of any response constraints.The subject's task was to write each nonsense syllable on a response sheet provided by the experimenter as it was heard.Subjects were tested in seven groups of three; the three subjects within a group were always presented the same stimuli in the same conditions; the order and gender of the conditions was counterbalanced across groups.Syllables were presented in lists of 129 syllables, where each syllable was randomly chosen (without replacement).The condition and speaker were held constant for each 129-syllable list.After each syllable presentation, there was a 5000-ms pause, during which time the subject recorded his or her response.A list included two blocks of syllables from either the male or female recording.The first block of 64 syllables was completed before the next block of 65 syllables began.The talker gender remained constant during the entire 129 syllable list.One half of the subject groups began with the male talker recordings while the other half began with the syllables spoken by a female talker.Short breaks were given between each syllable block with longer breaks between lists.Each 129-syllable list required approximately 20 minutes.All subjects practiced listening to the recorded stimuli in all conditions for a minimum of 3 hours.Each subject was required to pass a "qualification" test with 100% correct selected from stimuli presented in the diotic anechoic quiet condition.No subject could proceed until this high level of proficiency was achieved.

Practice
Three practice sessions consisting of at least two 64-syllable blocks for all 22 conditions were administered prior to data collection.Subjects were allowed to learn the task and reach asymptotic performance prior to data collection.The 1408 practice tokens exceeds the recommended minimum of 250 to minimize procedural learning effects [35].Additionally, a 10-syllable trial warm-up was provided at the beginning of each condition.

Overall Analysis
All response scoring was blind.The number of nonsense syllables correctly identified for each listening condition was tabulated for each subject.The mean raw scores and their standard deviations for each condition are shown in Table 1.However, raw score means do not adequately describe stimulus intelligibility.Traditional speech testing normally evaluates performance in a percentage correct scale.Therefore, these raw scores were converted to percentage correct in the figures to better illustrate listener ability in a more meaningful manner.

Table 1. Raw Score Means and Standard Deviations for All Conditions
An overall three-factor [3 presentation methods (diotic, virtual anechoic and virtual reverberant) X 3 SNRs (0dB, +5dB and +9dB) X 2 talker genders (male vs. female)] repeated measures analysis of variance (ANOVA) was conducted.The experimental score consisted of all correctly identified nonsense syllables (out of possible 129 total tokens) and was the dependent variable for statistical treatments.All statements addressing statistical significance were based on = 0.05.
All three main effects were significant: presentation method

The Effects of Presentation Method and SNR
The main effects of presentation method and SNR are apparent in Figure 2 where the data are collapsed across talker gender.Here it can be seen that intelligibility was significantly less in the diotic presentation method than in either virtual (anechoic or reverberant) condition.Clearly, subjects had greater difficulty identifying the nonsense syllables when recorded and presented in a diotic manner.It can also be seen that performance improved with increasing SNR.The predictable response pattern to SNR found in diotic presentations appears to also be present in the virtual presentations, as scores behaved in a similar manner.
Considering performance in the two virtual conditions, performance (collapsed across genders) in the reverberant environment was degraded compared to in the anechoic condition except in the most favorable SNR (+9dB).To test the significance of this effect, another three-factor [2 environment (anechoic vs. reverberant) X 3 SNR (0, +5 dB, +9 dB) X 2 talker gender] repeated measures analysis of variance on the raw scores between the virtual conditions was conducted.Significant main effects of environment [F (1,20) = 39.95,p < .001],SNR [F (2,40) = 662.03,p < .001],and talker gender [F (1,20) = 378.183,p < .001]were obtained.The only significant interaction was between SNR and environment [F (2,40) = 9.5, p < .001].Thus, the small performance difference (5%) at 0 and 5 dB SNR was statistically significant, but performance in the anechoic and reverberant conditions at a +9 dB SNR was essentially the same.

The Effects of Talker Gender
The effect of talker gender was also significant.Subjects identified female nonsense syllables with greater accuracy than male tokens for both virtual recordings at all three levels of SNR.This was not the case for the stimuli recorded in a diotic mode.Intelligibility in the most adverse listening condition (0 dB SNR) exhibited the same trend as in the virtual recordings, with higher identification accuracy for female tokens.However, in the more favorable diotic listening conditions (SNR = +5 and +9 dB) there was little if any difference in the identification of male versus female tokens.

The Effect of Virtual Presentation Mode on Identification Performance
The results of this study definitely indicate that in the presence of a multi-talker noise (positioned at two locations 45), virtual nonsense syllables recorded through a manikin (in either anechoic or reverberant environments) produced higher identification scores than identical nonsense syllables recorded in diotic conditions.The amount of benefit from the virtual presentation, collapsed across talkers and SNR, was approximately 41%.This improvement is slightly higher than the benefit (15%-28%) noted by previous investigators [20,36].In a similar manner, Bronkhorst and Plomp [32] also pre-recorded speech in an anechoic room through a manikin in an attempt to measure the effect of spatial separation of a single competing noise on speech intelligibility.In this study, the level required for subjects to achieve an overall 50% intelligibility threshold was measured by varying the signal while maintaining the same noise level.Separating the noise and speech in auditory space by 90 enabled subjects to obtain 50% correct intelligibility when the signal was 10.1 dB softer than when the noise and signal were located in the same position.In a more recent study that employed virtual signals created with individualized HRTFs, Koehnke and Besing [4] employed a similar adaptive threshold technique, but varied the noise while keeping the speech signal constant.They reported subjects gained an even larger advantage (13.7 dB) when 90 separated the single competing noise and the stimuli.
It is difficult to directly compare the performance improvement in the present experiment (where SNR was held constant) to that of the studies of Bronkhorst and Plomp [32] and Koehnke and Besing [4] who varied SNR to find the 50% performance level.However, the results are in qualitative agreement.In all three studies performance improved when signal and noise were spatially separated (as in the virtual condition) compared to the case when signals and noise came from the same location (as in the diotic condition).
It should be noted that the aforementioned studies employed only one noise source.Perhaps improvement differs with multiple noise sources.Subjects in the present study were able to identify very difficult stimuli sets even though competing noise sources simultaneously originated from not one, but two spatially separated locations.Additionally, the competing noise used in this study was a 20-talker babble.Multiple talkers tend to degrade speech intelligibility more than random noise, due to the similarity of speech signal spectra modulations.The improvement in intelligibility was corroborated by subject reports of their ability to better understand the syllables in virtual listening conditions.

The Effects of Reverberation on Performance in Virtual Environments
It is generally accepted that speech perception in noise is degraded by reverberation.In the present study, performance in the virtual reverberant environment (collapsed across SNRs) significantly decreased speech intelligibility compared to the virtual anechoic recordings (average difference: 3.4%).The decreased intelligibility in normal hearing listeners less than expected although generally consistent with previous investigations in reverberation and noise in real and virtual environments [4,36].One possible reason that reverberation had a relatively small effect on performance may be a function of methodology.In the present study, the nonsense syllables were presented in the traditional manner for the UCLA nonsense syllable test (without a carrier phrase).Houtgast & Steeneken [101] showed that the effect of reverberation on speech intelligibility was smaller for tests without carrier phrases than for tests with carrier phrases.This difference may be due to the fact that a carrier phrase precedes the speech target, increasing the reflected energy that could serve as a masker.On the other hand, speech tests without carrier phrases present the speech target immediately into the reverberant environment.Listeners hear the target speech token first without persistent reflected speech signals introduced by a carrier phrase.To the extent this difference holds, the present results may actually underestimate the influence of reverberation for meaningful material.
Even with spatial separation and absence of a carrier phrase, these results indicate that some persistent reverberant energy did significantly degrade performance.This delayed or reflected energy can influence some important aspects of a speech signal and interfere with intelligibility by distorting the original signal [102].Low frequencies are typically absorbed less efficiently by reflective surfaces.Because vowel (low-frequency) sounds are usually more intense than consonant sounds (high-frequency), the reflected vowel phonemes tend to mask consonant manner and place information, particularly in the VC combinations.Noise and reverberation as found in this study combine synergistically to degrade speech recognition.
The effect of reverberation in this study has important military relevancy.For example, soldiers prosecute military operations in urban terrain by conducting house-to-house and room-to-room searches of areas of interest.Likewise, communication in tactical operations centers (TOC) occurs in rooms, tents, or enclosures with similar or higher reverberation times.The reflections from a reverberant environment may be thought of as a type of masking noise, although the effects of noise and reverberation are perceptually different.Soldiers can easily detect the presence of noise, whereas the presence of a highly reverberant field is often not apparent.The subtle effects of reverberation may not be evident to an untrained soldier listener, yet may result in miscommunication as the ability to understand speech is degraded.If the influence of reverberation can adversely affect the ability to communicate in these enclosed environments, soldiers must recognize the possibility of decreased communication ability in reverberant environments.

The Effects of Signal-to-Noise Ratio (SNR) on Performance
The SNR appeared to influence intelligibility in a similar manner for all presentation methods (virtual or diotic --see Fig 3).Subject performance increased as SNR increased, although absolute identification remained poor in the diotic conditions for all SNRs.For example, identification of nonsense syllables doubled (15.95% to 31.82%) in diotic presentation as SNR improved from 0 dB to +9 dB.Likewise, increasing the SNR from 0 dB to +9 dB also significantly improved intelligibility, increasing scores an average of 24.2% and 29.7% in the virtual and reverberant environments, respectively.Apparently, the auditory system behaves in a similar manner with respect to signal-to-noise ratio in diotic and virtual presentation modes, although absolute intelligibility is very different.

Effects of Talker Gender on Identification of Performance
A major finding in the present study was that syllables spoken by this female talker were identified at a higher rate than when spoken by this male talker, at least in the virtual environments.This difference was not as evident for diotic presentation.Specifically, female tokens (collapsed across all SNRs) were more easily identified than male tokens in the virtual anechoic (difference: 12.2%) and reverberant (difference: 11.6%) presentations.However, in the diotic condition, the advantage for female tokens occurred only at the most difficult SNR (0 dB), and here the difference was only 5.1%.As SNR increased, the female talker advantage decreased to (2.0%).Finally, at the highest SNR (+9 dB), subjects identified male tokens with a slightly higher accuracy (1.6%) than female tokens.Hence, it appears for these stimuli, the advantage for female spoken tokens was evident when the signal was separated from the competing talkers in auditory space (i.e., under virtual listening conditions).Under diotic conditions there was generally little or no advantage with the female talkers.These results are similar to those reported by Ericson and McKinley [20].In their investigations, intelligibility improved more dramatically for female competing targets when spatially separated than when presented diotically.These results may indicate that female speech may be improved by incorporating spatial separation in communication systems.An attempt was made to explain the gender effect by computing the rms energy of each token.Although the rms energy was slightly higher for the female tokens in the // vowel context than for the male tokens (which could possibly account for some of the performance differences in the virtual conditions), this explanation is not entirely satisfactory, because it would predict a female advantage in the diotic conditions as well; as described earlier, there was little if any gender effect in the diotic condition for the more favorable SNRs.Several investigators have examined the spectral and aerodynamic vocal characteristics of male and female talkers and confirmed parameter differences between genders across speech conditions [103 -105].Generally, female speakers are perceived to be breathier than male speakers.The degree of glottal closure appears to be related to the perception of breathiness.Sodersten et al. [105] reported a significant correlation between perceived breathiness and the relative amplitude of the first harmonic.Further study is required to determine which of these acoustic measures (if any) might underlie the consistent female token advantage in virtual listening conditions.

Figure 2 .
Figure 2. Effect of Signal-to-Noise Ratio on Speech Intelligibility by Condition 4.1.2The Effects of Talker Gender

Figure 3 .
Figure 3.Effect of Gender by Signal-to-Noise Ratio and Condition