+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: not found

      The Modulation Transfer Function for Speech Intelligibility

      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.


          We systematically determined which spectrotemporal modulations in speech are necessary for comprehension by human listeners. Speech comprehension has been shown to be robust to spectral and temporal degradations, but the specific relevance of particular degradations is arguable due to the complexity of the joint spectral and temporal information in the speech signal. We applied a novel modulation filtering technique to recorded sentences to restrict acoustic information quantitatively and to obtain a joint spectrotemporal modulation transfer function for speech comprehension, the speech MTF. For American English, the speech MTF showed the criticality of low modulation frequencies in both time and frequency. Comprehension was significantly impaired when temporal modulations <12 Hz or spectral modulations <4 cycles/kHz were removed. More specifically, the MTF was bandpass in temporal modulations and low-pass in spectral modulations: temporal modulations from 1 to 7 Hz and spectral modulations <1 cycles/kHz were the most important. We evaluated the importance of spectrotemporal modulations for vocal gender identification and found a different region of interest: removing spectral modulations between 3 and 7 cycles/kHz significantly increases gender misidentifications of female speakers. The determination of the speech MTF furnishes an additional method for producing speech signals with reduced bandwidth but high intelligibility. Such compression could be used for audio applications such as file compression or noise removal and for clinical applications such as signal processing for cochlear implants.

          Author Summary

          The sound signal of speech is rich in temporal and frequency patterns. These fluctuations of power in time and frequency are called modulations. Despite their acoustic complexity, spoken words remain intelligible after drastic degradations in either time or frequency. To fully understand the perception of speech and to be able to reduce speech to its most essential components, we need to completely characterize how modulations in amplitude and frequency contribute together to the comprehensibility of speech. Hallmark research distorted speech in either time or frequency but described the arbitrary manipulations in terms limited to one domain or the other, without quantifying the remaining and missing portions of the signal. Here, we use a novel sound filtering technique to systematically investigate the joint features in time and frequency that are crucial for understanding speech. Both the modulation-filtering approach and the resulting characterization of speech have the potential to change the way that speech is compressed in audio engineering and how it is processed in medical applications such as cochlear implants.

          Related collections

          Most cited references 40

          • Record: found
          • Abstract: found
          • Article: not found

          Speech recognition with primarily temporal cues.

          Nearly perfect speech recognition was observed under conditions of greatly reduced spectral information. Temporal envelopes of speech were extracted from broad frequency bands and were used to modulate noises of the same bandwidths. This manipulation preserved temporal envelope cues in each band but restricted the listener to severely degraded information on the distribution of spectral energy. The identification of consonants, vowels, and words in simple sentences improved markedly as the number of bands increased; high speech recognition performance was obtained with only three bands of modulated noise. Thus, the presentation of a dynamic temporal pattern in only a few broad spectral regions is sufficient for the recognition of speech.
            • Record: found
            • Abstract: found
            • Article: not found

            Speech recognition in noise as a function of the number of spectral channels: comparison of acoustic hearing and cochlear implants.

            Speech recognition was measured as a function of spectral resolution (number of spectral channels) and speech-to-noise ratio in normal-hearing (NH) and cochlear-implant (CI) listeners. Vowel, consonant, word, and sentence recognition were measured in five normal-hearing listeners, ten listeners with the Nucleus-22 cochlear implant, and nine listeners with the Advanced Bionics Clarion cochlear implant. Recognition was measured as a function of the number of spectral channels (noise bands or electrodes) at signal-to-noise ratios of + 15, + 10, +5, 0 dB, and in quiet. Performance with three different speech processing strategies (SPEAK, CIS, and SAS) was similar across all conditions, and improved as the number of electrodes increased (up to seven or eight) for all conditions. For all noise levels, vowel and consonant recognition with the SPEAK speech processor did not improve with more than seven electrodes, while for normal-hearing listeners, performance continued to increase up to at least 20 channels. Speech recognition on more difficult speech materials (word and sentence recognition) showed a marginally significant increase in Nucleus-22 listeners from seven to ten electrodes. The average implant score on all processing strategies was poorer than scores of NH listeners with similar processing. However, the best CI scores were similar to the normal-hearing scores for that condition (up to seven channels). CI listeners with the highest performance level increased in performance as the number of electrodes increased up to seven, while CI listeners with low levels of speech recognition did not increase in performance as the number of electrodes was increased beyond four. These results quantify the effect of number of spectral channels on speech recognition in noise and demonstrate that most CI subjects are not able to fully utilize the spectral information provided by the number of electrodes used in their implant.
              • Record: found
              • Abstract: found
              • Article: not found

              Effect of temporal envelope smearing on speech reception.

               R Plomp,  R Drullman,  J Festen (1994)
              The effect of smearing the temporal envelope on the speech-reception threshold (SRT) for sentences in noise and on phoneme identification was investigated for normal-hearing listeners. For this purpose, the speech signal was split up into a series of frequency bands (width of 1/4, 1/2, or 1 oct) and the amplitude envelope for each band was low-pass filtered at cutoff frequencies of 0, 1/2, 1, 2, 4, 8, 16, 32, or 64 Hz. Results for 36 subjects show (1) a severe reduction in sentence intelligibility for narrow processing bands at low cutoff frequencies (0-2 Hz); and (2) a marginal contribution of modulation frequencies above 16 Hz to the intelligibility of sentences (provided that lower modulation frequencies are completely present). For cutoff frequencies above 4 Hz, the SRT appears to be independent of the frequency bandwidth upon which envelope filtering takes place. Vowel and consonant identification with nonsense syllables were studied for cutoff frequencies of 0, 2, 4, 8, or 16 Hz in 1/4-oct bands. Results for 24 subjects indicate that consonants are more affected than vowels. Errors in vowel identification mainly consist of reduced recognition of diphthongs and of confusions between long and short vowels. In case of consonant recognition, stops appear to suffer most, with confusion patterns depending on the position in the syllable (initial, medial, or final).

                Author and article information

                Role: Editor
                PLoS Comput Biol
                PLoS Computational Biology
                Public Library of Science (San Francisco, USA )
                March 2009
                March 2009
                6 March 2009
                : 5
                : 3
                [1 ]Helen Wills Neuroscience Institute, University of California Berkeley, Berkeley, California, United States of America
                [2 ]Department of Psychology, University of California Berkeley, Berkeley, California, United States of America
                University College London, United Kingdom
                Author notes

                Conceived and designed the experiments: TME FET. Performed the experiments: TME FET. Analyzed the data: TME FET. Contributed reagents/materials/analysis tools: TME FET. Wrote the paper: TME FET.

                Elliott, Theunissen. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
                Page count
                Pages: 14
                Research Article
                Computational Biology
                Neuroscience/Behavioral Neuroscience
                Neuroscience/Sensory Systems
                Neuroscience/Experimental Psychology

                Quantitative & Systems biology


                Comment on this article