49
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      The Natural Statistics of Audiovisual Speech

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Humans, like other animals, are exposed to a continuous stream of signals, which are dynamic, multimodal, extended, and time varying in nature. This complex input space must be transduced and sampled by our sensory systems and transmitted to the brain where it can guide the selection of appropriate actions. To simplify this process, it's been suggested that the brain exploits statistical regularities in the stimulus space. Tests of this idea have largely been confined to unimodal signals and natural scenes. One important class of multisensory signals for which a quantitative input space characterization is unavailable is human speech. We do not understand what signals our brain has to actively piece together from an audiovisual speech stream to arrive at a percept versus what is already embedded in the signal structure of the stream itself. In essence, we do not have a clear understanding of the natural statistics of audiovisual speech. In the present study, we identified the following major statistical features of audiovisual speech. First, we observed robust correlations and close temporal correspondence between the area of the mouth opening and the acoustic envelope. Second, we found the strongest correlation between the area of the mouth opening and vocal tract resonances. Third, we observed that both area of the mouth opening and the voice envelope are temporally modulated in the 2–7 Hz frequency range. Finally, we show that the timing of mouth movements relative to the onset of the voice is consistently between 100 and 300 ms. We interpret these data in the context of recent neural theories of speech which suggest that speech communication is a reciprocally coupled, multisensory event, whereby the outputs of the signaler are matched to the neural processes of the receiver.

          Author Summary

          When we watch someone speak, how much work is our brain actually doing? How much of this work is facilitated by the structure of speech itself? Our work shows that not only are the visual and auditory components of speech tightly locked (obviating the need for the brain to actively bind such information), this temporal coordination also has a distinct rhythm that is between 2 and 7 Hz. Furthermore, during speech production, the onset of the voice occurs with a delay of between 100 and 300 ms relative to the initial, visible movements of the mouth. These temporal parameters of audiovisual speech are intriguing because they match known properties of neuronal oscillations in the auditory cortex. Thus, given what we already know about the neural processing of speech, the natural features of audiovisual speech signals seem to be optimally structured for their interactions with ongoing brain rhythms in receivers.

          Related collections

          Most cited references42

          • Record: found
          • Abstract: found
          • Article: not found

          Is neocortex essentially multisensory?

          Although sensory perception and neurobiology are traditionally investigated one modality at a time, real world behaviour and perception are driven by the integration of information from multiple sensory sources. Mounting evidence suggests that the neural underpinnings of multisensory integration extend into early sensory processing. This article examines the notion that neocortical operations are essentially multisensory. We first review what is known about multisensory processing in higher-order association cortices and then discuss recent anatomical and physiological findings in presumptive unimodal sensory areas. The pervasiveness of multisensory influences on all levels of cortical processing compels us to reconsider thinking about neural processing in unisensory terms. Indeed, the multisensory nature of most, possibly all, of the neocortex forces us to abandon the notion that the senses ever operate independently during real-world cognition.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Phase patterns of neuronal responses reliably discriminate speech in human auditory cortex.

            How natural speech is represented in the auditory cortex constitutes a major challenge for cognitive neuroscience. Although many single-unit and neuroimaging studies have yielded valuable insights about the processing of speech and matched complex sounds, the mechanisms underlying the analysis of speech dynamics in human auditory cortex remain largely unknown. Here, we show that the phase pattern of theta band (4-8 Hz) responses recorded from human auditory cortex with magnetoencephalography (MEG) reliably tracks and discriminates spoken sentences and that this discrimination ability is correlated with speech intelligibility. The findings suggest that an approximately 200 ms temporal window (period of theta oscillation) segments the incoming speech signal, resetting and sliding to track speech dynamics. This hypothesized mechanism for cortical speech analysis is based on the stimulus-induced modulation of inherent cortical rhythms and provides further evidence implicating the syllable as a computational primitive for the representation of spoken language.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Speech recognition with primarily temporal cues.

              Nearly perfect speech recognition was observed under conditions of greatly reduced spectral information. Temporal envelopes of speech were extracted from broad frequency bands and were used to modulate noises of the same bandwidths. This manipulation preserved temporal envelope cues in each band but restricted the listener to severely degraded information on the distribution of spectral energy. The identification of consonants, vowels, and words in simple sentences improved markedly as the number of bands increased; high speech recognition performance was obtained with only three bands of modulated noise. Thus, the presentation of a dynamic temporal pattern in only a few broad spectral regions is sufficient for the recognition of speech.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS Comput Biol
                plos
                ploscomp
                PLoS Computational Biology
                Public Library of Science (San Francisco, USA )
                1553-734X
                1553-7358
                July 2009
                July 2009
                17 July 2009
                : 5
                : 7
                : e1000436
                Affiliations
                [1 ]Neuroscience Institute, Princeton University, Princeton, New Jersey, United States of America
                [2 ]Department of Psychology, Princeton University, Princeton, New Jersey, United States of America
                [3 ]GIPSA Lab, Grenoble, France
                [4 ]Department of Ecology & Evolutionary Biology, Princeton University, Princeton, New Jersey, United States of America
                University College London, United Kingdom
                Author notes

                Conceived and designed the experiments: CC AAG. Performed the experiments: CC AT. Analyzed the data: CC AT. Contributed reagents/materials/analysis tools: CC SS AC. Wrote the paper: CC AAG.

                Article
                09-PLCB-RA-0354R3
                10.1371/journal.pcbi.1000436
                2700967
                19609344
                973844d7-b0cd-40b8-8aaf-8e04327221cf
                Chandrasekaran et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
                History
                : 7 April 2009
                : 9 June 2009
                Page count
                Pages: 18
                Categories
                Research Article
                Neuroscience/Cognitive Neuroscience
                Neuroscience/Motor Systems
                Neuroscience/Sensory Systems
                Neuroscience/Theoretical Neuroscience
                Computer Science/Natural and Synthetic Vision
                Neuroscience/Natural and Synthetic Vision

                Quantitative & Systems biology
                Quantitative & Systems biology

                Comments

                Comment on this article