24
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: not found
      • Article: not found

      A Comparison of Approaches to Timbre Descriptors in Music Information Retrieval and Music Psychology

      , ,
      Journal of New Music Research
      Informa UK Limited

      Read this article at

      ScienceOpenPublisher
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Related collections

          Most cited references53

          • Record: found
          • Abstract: found
          • Article: not found

          Rapid task-related plasticity of spectrotemporal receptive fields in primary auditory cortex.

          We investigated the hypothesis that task performance can rapidly and adaptively reshape cortical receptive field properties in accord with specific task demands and salient sensory cues. We recorded neuronal responses in the primary auditory cortex of behaving ferrets that were trained to detect a target tone of any frequency. Cortical plasticity was quantified by measuring focal changes in each cell's spectrotemporal response field (STRF) in a series of passive and active behavioral conditions. STRF measurements were made simultaneously with task performance, providing multiple snapshots of the dynamic STRF during ongoing behavior. Attending to a specific target frequency during the detection task consistently induced localized facilitative changes in STRF shape, which were swift in onset. Such modulatory changes may enhance overall cortical responsiveness to the target tone and increase the likelihood of 'capturing' the attended target during the detection task. Some receptive field changes persisted for hours after the task was over and hence may contribute to long-term sensory memory.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            The Modulation Transfer Function for Speech Intelligibility

            Introduction Human speech, like most animal vocalizations, is a complex signal whose amplitude envelope fluctuates timbrally in frequency and rhythmically in time. Horizontal cross-sections of the speech spectrogram as in Figure 1A describe the time-varying envelope for a particular frequency while vertical cross-sections at various time points show spectral contrasts, or variation in the spectral envelope shape (Audio S1). Indeed, the structure in the spectrogram of speech is not characterized by isolated spectrotemporal events but instead by sinusoidal patterns that extend in time and frequency over larger time windows and many frequency bands. It is well known that it is these patterns that carry important phonological information, such as syllable boundaries in the time domain, formant and pitch information in the spectral domain, and formant transitions in the spectrotemporal domain as a whole [1]. In order to quantify the power in these temporal and spectral modulations, the two-dimensional (2D) Fourier transform of the spectrogram can be analyzed to obtain the modulation power spectrum (MPS) of speech [2],[3]. In this study, first we repeated this analysis using a time-frequency representation that emphasized differences in formant structure and pitch structure. Then we used a novel filtering method to investigate which spectral and temporal modulation frequencies were the most important for speech intelligibility. In this manner we obtained the speech modulation transfer function (speech MTF). We were then able to compare the speech MTF with the speech MPS in order to interpret the effect of modulation filters on perception of linguistic features of speech. 10.1371/journal.pcbi.1000302.g001 Figure 1 Component spectrotemporal modulations make up the modulation spectrum. (A) Spectrogram of a control condition sentence, “The radio was playing too loudly,” reveals the acoustic complexity of speech (Audio S1). All supporting sound files have been compressed as .mp3 files for the purpose of publication; original .wav files were used as stimuli. (B) Example spectrotemporal modulation patterns circled in the sentence (A) can be described as a time-varying weighted sum of component modulations. (C) The MPS shows the spectral and temporal modulation power in 100 sentences. The outer, middle, and inner black contour lines delineate the modulations contained in 95%, 90%, and 85% of the modulation power, respectively. Down-sweeps in frequency appear in the right quadrant, whereas upward drifts in frequency are in the left quadrant. Slower temporal changes lie near zero on the axis, while faster changes result in higher temporal modulations towards the left and right of the graph. Our study both complements and unifies previous speech perception experiments that have shown speech intelligibility to depend on both spectral and temporal modulation cues, but to be surprisingly robust to significant spectral or temporal degradations. Speech can be understood with either very coarse spectral information [4]–[8] or very coarse temporal information [9]–[11]. Our goal was to unify spectral and temporal degradation experiments by performing both types of manipulations in the same space, namely, the space of joint spectrotemporal modulations given by the speech MPS. The approach makes advances in the rigor of signal processing, in the specificity of the manipulations allowed, and in the comparison with speech signal statistics. First, the approach depicts visually and quantifies the concomitant effects that temporal manipulations have on the spectral structure of the signal, and that spectral filtering has on temporal structure. Second, the technique offers the possibility of notch filtering in the spectral modulation domain, something which has not been done before. Whereas degradation by low-pass filtering can reveal the minimum spectral or temporal resolution required for comprehension, notch filtering can distinguish more limited regions of spectrotemporal modulations that differ in levels of importance for comprehension. Third, the modulation filtering technique can be used to target specific joint spectral and temporal modulations. In this study, this advantage was exploited in a two-step filtering procedure to measure the effects of precise temporal and spectral degradations in the range of modulations most important for intelligibility. In this procedure, we first removed potentially redundant information in higher spectral and temporal modulations, and then we applied notch spectral or temporal filters within the remaining modulation space. Finally, we were able to compare the results of the speech filtering experiments to the MPS of speech, in order to make an initial characterization of the speech MTF in humans. As far as we know, this is the first such comparison using a linear frequency axis and a modulation transfer function obtained directly from speech intelligibility experiments. The resultant speech MTF could be used to design more optimal speech compression such as that required by cochlear implants. Neurophysiological research on animal perception of modulations inspired our study. While the cochlea and peripheral auditory neurons represent acoustic signals in a time-frequency decomposition (a cochleogram), higher auditory neurons acquire novel response properties that are best described by tuning sensitivity to temporal amplitude modulations and spectral amplitude modulations (reviewed in [12] and [13]). By designing human psychological experiments using the same representations used in neurophysiological research, we can begin to link brain mechanisms and human perception. Speech signals carry information about a speaker's emotion and identity in addition to the message content. As a final thrust of investigation, we tested whether modulations corresponding to acoustic features embedded in the speech signal enabled listeners to detect the gender of the speaker. Vocal gender identity has been shown to depend on some spectral features in common with, and some distinct from, the spectral features conferring speech intelligibility [14],[15]. Results Spectrotemporal modulations underlying speech intelligibility and gender recognition were assessed in psychophysical experiments using sentences in which spectrotemporal modulations had been systematically filtered. Since our psychophysical experiments were in large part inspired by our analysis of the spectrotemporal modulations of speech, we begin by reporting the resulting modulation space. We will describe the characteristics of the MPS of speech and emphasize which characteristics are general to natural sounds, which are general to animal vocal communication signals, and which ones are more unique to human speech. The goal of the psychophysical experiments was to determine the subset of perceptible modulations that contribute exceptionally to speech intelligibility. Modulation Power Spectrum of Speech The MPS of American English (Figure 1C) was calculated from a corpus of 100 sentences (see Materials and Methods). This speech modulation spectrum shares key features observed in other natural sounds. As in all natural sounds, most of the power is found for low modulation frequencies and decays along the modulation axes following a power law [3]. Moreover, as typical of animal vocalizations, the MPS is not separable; most of the energy in high spectral modulations occurs only at low temporal modulation, and most high temporal modulation power is found at low spectral modulation [3],[16]. This characteristic non-separability of the MPS is due to the fact that animal vocalizations contain two kinds of sounds: short sounds with little spectral structure but fast temporal changes (contributing power along the x-axis at intermediate to high temporal frequencies), and slow sounds with rich spectral structure (found along the y-axis at intermediate to high spectral frequencies). In normal speech, this grouping of sounds corresponds roughly to the vocalic (slow sounds with spectral structure, produced with phonation) and non-vocalic acoustic contrasts (fast sounds with less spectral structure, produced without phonation). Animal vocalizations and human speech do have sound elements at intermediate spectrotemporal modulations, but these have less power (or in other words are less frequent) than expected from the power (or average occurrence) of spectral or temporal modulations taken separately, reflecting the non-separability of the MPS. An additional aspect of human speech is that modulations separate into three independent areas of energy along the axis of spectral modulation, at low temporal modulation (Figures 1C and 2). First, the triangular energy area at the lower spectral modulation frequencies corresponds to the coarse spectral amplitude fluctuations imposed by the upper vocal tract, namely the formants and formant transitions (labeled in Figure 2B). The other two areas of spectral modulation energy, found at higher levels, correspond to the harmonic structure of vocalic phones produced by the glottal pulse; this energy diverges into two areas because of the difference in pitch between the low male voice (highest spectral modulations) and the higher female voice (more intermediate spectral modulations). The lower register of the male voice produces higher spectral modulations because of the finer spacing of harmonics over that low fundamental. Equivalent pitches corresponding to the spectral modulations are labeled parenthetically in white on the y-axis of Figure 2. The MPS can also be estimated from time-frequency representations that have a logarithmic frequency axis (see Materials and Methods, and Figure S1). Although log-frequency representations are better models of the auditory periphery, the linear-frequency representation is more useful for describing the harmonic structure present in sounds. For example, the separation of the spectral structure of vocalic phones into three regions is a property that is observed only in the linear frequency representation (Figure S1). 10.1371/journal.pcbi.1000302.g002 Figure 2 Spectral modulations differ in male and female speech. (A,B) The MPS of the 50 corpus sentences spoken by males (A), and of the 50 spoken by females (B), with black contour lines as in Figure 1. White parenthetical labels on the y-axes of (A) and (B) show related frequencies demarcating the male and female vocal registers; they correspond to spectral modulations based on harmonic spacing. (C,D) Modulation filters that resulted in misidentification of the speaker's gender. (C) the speech MPS for female speakers is overlapped with the boundaries of the low-pass spectrotemporal filter. In this condition, speaker gender was misidentified in a quarter of the sentences, with 91% of those errors being females misidentified as male. (D) the same female speech MPS overlapped with a notch filter that removed modulations from 3 to 7 cycles/kHz. Of the 21% gender errors in this condition, 95% were female speakers misidentified as male. Thus, in the speech MPS with linear frequency, not only do vocalic and non-vocalic sounds occupy different regions within the modulation space, but the spectral modulations for vocalic sounds corresponding to formants and male and female pitch occupy distinct regions. Also, human speech is symmetric between positive and negative temporal modulation frequencies, showing that there is equal power for upward frequency modulations (Figure 1C, left quadrant) and downward frequency modulations (right quadrant). Psychophysical Experiments in Spectrotemporal Modulation Filtering Our modulation filtering methodology allowed us not only to rigorously degrade speech within its spectral and temporal structure but also to relate the results from the degradation to acoustic features of the signal that are important for different percepts, as described above. Our psychophysical experiments are organized in three sections. We first report results from the two sets of modulation filters applied to the whole spectrotemporal modulation spectrum of speech—low-pass filters and notch filters—which indicated a subset of modulations that are critical for speech understanding, thereafter designated the “core” modulations. Subsequently, we report results from notch filters applied to sentences containing only core modulations, further refining our identification of crucial spectrotemporal modulations. Low-pass modulation filtering We scored the number of words reported correctly from sentences with low-pass filtered spectral or temporal modulations (see Materials and Methods for the modulation filtering procedure) at cutoff frequencies roughly logarithmically distributed across the speech MPS (Figure 3). Sentences were embedded in noise and played back at 3 different levels of signal-to-noise ratio (SNR). Comprehension dropped off significantly at around 4 cycles/kHz low-pass cutoff spectral frequency, and at 12 Hz in the temporal domain, with a further significant decrease at 6 Hz. Gray shading in the thumbnails of the modulation spectrum show the modulations of speech that were low-pass filtered spectrally (Figure 3A), or temporally (Figure 3B). The line graphs (Figure 3C and 3D) show mean±s.e. performance on the sentence comprehension test for the spectral and the temporal conditions, at the three SNRs. Spectrograms of the example sentence from Figure 1 show extreme spectral (0.5 cycles/kHz, Figure 3E, Audio S2) and temporal smearing (3 Hz, Figure 3F, Audio S3), in addition to the spectral smearing (4 cycles/kHz, Figure 3G, Audio S4) and temporal smearing (12 Hz, Figure 3H, Audio S5) conditions at which comprehension decreased significantly in comparison to control. 10.1371/journal.pcbi.1000302.g003 Figure 3 Comprehension of low-pass modulation filtered sentences. (A,B) Grayed areas of thumbnails show spectrotemporal modulations removed by low-pass modulation filtering in the spectral (A) or temporal (B) domain. Units and axis ranges are the same as in Figure 2. Each thumbnail represents a stimulus set analyzed in (C,D). (C,D) Mean±s.e. performance in transcribing words from the low-pass modulation filtered sentences. Cutoff frequencies on the x-axes of the two graphs are presented in units appropriate to the spectral or temporal domain, but could equally well be viewed on one continuous scale in either unit. Symbols show SNR levels. Dashed line shows control performance at +2 dB SNR; dotted line shows control performance at −3 dB SNR. Points at cutoff frequencies which share no capital letters in common (above line plots) are significantly different (repeated measures ANOVA, Bonferroni post-hoc correction, p 0.5 cycles/kHz). There are therefore important differences between the MTF obtained by measuring detections of ripple sounds in noise and the one obtained by performing notch filtering operations on speech. While humans might be equally good at detecting low and intermediate spectral modulations, the lower ones carry more information for speech intelligibility. The intermediate modulations should carry more information for other auditory tasks such as pitch perception. While animal models of speech perception remain a stretched analogy, models of animal sensitivity to relevant modulations hold more immediate potential. The shape of our speech MTF also resembles the MTFs that have been obtained for mammalian [31] and avian [32] high-level auditory neurons. This correlation between the power of the spectrotemporal modulations in speech (the speech MPS), the MTF resulting from tests of speech intelligibility, the MTF derived from detection of synthetic sounds [2], and the tuning properties of auditory neuron ensembles suggests a match between the speech signal and the receiver. The most informative modulations in speech, and in other animal communication signals, occur in regions of the modulation spectrum where humans show high sensitivity and where animals' high-level auditory neurons have the highest gain [13],[33],[34]. We also examined the role of modulations in the task of recognizing the gender of a speaker. The MPSs of male and female speech differ in the frequency rate at which power is concentrated in the higher spectral modulations (Figure 2). In our MPS representation, the pitch-associated spectral frequencies of male and female speakers showed a bimodal distribution: the two modes correspond to the glottal action of the vocal cords pulsing at ∼150 Hz in adult male speakers and at above 200 Hz in females [22]. The spectral notch filter that removed the high spectral modulation power unique to the female voice confused listeners' percept of gender, such that half of the female stimuli notch filtered between 3–7 cycles/kHz sounded male to subjects. Control stimuli containing only the core modulations, which likewise lack the female-specific modulation power, similarly confused listeners. We conclude that modulations between 3 and 7 cycles/kHz give rise to the percept of female vocal pitch. It is interesting that removal of the modulations underlying the male vocal register did not appear to detract from perception of speaker masculinity. Although fundamental frequencies provide the basis for gender recognition particularly in vowels [35], it has also been shown that the fundamental and the second formant frequency are equally good predictors of speaker gender [36]. Therefore the lower spectral modulations could carry additional gender information, but the acoustic distinction fails to explain the bias for male identification. Alternatively, the perception of vocal masculinity could depend more on gender-specific articulatory behaviors which account for social “dialectal” gender cues distinguishing even pre-pubescent speakers [37]. Our results have implications for speech representation purposes including compression, cochlear design, and speech recognition by machines. In both speech compression applications and signal processing for cochlear design, the redundancy of the speech signal allows a reduction in the bandwidth of a channel through which the signal is represented. For this purpose, limiting spectral resolution has been a favorite solution both because of the robustness of the signal to such deteriorations [6],[29] and because of engineering design constraints for cochlear implants. However, in noisy environments, additional spectral information results in significant speech hearing improvement [20],[25]. Our approach provides a guided solution: after determining the speech MTF, one can selectively reduce the bandwidth of the signal by first representing key spectral modulations and then systematically including the most important adjacent spectrotemporal modulations to capture the greatest overall space within constraints, as illustrated in cartoon form in Figure 6 (see also [2]). Our initial experiment with gender identification, together with research in music perception [38], shows that the most advantageous solution will depend on the task and the desired percept. Finally, the speech MTF could also be used as a template for filtering out broadband noise: a modulation filtering procedure can be used to emphasize the modulations important for speech and to de-emphasize all others. Both the speech compression and the speech filtering operation require a decomposition of the sound in terms of spectrotemporal modulations, as well as a re-synthesis. These are not particularly simple operations (see Materials and Methods), but with advances in signal processing they will become possible in real time. After all, a similar operation appears to happen in real time in the auditory system [12],[21],[39]. Materials and Methods Ethics Statement Subjects gave written consent as approved by the Committee for the Protection of Human Subjects at University of California, Berkeley. Subjects Native American-English speakers of mixed gender (20 in the low-pass experiment, aged 18–34 yr; and 17 in the notch experiment, age range 18–36 yr) volunteered to participate in the study. Audiograms showed that their hearing thresholds were normal from 30 to 15,000 Hz; one subject was excluded due to high-frequency hearing loss. Stimuli Materials Acoustically clean recordings of spoken sentences were obtained from the soundtrack of the Iowa Audiovisual Speech Perception videotape [40]. The soundtrack was digitized at 32 kHz sampling rate in our laboratory using TDT System II hardware. This corpus consists of 100 short complete sentences read without emotion by six adult male and female American-English speakers. See Figure 1 for the spectrogram of one example, “The radio was playing too loudly.” The corpus has been used by previous studies of speech perception [5],[6]. The original speech sentences were normalized for power. The synthetic degraded speech signals were generated from this original set by a novel filtering procedure performed on the log spectrogram, as described below. The Modulation Power Spectrum The modulation power spectrum (MPS) of a sound is the amplitude spectrum of the 2D Fourier Transform of a time-frequency representation of the sound's pressure waveform [3]. The MPS can be estimated for a single sound (e.g. one sentence) or for an ensemble of sounds (e.g. 50 sentences from adult male speakers). In our analysis, the time-frequency representation is the log amplitude of a spectrogram obtained with Gaussian windows. Gaussian windows are used because of their symmetry in time-frequency and because they result in time-frequency representations that are more easily invertible [41]. As in cepstral analysis [23], the logarithm of the amplitude of the spectrogram is used to disentangle multiplicative spectral or temporal modulations into separate terms. For example, in speech sounds, the spectral modulations that constitute the formants in vowels (timbre) separate from those that constitute the pitch of the voice (Figure 2B). The MPS is then the amplitude squared as a function of the Fourier pairs of the time and frequency axis of the spectrogram of the log amplitude of this spectrographic representation. We call these two axes the temporal modulations (in Hz) and the spectral modulations (in cycles/kHz). One of these two axes must have positive and negative frequency modulations to distinguish upward frequency modulations (e.g., cos(ωsf-ωtt)) from downward modulations (e.g., cos(ωff+ωtt)). By convention, we use positive and negative temporal modulations. The time-frequency resolution scale of the spectrogram (given by the width of the Gaussian window) determines the upper bounds of the temporal and spectral modulation in an inverse relationship as a result of the uncertainty principle or time-frequency tradeoff. The time-frequency scale must therefore be chosen carefully so that modulation frequencies of interest are considered. The choice of time-frequency scale can be made in a somewhat systematic fashion by using a value for which the shape of the modulation spectrum does not change very much. At these values of time-frequency scale, most of the energy in the modulation spectrum would be far from the boundaries determined by the time-frequency tradeoff [3]. For analyzing our original and filtered signals, we used a time-frequency scale given by a Gaussian window of 10 ms in the time domain or 16 Hz in the frequency domain . For obtaining the MPS of sound ensembles, sounds in their spectrographic representation were divided into segments of 1 s and the MPS for each segment was estimated before averaging to obtain a power density function. The boundaries of the modulation spectrum at the time-frequency scale of 10 ms–16 Hz are 50 Hz and 31 cyc/kHz. At this time-frequency scale, approximately 90% of the power in the modulation spectrum was found for temporal modulations below 25 Hz and for spectral modulations below 16 cycles/kHz, justifying the choice (Figure 1). Moreover, the temporal and spectral modulation cutoffs correspond approximately to the critical modulation frequency at which amplitude modulated tones and noise start to promote a pitch percept [33]. Thus, when we use this particular time-frequency scale, the temporal modulation frequencies analyzed are perceived predominantly as temporal changes, while higher temporal modulations (those above 50 Hz) which would mediate a percept of pitch are found along the spectral modulation axis. Using wider frequency filters might cause spectral modulation power that is plotted high on the ordinate (e.g., 5 cycles/kHz corresponding to a 200 Hz pitch) to appear instead at a correspondingly high temporal modulation (200 Hz) on the abscissa. For the modulation filtering operation described below we used other time-frequency scales which were adapted to the filter's cutoff frequencies and thus improved the required spectrogram inversion step in that process. The MPS can be obtained from a time-frequency decomposition with a linear frequency axis (resulting in spectral modulations in units of cycles/kHz), or from a decomposition with a log frequency axis (resulting in spectral modulation in units of cycles/octave). The log frequency axis is a better model of the decomposition that occurs in the auditory periphery, but we found that the linear-frequency scale is a better decomposition for describing sounds that have harmonic structure. We suggest that higher level neurons may be equally well described as representing either linear or log scale frequency [42]. In any case, both representations are useful. To be able to compare our results to other published work, we additionally obtained the speech MPS and psychometric curves using the log-frequency representation. These results are shown in Figure S1. Synthesis of Degraded Speech The sentences were degraded by a novel modulation filtering procedure. In brief, the sound is first represented in its spectrographic representation using a log-spectrogram calculated with Gaussian windows as described above. Then a new log-spectrogram is obtained by a 2D filtering operation. This filtering operation is performed in the Fourier domain of the modulation amplitude and phase in the following way. First the 2D FFT of the log spectrogram is calculated. Then the amplitudes of specific temporal and spectral modulations that we want to filter out are set to zero. The inverse 2D FFT yields the desired filtered log-spectrogram. After exponentiation, the spectrogram is then inverted using an iterative spectrogram inversion algorithm [43]. We then verified the procedure by calculating the spectrogram and MPS of the filtered sound. For a measure of the errors introduced by spectrogram inversion, we squared the differences between the desired spectrogram and the spectrogram obtained, and divided by the desired spectrogram power, summing the resulting values over time and frequency. Across the 100 stimulus sentences in the control condition, the residual error at the end of 20 algorithm iterations averaged 2.5%. When the 100 sentences were low-pass filtered in one step to create stimuli with only the core modulations, the average residual error after the 20 algorithm iterations was 6.3%. The modulation filtering was written in Matlab using modified code from Malcolm Slaney's Auditory Toolbox for the spectrogram inversion routine [44]. The complete program is available upon demand. The iterative method improves upon earlier overlap-and-add methods that had to compensate for the retention of phase information that unintentionally preserves some spectral information targeted for removal [7],[8]. For the low-pass modulation filtering procedure, the time-frequency scale of the spectrogram was adjusted depending on the desired modulation frequency cutoffs of the modulation filter. For example, if the amplitude of spectral modulation frequencies above 2 cycles/kHz was to be set to zero, then using a time-frequency scale where spectral modulations were represented only up to values approaching 2 cycles/kHz gave better results. In this example, one could use a time-frequency window in the spectrogram of 1.25 ms–128 Hz to obtain a MPS with boundaries at 402 Hz and 3.9 cycles/kHz. Such adjustments made the inverting process much more efficient. Moreover, for low-pass filtering only, one could take this procedure to the extreme and calculate the spectrogram at a time-frequency scale that corresponds exactly to the modulation frequency cut-off of the filter. In that case, the spectrogram would not require any additional filtering and the spectrogram inversion routine can be by-passed altogether. One can instead directly obtain the filtered sounds by using the amplitude envelopes in each frequency band of the spectrogram and using these to modulate a set of narrowband signals of the same bandwidth and center frequency but unitary amplitude. These unit-amplitude narrowband signals can be obtained from Gaussian white-noise that is decomposed through the same spectrographic filter bank [45] or, equivalently, by generating them directly using an analytic signal representation [46]. In the analytical representation the amplitude is set to 1 and the instantaneous phase is random but band limited so that the instantaneous frequency remains within the band. In this study, this direct method was used to generate the low-pass modulation-filtered sentences. The modulation filtering that involved notch or band-stop filtering was done with the complete spectrogram filtering and inverting procedure. In the direct methods, the frequency cutoff for temporal frequencies is inversely related to the frequency cutoff for spectral frequencies but the conjugate boundary was always far from the limits being considered here. For example, a 49 Hz low-pass temporal filter had a conjugate spectral frequency cutoff of 32 cycles/kHz and any temporal filtering with cutoff frequencies below 49 Hz has spectral modulations cutoffs higher than 32 cycles/kHz (Figure 3A and 3B). Because of this relationship the panels C and D of Figure 3 could be merged into one plot that would show a unimodal (inverted U) psychometric curve as a function of a spectrotemporal cutoff (as in Figure S1). More details on these sound synthesis procedures and on time-frequency scale effects can be found in [46] and [3]. A control (unfiltered) speech sentence was obtained by inverting the unfiltered log-spectrogram obtained with the 10 ms–16 Hz time-frequency scale (low-pass experiment) or 5 ms–32 Hz scale (notch experiment). The control sentences sounded very similar to the original sentences and yielded high levels of intelligibility. Errors calculated during resynthesis depend on the bandwidth of the time-frequency scale. Residual errors in the control case of spectrogram inversion without filtering would barely be affected by changing the time-frequency scale from 5 ms–32 Hz to 1.25 ms–128 Hz (2.92% vs. 2.52% after 20 iterations, averaged over all 100 sentences). Similarly, in the case of temporal and spectral low-pass filtering leaving only core modulations, this time-frequency change would make a minimal improvement in the residual errors (5.49% vs. 6.29%). However, in the case of low-pass spectral modulation filtering with a 2 cycles/kHz cutoff, the 128 Hz time-frequency scale would double residual errors (12.18% vs. 6.41%). Using the 128 Hz time-frequency scale for temporal low-pass filtering with a 6 Hz cutoff would similarly increase residual error (5.64% vs. 2.02%). Experimental Procedures All sounds were presented through headphones (Sennheiser HD265 Linear) to subjects who sat in a sound attenuated chamber. An audiogram from 30 Hz to 15 kHz was obtained initially for each subject, using an adaptive staircase procedure (Tucker Davis Technologies software PsychoSig) and subjects who had thresholds of 20 dB above normal were excluded. For the comprehension test, the sentences were embedded in Gaussian white noise (0–20 kHz). The white-noise lasted 6 seconds and the sentences (filtered and control) started at random times between 300 ms and 2 s after the onset of the noise. The white noise was played at a level of 65 dB SPL (B&K Level Meter, A-weighting, measured with headphone coupler from B&K). The modulated speech sentences were played at 3 different levels: 72 dB, 67 dB, and 62 dB SPL (B&K level meter, A-weighting, peak level with slow integration, headphone coupler). The 5 dB attenuation steps were obtained using a programmable attenuator (Tucker Davis Technologies). The signal to noise ratios (SNR) calculated from the SPL measurements of the speech and noise signals were therefore +7, +2 and −3 dB. We also calculated the SNR in terms of the RMS values of the sound pressure waveform of the noise and speech and found almost identical values (6.7 dB, 1.7 dB and −3.3 dB). These SNRs were chosen in pilot data to yield complete sigmoidal psychometric tuning curves in the low-pass filtered conditions, and almost perfect speech intelligibility for the control condition [47]. Furthermore, these SNRs cover the 3 dB SNR level that presents little difficulty for normal listeners but reduces comprehension in the hearing impaired [48],[49]. Subjects listened to the sentences at their own pace, pressing a button to elicit the next stimulus. They were instructed to type whatever words they heard followed by whether they perceived the speaker's gender to be male or female. Subjects were asked to guess if necessary, but not to force sentences into making sense if any words did not make sense together. The typed response files were scored for the percentage of words reported correctly, with an algorithm to compensate for small spelling errors. Baseline performance under control conditions and with +2 dB SNR was around 90%. During an experiment each subject heard all 100 sentences in the corpus without repetitions, so that each sentence was pseudorandomly assigned only to one normal (control) or filtered condition at one level. The SNR levels and the filtering conditions were presented in pseudorandom order. The notch-filtered sentences were presented only at +2 dB SNR. Supporting Information Audio S1 Example sentence under control condition. Mp3 file after conversion from the original wave file of an example stimulus sentence in Figure 1A. No modulation filtering was performed under this condition controlling for spectrogram inversion. (0.09 MB MPG) Click here for additional data file. Audio S2 Low-pass modulation filtering at 0.5 cyc/kHz. Mp3 of an example sentence (Figure 3E) with the most extreme spectral modulation filtering (with a low-pass cutoff of 0.5 cyc/kHz). (0.09 MB MPG) Click here for additional data file. Audio S3 Low-pass modulation filtering at 3 Hz. Mp3 of the example sentence with the most extreme temporal modulation filtering tested (having a low-pass cutoff of about 3 Hz; Figure 3F). (0.09 MB MPG) Click here for additional data file. Audio S4 Low-pass modulation filtering at 4 cyc/kHz. Mp3 of the example sentence with the spectral modulation filtering at which comprehension became significantly worse (cutoff 4 cyc/kHz; Figure 3G). (0.09 MB MPG) Click here for additional data file. Audio S5 Low-pass modulation filtering at 12 Hz. Mp3 of example sentence with the temporal modulation filtering at which comprehension became significantly worse (cutoff 12 Hz; Figure 3H). (0.09 MB MPG) Click here for additional data file. Audio S6 Example sentence with core modulations. Mp3 of the example sentence containing only the core of essential modulations below 7.75 Hz and 3.75 cyc/kHz (Figure 4G). (0.09 MB MPG) Click here for additional data file. Audio S7 Spectral notch filter producing gender misidentification. Mp3 of the example sentence after spectral modulations between 3 and 7 cyc/kHz were filtered out (Figure 4F). Listeners misreported the gender of about half the female speakers. (0.09 MB MPG) Click here for additional data file. Figure S1 Modulation power spectrum and performance with linear time-frequency scale. (A) The top panels in the figure show the modulation power spectrum (MPS) of speech (American English) calculated from a time-frequency representation of the sound using a logarithmic frequency filter bank (log-f). The modulation spectrum is shown for male and female speakers. As was the case for the modulation spectrum estimate with a linear frequency filter bank (Figures 1 and 2), the log-f speech modulation spectrum shows a power law distribution of energy and some degree of non-separability between spectral and temporal modulations. However, in the linear modulation spectrum, the spectral modulations in cycles/Hz distribute into clearly separate regions corresponding to pitch and formant energy (Figure 2), whereas in the log-f modulation spectrum the corresponding modulations overlap in a single triangular region below 4 cycles/octave. In addition, in this speech corpus at this time-frequency scale, the harmonic structure of women's vocalic sounds creates a repeated pattern of spectral modulations. The log-f spectrogram was obtained with logarithmically-spaced Gaussian filters with a bandwidth of 0.0138 octaves. (B) The line graph replots on a linear spectral modulation axis the comprehension of sentences after log-f low-pass filtering. The resulting psychometric curve includes low-pass filter cutoffs from 1/4 cycles/octave to 256 cycles/octave, but these can be interpreted as low-pass spectral filtering on the left side of the peak and low-pass temporal filtering on the right side of the peak, as follows. The sound pressure re-synthesis of these sentences used the direct method, where the filtered amplitude was obtained by decomposing the sound into a set of narrowband signals with the frequency bandwidth given by the modulation frequency cutoff, and the filtered phase was obtained from Gaussian white-noise that is decomposed through the same filter bank [40],[41]. For high modulation frequency cutoffs, because of the time-frequency tradeoff, this method effectively low-pass filters the amplitude envelope. In a log frequency representation, the temporal frequency cutoff depends on the center frequency. We show the corresponding temporal cutoff for the frequency band centered at 1 kHz in parentheses under the relevant x-axis labels. The left side of the figure can therefore be compared to the psychometric curve shown in Figure 3C, and the right side to Figure 3D. The left side shows that speech comprehension remains very good with representations having filter bands as wide as 0.25 octaves (the sigma parameter corresponding to 2 cycles/octave cutoff [14]) but that it degrades rapidly with wider frequency bands, particularly in noisy conditions. As in our interpretation of the linear frequency results, this steep decline occurs when spectral modulations that correspond to formants and formant transitions are filtered out. On the right side of the curve, the critical temporal modulation cutoffs are approximately twice as large in this plot as in the linear frequency plot, suggesting that humans cannot easily use the faster temporal information that is present in filters above 500 Hz to compensate for the loss of that information in the lower frequency bands. (4.37 MB TIF) Click here for additional data file.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis.

              Rainstorms, insect swarms, and galloping horses produce "sound textures"--the collective result of many similar acoustic events. Sound textures are distinguished by temporal homogeneity, suggesting they could be recognized with time-averaged statistics. To test this hypothesis, we processed real-world textures with an auditory model containing filters tuned for sound frequencies and their modulations, and measured statistics of the resulting decomposition. We then assessed the realism and recognizability of novel sounds synthesized to have matching statistics. Statistics of individual frequency channels, capturing spectral power and sparsity, generally failed to produce compelling synthetic textures; however, combining them with correlations between channels produced identifiable and natural-sounding textures. Synthesis quality declined if statistics were computed from biologically implausible auditory models. The results suggest that sound texture perception is mediated by relatively simple statistics of early auditory representations, presumably computed by downstream neural populations. The synthesis methodology offers a powerful tool for their further investigation. Copyright © 2011 Elsevier Inc. All rights reserved.
                Bookmark

                Author and article information

                Journal
                Journal of New Music Research
                Journal of New Music Research
                Informa UK Limited
                0929-8215
                1744-5027
                February 08 2016
                January 29 2016
                : 45
                : 1
                : 27-41
                Article
                10.1080/09298215.2015.1132737
                21409ada-5b75-439d-8520-fa71ebf22c18
                © 2016
                History

                Comments

                Comment on this article