Introduction Many animals communicate using sound. Often, brief sounds are produced to warn of danger or mediate aggressive encounters. Some species, however, produce long sequences of vocalizations often called “songs.” Most commonly, these long sequences are generated as a part of courtship. For example, many insects and amphibians  advertise their presence and identity with a single type of utterance—which, depending on the species, might be described as a chirp, click, or whine—repeated several times to form a “phrase,” with silent gaps between phrases. The utterance, its repetition rate, and the number of repetitions in a phrase are characteristic of the species . More complex vocalizations are observed in many birds , as well as in a few mammals such as whales  and bats . These species generate multiple types of sounds organized in more intricate phrases. Rodents produce a variety of social vocalizations, including vocalizations audible to humans, like postpartum sounds and distress calls, as well as ultrasonic vocalizations [5,6]. In mice, ultrasonic vocalizations utilize frequencies higher than 30 kHz , and therefore cannot be detected directly by human ears. A number of studies have shown that mice produce ultrasonic vocalizations in at least two situations: pups produce “isolation calls” when cold or when removed from the nest , and males emit “ultrasonic vocalizations” in the presence of females or when they detect their urinary pheromones [6,9–11]. Most commonly, these sounds have been recorded using a detector with narrow frequency tuning [9,10], which suffices to estimate the amount of vocalization. However, because of its narrow frequency tuning, such a detector does not record the acoustical details of these vocalizations. While numerous studies have focused on the circumstances leading to ultrasound production, few have examined the sounds themselves. Sales  observed that these vocalizations consisted of a series of discrete utterances, with species-specific differences in vocalizations. Some diversity was also noted among the utterances within a species [6,7], but it was not determined whether this latter variability was continuous—as in the case, for example, of the “random” variability observed when a single word is spoken many times—or whether the utterances fall into distinct categories. In a recent quantitative study of mouse vocalizations, Liu et al.  studied changes in pup vocalizations during the first 2 wk after birth, and compared these to adult vocalizations. However, this study focused only on the aggregate properties of vocalizations, measuring parameters such as median pitch and call rate, which, if applied to humans, would be more analogous to “voice” than to speech. To date, no study that we know of has examined whether the discrete utterances consist of distinct syllable types, or whether these vocalizations have significant temporal structure. Here, we provide a quantitative description of the ultrasonic vocalizations of the adult male mouse, and show that they display unexpected richness, including several syllable types organized into phrases and motifs. Thus, these vocalizations display the characteristics of song [1,3,13]. Different males, even though genetically identical, show small but significant differences in syllable usage and the temporal structure of their songs. These results indicate that communication among mice may be more complex than previously appreciated. Because of the ubiquity of the mouse for physiological and genetic investigations, these observations may lead to new opportunities in studies of the biological basis of song production and perception. Terminology As the terminology used to describe animal vocalizations is varied, we adopt the following definitions. A “syllable” is a unit of sound separated by silence from other sound units ; it may consist of one or more “notes,” continuous markings on a sonogram. A “syllable type” is a category of syllable, observed regularly in the animal's vocalization, distinct from other syllable types. A “phrase” is a sequence of syllables uttered in close succession. A “phrase type” or “motif” is a sequence of several syllables, falling into one or more syllable types, where the entire sequence is observed repeatedly in the animal's vocalization. The term “song” has been used with a variety of connotations, so that Broughton  offers three different definitions of song: a sensu latissimo, a “sound of animal origin which is not both accidental and meaningless,” which includes relatively simple vocalizations often described as “calls”; a sensu stricto, “a series of notes [or syllables], generally of more than one type, uttered in succession and so related as to form a recognizable sequence or pattern in time”; and a sensu strictissimo, “a complete succession of periods or phrases,” in which a song consists of several distinct motifs, often delivered in a characteristic sequence. Results Listening to Ultrasonic Vocalizations To induce ultrasonic vocalizations, male mice of the B6D2F1 strain were presented with sex-specific odors applied on cotton swabs (Figure 1). We tested dilute urine of either sex (BALB/c strain) and mixtures of urine from both sexes. (The correspondence between stimulus identity and vocal response will be reported elsewhere.) We recorded all sounds in the chamber with a microphone with flat frequency response from 20 Hz to 100 kHz. While these vocalizations are well beyond the range of human hearing, we make them audible through two techniques. Most straightforward is to play them back slowly. When slowed 16×, these vocalizations sound like a series of breathy whistles (Audio S1). However, slow playback makes it difficult for human listeners to develop an appreciation of the temporal sequence of the vocalizations. Using a phase vocoder algorithm , the pitch of these vocalizations can be dropped several octaves without lengthening the duration of the playback. These pitch-shifted vocalizations are reminiscent of birdsong (Audio S2). Readers are urged to listen to these recordings. Elementary Features of Vocalizations Male mouse ultrasonic vocalizations consisted of a rapid series of “chirp-like” syllables in the 30–110 kHz band (Figure 1). Syllables were of varying duration (approximately 30–200 ms), uttered at rates of about ten per second. Most syllables involved rapid sweeps in frequency, with rates of approximately 1 kHz/ms typical. Over tens of seconds, periods of closely spaced syllables alternated with periods of silence. These features of adult male vocalizations, and their analogs for the isolation calls of mouse pups, have been previously described [7,12]. The microphone recorded a variety of sounds in the test chamber, including noises from movement, gnawing, contact with the cage wall, audible squeaks, and ultrasonic vocalizations. For the purposes of this study, we excluded sounds other than ultrasonic vocalizations. The majority of extraneous sounds fell below 30 kHz, and were excluded by selecting the appropriate frequency band. However, some sounds, particularly brief “snaps,” penetrated into the frequency band of the ultrasonic vocalizations. We developed an automated algorithm to recognize ultrasonic vocalizations in terms of their generic features. Subjectively, the algorithm appears no worse than a well-trained human in identifying these vocalizations (see Materials and Methods; Figure 1). Features of Syllables: Pitch Changes As reported previously , inspection (Figure 1) suggests that some syllables involve relatively sudden, large changes (“jumps”) in frequency. To determine whether these frequency jumps are stereotyped or random, we analyzed a collection of 750 syllables uttered by one mouse in a single 210-s trial. We simplified our description of each syllable by extracting the dominant frequency (the “pitch”) as a function of time (Figure 2A). For each syllable, we compared the pitch at one moment with the pitch in the next time bin, approximately 1 ms later. These pitch pairs were pooled for all 750 syllables, resulting in a total of 31,303 consecutive pitch pairs. This analysis (Figure 2B) revealed four distinct clusters of pitch changes. The long cluster along the diagonal corresponds to the gradual shift in pitch occurring at most time points in all syllables. Two distinct off-diagonal clusters reveal large, stereotyped jumps to or from comparatively low frequencies (35–50 kHz). These downward (“d”) and upward (“u”) jumps are often paired in a syllable (see below and insets for Figure 2B), and will be collectively described as “low jumps.” The cluster just below the diagonal, containing transitions from 70–90 kHz down to 55–70 kHz, results from a third type of jump (“high jump,” or “h”). These jumps were often, but not exclusively, associated with a brief “grace note” at the beginning of a syllable (see jump labeled “h” in lower inset, Figure 2B). These pitch jumps were identified in Figure 2B from a single 210-s recording from one mouse. To determine whether these jumps are stereotypic features of the ultrasonic vocalizations of all male mice, we performed the same analysis for a 210-s trial from each of 45 different males. The pitch changes in adjacent time bins are pooled across mice in Figure 2C. Both the number of clusters and their positions and sizes are essentially unchanged, and examples of all three types of jumps were broadly distributed across mice (Figure 2D). Thus, at least for similarly aged males of the B6D2F1 strain, these pitch jumps are a universal feature of ultrasonic vocalizations. Pitch Jumps and Mechanisms of Sound Production Many syllables with low jumps display both a fundamental frequency and a faint first harmonic during the low-frequency period (Figure 3A; see also Figures 1 and 2A). The frequency of the harmonic is almost precisely twice that of the fundamental, suggesting the involvement of a resonator in the production of these sounds. A priori, this resonator might be the vocal folds of the larynx. However, based on the effect of partial replacement of air with helium, Roberts  argued that these sounds are not produced by the vibration of vocal cords. Instead, he proposed that ultrasound arises from an aerodynamic whistle, and showed that mechanical whistles can produce sounds similar to the examples described by Sales , including pitch jumps. While our recordings appear largely consistent with Roberts's results, several features of these vocalizations indicate that their production is more sophisticated than that of a whistle from a rigid, static pipe. The rigid whistles investigated by Roberts had a characteristic relationship between frequency and fluid velocity . Frequency was fairly stable over a range of velocities, and would suddenly jump to a new frequency at yet higher or lower velocities. In contrast, the pitch of mouse vocalizations is modulated considerably, in both a continuous and discrete (jump) fashion. Despite their stereotyped form, jumps were not obligatory upon reaching a particular frequency. While down-type jumps began from frequencies of 65–80 kHz (see Figure 2B), these frequencies were well-sampled even in syllables that lack these jumps (Figure 3B). Furthermore, if jumps were produced by changes in air velocity, one might expect to see differences in vocal power between cases where jumps do and do not occur. In contrast with this expectation, the power distributions of syllables both with and without “d” jumps overlap considerably (Figure 3C), although variability in the mouse's head position and orientation relative to the microphone could obscure a true relationship. Finally, the fine-scale temporal structure of pitch jumps appears to be inconsistent with the nonlinear properties of purely static whistles. During a downward low jump, the pitch of the preceding phase overlaps in time with the pitch in the succeeding phase (Figure 3A), often by 5–10 ms. This behavior is apparently not observed in pitch jumps arising from mode-locking nonlinearities , where changes in pitch are nearly instantaneous. In a few cases, both tones were present simultaneously for longer periods, with one frequency modulated and the other nearly fixed (Figure 3D). In birdsong, similar observations were used by Greenewalt  to posit two sites of sound production—specifically, that birds could independently control the left and right sides of their syrinx. This assertion was later confirmed directly . Examples such as Figure 3D may indicate that mice have at least two sites of ultrasound production. However, the strength of this conclusion is tempered by our incomplete knowledge of the nonlinear properties of aerodynamic whistles . Classifying Syllables into Distinct Types Because pitch jumps exist in three distinct categories, their presence or absence serves as a basis for classifying individual syllables into types. However, it is possible that other features of these vocalizations might also be a basis for classification. We therefore analyzed these syllables using multidimensional scaling, a technique that has been used previously to classify syllables in birdsong . Multidimensional scaling provides a method to represent high-dimensional data in lower dimensions; it takes as an input the pairwise “distance” between any two pitch waveforms, and attempts to faithfully represent the set of all pairwise distances by proper placement of points, each representing a single syllable, in two dimensions (Figure 4A). Because inspection suggested that a given syllable type can be uttered quickly or slowly, we first aligned the pairs by warping their time axes to place the pitch waveforms in maximal correspondence with each other (Figure 4A; ). We also used a variant of multidimensional scaling, called isomap , which assembles global information from local features. The isomap analysis revealed the presence of several clusters, indicating distinct syllable types. The most prominent distinction is illustrated in Figure 4B, with an almost perfect correspondence between cluster membership and the presence or absence of low-jump transitions. Closer examination of the cluster representing syllables containing low jumps reveals further clusters within this overall category. An example is shown in Figure 4C, in which syllables again group into types that can be described in terms of their pitch jumps: one distinct cluster contains almost entirely syllables with an “h” jump followed by a “d” jump. Further projections (not shown) confirm the presence of additional clusters, which also correspond to particular sequences of pitch jumps. Therefore, general classification techniques confirm that syllables are naturally grouped by their pitch jumps. In fact, from the isomap analysis we have not found evidence for any other means to categorize syllables; in all cases we have examined, clear isomap clusters correspond to types defined by their sequence of pitch jumps. However, it remains possible that further subtypes exist, but that the isomap analysis fails to reveal these clusters. We therefore focused on the simplest syllable type, with no pitch jumps at all. These syllables take a variety of forms, some of which are illustrated in Figure 5A. We noted that many had an oscillatory appearance. We therefore fit each pitch waveform to a sine wave, scaling and shifting both the time and frequency axes for maximal alignment. (We did not permit local time warping, as used in Figure 4.) The quality of the fit could be assessed by scaling and shifting each pitch waveform to a common axis, revealing that the vast majority of these waveforms lie on top of each other, as well as the underlying sine wave (Figure 5B). Based on this result, we call syllables lacking any pitch jumps “sinusoidal sweeps” (SSs). This analysis suggests that the pitch waveforms of SS syllables can be accurately described in terms of five variables (see Materials and Methods): the starting and ending phases, the rate of oscillation, the center frequency, and the pitch sine amplitude. Analysis of these parameters reveals that most SSs begin during (or just before) the rising phase of the sine wave (Figure 5C), and that a large subset terminate at the peak of the sine wave (Figure 5D). There is also a strong inverse relationship between the oscillation rate and the oscillation amplitude (Figure 5E; see example in bottom two waveforms in Figure 5A). An analogous inverse relationship has been found in birdsong, between the trill rate and the amplitude of pitch variation . In birdsong, this relationship has been interpreted as evidence of a performance limit in the rate at which frequency can be modulated by changes in beak conformation. An analogous limit may constrain a mouse's ability to modulate the frequency of its whistle. While syllables are naturally grouped by their pitch jumps, and indeed we have not found any clear means of classifying them in a different way, it remains possible that other groupings exist. In particular, short stretches of a recording sometimes seem to provide evidence for further subtypes; an example is shown in the next section (Figure 6A). Table 1 shows a breakdown by prevalence of the most common syllable types in mouse ultrasonic vocalizations. Temporal Sequencing of Syllables In sonograms of mouse vocalizations, complex syllable sequences can be identified: Figure 6A shows an example of a phrase in which three “hdu” syllables with descending low-frequency bands (labeled “A”) are followed by six “hdu” syllables with ascending low-frequency bands (labeled “B”); the phrase is finished off by an “h” syllable (almost a SS, but for the brief grace note), an A-type “hdu,” and an SS (Audio S3). An example of a motif can be seen in Figure 6B, in which a phrase beginning with 2–3 SSs followed by 6–8 “du” syllables is repeated three times. The consistency of this repeated sequence, in the context of the whole, is easily noted in pitch-shifted playbacks (Audio S4). Finally, there are regularities in the syllable types over longer time scales. Figure 6C shows an example of a trial that begins with a series of SSs, has a middle period with many syllables containing low jumps, and ends with repeated blocks of “h” syllables. To determine whether such examples are statistically significant, we investigated the temporal structure of these vocalizations quantitatively in terms of two models of syllable selection. To simplify the analysis, we grouped syllables into only two categories, depending on whether they did (“1”) or did not (“0”) contain one or more low jumps. We considered whether individual syllables might be selected randomly. In the first model, we tested whether the probability of selecting a syllable was based purely on the prevalence of each type, so that each syllable is selected independently of all others. In the second model, the selection probability depended on the identity of the previous syllable (Figure 7A): from the data, we calculated the conditional probability pi →j to choose a syllable of type j after a syllable of type i (i, j = 0, 1). We also used a third state (a “gap”) to represent a silent period lasting more than 0.5 s, to ensure that the analyzed state transitions occurred within a phrase. Omitting the gap state from the model did not qualitatively change the results. We then examined the prevalence of all possible three-syllable combinations (see Materials and Methods) in terms of these two models. As shown in Figure 7B, the first model, based purely on prevalence, does a poor job of predicting the distribution of three-syllable combinations (p 95%) of human-identified syllables, with systematic omission occurring only on the faintest and briefest syllables. False positives were encountered so rarely (two clear examples in 10,500 s of recording) that it was difficult to estimate their frequency, but they were clearly rarer than true syllables by several orders of magnitude. The algorithm also identified numerous vocalizations that were initially missed by a human observer, but which proved upon closer inspection to be correctly identified (verified graphically and by audible playback). The algorithm also identified the timing of the beginning and end of each syllable with high accuracy; occasional discrepancies with a human observer arose from interfering sounds or when the beginning or end of the syllable was unusually faint. Pitch was defined as the dominant frequency as a function of time, discarding periods of dropout. Note that pitch was occasionally corrupted by other noises, contributing particularly to background “hash” in Figure 2. The criteria used to define the three pitch jump types “d,” “u,” and “h” are illustrated in Figure 2C. Alignment of pairs of pitch waveforms (see Figure 4) was performed by dynamic time warping . The distance between any two pitch waveforms was defined as the root mean squared distance between the aligned waveforms. The isomap analysis of pitch waveforms used a neighborhood distance criterion of 3 kHz; in Figure 4 and other such figures, only the largest connected component is shown. In fitting SSs to a sine wave (see Figure 5), the sine wave was described in terms of the following parameters: , where ϕ0 is the starting phase, f is the rate of oscillation, ȳ is the center frequency, and A is the pitch sine amplitude. The total duration of the syllable, T, defines the ending phase ϕend by ϕend = 2πft + ϕ0. In analyzing the temporal structure of mouse songs (see Figure 7), the prevalence of a syllable type was defined as follows: if n 0 and n 1 are the numbers of type 0 and type 1 syllables, respectively, then the prevalence of type 0 (within that trial) is defined as p 0 = n 0/(n 0 + n 1) . The prevalences of other types are defined similarly. The prevalence of a particular transition, for example, p 1→1, is defined analogously in terms of the numbers of each transition type n 1→0 and n 1→1 observed during the trial. In Figure 7B, sequences interrupted by a gap were discarded. The expected number of a given three-syllable combination “abc” was calculated as Npapbpc for the “syllable prevalence” model, and as Npapa →b pb →c for the transition-probability model, where N is the total number of three-syllable combinations. The analysis of syllable usage across mice (see Figure 8B) defined the distance between trials in terms of the differences in percentage utilization of each syllable type. More precisely, if pi 1 is the fraction of syllables of type i used in trial 1, and pi 2 is the fraction of the same syllable type in trial 2, then d 12 = 〈|pi 1 − pi 2|〉 i . The pairwise distances were used to project into two dimensions using isomap, much as schematized for pitch waveforms in Figure 4A (rightmost panel). The isomap analysis used a local neighborhood definition consisting of the five closest points; this criterion incorporated all trials into a single connected component. The bootstrap analysis of the spread in transition probabilities across individuals (see Figure 8C) was performed as follows: starting from the median value (calculated separately for each condition, 0→1 or 1→0, and for each mouse), we calculated the absolute deviation for each trial. We then calculated the mean absolute deviation across all mice, conditions, and trials. We compared this mean deviation to the same quantity calculated from synthesized datasets in which the singers' identities were scrambled across trials. Supporting Information Audio S1 Slowed (16×) Playback of the 2-s Section Expanded in the Lower Panel of Figure 1 (977 KB WAV). Click here for additional data file. Audio S2 Pitch-Shifted (16×) Playback of the 2-s Section Expanded in the Lower Panel of Figure 1 (61 KB WAV). Click here for additional data file. Audio S3 Pitch-Shifted (16×) Playback of the Phrase in Figure 6A (48 KB WAV). Click here for additional data file. Audio S4 Pitch-Shifted (16×) Playback of a Longer Segment of Song The triply repeated phrase shown in Figure 6B begins at 40 s into the recording. (1.6 MB WAV). Click here for additional data file. Audio S5 Recording of Juvenile Swamp Sparrow Song For comparison between bird and mouse songs. Courtesy of Peter Marler. (2.4 MB WAV). Click here for additional data file.