Video Mail Retrieval Using Voice: An Overview of the Stage2 System

This paper outlines the Video Mail Retrieval (VMR) project at Cambridge University. The goal of the VMR project is to develop an application for the retrieval of spoken documents in multimedia systems. Speech documents pose a particular problem for retrieval since the contents are unknown. The VMR project seeks to address this problem by combining state-of-the-art speech recognition with established document retrieval technologies to provide an e(cid:11)ective and e(cid:14)cient retrieval tool. Experimental results with a small spoken message collection show that retrieval precision is somewhat dependent on the generality of the acoustic modelling used. For talker-dependent acoustic modelling retrieval performance is around 95% of that observed when text transcriptions of the same (cid:12)les are used. However, even with incorporation of completely open-user talker-independent acoustic models, retrieval performance of about 75% of text can be obtained.


Introduction
This paper discusses work to date on the Video Mail Retrieval (VMR) project at Cambridge University. Our objective is to develop a novel multimedia application for the retrieval of spoken documents. The project seeks to combine state-of-the-art speech recognition and document retrieval technologies for spoken message retrieval, envisaged as one function among many provided on a workstation equipped with multimedia video facilities.
The paper outlines the problems involved, specic strategies being deployed to overcome these, the current system implementation, and the design and results of our retrieval tests to date. We demonstrate that the straightforward probabilistic methods established for text retrieval can be naturally extended to the speech domain; and also that current speech recognition technology can support good message retrieval performance. Here we present results to support these claims and summarise our work to the end of Stage 2 of the VMR project.
Section 2 presents background details of the VMR project: to provide a context for the subsequent discussion of the distinctive problems to be overcome in speech retrieval, and to motivate our own approach. Section 3 describes the specic objectives of the VMR project, section 4 outlines the project strategy, and section 5 considers problems encountered in spoken document retrieval. Details of our experimental investigations to date are given in section 6. Finally, section 7 comments on our work to date and summaries our planned future research. Much of the material in this paper brings together key results from work previously appearing in [1], [2], [3], and [4].

Background
Recent y ears have seen a rapid expansion in the availability o f m ultimedia applications, including video conferencing, and video and audio mail. Using these systems can create large archives of material which can MIRO '95 1 pose signicant problems since the data is expensive to store and unwieldy to access. A particular problem is that users are unable to nd particular stored documents since, unlike text, there is no simple content-based way to search for an individual reference. Manually searching of an archive b y listening is signicantly more time consuming than a similar search of a text archive since audio browsing is much less ecient than visual browsing. Little work has been done in the area of spoken document retrieval, and what has been done has been limited in scope or evaluation. Sch auble and others [5] h a v e proposed a system for spoken document retrieval based on predened acoustic units, and have considered the eect of term occurrence errors of semantic and acoustic origin, but only by simulation. Their most recent w ork has included some real speech document retrieval results [6]. However, there are signicant dierences between this work and that reported here and hence it cannot be compared directly.
The VMR project is addressing these problems in multimedia retrieval and browsing by developing a system to retrieve stored video mail messages using voice indexing. A specic goal of the project is to develop a useful retrieval application for the Medusa multimedia environment installed at Olivetti Research Ltd in Cambridge [7]. The following section describes the salient features of the Medusa system.

The ORL Medusa system
The Medusa Project at ORL is an experimental multimedia system based on multiple streams of digital audio and video sent o v er a 100 Megabit-per-second switched ATM network [8,9]. The ATM fabric covers all rooms at the laboratory and some 200 network connections are in place. An ATM network has useful characteristics for multimedia transport: the bandwidth is relatively high, the latency is low, and transit time jitter is also low. Multimedia input and output functionality has traditionally been provided by cards plugged into a workstation bus. In contrast, Medusa adopts a more exible approach of making microphones, speakers, cameras and storage systems into rst-class network objects. The Medusa hardware design is modular: various ATM direct peripherals are made by adding specialised options cards to an ATMos card. The ATMos card has an ATM network interface, an ARM processor and up to 32 MBytes of memory. A range of Smart ATM Modules (SAMs) have been developed. These are lunchbox-sized units which plug directly into the ATM network. Figure 1 shows connection of various Medusa ATM Direct Peripherals to the network. Figure 2 shows a typical Medusa installation.
An Audio SAM supports a set of ATM microphones and speakers. Audio input signals are sampled at 32kHz using 16-bit coding, although higher sampling rates are possible. Stereo audio can be transmitted across the network using two audio channels enabling near CD quality digital audio playback. The Video SAM is an ATM networked video unit to which one or two colour camera heads can be connected. Colour video is output to either ATM networked workstations or a colour LCD VideoTile as shown to the right o f the workstation in gure 2. Medusa application programs (including the VMR user interface described in Section 6.5) are written in Tcl/Tk [10], a scripting language with useful built-in X widgets. Tcl has been extended at ORL to include commands for initialising Medusa modules; making connections between them; and reading and writing to the controlling attributes on the modules. In the case of an audio source module the attributes include the sampling rate, the number of sample bits, and the quantisation method.
Several large Tcl applications have been written to demonstrate the multi-stream capabilities of Medusa. The Medusa phone application can show small images from up to four cameras from each installation at each end of the conversation, as well as large images from each end which can be either selected manually or automatically. Agent software can use the output from analyser modules attached to the audio and video sources, to automatically select video and audio streams. The Medusa mail application ( Figure 3) makes a recording of all four streams of video available from a Medusa system as well as the most appropriate microphone stream. These ve media streams are passed through timestamping pipeline modules and recorded onto ATM networked Disc Bricks. During playback, all four recorded views are seen, and one of them can be selected to be shown in large format, as shown in Figure 3.

VMR Project Goals
The VMR project goals should be evident from the preceding sections. The primary research issue is how t o integrate text retrieval methods with automated speech recognition technology. H o w ever, the nal objective is to develop a practical spoken document retrieval system for the ORL Medusa environment. A usable tool must rapidly provide robust high retrieval performance using a practical amount of computation and storage. Because speech recognition is computationally expensive, the only practical way t o a c hieve this is to index spoken documents at the time they are added to the archive. To retrieve documents, the pre-computed indexes can then be rapidly searched to nd potentially relevant documents. Identifying potentially relevant documents is only part of the solution; the user must be able to select and play back a n y desired document. Since audio and video consume orders of magnitude more storage than text, this is a non-trivial problem in itself. Further, it is inecient t o p l a y back e n tire documents when just a small portion is of interest. A truly useful retrieval application should give the user the ability to identify and play back potentially interesting portions of individual messages. Finally, a useful application should be well-integrated into the environment in which it is to be used, in our case Medusa.

The Project Structure
The VMR project has three phases that encompass progressively less restricted and more realistic conditions for spoken document retrieval. In Stages 1 and 2 (the work reported here) searching depends only on a xed keyword vocabulary, known in advance of search time. Stage 1 assumed a closed talker community responsible for all messages. Speech recognition was restricted to locating instances of the xed keyword set in acoustically clean speech from a known talker from a set of 15 talkers. In addition, all talkers provided sucient examples of their speech so that specialised models could be built. Though this is clearly unrealistic in the long run, it did provide a benchmark for performance in later, less favourable conditions. Stage 2 relaxed the requirement that the talker's identity be known in advance. In addition, more sophisticated retrieval strategies were implemented and additional evaluation material used.
Key results from Stages 1 and 2 are described later in this paper. The third stage, where research i s ongoing, extends the search term vocabulary from the small set of known keywords to a potentially open search term vocabulary. A n i n tegral part of each stage is the development of the general user interface and ultimately the integration into Medusa. 5 Problems in Spoken Document Retrieval

Speech Document Indexing Issues
Attempts to retrieve s p o k en documents encounter similar problems to those associated with text document retrieval but there are further important issues which m ust be considered. Most obvious among these is that the contents of spoken documents are unknown and hence the initial phase must perform an indexing operation using speech recognition. This speech recognition phase may be carried out in one of two basic ways. Either the speech recognition system may attempt to perform a full transcription of the contents of the documents using a large vocabulary recogniser or recognition may be restricted to a limited and, hopefully, useful set of indexing terms (or keywords) selected a priori. The recognition used in the Stages 1 and 2 of the VMR project is of the second type. In both of these approaches the indexing vocabulary is limited to that of the recogniser. Large vocabulary systems are now a v ailable with vocabularies approaching 100,000 words but this should be compared to the 500,000 word vocabularies encountered in text retrieval systems. Many w ords, particularly proper nouns, cannot be recognised correctly by either recognition system since they are outside the domain of its vocabulary. This creates a signicant search problem which does not exist in text based systems where new document terms are merely added into the inverted le structure.
Additionally, speech recognition is inherently not completely reliable. Even the very best systems will make recognition errors often arising from variable pronunciation or events outside its domain. The recogniser typically maps out-of-vocabulary words to something in its existing vocabulary, which will inevitably result in a recognition error. Short words are more susceptible to recognition errors than longer ones, both because of their inherent greater confusability and the greater tendency to poor articulation. In any case it is important to realise that good text search terms may not be as useful in the speech domain because of their acoustic properties.

Comparison with Text IR
There are some similarities between term identication for text and spoken document retrieval. For example, in the text case there may b e false alarms on search terms with multiple senses. If one of these terms is present in a query it will match a n y occurrence in a document regardless of the sense used in each case. This has been shown to have minimal eect on retrieval eectiveness except for very short queries [11], but is nevertheless a real issue. Also, there may b e misses on search terms which occur in the documents as synonyms of a query term; although of course there are techniques designed to overcome this problem. Finally there may be query-document term matching errors arising from spelling errors in the documents or query, or inappropriate term stemming.
These problems are also potentially present i n s p o k en document retrieval. However, there are two additional sources of potential error similar in eect to those just described. Acoustic false alarms which occur when the speech recogniser hypotheses the presence of a term when none is actually present, and acoustic misses where the occurrence of a term is not detected by the recogniser.
All these sources of search error may be oset by adding more search terms to the query.

Experimental Investigations
This section outlines experimental strategies on the VMR project. The following subsections describe the experimental message archive, acoustic training data for the speech recognition component, indexing of spoken documents via word spotting, retrieval testing and our prototype video mail retrieval application.

Message Archive
A particular problem which w e encountered was the lack of real video mail data for experimental use.
Thus we had to engage in a serious collection construction exercise for our initial retrieval test data. Our rst message set, VMR1, was designed to satisfy requirements of both document retrieval and the speech recognition. From the document retrieval perspective the database had to consist of messages with the same general properties as could be expected in real video mail messages. But in order to meet the specication for Stage 1 of the project, it also had to consist of messages making natural use of a set of xed search keywords. At the same time the corpus should have the sort of message similarities and dierences that pose challenges for recall and precision typical of expected VMR situations. Messages should also have similar acoustic properties and speaking styles to those found in an operational system, and be of comparable length. A k ey issue for system assessment i s t o e v aluate the performance of the speech recognition component and to investigate the extent to which w ord recognition accuracy aects retrieval performance. For this reason all messages were orthographically transcribed, including marking of pauses, disuencies, and extraneous noises. This detailed transcription can only be done manually and is very expensive to carry out. For this reason the VMR1 archive i s a v ery small collection from the retrieval point of view. In order for this small database to be viable for retrieval research it had to be carefully structured. 6 The structure of the VMR1 archive w as derived as follows. Messages were sought o n topics within a set of topic categories. Associated with each category was a set of keywords drawn from a small xed keyword vocabulary from which all search terms used in Stages 1 and 2 of the project must be taken. In addition, since the keyword vocabulary is not very large, a set of other-words were provided for each category as further prompting and potential search v ocabulary. The messages were prompted by using scenarios which stimulated the talker to talk on a topic within a category without constraining them to produce messages strictly tied to pre-specied topics. The prompt for each s p o n taneous message consisted of the scenario and the keywords and other-words for the category. T alkers were asked to favour the use of the listed keywords and other-words, but not at the expense of construction of realistic messages. They were also not restricted to the keywords precisely as shown to them but could use them in variant w ord forms: for example the keyword mail might be used in the forms mailed, mails or mailing. The talkers were not shown a complete list of the keywords available, but only those relevant to the current category. H o w ever, they were free to use any k eyword in any message. The collected messages varied in their individual topics but are clustered round the prompting categories.
The total keyword vocabulary was 35 words, along with a total of 31 related other-words. The keywords were selected manually and contained a mixture of longer more easily recognised words and shorter monosyllabic words. The full list of 35 keywords used was: active assess badge camera date display document find indigo interface keyword locate location mail manage meeting message microphone network output Pandora plan project rank retrieve score search sensor spotting staff time video windows word workstation A total of 10 topic categories were dened and a keyword subset associated with each one. The categories were chosen to reect the anticipated messages of a particular user community, the sta associated with the VMR project. The 10 categories were: spotting document output retrieval windows management badge Pandora schedule equipment Keywords were assigned manually to the categories as being representative of the topics dened within the category. Eg for the category schedule the following assignment w as made: schedule -> manage project meeting plan 5 message prompt scenarios were generated for each category. 1 5 e v enly distributed sets of 4 categories were formed. Each category group was assigned to a knowledgeable talker. The talker then recorded a message in response to each prompt, giving a total of 20 messages for each talker, and a total of 300 in the VMR1 archive.
The average message length was approximately 1 minute, giving a total message archive of around 5 hours. The average number of xed keywords per message was about 7. A detailed description of VMR1 is contained in [12].

Speech T raining Data
Speech recognition systems require acoustic data for the training of recognition models. In Stages 1 and 2 of the VMR project training data was required for xed keyword models, background models for non-keyword speech, and a model for silence. In Stage 1 individual models were needed for each talker, and so separate training data had to be collected for each model set. In Stage 2 general acoustic recognition models were trained to recognise any talker 1 . This requires a large set of training data collected from many dierent talkers.

Stage 1 Training Data
Each talker provided the following speech training data: 77 read sentences (\r" data): sentences containing keywords, constructed such that each k eyword occurred a minimum of ve times. 170 isolated keywords (\i" data): 5 occurrences of each of the 35 keywords spoken in isolation.
150 read sentences (\z" data): phonetically-rich sentences from the TIMIT corpus [13]. There were a total of about 5 hours of spoken training data collected from the same 15 talkers who generated the experimental message set.

Stage 2 Training Data
For this stage the WSJCAM0 British English spoken corpus was used. This consists of spoken sentences taken from the Wall Street Journal. Data was collected for 100 British English speakers with equal numbers of male and female speakers drawn from a variety of age groups and regional backgrounds. The corpus contains a total of around 12 hours of spoken data. WSJCAM0 was collected at Cambridge University Engineering Department and further details are contained in [14]. 6

.2.3 Recording Environment
All speech data was recorded in parallel at 16kHz using both the desk microphone from the Medusa system and a Sennheiser HMD 414 close-talking microphone (as used in many current speech recognition systems). The former represents the operational system and latter functions as an experimental control. The speech recordings were made in a quiet environment.

Word Spotting
Automatically detecting xed keywords in unconstrained speech is termed \word spotting" [15]; this technology is the basis of the speech recognition in Stages 1 and 2 of the VMR project. The best-performing word spotters are based on hidden Markov model (HMM) methods, used in successful continuous-speech recognition [16]. A hidden Markov model is a state-based statistical representation of a speech e v ent, typically a w ord or subword. Dierent states model diering characteristic speech sounds. A typical subword unit is the phone (sometimes referred to as a phoneme). All words are built from a phone sequence drawn from the set of around 45 distinct phones. Phones vary somewhat with context, i.e. the phones which precede and succeed them. When sucient training data is available improved recognition can be achieved by modelling this variation.
There exist ecient algorithms for both training HMM parameters and nding the most likely model sequence given unknown speech input. The HTK tool set developed at Cambridge University ( [17]) is a powerful and exible set of software tools for developing HMM applications such as the keyword spotting system presented here. Stage 2 models A set of talker-independent k eyword models were formed using using the WSJCAM0 data [14]. The keyword models were built using word-internal triphone HMMs. Word-internal triphones model phone context within words. but do not take i n to account phonetic variation arising from interaction with the previous or following word. These were generalised using a tree-based clustering technique [18]. This training method enables all possible triphones, biphones and monophones to be modelled. Given such a model set, a particular keyword may be easily modelled by concatenating the appropriate sequence of subword models (obtained from a phonetic dictionary). Biphones are used at the beginning and end of the keyword, while triphones model the internal structure. For example, the keyword \nd" is represented by the model sequence f+ay f-ay+n ay-n+d n-d. Non-keyword speech is modelled by an unconstrained network of monophones. The Stage 2 keyword recognition network is shown in gure 5. 1 The talker-independent system developed here was trained primarily for native British English talkers.

Keyword R e c o gnition
The speech recognition is performed using the Viterbi algorithm, a standard technique for HMM based speech recognition [16]. The Viterbi algorithm combines the HMM model parameters and spoken data to calculate the most likely state sequence of the HMM models in the recognition vocabulary. The output is the corresponding HMM model sequence.
Word spotting is done with a two-pass recognition procedure [15]. First, the Viterbi decoding is performed on a network of just the ller models, yielding a time-aligned sequence of the maximum-likelihood ller monophones and their associated log-likelihood scores. Secondly, another Viterbi decoding pass is done using the appropriate full network (as shown in gures 4 and 5). Putative k eyword hits are rescored by normalising each h ypothesis score by the average ller score over the keyword interval. This procedures helps take i n to account v ariation in hypothesis score arising from changes in speaking style or background conditions.
In the Stage 1 system, it was necessary to tune the ller models so that they did not match an undue numb e r o f k eywords. This problem arose because of the limited training data available for the whole-word keyword models. A satisfactory solution was to introduce ller models of common 3-phone sequences (not contained within any of the keywords) by concatenating three monophone models, and adjusting the word transition penalty to penalise the ller sequences (which m ust be traversed in groups of three).
A similar problem was observed in the Stage 2 system; however, in this case there were too many false alarms. A solution was to introduce a separate transition penalty to the keyword models as shown in gure 5. It is observed that increasing this penalty dramatically reduces the number of false alarms, while only mildly impacting the number of correctly identied keywords. The net eect is similar to, but better than, increasing a cut-o threshold on keyword scores, such that those with low scores are ignored. 6 An accepted gure-of-merit (FOM) for word spotting is dened as the average percentage of correctly detected keywords as a threshold on the putative k eyword scores is varied from one to ten false alarms per keyword per hour. The keyword spotting output was scored by comparing it against time aligned manual text transcriptions of the documents. A putative hit is counted as a hit if it overlaps more than half of an occurrence of this keyword in the text transcription. Exact time aligned boundaries of the transcriptions and word spotting output are unlikely to be the same. Word boundaries in the text transcription are determined manually be the transcriber and those from the word spotter are calculated stochastically in the Viterbi decoder. In the speech recognition the Viterbi decoder must form the optimal state sequence for the acoustic data using only available models. Inevitably this optimal t will slightly distort the word boundaries. The FOM for model sets in the Stage 1 and 2 word spotting systems are shown in table 1, averaged across both the 15 talkers and the 35 keywords. FOMs for the talker-independent models are taken at the best experimentally-determined transition penalty v alue.

Talker Adaptation
VMR1 is realistic in that it contains talkers with non-British accents. For example, one of our talkers is a native American. This is problematic when using models trained exclusively on British English talkers since the model parameters will not well represent the acoustic content of the speech of these talkers. In an attempt to ameliorate this problem, and increase word spotting performance in general, talker adaptation was investigated. In this procedure a small amount of \adaptation" data is used to generate a modied HMM model set which better represents the speech of the individual talker. The approach c hosen was maximumlikelihoods linear regression because it has been shown to improve recognition with a comparatively small  [19]. This method involves adapting only some of the model parameters (the means of the HMM Gaussian mixtures) to increase the likelihood of the adaptation data given the models.
Varying amounts of the VMR Stage 1 training corpus were used as enrolment data for talker-adaptation experiments. Word spotting performance using talker adaptation is shown in table 2. The R13 row used 13 utterances of enrolment data containing in all 2 occurrences of each k eyword. The R75 row used the full 75 \r" sentences from the Stage 1 training material, containing 5 utterances of each k eyword. Adaptation does not uniformly improve performance for all talkers. However the average increase is substantial, and is particularly dramatic for our American English talker. As shown in table 2 using a small amount of enrolment data improved the FOM performance substantially.
This type of talker adaptation is referred to as supervised since the correct transcription of the enrolment data is known by the recogniser. In operation this requires an operator to speak some given enrolment text in before of message recognition, so that models can be suitably adapted in advance. This may not be possible in practice for the VMR system since it is quite probable that there will be no opportunity to gather the enrolment material for messages from a new talker. We hope to investigate unsupervised adaptation where the parameters are modied without use of a priori transcriptions.
Further details of our word spotting systems and corresponding experimental results are contained in [2, 3 ].

Requests and relevance assessments
Our retrieval tests so far have used VMR1 with two dierent request sets dening two retrieval test collections, VMR1a and VMR1b. The primary purpose of these tests has been to establish that spoken document retrieval is feasible and viable. VMR1a Queries were formed from the message prompts used in the database recording. To reduce variations in word form, query words were sux stripped to stems using the standard Porter algorithm [20]. Queries were formed from the prompts by selecting those stems also found in a keyword stem list. For example, given the prompt Your current project is lagging behind schedule. Send a message pointing this out to the other project management staff. Suggest some days and times over the next week when you would be willing to hold a meeting to discuss the situation. the following query was obtained: project messag project manag staff time meet To obtain relevance assessments, the 6 recorded messages generated in response to each prompt were assumed relevant to the query constructed from that prompt. The 24 other messages in the same category were assumed to be not relevant, even though they are quite likely to contain similar keywords.
VMR1b This is a more realistic set of requests and relevance assessments, collected from the user community that supplied the database messages. A total of 50 requests were collected, 5 for each of the 10 categories used in message collection. These were gathered from 10 users, each of whom generated 5 requests and corresponding relevance assessments. This was achieved by forming 10 unique sets of 5 categories, and assigning each to a user knowledgeable about the categories in that set. For each category a text prompt was formed by combining information given in the 5 message prompts associated with the category.
Users were shown the prompt for the category and asked to compose a natural language request from the information given in the prompt. Users were asked that their request include at least one of the keywords associated with the category.
As for VMR1a request words were sux-stripped using the Porter algorithm and search queries were formed by selecting the keyword stems. For example, given the request In what ways can the windows interface of a workstation be personalised.
the following query was obtained: window interfac workstat Ideally, the relevance of all archived messages should be assessed; however this is not practical even for our 300 message archive. A suitable assessment subset was formed by combining the 30 messages in the category to which the original message prompt belonged, plus 5 messages from outside the category having the highest query-message scores (computed using collection frequency weighting (see section 6.4.2) on the VMR1 document archive). Subjects were presented with the transcription of each potentially relevant document in random order and asked to mark it as \relevant", \partially relevant", or \not relevant". The average number of highly relevant documents was 10.8, while 17.2 were judged highly or partially relevant. The following sections report results only for the highly relevant relevance set. A full description of the VMR1b naturalistic request set is contained in [21].
Apart from the greater realism, the main dierence between VMR1a and VMR1b is that there were far fewer terms per query for the latter, an average of 2.6 distinct terms, against an average of 4.6 for VMR1a.

Query-Document Matching
Document retrieval experiments compared three forms of document scoring: unweighted (uw) term matching, collection frequency weighting (cfw), and combined weight (cw) which takes into account several factors. The unweighted score is simply the sum of matching terms occurring in both the query and the document. The collection frequency is conventional inverse document frequency weighting, computed as cf w(i) = log N n(i) ; where cf w(i) is the cfw weight of term i, N is the total number of documents and n(i) is the number of documents in which term i appears. The combined weight incorporates cfw, within-document term frequency, and normalised document length. The cw weight w as dened in [22] and derived in [23]: the cw scheme reects the City University w ork for TREC [24]. The cw weight for each term in each document is calculated as follows cw(i; j) = cf w(i) tf (i; j) (K + 1 ) K ndl(j) + tf (i; j) where cw(i; j) represents the cw weight of term i in document j, tf (i; j) is the frequency of i in j, and ndl(j) the normalised document length. ndl(j) is calculated as Average dl for all documents ; where dl(j) is the total length of j. The combined weight constant K has to be tuned empirically: after testing we set K = 1 .
.   Table 4: VMR1b text and phonetic text retrieval performance.

Calibration via text retrieval
Retrieval performance for speech documents can be expected to suer degradation relative to text documents due to either misses or false alarms. The degradation can be measured, when transcribed texts are available, by comparing performance for spoken word spotting results with that for the transcriptions. We used our transcribed corpus to provide us with this performance standard. A particular problem with word spotting is that unrelated acoustic events will often resemble valid keywords. For example, the last part of \hello Kate" is acoustically quite similar to the keyword \locate." Because even the most accurate acoustic models cannot discriminate between homophones, the output of an ideal word spotter that reports all keyword phone sequences provides a more legitimate standard of comparison than text. We simulated this ideal`phonetic text' performance by scanning phonetic transcriptions of the messages for phone sequences that match those of a keyword. Table 3 shows retrieval performance for the standard transcribed messages (text) and for the ideal phonetic text (phonetic) with collection VMR1a and table 4 shows that for VMR1b. It can be seen that introducing cfw weighting gives a substantial improvement in performance over the unweighted case, and cw in turn does better than cfw. For VMR1a the text transcription performs better than the phonetic reference; however the opposite is true for VMR1b. We attribute this phenomenon to stemming inconsistencies between the text transcription and phonetic data. VMR1b in particular is sensitive to this eect due to its very short queries. Due to the small size of the message collection absolute values and observed dierences must be treated with caution. 6.4.4 Spoken message retrieval performance As described previously, the word spotter outputs a list of putative k eyword hits and associated acoustic scores. It is found that acoustic false alarms frequently score worse then true hits, and hence a score threshold can be applied to remove most of the false alarms. Clearly, it is desirable to choose an operating threshold that optimises retrieval performance in trading false alarms against`pseudo'-misses, ie hits with scores below the threshold (this question is discussed in more detail in [2]). Tables 5, 6 and 7 show s p o k en document retrieval performance for the three investigated weighting schemes and the dierent w ord spotting models at the best a p osteriori threshold. In practice, an a priori xed threshold would be used in an operational system. For simplicity only average precision values are shown in these tables. It is found that precision at the cuto values shown in tables 3 and 4 follow similar relative performance trends to the average precision observed for dierent acoustic model sets.
It can be seen from these tables that talker-dependent models produce the best overall retrieval performance values. This was anticipated from the superior word spotting FOM discussed earlier.  Table 7: Retrieval performance with cw scheme. talker adaptation can signicantly improve retrieval performance. In some cases, talker adapted models using R75 adaptation data achieve superior retrieval performance to talker-dependent models. The reason for this is not clear, however, it is important to remember that the document set here is very small. Performance trends for the dierent w eighting schemes are similar to those observed for the text standards, with cw again producing the best performance. Note not only does cw give the best absolute values but performance relative to the text standards is improved as well. Desk true hits Figure 6: No of putative hits versus threshold for talker-dependent models. Figure 6 illustrates the eect of increasing score threshold on the number of true hits and false alarms for talker-dependent head and desk microphone models. Figure 7 shows spoken document retrieval performance for VMR1a and VMR1b using talker-dependent head-microphone models at dierent acoustic thresholds for uw, cfw and cw schemes. These gures show that the performance trends observed for the a p osteriori best performance thresholds are consistent across the dierent threshold levels. Also, signicantly, a s w ell as achieving the best retrieval performance in absolute terms, the cw scheme is also less sensitive to the choice of threshold than the other schemes. This trend is more pronounced with the VMR1a results which is probably due to their longer average query length. Figure 8 shows that increasing the keyword transition penalty (shown in gure 5) dramatically reduces the number of false alarms, while only mildly impacting the number of correctly identied keywords. Figure 9 shows the eect of keyword score thresholding for word spotting output from systems with dierent k eyword transition penalties. It can be seen that there is little dierence between the best retrieval performance at the optimal a p osteriori threshold. However, the system with fewest false alarms prior to the use of thresholding (transition penalty 100) not only exhibits the best available individual retrieval performance, but is also less sensitive t o v ariation in the acoustic score threshold.

Retrieval results summary
In summary, the following general points can be made: Term weighting schemes developed for text retrieval transfer well to the retrieval of spoken document. Spoken document retrieval performance of between 75% and 95% of that achieved with text transcriptions can be obtained depending on the generality of the acoustic recognition models. Unsurprisingly, s p o k en document retrieval performance is adversely aected by degradation in word spotting performance.

User Interface
We h a v e developed a prototype VMR application that integrates keyword spotting, information retrieval, and video capture/playback capabilities. The audio soundtrack of each message (whether from an existing archive or received as new mail) is passed to the acoustic word spotter. This computes a sequence of putative keyword hits, which is added to an index containing all putative hits for all messages, along with mail header information and pointers to the message data (for playback). Because the computationally-intensive w ord spotting phase is done o-line (as messages are added to the archive), retrieval of archived messages is nearly instantaneous. The VMR user interface is shown in gure 10. The interface shows a scrollable list of all available messages in the user's video mail archive. Various controls let the user \narrow" the list, for example, by displaying only those messages from a particular user or received after a particular time. Unsetting a constraint restores the messages hidden by that constraint; multiple constraints can be active at one time, giving the messages selected by a boolean conjunction of the constraints. With no constraints messages are ranked by origination date, such that the most recently received document is displayed at the top of the list. When the user inputs a search query the retrieval engines computes the resulting score for each message. The interface then displays a list of messages ranked by score, with the scores shown as bar graphs. Messages with identical scores are ranked by date.
To review an individual message, a \video browser" can be activated. The video browser, shown in gure 11, graphically represents the message as a dark horizontal bar, with putative k eyword hits displayed as lighter regions. Time runs from left to right, and keyword hits are displayed proportionally to when they occur in a message; for example, keywords at the beginning appear on the left side of the bar. The brightness of a keyword region is proportional to the score computed by the word spotter, so that more likely hits appear brighter and stand out. Portions of the video message can be selected for playback b y dragging Figure 11: Prototype mail browser. over part of the bar; this enables the user to selectively audition regions of interest rather than the entire message. A more detailed discussion of the system is contained in [1].

Conclusions and Future Work
The results obtained in the VMR project so far suggest that spoken document retrieval is a feasible proposition. Retrieval performance of between 75% and 90% of text has been achieved for indexing via word spotting. Of course, the ultimate evaluation is whether users nd the retrieval tool useful in day-to-day operation.
Next stage system development in VMR will concentrate on greater integration with the ORL Medusa system and enhancing the user interface, hopefully including including input from real users.
The speech recognition component is currently being extended to use of a large vocabulary recognition system, and also a phone lattice scan approach developed by James [25] to search for out-of-vocabulary query terms.
Further directions for the project include the development of a system for the automated retrieval of broadcast TV news. Initial results from this work using text subtitles to index the data are very encouraging [26] and we i n tend to extend this to indexing using speech recognition in the near future.

Acknowledgements
This project is supported by the UK DTI Grant IED4/1/5804 and EPSRC Grant GR/H87629. Olivetti Research Limited is an industrial partner of the VMR project and we are indebted to them for the use of the Medusa system. We wish to thank Julian Odell for the baseline triphone models, and Chris Leggetter and Phil Woodland for the talker-adaptation software. Details of the VMR project are contained in the project web page: http://svr-www.eng.cam.ac.uk/Research/Projects/Video Mail Retrieval Voice/ from which copies of VMR publications can be obtained.