Speech Retrieval Based on Automatic Indexing

We present a system that retrieves audio recordings containing spoken text in response to a given textual query. In particular, we describe indexing methods that automatically describe the content of the recordings. The indexing methods, which are based on phoneme recognition output, take account of speech recognition errors. Additionally, the indexing methods we present are suitable for a languagewhere many word inflections and compoundsmay occur. To compare different indexing methods, we have evaluated the retrieval effectiveness on a test collection of 1289 documents and 26 queries. The results show that better effectiveness can be achieved when taking into account the characteristics of the underlying speech recognition system.


Introduction
A speech retrieval system accepts vague queries and it performs best-match searches to find speech recordings that are likely to be relevant to the queries.Efficient best-match searches require that the speech recordings are indexed in a previous step.We focus on effective automatic indexing methods that are based on automatic speech recognition.
Automatic indexing of speech recordings is a difficult task for several reasons.One main reason is the limited size of vocabularies of speech recognition systems, which are at least one order of magnitude smaller than the indexing vocabularies of text retrieval systems.Another main problem is the deterioration of the retrieval effectiveness due to speech recognition errors that invariably occur when speech recordings are converted into sequences of language units (e.g.words or phonemes).
We present the design and evaluation of several automatic indexing methods for a prototype speech retrieval system for German speech recordings.Recognition of German speech is challenging because very little training data is available.The German language also encompasses a large number of different word inflections and compounds, which makes the retrieval task more difficult.
Other approaches to speech retrieval are focusing on English speech recordings.English speech retrieval is less difficult than German speech retrieval because there are fewer word inflections and in particular there exists a large amount of training data [Lamel et al., 1986], [Garofolo et al., 1990], [LDC, 1992].At Cambridge University, a video mail retrieval system is being developed that currently accepts 35 query words [Sparck-Jones et al., 1995].Similar work was also done by David James [1995] who experimented on retrieval of English news broadcasts.At Carnegie Mellon University, a retrieval system is being developed for a digital video library [Hauptmann et al., 1995], [Informedia, 1996].
As our main contribution, we present a family of speech retrieval methods suitable for a language with many different word inflections and for cases where limited training data is available.
This paper is structured as follows.In Section 2, we describe methods to index and retrieve speech documents.An overview of the current prototype retrieval system and the speech recognition component is given in Section 3. In Section 4, we report on experiments made to determine the retrieval effectiveness and to compare the indexing methods.Conclusions are drawn in Section 5.

Indexing Methods for Speech Documents
Indexing is the process of generating document descriptions that contain clues about the content of the documents.A document description consists of (possibly weighted) indexing features that have been identified in the document.In the case of text retrieval, words and phrases are the most common indexing features.

Speech Recognition Component
Indexing speech documents automatically requires speech recognition technology.Ideally, a word recognition system would transcribe the spoken document into text, such that text retrieval methods could be applied.However, this is not feasible for the following reasons: There exist too many different words for recognition.For instance, the TREC document collection contains more than 500,000 different English word stems [Harman, 1994], whereas the vocabulary size of current stateof-the-art recognition systems is about 65,000 word forms [Woodland et al., 1995].In the case of the German language the situation is even worse because of the vast number of different word inflections.
Infrequent words like person or company names should be in the recognition vocabulary because they are excellent indexing features.However, they are not known at indexing time.
In the case of German, we lack sufficient data to train the acoustic word models used for recognition.
Therefore we use phonemes as the basic units in recognition.Using the Hidden Markov Toolkit developed by Young et al. [1993], we have built a phoneme recogniser, which produces phonemic transcriptions of the speech documents (Figure 1).This intermediate representation of the speech is a suitable basis for indexing, since it is independent of the i n a l g e r i @ n h a t d i i s l a m i sch @ h ei l s f r O n t sil E f i E s d i b e f oe l k E r u N ... correct i m a E d i E i @ n h a t s t i i s l a m i sch e h ei l ch s v O n t sil E f i E s d e b E sch ei r E k r u N ... recognized In Algerien hat die Islamische Heilsfront FIS die Bevoelkerung ... Text Figure 1: Sample of a radio news sentence: text, correct transcription and phoneme recognition output indexing vocabulary.That is, the transcriptions allow the spotting of any query word, if its phonemic transcription is available.Furthermore, modifications in indexing methods are possible without expendable changes in the recognition system.Our phoneme recogniser is described more thoroughly in Section 3.1.
There are two major differences between speech and text retrieval, as can be noticed in Figure 1.First, word boundaries are difficult to detect, since there is no explicit word delimiter in speech.Second, recognition errors veil the true content of the speech documents, which makes the retrieval task more difficult and thus requires appropriate indexing methods.

Indexing and Retrieval
We first give a formal description of an indexing and retrieval method based on the vector space model.Let D be a collection of speech documents and q a user query.An indexing method is a function that maps a document d j 2 D (query q) into a document description vector dj (query description vector q).The dimension of dj and q is given by the number of different indexing features in the indexing vocabulary = f ' 0 ; : : : ; ' m 1 g .We will introduce various indexing vocabularies later in this section.The description vectors are defined as dj := (a 0j ; : : : ; a m 1 j ) T (1) q := (b 0 ; : : with weights a ij (b i ) that represent the relevance of the indexing feature ' i in the document d j (query q).
To generate a list of documents that are ranked in decreasing order of their estimated relevance to the query, the system computes the Retrieval Status Value (RSV) between q and every document d j using a standard cosine measure [van Rijsbergen, 1979, p.41] RSV (q;d j ) = qT dj p qT q q dj T dj : In the following sections we describe two indexing methods for speech documents in more detail.Both indexing methods refer to the general definition of (1) and ( 2).The input to the indexing methods consists of phonemic transcriptions.

N-Gram Indexing
In the first indexing method, the set of indexing features consists of phoneme N-grams, for a given N.The technique of using N-grams for indexing is well known in searching tasks, e.g.letter based N-grams in text retrieval [Teufel, 1989], [Cavnar, 1992].The phonemic transcription of a speech document d j is simply decomposed into overlapping phoneme N-grams.The example in Figure 1 would yield the indexing features i m a, m a E, ... (n = 3 ).The weight of such an indexing feature ' i in the document d j is defined as a ij := (' i ; d j ) idf(' i ); (3) where (' i ; d j ) denotes the feature frequency (the number of occurrences of ' i in d j ), and idf(' i ) denotes the inverse ; which is function of the document frequency df(' i ), that is the number of documents containing ' i .
A query-in our case entered as natural language text-is first transcribed into a phonemic transcription using a pronunciation dictionary.Thereafter, it is also decomposed into N-grams, and the query description q is derived according to (2), using the weights b i := (' i ; q ) idf(' i ): (4) This simple method is suitable for indexing speech documents for the following reasons.First, it accounts for different word inflections and compounds that are common in German.Although the query and the document may contain different inflections of the same word, it is likely that there are matching N-grams corresponding to the common stem of the word.Second, this method is tolerant to recognition errors, if N is not too large.Any sequence of N correctly recognised phonemes yields a correct N-gram.

Word Matching and Probabilistic Weighting
The second indexing method identifies possible occurrences of query words in the phonemic transcriptions, i.e. the query words are the indexing features.The probability of occurrence serves to calculate the weights in the document description vectors (1).A similar method was applied in Mittendorf et al. [1995], where retrieval is performed on corrupted text obtained by Optical Character Recognition (OCR).
Let w = ( w [1] : : : w [ I ]) be a phonemic transcription of a query word ' i and t = ( t [1] : : : t [ R ]) a phonemic transcription of a document d j .The indexing system first identifies a set of subsequences fs 0 ; : : : ; s r 1 gof t that are similar to the query transcription w.Let p k denote the probability that a phonemic sequence s k corresponds to the query word ' i (We explain the estimation of p k later in this section).Then, we obtain the expected feature frequency of ' i in d j by e(' i ; d To determine the set of subsequences fs 0 ; : : : ; s r 1 g , we consider all subsequences in t that have length J 2 [I ; I+ ] .Setting J := I would be too restrictive because of phoneme insertion and deletion errors.From the candidates, we select those r subsequences with maximal occurrence probability, such that the subsequences do not overlap.We are aware that considering all subsequences, which is of time complexity O(R I), may not be feasible for large collections.We are currently investigating various data structures and methods for faster approximate matching.
The estimation of an occurrence probability is mainly based on an edit distance computed between the phonemic transcription w and a subsequence s.The edit distance d(w;s) := (I;J) is a measure for the dissimilarity of w and s.We define (i; j) recursively as follows.
We defined those cost functions by analysing the errors of our phoneme recogniser on the training speech.
To achieve good estimates for the occurrence probabilities, we trained a probability estimation function that is based on several parameters.The probability p, that a given phonemic transcription s corresponds to w is estimated according to the formula log p 1 p = 0 + 1 I + 2 e + 3 d(w;s) ; where I is the length of w, e is the number of equal phonemes in w and s, and d(w;s)is the edit distance.We computed I, e and d(w;s)for a set of 355 words and several substrings s on a small training set where p was set to either zero or one by manual checking.Having a set of approximately 11'000 data points, of which 358 (3%) are positive examples, we then applied logistic regression to estimate the coefficients 0 ; : : : ; 3 [Hosmer & Lemeshow, 1989].

Weighting
Given a query q, the stop words were first removed.For the remaining words, the expected feature frequencies were calculated according to (5).Finally, the following weights were used in ( 1) and ( 2): b i := (' i ; q ) idf(' i ): (9) Note that it is not possible to calculate the idf-values exactly in this approach, since the document frequency is not known.We used approximative idf-values that were derived from a text collection of a similar domain (articles of the Swiss News Agency SDA).

Prototype System Description
Our prototype speech retrieval system provides access to four hours of Swiss radio news, spoken by a single speaker.
The language is high German with a Swiss flavour.Each news unit has a time duration of approximately 5 minutes, including various topics such as special events, politics, sports, business information and weather.Because topic boundaries are difficult to detect, the units have been segmented into fixed length passages (speech documents) of 20 seconds with an overlapping distance of 10 seconds.This segmentation yields a collection size of 1289 speech documents.Every speech document is an audio recording sampled at 16kHz with 16 bit resolution.Queries are entered as natural language text.The system maps the input text into a phonemic transcription using a pronunciation dictionary with more than 350'000 entries [Celex, 1993].Finally, the query is indexed as described in Section 2.
Figure 2 shows a screen dump of the current prototype.After query entering, the system returns a list of documents ranked in decreasing order of their RSV.For each document, date, time, and a passage number are shown.The user may select a rank to listen to a specific document.

Speech Recognition
We developed a speaker dependent phone recogniser using the HTK Toolkit [Young et al., 1993], which is based on continuous density Hidden Markov Models.We trained 52 different acoustic monophone models including a silence model.
We had to generate training material for the recogniser, since we lacked a training set that is in German, speaker specific and phonetically labelled.We collected 30 minutes of radio news together with the textual and phonetic transcripts.Fortunately, we were able to use the English TIMIT speech corpus to initialise the acoustic models for our MIRO '95 German speaker.Using these models, we performed Viterbi alignment on the data to determine the phone boundaries [Young et al., 1993], which are necessary to train the acoustic models.
For recognition, we applied Viterbi decoding, which generates the most likely phone sequence to given input speech.Additionally, a stochastic phone bigram language model was incorporated to avoid the output of unlikely phone sequences.The sequences were further reduced by clustering some of the most similar monophones into 32 phoneme classes.
A preliminary evaluation showed that 72% of the phonemes are recognised correctly.This is only an upper bound for the performance, since evaluation was done on the training set.To build an evaluation set, additional speech would have to be collected, transcribed and labelled.

Experiments
In this Section we report on retrieval experiments to evaluate the system's effectiveness and to compare different indexing methods.We set up an IR test collection, consisting of 1289 speech documents (see Section 3) and 26 short text queries that are message titles gathered from the Swiss News Agency.Each query contains eight words on average.An example of such a query is Kroaten stimmten über Unabhängigkeit ab 1 .To obtain relevance assessments, we let a student read the queries and listen to all the documents.
In the first retrieval experiment we compared the N-gram indexing method described in Section 2.3 for various N.
Figure 3 shows the recall-precision curves for different N-gram methods.Evidently, the trigram and tetragram method perform much better compared to bigrams.Using the trigram method, a 225% improvement of average precision can be observed.Bigrams are too short and too common units for indexing purposes.They are not able to discriminate the documents sufficiently.On the other hand, tetragrams would contain much evidence about the underlying text.However, a 3% degradation of average precision compared to trigrams indicates that in the context of speech retrieval, too long units are not useful as well.This is due to phoneme recognition errors that prevent a document feature from being matched to a query feature.Apparently, trigrams seem to be a suitable compromise between indexing power and recognition errors.
The second retrieval experiment contains a comparison between the word matching and the trigram method.For 1 The Croats voted for independence.the word matching method, we set r = 3 , i.e. we assume that a query word does not occur more than three times in a document.This is a reasonable assumption for documents of 20 seconds.The graph on the left of Figure 4 shows the In the other graph of Figure 4, the experiment was repeated without using idf-weights in (3),( 4),( 8) and ( 9), respectively.Here, the performance differences are even more significant.Using word matching resulted in an 42% improvement of average precision over trigram indexing.The comparison of both graphs confirms once more that the idf-weightshelp to improve the retrieval effectiveness.The improvements in average precision are 7% (word matching) and 15% (trigram), respectively.The more significant performance gain in the trigram method can be explained by the presence of stopwords.In N-gram indexing, we did not remove stopwords with the intention that phrases (e.g. in Italien) might produce additional matches.N-grams occurring in stopwords tend to have a low idf-value.Thus, incorporating idf into weighting naturally degrades the relevance of stopword N-grams.
A clear advantage of the word matching method is that it considers more of the context in the phonemic transcriptions when matching a word.A trigram bears much less context, and no information about neighbouring trigrams is used, because the retrieval is based on the vector space model where the indexing features are assumed to be independent of eachother.
On the other hand, the word matching method is computationally more expensive due to the calculation of the expected feature frequencies.However, a phoneme trigram index could be used to restrict the search space when looking for a set of possible subsequences.

Conclusions
In this paper we have described different indexing methods for a system that retrieves speech documents in response to given textual queries.As a preprocessing step in indexing the speech documents, we use a phoneme recognition system which produces phonemic transcriptions of the speech.Unlike other approaches that work with a fixed recognition vocabulary, we gain independence between the recognition and the indexing process.Particularly, the recognition system does not impose any limitations to the query vocabulary.
Based on phonemic transcriptions, we have described two different indexing methods.A N-gram method and a probabilistic word matching method.Both methods account for the high number of different word inflections and compounds in the German language.For example, the query word Europa 2 matches to documents containing europäischen 3 .Similarly, for a compound like Fussballweltmeisterschaft 4 , the system can find matches to the base parts Fussball, Welt and Meisterschaft.
Although less than 72% of the phonemes are recognised correctly, it is still possible to find useful information, because the indexing methods account for recognition errors to a certain extent.The N-gram method does this by regarding a limited context, whereas the word matching method estimates occurrence probabilities by using explicit information about common recognition errors.
We have evaluated the speech retrieval system and both indexing methods on a IR test collection.Using the best indexing method, 24% of the first five documents are relevant on average.The probabilistic word matching method performs better than the N-gram method, because it uses more phoneme context while searching.

Figure 2 :
Figure 2: Speech Retrieval System Prototype.The query means: Negotiations with the European Community.The system is playing the top ranked document.In this example, all the important query words are uttered.

Figure 3 :
Figure 3: Recall-precision curves for N-gram indexing methods

Figure 4 :
Figure4: Recall-precision curves for trigram and word matching method for different weighting schemes results using the weighting scheme defined in Section 2.3.The word matching method achieves an average precision of 0.2008, that is a 32% improvement over the trigram method.