A language model which integrates uncertainty A language model which integrates uncertainty

All information retrieval models are based on the equality which exists between the query terms and the documents terms. This principle is based on the assumption that terms are correctly extracted from the documents and queries. But in some context (such as document generated by automatic speech recognition or by optical character recognition systems), this assumption is not true. We propose a generalisation of the information retrieval model based on the language model integrating this dimension. To do it, we introduce two notions: the term certainty value (in relation with the extraction process) and the pairing between two terms. Pairing between two terms is defined by their relative position (called the concordance) and the area they have in common (called the intersection).


INTRODUCTION
Usually, in information retrieval, a matching function is used to determine a document ranking from a corpus in response to a query.This function matches terms of a query with terms of a document determining if the terms of a query are present or not in a document.We can distinguish two dimensions: term presence (td,j = tq,i) and term absence (td,j ≠ tq,i).To do it, we assume that terms used in document representation are equal to initial terms (i.e.terms really used in the document).
Ex: Q = {'tomato'} and D1 = {'tomato'}, D2 = {'tomato', 'orange'}, D3 = {'tamato'} Only the documents D1 and D2 which contain the term of the query Q are considered as relevant for the query Q.The document D3 is considered as non relevant even if his term 'tamato' is probably the term 'tomato' badly recognized.
In some contexts, the assumption described above is not completely true.For example, in the automatic speak recognition (ASR) or optical character recognition (OCR) domains, the system can have some transcriptions errors: badly recognized words, missed words or added words.It is what is called uncertain data.In such contexts, each term is associated to a certainty value provided by the extraction process.
The matching function takes into account if the terms of the query are present or absent in the document and we add a new case: the term of the query is almost present (td,j = ~tq,i) in the document.
We call the way two terms are associated the pairing.Pairing between two terms is defined by their relative position (called the concordance) and the area they have in common (called the intersection).To take into account the three types of possible equality between a term of a query and a term of a document, we must integrate two notions: term certainty value and pairing (cf. Figure 1).In this paper, we deal with this problem of uncertain data.First we present a state of the art of information retrieval system models performances with uncertain data.Then we define the pairing notion with the concordance and the intersection between two terms.After having described document and query representations, we develop our matching function.Finally, we present our first evaluations showing that the handling of the term certainty value in the term weighting improves results of retrieval.

STATE OF THE ART
Information retrieval applicability to uncertain data is a problem that was born with the expansion of the number of electronic documents from scanned "paper" documents or documents from audio recordings translated by automatic speech recognition system.Indeed, lots of documents were scanned to be available online.These documents from optical character recognition system are uncertain because of badly recognized words, missed words and added words.There is the same problem for the documents generated by automatic speech recognition systems.
Research has been done to compare performances of several information retrieval systems on uncertain data versus on correct data (i.e.data from manual transcription).Taghva [9] compares performances of Boolean information retrieval system on OCR documents versus its corresponding correct documents (i.e. from manual transcription).Croft [3] does the same study on vectorial information retrieval system (SMART) and probabilistic information retrieval system (INQUERY); Mittendorf [5] do it also on probabilistic information retrieval system.Their conclusion is that the effects of uncertain data on information retrieval system performances are not significant with long documents: recall is not affected and average precision decreases by 3%.Average precision is more affected for short documents: -10%.
Grangier [4] compares performances of probabilistic information retrieval system on ASR documents versus its corresponding correct documents.His conclusion is the same as the one given for OCR documents.The average precision for long documents decreases by 5%.
It is not surprising that results are almost the same for uncertain data (i.e.OCR or ASR documents) than for their corresponding correct documents because a long document allows a smoothing of errors.In fact, a same word can be sometimes badly recognized in the document and sometimes correctly recognized.So, globally the word is known as being in the document.
On the other way, short documents are not included in such experiments, even if the bad recognition of a word in such documents has a direct and bad influence on the quality measure (recall, precision).We propose in this paper to integrate not only the absence or presence of each term but also its approximation due to a probably bad extraction.

MATCHING PRINCIPLE
We want to introduce the uncertainty notion (i.e.certainty value and pairing) in the language model adapted to information retrieval.Usually, the language model principle is based on the probability P(Q | MD) that a query Q is generated by a document model MD [5] and the assumption of the independence of the terms of the query (Q = tq,1, tq,2, …, tq,Nq) is done [7]: is the probability of the term tq,i to be in the corpus of documents C (i.e.tc,j = tq,i ).This probability is used in the matching function to avoid having null probability for the query because of a term absence (i.e.td,j ≠ tq,i ) in the document: this is a smoothing of the language model.This writing is based on the probability of a term in a document model.To take into account uncertain notion, we must add a new dimension: term approximation (i.e.td,j = ~tq,i ).To do it, we add a third parameter in the matching function: P(~tq,i|MD).This is the probability of the term approximations in the document model MD.
Before developing each component of the function, we define the concept of pairing.

PAIRING NOTION
Given two terms x, y of respective length nx and ny.These two terms can be paired together i.e. they are close to each other taking only into account their string value and their relative position, in a more or less strong way: -according to their concordance, i.e. their relative position -according to their intersection.In the example of the Figure 2, we can measure pairing between x and y by taking into account their concordance (their relative position), i.e. x begins y : Conc(x,y ) = begins and by taking into account their intersection (shaded area) i.e. x for the one and y1 for the other: Inter(x, y) = (x, y1).
To describe more precisely the type of equality existing between two terms, a typology of relations between two terms is necessary.Our concordance typology between two terms is based on Allen's relations.Allen's relations allow to define temporal position of all the document objects while placing temporal relations between them.

Concordance between two terms
Allen's model is a complete graph where the arcs express relations and the nodes represent the intervals [1].This model defines thirteen possible relations between two intervals.There are relations of connexity (precedes, meets, overlaps), inclusion (begins, during, finishes), simultaneity (equal) and their opposite.If we transpose Allen's relations to our context, we get the typology illustrated by Figure 3.We now deal with concordance between two terms instead of temporal relations.Two terms can concord between them in different ways; this is what is called the concordance (cf. Figure 3, column 2).

Nature of the concordance between two terms
All the concordances between two terms x and y is defined by A: A = {begins, is_began_by, is_included_in, includes, ends, is_ended_by, overlaps, concurs, not_concurs}.We define the function Conc(x, y) which determines the nature of the concordance between the terms x and y ∈ V (V being a closed set of terms).

Value of the concordance
All the possible values of the concordance are defined by ValConc.

Nature of the intersection
The nature of an intersection between two terms x and y is defined as:

Intersection value
We define the intersection value between two terms (x, y) ∈ 2 V by the function ValInter:

, ValInter
For example, we can have the following situation: Conc(x, y) = concurs and x and y have a partial equivalence because we are working in a context where data are uncertain.This intersection equivalence degree between the two terms is expressed by ValInter(x, y).
Problem of uncertain data is known in several domains, such as determination of synonyms or in orthographic correction systems; the algorithms used for these domains can be also used to determine ValInter(x, y).

DOCUMENT REPRESENTATION
We describe the document model definition which integrates term certainty.

Term certainty
A value of certainty c is associated for each term x ∈ V.This value is dependent on extraction process and is represented by the function named Cert.

Document model
We called MD the unigram model of the document D. MD = {(td,i, P(td,i)), ,1, P(td,1)), (td,2, P(td,2)), … ,(td,i, P(td,i)), ..,(td,NdI, P(td,NdI))} with td,i ∈ V and i ∀ , j ∀ , i ≠ j, td,i ≠ td,j Usually in a language model, a count of term occurrences is used ( td,i) to calculate the probability of the term td,i in the document model: In a context with terms associated with certainty value, the count is not made only on the number of occurrences but also on terms certainty values.The probability of a term in a document is dependent both on the maximum likelihood estimation and on the certitude of extraction.So, we have: ), (tq,2, c2), (tq,3, c3), …, (tq,Nq, cNq)] with tq,i ∈ V and ci ∈ + ℜ .

Definitions
As in most of the language models, the following simplification was made: we assume the independence between terms in the query.The matching function associated to the language model takes into account the case where the term of the query is present in the document and the case where the term is absent in a document but present in the corpus (thanks to a smoothing function).In our working context, we take into account three dimensions: presence of the term in the document, presence of the term in the corpus and presence of approximations of a term in the document.
To define matching function, we used the principle of the query generation by the document to evaluate document score in response to a query.With a query Q and a document D as defined previously, we have the following matching function (cf. 3. Matching principle): If the term x is not present in the document, Presence(MD, x) = w0 with w0 is the null term.
More the term x is present in the document D, more the probability ( ) D i q, M t P is high.

The term is absent in the document:
We use a smoothing function to take into account that a term of the query can be absent in the document and to avoid a null score for the entire query.There are lots of smoothing functions in language model domain; we choose to use a smoothing by interpolation proposed by Hiemstra [5].The probability of the term tq,i in the corpus C is expressed by ( ) Approximation(MD, x) determines all the terms of the document MD can be paired with x such as To determine approximations of a term, we use the algorithm of Soundex [11].

Matching function
To combine the three dimensions: presence of the term in the document, presence of the term in the corpus and presence of the approximations of the term in the document, we must use two parameters.To define the importance of approximations in comparison with the terms themselves, the parameter is introduced.The parameter allows to vary the importance of the smoothing function (i.e.y x ≠ ).It is also necessary to take into account the fact that terms extraction of the query are uncertain introducing Cert(tq,i) in the matching function.So, the matching function taking into account uncertainty notion is:

FIRST EVALUATIONS
To validate the integration of term certainty value in term weighting, we evaluated our proposition on CLEF-2004 collection.This corpus is composed of French newspapers articles ("Le Monde").There are 47 646 documents and 50 queries in natural language.We compare our proposition [10] with Salton's vectorial model [8].Our term weighting improves precision for the first recall value (cf.Tab 1).

CONCLUSION AND PERSPECTIVES
In this paper, we deal with the problem of uncertain data (for example data from ASR or OCR process).There is no information retrieval system in response to this problem: only some observations that retrieval performances are not significantly affected with the use of long documents and more affected with short documents.We want to take into account this dimension in the matching function.To make it, we introduce two notions: term certainty value and pairing between two terms.In a near future, we will evaluate our model.We ask the question to know if this method cannot be seen like a language model smoothing method.The handling of the uncertainty can be a solution for absent words in document.

Figure 1 .
Figure 1.General principle Figure 2. x and y start together and nx < ny

Figure 3 .
Figure 3. Concordance and intersection between two terms

Figure 3 (
third column) shows intersection associated to each concordance (previously defined).
a closed set of characters.)

3 .
Some approximations of the term are present in the document: