A Dempster-Shafer Model for Document Retrieval using Noun Phrases

In this paper, we propose a document retrieval system based on natural language processing of documents and queries. We use single terms and term groups as indexing elements to represent documents and queries. The model is formally expressed within the Dempster-Shafer Theory of Evidence. We discuss in detail how we use this theory to represent a document collection, indexing elements, documents and queries. The retrieval function is derived directly from the underlying theory. We then present an implementation of the model. The experimental work carried out is reported last.


Introduction
Document retrieval (DR) systems focus on the problem of retrieving documents relevant to a user's information need represented as a query.In this work, we concentrate on text-based systems.Traditional DR models usually represent documents and queries as a set of keywords.Hence, they cannot handle the diversity of the human language.Their simplistic representational approaches cannot, for example, differentiate between a query on "high risk of intoxication" and a query on "risk of high intoxication".
The use of Natural Language Processing (NLP) in DR attempts to overcome such shortcomings [8,9].NLP techniques are based on the fact that the content of a document or a query is encoded in natural language.They aim to extract accurate linguistic structures that are then used to represent documents and queries, where linguistic structures can vary from noun-phrases to tagged sentences.
In this work, we use shallow NLP techniques to represent documents and queries.Shallow NLP techniques are not domain specific and can be applied to any document collection.Our model is formally expressed within the Dempster-Shafer (D-S) Theory of Evidence [7].This is a theory of uncertainty that yields structural representations (a set of propositions and their associated beliefs) from the available evidence.The evidence here are the indexing elements that come from applying shallow NLP techniques on documents and queries.
The first section gives a brief introduction to the D-S theory.In the next section, we present the theoretical model.A description of the implementation of the model follows.The experimental work conducted is then reported followed by an evaluation of the results.Conclusions and future work are discussed in the last section.DEFINITION 2.1 Let be a finite non-empty set of mutually exhaustive and exclusive events.The set is called a frame of discernment.DEFINITION 2.2 Let 2 be the set of all subsets of the set , including the empty set ; and itself.DEFINITION 2.3 Given a frame of discernment , the function m: 2 7 !0; 1 is called a basic probability assignment (bpa) if: m; = 0 and X A22 mA = 1 The bpa represents a source of evidence supporting various subsets A in 2 with value, or "degree of support", mA.The subsets A of 2 such that mA 0 are called focal elements.DEFINITION 2.4 Given a bpa m: 2 7 !0; 1 , a function Bel: 2 7 !0; 1 , called a belief function over , is defined as: The value BelA quantifies the strength of the total belief committed to A. In contrast, mA quantifies the exact belief committed to A.
A particular characteristic of the theory is that the belief of an event being x does not necessarily imply that the belief associated to the negation of the event is 1 , x (as it happens in probability theory).In the absence of any other evidence to support the negation of the event, the remaining belief is assigned to the frame of discernment (all the possible events), and represents the uncommitted belief.

A logical interpretation of the D-S theory of evidence
The D-S has a logical interpretation, which we use in this paper.The events defining the frame of discernment can be considered as a set of elementary propositions.The non-elementary propositions are the propositions defined as the disjunctions of the elementary propositions.The set 2 is the set of all propositions, whether elementary or not, including true and false ?.
Using the logical view of the D-S theory, the bpa is defined as m : 2 7 !0; 1 such that: m? = 0 and X p22 mp = 1 Respectively, the belief function B e l : 2 7 !0; 1 becomes: where ! is the material implication of classical logic.Verbally the belief on q depends on the bpa of the propositions of the frame that imply q.
IRSG98 EXAMPLE 2.2 Following our previous example, let me 0 = 0:3; m e 0 _ e 1 = 0:4.The uncommitted belief is captured by assigning a bpa value to the truth proposition , m = 0 :3.All other propositions of 2 have null bpa value.

Description of the Model
In this section we present our model.First we describe the indexing elements upon which the model is based (section 3.1).Based on this, we explain how the document collection is represented as a frame of discernment (section 3.2).We show how documents are represented within the frame (section 3.3).Finally, we describe the query representation (section 3.4) and the retrieval function (section 3.5).

Indexing elements
The representation of documents is based on the two following categories of indexing elements: Single Terms: These are the standard single terms used in traditional DR systems.
Term Groups: These are groups of words derived from noun-phrases extracted from documents and queries.The NLP tool that extract the noun-phrases is described in Section 4.
We construct propositions based on word content only.Therefore, the ordering of the words constituting term groups is ignored.Our aim is not to obtain an exact representation of the extracted linguistic structures, but to use them to derive a more precise description of document and query content.However, the model proposed in this paper can be extended to include word ordering.EXAMPLE 3.1 Suppose that the noun-phrase "red wine" has been extracted from a document.The corresponding term group is fred; wineg.This term group adds to the document representation a more precise description, that is that the document is about "Red AND Wine".This is equivalent to "Wine AND Red" since word ordering is ignored.Suppose that the analysis of the document does not yield the above term group, but two single terms "red", and "wine".This means that the document is about "Red", and "Wine", but not necessarily about "Red AND Wine".In practice, this means that the document is about "Red OR Wine"

Document collections as frames
First, we give some preliminary definitions of the indexed document collection.DEFINITION 3.1 Let C = fD 1 ; : : : ; D N g be a document collection, where N is the number of documents.Let S = fs 1 ; : : : ; s S g be the set of single terms that appear in the document collection C , where S is the number of single terms in the document collection.Also, let G = fg 1 ; : : : ; g G g be the set of term groups in the document collection, where g i S.
DEFINITION 3.2 For a document D i 2 C , let S i S be the set of single terms that appear in D i .Also let G i G be the set of term groups that appear in D i ."At every stop of our long motorcycle trip, we were drinking red dry wine".

IRSG98
The following set of single terms (underlined in the sentence) are obtained1 for document D 1 , S 1 = fstop, long, motorcycl, trip, drink, red, dry, wineg.Also the set of term groups for document D 1 is G 1 =fflong, motorcycl, tripg, fred, dry winegg.
For a document collection C , a frame of discernment is constructed based on the set S. The elements of the frame are defined as mutually exclusive propositions derived from a boolean combination generated from the set S. DEFINITION 3.3 For the set of single terms S = fs 1 ; : : : ; s S g of a document collection C , all the 2 S boolean combinational elements are generated using the terms s 2 S, the negations (:) of these terms and the boolean conjunction (^).These boolean elements represent the elementary propositions of the constructed frame .It can be shown that the number of constructed elementary propositions is 2 S .

Document representation
Having the document collection modelled as a frame of discernment, the representation of documents is achieved through a set of focal elements (section 3.3.1)and the associated bpa (section 3.3.2) defined on the frame of discernment.

Focal and indexing elements
In the D-S theory, focal elements correspond to propositions for which there is positive evidence.Therefore, focal elements can be used as propositions modelling the indexing elements (single terms and term groups) of a document.
For a document D i 2 C , they are defined upon the sets S i and G i .DEFINITION 3.4 Every single term s j 2 S i of a document D i 2 C defines a focal element, e.g. the proposition p j .Furthermore, every term group g k 2 G i also defines a focal element, the proposition p k = V l p l where each p l is the proposition associated to single term r l for r l 2 g k .i is defined as the set that includes all the propositions representing single terms and group terms of the document D i .The frame of discernment along with the propositions modelling the indexing elements of the document D 1 is shown schematically in Figure 1.The truth proposition can be viewed as the disjunction of all the elementary propositions.As explained in section 2.1, this proposition is used to capture the uncommitted belief for the document according to the D-S theory, and may then constitute a focal element.
The propositions used to model a document derive from observed evidence (the set S i and G i for document D i ).
Obviously, some propositions have stronger evidence than others.This is represented in the D-S via the use of a bpa.

Basic probability assignment
A bpa must be defined for every document D i to capture the exact belief that the various propositions (focal elements) provide a good description of the document content.We compute the bpa values from term statistical characteristics in documents.

IRSG98
The bpa formula considered is: FREQ i pj TOTFREQ i IDFN;p j p j 2 i 0 p j 6 2 i and p j 6 = (1) where: (i) FREQ i p j is the number of occurrences of the indexing element represented by the proposition p j in the document D i .
(ii) TOTFREQ i = P p k 2i FREQ i p k is the total number of occurrences of the indexing elements of the document D i .
(iii) IDFN;p j is the inverted document frequency of the indexing element represented by the proposition p j in a collection with N documents.
The first part of the formula (p j 2 i ) assigns a positive bpa value to propositions representing indexing elements of S i and G i .The second part (p j 6 2 i ) assigns 0 to all others propositions except for the truth proposition .
Verbally FREQ i pj TOTFREQ i calculates term frequencies (i.e.occurrence number normalised by total occurrence).In [10]   alternative variants of FREQ i p j were tested (e.g.different formulations were used to compute the frequency values of term groups) but failed to give better results than with the formulation above.log N N np k for p j a term group (5) where: (i) np k is the number of documents that contain the indexing element represented by p k .(ii) P i are the propositions p k s that represent the single terms composing the term group p i .

IRSG98
Formula (2) gives the standard IDF formula.In formulas (3), ( 4) and ( 5), the IDF value of single terms is as in Formula (2), whereas the IDF value for term groups is calculated as the maximum, the average and the minimum IDF values of the single terms that constitute the term groups, respectively.
Since the logarithms used in our formulas are on based N, the IDF values lie in the interval 0; 1 .As a result, the D-S restriction for the total bpa to be always equal to one ( P A22 mA = 1 ) is satisfied (see [10]).

Query representation
Queries are represented as propositions defined in terms of the frame of discernment.DEFINITION 3.6 Let Q = ft 1 ; : : : ; t q g be the set of indexing elements used in a query.t k can be a single term or a term group.To each query indexing element t k we have an associated proposition q k .The way the propositions are defined are the same as for documents (see section 3.3).For a single term that does not appear in the document collection (a term not defined in S), the associated proposition is ?. DEFINITION 3.7 Let Q the set of propositions representing the query indexing elements of the set Q.The query is represented by a proposition q defined as follows: The disjunction (_) is used since it is difficult to derive from a natural language query whether a user is seeking, for instance, documents about "red wine" or documents about "red" or "wine" unless the former is found as a term group in the query.
We consider two representations of queries: Single term queries: Queries where only single term are used to express the query proposition q.
Term group queries: Queries which contain single terms and term groups.Single terms that appear only in term groups and not as stand-alone single terms in a query are represented in the query proposition only as a part of the term group proposition.EXAMPLE 3.6 Consider the following query sentence: "Documents about red wine".Based on our previous example, if only single terms are considered, we obtain Q 1 =fdocument, red, wineg and the query proposition is q = ?_ p 2 _ p 3 .The term "document" is represented with the false proposition (?) because the term is not part of the set S. If term groups are used, we have (Q 1 =fdocument, fred, winegg), and the query proposition becomes q = ?_ p 4 .

Retrieval
To estimate the degree of relevance of a document to a query, we use the belief function of the D-S theory.To each document D i with bpa m i , we have an associated belief function Bel i defined upon m i .The degree of relevance of the document to a query represented by the proposition q is formulated as: This measure encapsulates the "relevance" of all the propositions used to describe the document content that imply the query formula q.If Bel i q = 0 , the document is not relevant to the query.For a document collection, we use the belief values Bel i q to rank the documents according to their estimated relevance to the query.

Implementation
We use a part-of-speech tagger and a noun-phrase extractor for the extraction of noun-phrases from the document collections and queries.The NLP software used in this implementation was designed and implemented at the Language Technology Group of the Human Communication Research Centre, at the University of Edinburgh [2].The tagger achieves 96-98% accuracy if all the words in the text are found in the taggers lexicon, and 88-92% if unknown words appear in the text.
The DR system used for indexing and retrieval is based on a heavily modified version of the SIRE system [6].The extracted single terms and term groups were filtered from stop words [11] and stemmed [4].So a phrase like "the long motorcycle trip" yields the term group "long motorcycl trip".Noun phrases reduced to stand-alone terms because of stop word removal were not considered as noun-phrases, but as single terms.

Experiments and Evaluation
For the evaluation of the model, several experiments were performed.In this section we first describe the document collections used for the experiments followed by the model evaluation.

Document collections
The Cranfield-1400 collection was used first to perform a large number of initial experiments because it is small compared with the modern document collections (Wall Street Journal, Financial Times etc.).The final experiments were performed on the Financial Times (FT) collection.

The Cranfield-1400 collection
This collection contains 1400 aeronautical abstracts and 225 queries.Two empty documents can be found in the collection.Usually, these documents are ignored.In this work, they are considered valid and are represented with one focal element, the truth proposition with bpa value of 1.
In Table 1

The Financial Times (FT) collection
For the second set of experiments the Financial Times (FT) collection from TREC-5 was used.The FT collection contains 210158 articles summing up to roughly 600M bytes of text.TREC queries (also referred to as TREC topics) are longer than the Cranfield-1400 queries.Three levels of topic details are defined.The title level can be viewed as the query entered manually to a system by the user.The description IRSG98 A Dempster-Shafer Model for Document Retrieval using Noun Phrases level is an expansion of the title part in one sentence.In the narrative part the properties that the relevant documents must have are explained.
In Table 2   For the experiments performed in this work only the combination of title and the description part of the TREC topics was used.Some initial experiments performed (see [10]) showed narrative queries frequently contain query terms which easily retrieve non-relevant documents.

Model evaluation
We compare our model model with the vector space model [5].We used two variants of the vector space model.The first variant uses the following weighting function2 : w i p j = log 2 FREQ i p j + 1 log 2 TOTFREQ i IDFN;p j It was reported in [1] that such a formula "can be safely used" for retrieval.This variant is labelled "VSM".
The second variant of the vector space model uses the formula (1) with the IDF variant (2).We label this variant as "Baseline".This model can be considered as a special case of the proposed model where only single terms are used as indexing elements.In this case, Bel i p j = m i p j holds for every proposition p j of any document D i .
It must be noted that both VSM and Baseline use single terms for the representation of documents and queries.The comparisons here are done using the the 'best' variant (in terms of retrieval effectiveness) of the Formula (1) for our model.This variant combines Formula (1) and the IDF factor (5).

Single term queries
The first comparison examines the results obtained with our model using the single term queries.The comparison can be seen in Table 3 for the Cranfield-1400 and the FT collections.
It can be seen that our model did not achieve higher precision results than the vector space model.The difference in average precision between our model and the Baseline is 2% for the Cranfield-1400 collection and 1% for the FT.

Term group queries
Here a comparison of the results obtained with the model using the term group queries is presented.The comparison can be seen in Table 4 for the Cranfield-1400 and the FT collections respectively.

EXAMPLE 3 . 2
Consider a document collection C = fD 1 g where the document D 1 consists of the only sentence:

EXAMPLE 3 . 3
Let S = fLong, Wine, Redg, and s 1 = Long, s 2 = Wine, and s 3 = Red.We obtain 2 3 = 8 elementary propositions forming the frame of discernment : e 0 :Red : Wine : Long e 1 :Red : Wine ^Long e 2 :Red ^Wine : Long e 3 :Red ^Wine ^Long e 4 Red : Wine : Long e 5 Red : Wine ^Long e 6 Red ^Wine : Long e 7 Red ^Wine ^Long

EXAMPLE 3 . 4 Red p 4 7 =
Let D 1 be the document with S 1 = fLong, Wine, Redg and G 1 = ffRed, Winegg.The following propositions are the focal elements modelling the indexing elements of the document: Red ^Wine A Dempster-Shafer Model for Document Retrieval using Noun Phrases The propositions modelling indexing elements must be defined in terms of the elementary propositions defining the frame of discernment.DEFINITION 3.5 A proposition is represented as the disjunction of elementary propositions as follows: 8p j 2 i p j = _ e k 2 e k !p j EXAMPLE 3.5 The proposition p 1 in example 3.4 is defined in terms of the elementary propositions defined in example 3.3 as e 1 _ e 3 _ e 5 _ e 7 .This comes from the following implications: e 1 = :Red : Wine ^Long !Long e 3 = :Red ^Wine ^Long !Long (e 5 = Red : Wine ^Long !Long (e Red ^Wine ^Long !Long The proposition p 4 = e 6 _ e 7 is defined as such because of the following implications: (e 6 = Red ^Wine : Long !Red ^Wine (e 7 = Red ^Wine ^Long !Red ^Wine

2 Figure 1 :
Figure 1: An example of a document in a frame of discernment uncommitted belief and is assigned as the bpa value of the proposition .If non-null, constitutes a focal element.The various formulations of IDF in Formula (1) were motivated by the work of[3] on syntactic and statistical phrases:

Table 1 :
some statistics of the Cranfield-1400 document collection are shown.It can be seen that the Cranfield-1400 collection is rich in term groups.Statistical characteristics of the Cranfield-1400 collection the statistics for the FT collection are shown.Compared with the Cranfield-1400 collection, FT has longer documents on average.Also the TREC topics are longer than the Cranfield-1400 queries but fewer (50 compared to 225).

Table 2 :
Statistical characteristics of the FT collection and TREC topics