Where the Linked Dependence Assumption Fails and How to Move Beyond It.

The linked dependence assumption, a widely used simpliication in probabilistic retrieval, is brieey reviewed and its validity i s i n v estigated. We show that the linked dependence assumption is violated particularly if the query features are good discriminators between relevant and non-relevant documents. We then propose the manual concept generation method, which takes into account the dependence between diierent query concepts. For a limited set of routing queries, experiments show a signiicant improvement when compared with retrieval methods which do not take i n to account feature dependencies. We consider this method to be the rst successful step which progresses beyond the linked dependence assumption.


Introduction
For the past twenty years, the application of probability theory in information retrieval has resulted in powerful, and quite e ective, retrieval models and methods.However, in order to derive practicable models, researchers have had to tread a careful path between theoretical purity and pragmatism.Simplifying the assumptions underlying a given model are typical of such an approach.In this way, a number of linear ranking functions have been derived, including the well-known feature frequency times inverse document frequency weighting schemes, Fuhr 1992.Common assumptions relate to the independence of the occurrence of indexing features, the independence of features given relevance, and the independence of features given non-relevance.These simplifying assumptions are not only crude and poor re ections of data Gey 1993, but if assumed simultaneously they even lead to logical inconsistencies Robertson 1974.Fortunately, Cooper proved that the linked dependence assumption LDA|a weakening of the simultaneous assumption of independence given relevance and independence given non-relevance|su ces for most probabilistic retrieval functions Cooper 1995.During the last few years, especially while observing results from interactive retrieval, a growing number of research groups have established a feeling" that there is something wrong with the ranking algorithm in particular for short queries, e.g.Hearst 1996 andRose &Stevens 1997.The co-occurrence of query features in particular seems to receive a disproportionate emphasis.Re ltering a list of probabilistically ranked documents with a boolean lter has proved to be successful Hearst 1996.This idea has also been carried further for routing retrieval Gey et al. 1997.
The probability ranking principle has been proved optimal from a theoretical point of view, but it is obviously not optimal in practice.Thus there must be something wrong with the theoretical assumptions.We know that the probability ranking principle is not optimal if the relevance of documents is not assessed independently Gordon & Lenk 1991, Robertson 1977.This accounts for some of the ranking weaknesses in interactive retrieval when a user wants to address di erent aspects of a query e.g.name di erent alternative 2. INDEPENDENCE AND DEPENDENCE ASSUMPTIONS sources of energy", which countries have weak anti-pollution laws", etc.However, the dependence of relevance assessments can not be the reason for the probabilistic ranking weakness in the routing task in the TREC-environment Harman 1994, where relevance assessments are purely independent.If we rely on the probability estimates, then the question arises: how w ell does the linked dependence assumption t the data?To our knowledge, this has never been investigated.We estimated values on TREC data to determine how w ell the data ts the linked dependence assumption, and we h a v e concluded that the linked dependence assumption does not t the data very well at all.Moreover, the LDA is more signi cantly violated for good" query features, i.e. features possessing great discriminating power between relevance and non-relevance.
There have been several attempts to include feature dependencies into retrieval models and weighting formulas, e.g.van Rijsbergen 1977, Harper & van Rijsbergen 1978, Eftimiadis 1993.So far these attempts have had only limited success.We believe that di erent reasons can be given for this.One reason is the fact that these approaches only tried to move b e y ond independence and not beyond linked dependence.A second reason is that the approaches did not have enough training data to estimate di erent dependencies.The approach of using non-linear neural networks for the routing task Hull et al. 1996a, Hull et al. 1996b is one case which does not rely on the LDA, but it has not yielded great success yet.The di cult estimation of too many parameters making it prone to over tting may be responsible for this.
Inspired by our observations about the violation of the LDA, and learning from the related work, we aim to nd weighting formulas that move b e y ond the linked dependence assumption, i.e. beyond a simple sum of feature weights.Such a function must account for the interaction between groups of features concepts.To be able to estimate the interaction we m ust restrict our experiments to the routing task where training data is available.Our feature dependence retrieval model FDRM for the routing task requires a grouping into di erent sets, which w e call the concepts of the query.The concepts are required to be independent from each other in a colloquial sense.An example: for a query about joint v entures involving Japanese companies" a user identi ed the two independent concepts joint v entures" and Japanese companies", and grouped the features joint" and venture" into the rst concept and features such as Japan", Tokyo", Kobe" into the second concept.The grouping into concepts is not trivial and is best performed by a searcher with domain knowledge.The FDRM shows a substantial improvement i n terms of recall and precision compared with other successful retrieval functions.As far as we know, this is the rst approach that goes beyond the linked dependence assumption.
Our paper is structured as follows: In Section 2 we give an overview of independence assumptions in probabilistic retrieval and discuss critically the linked dependence assumption.Section 3 explains how w e c heck the validity of the LDA and discusses the results on TREC data, Section 4 describes the feature dependence retrieval model for the routing task, Section 5 and Section 6 discuss the application of this model and the experimental results.

Independence and Dependence Assumptions
In this section we brie y review the di erent independence assumptions that are used in probabilistic retrieval models and connect them to the linked dependence assumption Cooper 1995.We then introduce some scenarios where the linked dependence assumption is violated.
The indexing features are denoted by ' i .In our experiments for example, features are porter-reduced non-stopwords Porter 1980.We abbreviate the event of pairs q;d j where d j contains the feature ' i simply by ' i .The set of all query document pairs q;d j with d j relevant to query q is denoted by R and analogously R denotes the set of pairs q;d j such that d j is not relevant to q. See the symbol table in Appendix A for an overview of the notation.The notation is chosen according to Sch auble 1997.
There are three independence assumptions about the occurrence of indexing features which h a v e been used in probabilistic retrieval to get tractable formulas.The independence assumptions are: The independence of features ' i1 ; : : : ; ' i n , P ' i 1 ; : : : ; ' i n = P ' 1 P ' n : The conditional independence given relevance P ' i1 ; : : : ; ' i n j R = P ' i 1 j R P ' i n j R : 2 The conditional independence given non-relevance P ' i1 ; : : : ; ' i n j R = P ' i 1 j R P ' i n j R : 3 These assumptions are problematic.If the features are distributed in the collection and among the relevant documents in a manner that P ' i1 ; R P ' i n ; R P ' i 1 P ' i n P R n,1 then|no matter how the co-occurrence of features is distributed|Equations 1 and 2 can NOT hold simultaneously.This can be proved for example by contradiction in a straightforward manner.Data that is distributed in this manner exists and so is a realistic example.That data with these logical inconsistencies might exist was rst mentioned in Robertson 1974.The simultaneous assumption of 1 and 3 leads to inconsistencies in an analogous way.However, the performance of newer probabilistic models does not su er from these kinds of inconsistencies since the assumption of 1 is super uous.Not all problems are solved by relying only on the assumptions of conditional independence 2 and 3.In several papers it was claimed more or less intuitively that the two conditional independence assumptions must be incorrect.The rst veri cation of this intuition was given by F. Gey in his PhD-thesis Gey 1993.He tested the hypothesis of independence for all three assumptions 1, 2, and 3 on CACM data.The result is that all three hypotheses can be rejected for a majority of randomly chosen query feature pairs ' i1 ; ' i 2 .
Fortunately, the simultaneous assumption of 2 and 3 can be replaced by the weaker assumption of linked dependence: P ' i1 ; : : : ; ' i n j R = C r P ' i 1 j R P ' i n j R and 4 P ' i1 ; : : : ; ' i n j R = C nr P ' i1 jR P ' i n j R and C r = C nr : In words: If the linked dependence assumption holds for the features ' i1 ; : : : ; ' i n then the degree of dependence of the features is the same in both the set of relevant documents and the set of non-relevant documents.It is obvious that the linked dependence assumption is weaker than the conditional independence assumptions because 2 and 3 implicate 4. Thus, it is clear that the LDA ts any data collection at least as well as the assumption of 2 and 3.
Although the LDA is a weaker, and thus a better, assumption than the conditional independence assumption, it does still not mean that it is a realistic one.To our knowledge the LDA has never been tested.What is the reason for this lack of research results?One reason might be that there are no standard test procedures for testing linked dependence as there are for testing independence.Another reason is that one needs to have this simpli cation to get tractable formulas.Without LDA, retrieval formulas can not be as simple as a sum over weights of query features.This is what retrieval functions nowadays almost always are.The LDA de nitely helped to develop powerful retrieval models.However this is no reason to stick with this assumption forever, especially if we know it leads to non-optimal rankings.
As we stated in the introduction, there are experiments which indicate that the LDA is not always a good approximation of reality.In addition, it is very interesting for us to observe that a closer look at Gey's experiments Gey 1993 reveals that the the rejection of 2 is not as signi cant as the rejection of 3. If this observation is not due to an artifact from widely di ering sizes of R and R, it places further doubt on the validity of the LDA, because the LDA presumes an equivalent degree of dependence for both sets R and R.
Figures 1,2, and 3 demonstrate di erent dependence scenarios.The enclosing rectangle symbolizes the set of all documents.To the left of the vertical line the set of all relevant documents is situated, to the right the non-relevant documents.For simplicity sake, the query is limited to two features ' 1 and ' 2 : The vertical grey rectangles represent the sets of documents containing ' 1 and the horizontal grey rectangles represent the sets of documents containing ' 2 within the subset of relevant or the subset of non-relevant documents.
Figure 1: A scenario where the linked dependence assumption holds.
The linked dependence assumption is violated for a feature ' 2 that discriminates well between relevant and non-relevant documents good feature".
The intersection is represented in black.Figure 1 illustrates an example where the LDA holds in this case even the two independence assumptions 2 and 3 hold.We want t o illustrate how the LDA might be violated.Figure 2 and Figure 3 indicate two di erent examples where the LDA does not hold.This can be compared with the example in Figure 1.We c hoose examples with C r = 1 since in our experiments C r is often estimated to be close to 1.We show that there is a connection between the ability of the features to discriminate between relevant and non-relevant documents and the violation of the LDA.Therefore Figures 2  and 3 show examples of di erent kinds of features.
If we compare Figure 2 with Figure 1, the probability P ' 1 ; ' 2 j R remains constant and P ' 2 j R decreases.This means that the feature ' 2 in the case of Figure 2 is a better discriminator than in the case of Figure 1.In Figure 2 the LDA is violated because C nr 1.
If we compare Figure 3 with Figure 1, the probability P ' 2 j R remains constant but P ' 1 ; ' 2 j Figure 3: The linked dependence assumption is violated for a feature ' 2 that does not discriminate well between relevant and non-relevant documents bad feature".
for which the estimations of C r and C nr are performed.In Section 3.3 we present our idea for examining the LDA according to the di erent cases of features and documents.Experiments are then described in Section 3.4.

Good"and Standard" Features
In this section we describe how we select the features in q for which the LDA is tested.We shortly describe a feature selection method U-features which has been statistically derived in Mateev 1996 andBallerini et al. 1996.
The set q can be obtained in many di erent w a ys: e.g. from the original topic Lewis et al. 1996, from Rocchio expansion Buckley et al. 1995, using the 2 -test Hull et al. 1996a, etc.The quality of the features is always w.r.t. a given weighting scheme, but normally features which perform well for one weighting scheme are also suitable for others.We examine the LDA i n t w o extreme cases.The rst is where q consists of features belonging to the original query topic standard features.The second case is where q consists of features selected through a very good feature selection algorithm good features.
Intuitively a good feature is one which discriminates well between relevant and non-relevant documents, i.e. the feature occurs more often in relevant documents than in non-relevant ones.We are interested in how this in uences the coe cients C r and C nr .We decided to work with U-features.The method consists of building a contingency table Table  The N features with the highest value are selected.These features build the set q.
We already compared the U-features with other feature selection methods in Mateev 1996 andBallerini et al. 1996.The U-features out-performed all other strategies tested.Our experiments show that the LDA is much more signi cantly violated for good features than for the standard ones.

Di cult" and Random" Non-relevant Documents
For a given query, usually the methods for routing retrieval do not use the whole set of non-relevant documents but only a small part of it because the set is too large.This raises the question of the impact of the chosen subset on the LDA.We distinguish between subsets containing random non-relevant documents and subsets containing di cult non-relevant documents.A random non-relevant document is a document which is randomly selected from the set of all non-relevant documents w.r.t.query q.A di cult non-relevant document is a non-relevant document which is hard to distinguish from relevant ones.Typically, a di cult non-relevant document contains many of the features used in the query.Since the features from our feature set q occur more often in such di cult non-relevant documents than in random ones, the coe cient C nr might be a ected.The set of relevant documents remains constant in the experiments.
Our experiments use documents with relevance assessments from the TREC collection.Harman 1995 describes how such documents with relevance assessments are obtained.Our experiments show that the LDA i s m uch more signi cantly violated in collections where the non-relevant documents are di cult ones, than in collections with random non-relevant documents.

How to Examine the Linked Dependence Assumption
We are not aware of any methods to statistically test the LDA.We have therefore developed statistics to indicate how w ell the LDA is met.The rst step is to estimate the coe cients C r and C nr : C r = jfd j 2 Rj' i1 ; ' i 2 ; : : : ' i n 2 d j gj jRj jfdj2Rj'i 1 2djgj jRj jfdj2Rj'i n 2djgj jRj C nr = jfd j 2 Rj' i1 ; ' i 2 ; : : : ' i n 2 d j gj jRj jfdj2Rj'i 1 2djgj jRj jfdj2Rj'i n 2djgj jRj where jRj is the number of relevant documents, jfd j 2 Rj' i1 2 d j gj the number of relevant documents containing ' i1 , and jfd j 2 Rj' i1 ; ' i 2 ; : : : ' i n 2 d j gj is the number of relevant documents where the features occur together.The notations for R are analogous.
Our idea is to examine the LDA for di erent numbers of features n, and for features obtained from di erent feature selection algorithms.We experiment with a set of features q, for a query q, and a xed value for n, where n N = j qj.It makes sense to experiment only with small values for n, otherwise we can not estimate P ' i1 ; : : : ; ' i n j R and P ' i1 ; : : : ; ' i n j R .
Suppose that the case where C r C nr occurs on average as often as the case where C r C nr .This would mean that the LDA is a priori not a bad assumption, and that there is no reason to reject it, since we can hope that the generated errors can compensate for each other.If one of the two inequalities C r C nr or C r C nr receives a signi cantly higher count o f e v ents, then we h a v e good reason to assume that the LDA is not a reliable assumption for retrieval.Note that this is not a powerful statistic like for example a statistical test, rather it is a conservative decision rule that is very careful in rejecting the LDA.
According to our considerations in Section 3.1 and Section 3.2 we distinguish between a set q containing good features w.r.t.q and a set containing standard features.Moreover the set of non-relevant documents needed to estimate C nr can consist of randomly selected documents or of di cult ones.Four cases are checked: 1. good features and di cult non-relevant documents 2. standard features and di cult non-relevant documents 3. good features and random non-relevant documents 4. standard features and random non-relevant documents.Our results show that the LDA behaves di erently in each of these four cases.

Results
Experiments are performed with a subset of the TREC-4 routing queries as a rst setup.This subset consists of all TREC-4 routing queries with TREC topic numbers smaller than 50.There are 23 such queries.These experiments use all available relevance information on disk1, disk2 and disk3 Harman 1996.Another experiment setup involved TREC queries 202 250, and the relevance information from disk2 and disk3.Since the results from the two setups are similar, the results from the second experiment setup are not listed here.The LDA is examined for two and three features.The coe cients C r and C nr are computed for all pairs resp.triplets from q, where the feature set q consists of: all features from the original queries the set of the best 20 U features.
The set of documents where C r and C nr are estimated is: the TREC training set this set contains many di cult non-relevant documents the randomized TREC training set: all non-relevant documents from the original set which are not relevant for any other query are assigned at random to the di erent queries we assume that for each query the non-relevant documents obtained in this way are random.Combining the di erent possibilities for features and sets of documents gives us four cases to investigate: good features, di cult documents, standard features, di cult documents, good features, random documents, standard features, random documents.
In order to examine the LDA w e observe: 1. the coe cients C r and C nr 2. the median of Cr Cnr 3. the average values of the estimated probabilities: P ' 1 ; ' 2 j R , P ' 1 jR, P ' 2 jR, P ' 1 ; ' 2 j R , P ' 1 jR, P ' 2 jR o v er all pairs of features.
Table 2 shows some statistics about the coe cients C r and C nr over all pairs resp.triplets which w ere built.The size for good features and random non-relevant documents in Table 2 is smaller than that for good features and di cult non-relevant documents.Fewer pairs occur together in these documents, i.e. there are more cases where the coe cient C nr can not be estimated.
The results for the good features show that there are many more cases where C r C nr than where C r C nr .For standard features there is less asymmetry between both cases.This means that the LDA is more signi cantly violated for the good features than for the standard ones.For the same reason, the LDA is more signi cantly violated for document sets containing di cult non-relevant documents than for document sets containing random non-relevant documents.The LDA is also more signi cantly violated for triplets than for pairs.
Table 3 shows information about the median, the 0.25-and the 0.75-quantile for Cr Cnr over all pairs resp.triplets which w ere built.Results are shown only for di cult non-relevant documents.One can see that for the standard features the median is close to one, which means that the LDA holds on average.Moreover, the 0.25-and the 0.75-quantiles are placed more or less symmetrically w.r.t. the median.For the good features, the median is much l o w er than one, and again both quantiles are placed more or less symmetrically w.r.t. the median.These results con rm our observations in Table 2.
We also estimate the probabilities P ' 1 ; ' 2 j R , P ' 1 jR, P ' 2 jR and the analogous for R over all pairs of features for the four cases described above.We then compute the averages of these estimations.Figure 4 shows the results in percentages.
One can see that the C r remains more or less constant.The constant C nr is larger both for good features than for standard ones, and for random non-relevant documents than for di cult ones.Having C r constant means that larger C nr values lead to considerable infringements of the LDA.This once again con rms the results from the previous two experiments.
All three experiments show that the LDA is more signi cantly violated for good features than for standard ones, and for random non-relevant documents than for di cult ones.

A Feature Dependence Retrieval Model
Inspired by the investigations described in the previous section, we tried to nd weighting formulas that move beyond the LDA.Considering all possible dependencies, however, yields intractable formulas.In addition, we w ould like to deal with only a few free parameters in order to avoid over tting, which i s a s e v ere problem for routing Hull et al. 1996b.Thus we try to move b e y ond LDA only in cases where we can expect the LDA to be violated heavily.Encouraged by the work of Hearst 1996, we decided to group features together into so-called concepts.The dependencies between the concepts which is captured in Hearst's approach b y the Boolean AND-operator must be modelled in the probabilistic retrieval model.
A query is represented by a set of features, e.g. the U-features described in Section 3.1.The partitioning of the features is called the concept generation.It can be performed manually|manual concept generation| or automatically.We focus here on the manual concept generation because we do not yet know a good automatic partitioning algorithm which serves our goals.
The aim of probabilistic retrieval is to estimate the probability of relevance as exactly as possible, or PRjq;d j = log P R;q;d j PR;q;d j = log P d j jR;q Pd j jR; q + Kq;R; 6 where Kq;R is a constant not depending on d j .At this point usually the LDA is applied and the probability is split into a product over features Robertson 1977, Robertson & Walker 1994.We intervene here, however, to consider the query q = C 1 C 2 as the intersection of two dependent, but not linked dependent, concepts: logitP Rjq;d j = log P R;C 1 C 2 ; d j P R; C 1 C 2 ; d j = log C r d j P R;C 1 ; d j P R;C 2 ; d j The last step in this derivation uses Equation 6.Note that constants C r d j and C nr d j which ful ll the preceding equation always exist as long as we h a v e not to deal with zero probabilities.These constants capture the dependencies of the concepts C 1 and C 2 .Note also that these constants are not the same as C r and C nr .In contrast to C r and C nr which h a v e been estimated in Section 3.3, the coe cients C r d j and C nr d j are NOT independent from the document d j , and thus the document d j must in uence the estimation of log Crdj Cnrdj .

Towards the Application of the Feature Dependence Retrieval Model
A reliable estimation of log Crdj Cnrdj relative to a document is a di cult and crucial step in going beyond the linked dependence assumption.Assuming LDA for the features in C 1 and C 2 , the rst and second terms in Equation 7 can be treated as usual in probabilistic retrieval, and can be approximated by a w ell know retrieval function.A v ery successful RSV-function is the so-called Lnu.ltn weighting scheme Singhal et al. 1996.Using this weighting scheme we approximate Equation 7 by logitP Rjq;d j = RSVC 1 ; d j + RSVC 2 ; d j + RSVC 1 ; d j RSVC 2 ; d j + ; where ; ; ; 2 I R. In this equation, log C r d j =C nr d j is estimated as RSVC 1 ; d j RSVC 2 ; d j .We need and in the formula since we do not assume that RSV is normalized over the queries or concepts.The unknown parameters ; ; , and are estimated with logistic regression on the training data for the routing task.
Unfortunately, experiments with this formula have not been very promising.We observe that some features from C 1 and C 2 seem to belong to both concepts, and that other features can be classi ed neither to C 1 nor to C 2 .A tuning" of the manual concept generation method is necessary.We add two pseudoconcepts" C 12 and C garb to our algorithm.The pseudo-concept C 12 contains all features which belong to C 1 and C 2 in the opinion of the person who generated the concepts.The features from C 12 are only used to improve the estimation of the dependence term log C r d j =C nr d j .The pseudo-concept C garb garbage" concept contains features which can be classi ed neither to C 1 nor to C 2 .The concept C garb can also contain features which the person who generates the concepts does not understand.The features from C garb are not used in the retrieval function.Altogether we h a v e logitP Rjq;d j = RSVC 1 ; d j + RSVC 2 ; d j + RSVC 1 ; d j RSVC 2 ; d j RSVC 12 ; d j + : 8 The design strategy for this formula is to keep the model as simple as possible fewer parameters for estimation, and at the same time model the dependency between C 1 and C 2 in the way described above.We do not include a linear term for RSVC 12 ; d j in the function above because C 12 has meaning only w.r.t.C 1 and C 2 .One could include the terms RSVC 1 ; d j RSVC 12 ; d j and RSVC 2 ; d j RSVC 12 ; d j .This would yield more degrees of freedom to model, for example, a stronger dependence between C 12 and C 1 than between C 12 and C 2 .We did not experiment with such terms because of lack of time and also because more parameters in the logit function require more training documents.
Here is a small example which illustrates the ideas described above: Consider the TREC-4 query Topic 3 which concerns the announcement of new joint v entures involving Japanese companies and also the activities of these joint v entures.One of the authors plays the role of the questioner.The subjective opinion of the questioner is that one can de ne two more or less independent concepts joint v entures" and Japan" by putting automatically selected features into the sets C 1 and C 2 respectively.An arbitrary feature selection algorithm can be used to obtain features.The U-features described in Equation 5 are used for this experiment.For Topic 3 the top 20 features are: fventur, japan, nippon, prm, joint, tokyo, corp, rim, japanes, sumitomo, cyc, kobe, laser, osaka, kawasaki, yen, mitsubishi, pacif, nihon, steelg.The features joint" and venture" can|in the questioner's subjective opinion| be put in C 1 , and features like japan", tokyo" or japanes" in C 2 .Features like nippon" another name for Japan, but also a rst word in some japanese company and joint-venture names or kawasaki" japanese company but also partner in jointventure companies are common to both concepts mentioned above probably depending on the linguistic context.For this reason, one can put such features in a pseudo-concept C 12 which is justi ed only given C 1 and C 2 .There are features, like prm" or rim", which could not be identi ed by the questioner and which were put in another concept C garb .Again we emphasize that this concept generation is subjective and other questioners might perform it in a quite di erent manner.Now w e formalize the way manual concepts are generated and used.It is assumed that the query topic can be described by t w o more or less independent concepts: 1.Let S be the set of the top 20 features selected for a query topic using a feature selection algorithm for example U-features described in Section 3.1.2. Find two concepts C 1 and C 2 in the query topic which are as independent as possible in a colloquial sense.Describe these concepts through features from S and mark these features as used.
3. Group all unused features from S which h a v e something in common with C 1 and C 2 in a pseudo-concept C 12 .4. Put the remaining features from S into a pseudo-concept C garb .These features are either not familiar to us or they have nothing to do with C 1 or C 2 in the questioner's subjective estimation. 5. Compute the coe cients for the model 8 using logistic regression over the training data.6. Perform ranking on the test data using the computed model.
The advantage of manual concept generation is that human knowledge about the topic is included in the query.Disadvantages are that the manual generation of concepts is time consuming and that concept identi cation is very subjective.There are problems determining the two most important and independent concepts for a query topic and de ning these concepts through the given features.These problems are illustrated in the following experiment: Di erent people were asked to build concepts for query Topic 3 described above.They had to describe with few words the concepts C 1 and C 2 , and to put the given 20 features into the four di erent groups.Most people described C 1 and C 2 as joint venture" and Japan".Some interpreted these concepts in a very broad sense and others very strictly.Although it appears that there is similarity b e t w een the described concepts there are great di erences in the way the features are put into these concepts.Such experiments also show us that very often we need the knowledge of an expert on the given topic.Another disadvantage is that the process of stemming words to features is not bijective, and sometimes it is di cult to determine the meaning of the original word from the selected feature.

Preliminary Experiments
Our experiments are performed using the SPIDER information retrieval system described in Knaus et al. 1996.We experiment with the query topics 3, 5, 14 and 20 from the TREC-4 routing task.The query set is very limited for two reasons: First, the method cannot be applied for every query because some queries might consist of only one or more than two di erent concepts; and Second, the manual concept generation is a very time consuming process and we did not have more time to generate queries.Note however that the four queries have not been selected because they yield the best results, but were selected arbitrarily.
The training data to estimate the regression parameters and to select optimal features consists of all available TREC-4 routing training documents for these queries on disk1, disk2 and disk3 about 1800 relevant and non-relevant documents per query.Feature selection selected the 20 best U-features Equation 5.For each topic we generate concepts from the selected features.For each query the coe cients , and in Equation 8 are computed using logistic regression over the training data, with RSV Lnu:ltn as the basic weighting function.The test data is the TREC-4 routing test data.
For e ciency reasons, we perform re-ranking on an existing list of ranked documents from the TREC-4 routing task instead of retrieving documents from the whole test set.The retrieval method used to build this list is described in Buckley et al. 1996, and is obtained by dynamic feedback optimization DFO.The DFO w as one of the best TREC-4 routing methods.We refer to this list as DFO-list.The average precision per query is shown in Figure 5.It is compared with the average precision per query of the original DFO-list, Figure 5: Manual concept generation for TREC-4 query topics and with the best average precision per query obtained by re-ranking on the DFO-list using Lnu.ltn on the top 3, 20, or 50 features from the U-features list.
The manual concept generation, together with Lnu.ltn as the weighting function for the di erent RSV's, shows very good results.It performs always better than the poor Lnu.ltn, and in three of four cases better than the DFO method, which i s m uch more sophisticated.
It is still an open question how this algorithm can be extended for N such concepts.When there are more than two concepts then the most important t w o can be used, and the features for the others can be put in C 12 .If there is only one concept then the algorithm is reduced to the normal weighting function applied over all features.

Summary and Future Work
We question the linked dependence assumption, and have been able to present evidence indicating that the dependence of features in the set of relevant documents, and the dependence in the set of non-relevant documents, are not as tightly linked as the linked dependence assumption supposes.The dependencies have been analysed on feature pairs and on feature triplets.The violation of the LDA is usually produced by a stronger dependence in the set of non-relevant documents.So although one may hope that di erent violations of the linked dependence assumption compensate for each other, in fact they do not.It is alarming that the violation is more signi cant on good" query features, i.e. features that have a strong capability t o discriminate between relevant and non-relevant documents.
We have also introduced a rst approach to move beyond the linked dependence assumption for the routing task: the feature dependence model with manual concept generation.Manual concept generation is a v ery careful manual grouping of query features into two concepts, a set of features that interact with both concepts, and a set of garbage features.Together with the feature dependence model, a retrieval method that pays attention to the dependency of the two concepts| on a small query set the manual concept generation yields a substantial increase in average precision over other very good weighting functions which do not pay attention to these dependencies.
In our future work we m ust get beyond the limited usability of the feature dependence method based on manual concept generation.Methods for more than two concepts must be generated.Automatic methods to split features into concepts are needed in order to enable users without a very deep domain knowledge to use this manual concept generation.And nally one should get beyond the linked dependence assumption for ad-hoc queries, especially for short queries.For the ad-hoc task we w ould like to exploit the result that C r is usually close to one and that C nr is considerably greater than one for good" features and for random" documents.

A Symbols used in the paper
The symbols used in this report are shown in Table 4.

Symbol De nition
DT set of features in the training collections q set of features for query q on which the LDA i s c hecked C r ; C nr coe cients in the LDA Equation 4C 1 ; C 2 main concepts in the manual query generation C 12 additional concept in the manual query generation C garb garbage" concept in the manual query generation I R the set of the real numbers 1 for each feature ' i 2 DT , where DT is the set of all features from the training documents D T .

Table 2 :
Examining the LDA through statistics on C r and C nr .

Table 3 :
Examining the LDA b y comparing the median of C r =C nr for di cult documents.
Figure 4: Examining the LDA through estimations of the average probabilities.Results shown in percentages.rather to estimate a monotonous function of the probability of relevance.A commonly used function is:

Table 4 :
Used symbols