Query Performance Prediction Based on Ranking List Dispersion

In this paper we introduce a novel approach for query performance prediction based on ranking list scores dispersion. Starting from the premise that different score distributions appear for good and poor performance queries, we introduce a set of measures that capture these differences between both types of distributions. The proposed measures will employ the ranking list, output of a search system, as an information source to predict query performance in terms of MAP. The obtained results reveal a signiﬁcant correlation degree with MAP and are very similar to those achievedwith more complex methods. Finally some generic open questions that could guide further research on query prediction methods are introduced.


INTRODUCTION
During the last years a growing attention has been focused on the problem of query performance prediction.This topic has turned into an important challenge for the IR community.Query performance prediction deals with the problem of detecting those queries for which a search system would be able to return a document set useful for an user.In other words query prediction objective is the development of a search system able to estimate the quality of its answer before the relevant document set is supplied to the user.
A wide range of possible applications appears for a system like the described before.For example a system could ask the user for some extra information in order to improve the result quality before an answer is supplied; a federated search system could select the best answer from any of its sources; a specialised search system on one specific subject could decide by itself if it needs make use of a broader topic index in order to supply a better answer.
In this paper a novel approach for query performance prediction is introduced.The proposed method falls into post-retrieval prediction methods.This type of predictors make use of the information supplied from the search system, once the search has been carried out, opposite to pre-retrieval prediction where the estimation is computed before the search has been completed.The proposed method is based on the study of document scores distribution among the document scores ranking list.This proposal is based on the hypothesis that some differences between the scores distribution of good performance and poor performance queries can be observed.Some measures that try to capture the prior differences among document scores from a ranking list will be proposed in order to predict query performance.

Related work
In the last years several works dealing with query performance prediction have been proposed.In general the different prediction methods can be classified into two main groups, those approaches that use information from the results obtained after a query is executed (post-retrieval predictors), and those that try to estimate query difficult before the ranking list is obtained from the search The 3rd BCS IRSG Symposium on Future Directions in Information Access engine (pre-retrieval predictors).In general it is accepted that the last has as main advantage a low computational cost even at the expense of providing less accurate estimations.
Pre-retrieval predictors use statistics as collection frequency (CF), inverse document frequency (IDF) or query length.These methods try to detect the ambiguity of the query based on these statistics.He and Ounis [9] propose different measures based on IDF and CF as the inverse collection term frequency mean or the IDF standard deviation.Recently Zhao et al. [17] have proposed some measures based on the standard deviation of the query terms, which are weighted using a TF-IDF schema.
Post-retrieval methods are more related to the approach introduced in this paper.First attempts were started by [5] where Clarity Score was proposed.This estimator tries to measure the ambiguity of a query with respect to the document collection.The ambiguity of a topic is calculated with Kullback-Leibler divergence (KLD) between the collection and the top ranked documents language model.A good performance query will show a high divergence value, it can be explained by the fact that top ranked documents are about a single topic and this involves low ambiguity.Further works based on Clarity Score like Ranked List Clarity Score, which replace ranking scores by ranking position, and Weighted Clarity Score that assigns different weights to query terms in order to calculate KLD, appears in [6].
More specific methods for the web environment can be found in [18], Weighted Information Gain and Query Feedback, where both measures show a good performance for 'ad-hoc' and Named Pages tasks.The first method tries to measure the information gain between a state where an average document is retrieved and the state where the real search has been accomplished.On the other hand Query Feedback models the retrieval system as a noisy channel.The input for the noisy channel is the query and the output is the ranking list obtained.From this premise the authors try to measure the degree of noise introduced by the retrieval system and from it to estimate the query performance.
Carmel et al. [4], try to model the relation among the three components that take part into the process of retrieval: topic, set of relevant documents and the collection.The relation between them is measured by means of the Jensen-Shannon divergence.Once the previous measures have been calculated they propose the application of a machine learning method to combine them.
In the works developed by Yom-Tov et.al [16] the proposed model was based on the agreement between the full query and sub-queries, where each one of these sub-queries include only one term from the original query.'Agreement' is measured from the overlap between full query and sub-queries ranking list results.They conclude that in general hard queries will not show agreement between the obtained ranking list, while a high level of overlapping will mean a good performance query.[3] propose a technique based on the Jensen-Shannon divergence between the ranking lists obtained from multiple retrieval functions.This approach is based on the idea that for 'easy' queries ranking functions must agree and therefore a lower divergence would be calculated.

Aslam and Pavlu
A different approach based on KLD was proposed by Amati et al. [1].Here the term frequency divergence between top retrieved documents and the whole collection is measured.They claim that a well-defined query (good performance topic), will show a significant divergence.
Recently a new improved version of clarity score has been presented by Hauff et.al [8].The authors propose two main contributions to clarity score.First, the set of the feedback documents used in order to compute the query language model is fixed to the documents that contain all query terms.Next, they propose to select a subset of terms from the retrieved documents in order to remove the noise generated by those terms with a high document frequency.Following, a set of new post-retrieval predictors based on the scores dispersion among a ranking list will be introduced.These measures will employ the ranking list obtained from a retrieval system in order to predict the query performance, with a low computational cost.

RANKING LIST SCORES DISPERSION AS A PREDICTOR
The approach proposed on this paper is based on the study of the ranking list obtained after a retrieval process is executed.As it is well known a ranking function tries to order documents based on their relevance for a topic.For this purpose a ranking function assigns a weight (or score) to each document in the collection.In a naive sense this score can be interpreted as a 'quantitative measure' of the document relevance.The ranking list scores distribution can be an indicative of the quality performance for a specific topic.Based on this premise some differences between document scores distribution for good and poor performing topics should be observed.
For example, if a ranking list has a high value of dispersion among the document scores, it could be a sign that the ranking function has been able to discriminate between relevant and not relevant documents.On the other hand if a low level of dispersion appears, because the ranking function has assigned similar weights, it can be interpreted as it was not able to distinguish between relevant and not relevant documents.
The differences in terms of scores dispersion can be observed in figure 1, where some of the best and worst performance topics for Robust 2004 are represented.As it can be seen best topics show a longer distance between maximum and minimum score and a sharp slope.Opposite to this, topics with poor performance show a lower distance between maximum and minimum and a softer slop.
A feature of this approach is that it can be applied independently of the ranking model employed for retrieval.The reason for this characteristic is due to retrieval models try to maximise in terms of score, the differences between relevant and not relevant documents1 .Based on this property the proposed approach can be considered as a generic method of quality performance prediction, not dependent on the model applied for document weighting.
The measures that will be introduced on next section are focused on trying to capture the differences between good and poor performance topics in terms of dispersion.In order to evaluate the quality of the proposed measures the correlation between them and AP (average precision) will be computed.

Proposed Measures
A reliable method to capture and measure dispersion along the obtained ranking list is a key part of this work.Some prior studies have tried to model how document weights are distributed along a ranking list.In general, and as a simplification, it can be assumed that an adequate model could be a mix between an exponential and a normal probability distribution.Exponential for not relevant documents, and normal for relevant documents [12,13].Generally a great number of retrieved documents are not relevant (exponential distribution), thus it is likely that a great number of documents will be weighted with a low score.A consequence of it is that ranking lists shape use to hold a long tail where a majority of not relevant documents are placed, see figure 1.It is important to understand how the scores distribution will affect some of the typical measures in order to capture the dispersion.Some notation is needed to define the following proposed measures: (i) A ranking list RL is a document list sorted in decreasing order by their documents scores; (ii) The score assigned to a document d, placed at position i into the ranking list, by a ranking function is defined as score(d i ).
1. Minimum normalised score: A first approach to measure the dispersion can be computed with the minor document score found into the ranking list.The minimum score must be normalised in order to make possible a comparison among it and the minimum scores obtained by the rest of topics.Thus for a ranking list of size N : 2. Standard Deviation: Standard deviation σ is a simple measure of the variability or dispersion of a data set.A low standard deviation indicates that the data points tend to be very close to the same value (the mean µ), while high standard deviation indicates that the data are spread out over a large range of values.Given ranking list scores mean µ(RL), standard deviation is computed as next: A drawback in the use of the standard deviation is caused by the great number of low scores assigned by the ranking function.As was described previously, a high percentage of document scores have a low value, which causes that mean is displaced towards the region of densest distribution, that is the tail of the ranking list as can be seen in figure 2 (left).As a consequence of it, the deviation on the top documents is not captured properly when the The 3rd BCS IRSG Symposium on Future Directions in Information Access Query Performance Prediction Based on Ranking List Dispersion standard deviation is computed along the full ranking list.

Maximum Standard Deviation:
A different approach to minimise the effect of low scores high frequency is computing the maximum standard deviation.This estimator is based on the idea of computing the standard deviation at each point in the ranking list, and selecting the maximum standard deviation found.As can be seen in figure 2 (right), which shows how standard deviation evolves along ranking list, this measure tends to decrease once the maximum has been reached, coinciding with the start of the ranking list tail.The σ max is defined as next: Following, the experimental setup designed in order to test the validity of the proposed measures will be described.

EXPERIMENTAL SETUP
The validity of a query performance predictor is tested based on the correlation coefficient between the proposed predictor and the performance of the real search system in terms of AP.
Besides correlation coefficients, a standard test collection and a set of state-of-art retrieval models have been used.
Correlation: On related literature three different correlations coefficients can be found: Pearson, Spearman(ρ) and Kendall(τ ).Pearson indicates the strength and direction of a linear relationship between two data series.Kendall and Spearman are based on ranking correlation between both data series.In both cases values are ranked and the correlation coefficient depends on the observable differences between both ranks.More specifically Spearman applies a basic linear correlation between both rankings while Kendall computes the correlation value counting pairwise swapping needed in order to transform one ranking into the other.The three correlation coefficients calculate a real number in the range [−1, 1], where 1 means perfect correlation, −1 means a perfect inverse correlation and 0 means no correlation at all.

Test data:
The different measures proposed in this paper have been tested with the set of documents from TREC Disk4 & 5, minus Congressional Record.This data was employed for the Robust Track 2004 and contains around 528,000 documents with a total size of almost 2 GB [14].The set of topics are those available from the same track, that is topics 301-450 and 601-700 2 , this makes a total of 249 topics with their relevant judgement.Only the field title from topics has been employed in the experiments.
Ranking models: Since the proposed method can be applied to any retrieval model, we have selected a set of retrieval models representative enough to test the validity and compare the obtained predictions values among them.For each retrieval model3 the parameters have been fixed to the values recommended by Terrier documentation: • Okapi BM25 [11] with parameters b = 0.34, k 1 = 1.2 and k 3 = 8.

RESULTS
The obtained results 4 after the execution of the experiments are shown at table 1.These measures were computed with a default ranking list size of 1000.This is the standard size in TREC experiments for the MAP calculation.As it can be seen the obtained correlation coefficients, among the different retrieval models, are similar.This similarity, in the obtained results with the different retrieval models, can be interpreted as a proof of a common behaviour in terms of scoring, as was suggested before.
Minimum score measure shows a significant correlation degree, although its performance could drop when a whole ranking list is used.This is a consequence of the ranking list trend to achieve zero at the lower positions, when the scores have been normalised by the maximum score found.
As was expected and due of the ranking list tail effect described before, standard deviation shows a poor performance measuring the scores dispersion.
On the other hand the results obtained with the maximum standard deviation outperforms to those achieved with standard deviation.Therefore σ max avoids, at least in part, the lack of precision, in terms of dispersion measurement, obtained by the classic standard deviation.

CONCLUSIONS
In this paper some novel query performance predictors have been introduced.The obtained results show that measures based on standard deviation over scores ranking list, can be used to predict the quality of a search system reply.
The application of standard deviation as a dispersion measure for a ranking list, has shown to be a weak approach due to the noise introduced by the not relevant documents set retrieved.In order to avoid the prior effect the σ max has been applied and it has improved the results acting as a noise reduction method.The correlation degree has been calculated with the most important correlation coefficients obtaining for all of them similar results.The obtained results outperform pre-retrieval approaches with a similar computational cost.In relation with post-retrieval approaches the results obtained are similar with the advantage of a much lower computational cost.
The consistency of the obtained results with the different retrieval models suggests the validity of the proposed method independently of the retrieval model.The main reason of this generalisation is achieved due it is based on a common expected behaviour of any ranking retrieval model, that is, the ability to distinguish between relevant and not-relevant documents for a specific topic.

FUTURE WORK
As a consequence of the study of the related work and the development of a new family of predictors, some open questions have been identified which could guide further research about this topic.Firstly it has not been established clearly which correlation coefficient is more adequate.Evaluations based on any of the three traditional correlation coefficients can be found, these are applied indistinctly without a clear justification of its use.This issue has been recently discussed by Claudia Hauff et al. in [7].
Moreover an evaluation based only on correlation coefficients can be hard to interpret.As can be found on the related literature, the obtained results in terms of correlation degree between prediction methods are almost equivalent for many of them.Some other measures have been proposed as Kendall Average Precision by Yilmaz et al. [15], and Root Mean Square Error in [7].
In our opinion a new family of different evaluation measures should be proposed, these measures should be more focused on the qualitative aspects of the prediction methods, and thus it should guide us towards a better understanding of the possible applications of prediction methods.

Figure 2 :
Figure 2: Scores histogram and scores standard deviation respectively for Topic 313.Scores have been normalised in range [0, 1].The maximum number of retrieved documents has been fixed to 1000.