Word Embedding based Approaches for Information Retrieval

The interest on using word embedding has expanded in various areas of text processing in recent years following the introduction of the word2vec model by Mikolov et al. (2013) and Pennington et al. (2014). The word embedding models use a large amount of text to create low dimensional representations of words capturing relationships between words without any external supervision. The resultant representation is shown to replicate many linguistic regularities such as, semantic similarity between terms, conceptual composition of terms, laws of analogy of terms. These features can be used for Information Retrieval (IR) where, the retrieval functions primarily depend on statistical co-occurrences. For keyword based retrieval systems, often there is the problem of vocabulary mismatch. For example, given the information need ‘vehicle industry in Germany’, relevant documents might not get retrieved due to the presence of ‘automobile’ in place of ‘vehicle’. Documents with ‘volkswagen’ might not get the suitable importance due to the vocabulary mismatch. Researches has been going on to overcome the problem of vocabulary mismatch problem utilising word embedding. In this paper, some of the approaches that use word embedding for better retrieval is presented. Empirical experiments have shown the positive contribution of the embedding informations in text retrievals. The rest of the paper is organized as follows. In Section 2, a baseline retrieval model, that uses word embedding, is presented. Following that, in Section 3, word embedding based query expansion methods are elaborated. The empirical evidence of superiority of both type of models over the state-of-the-art retrieval models are shown after the presentation of the corresponding models. The paper is concluded with some future directions of studies in Section 4. 2. BASELINE RETRIEVALS


INTRODUCTION
The interest on using word embedding has expanded in various areas of text processing in recent years following the introduction of the word2vec model by Mikolov et al. (2013) and Pennington et al. (2014).The word embedding models use a large amount of text to create low dimensional representations of words capturing relationships between words without any external supervision.The resultant representation is shown to replicate many linguistic regularities such as, semantic similarity between terms, conceptual composition of terms, laws of analogy of terms.These features can be used for Information Retrieval (IR) where, the retrieval functions primarily depend on statistical co-occurrences.For keyword based retrieval systems, often there is the problem of vocabulary mismatch.For example, given the information need 'vehicle industry in Germany', relevant documents might not get retrieved due to the presence of 'automobile' in place of 'vehicle'.Documents with 'volkswagen' might not get the suitable importance due to the vocabulary mismatch.Researches has been going on to overcome the problem of vocabulary mismatch problem utilising word embedding.In this paper, some of the approaches that use word embedding for better retrieval is presented.Empirical experiments have shown the positive contribution of the embedding informations in text retrievals.The rest of the paper is organized as follows.In Section 2, a baseline retrieval model, that uses word embedding, is presented.Following that, in Section 3, word embedding based query expansion methods are elaborated.The empirical evidence of superiority of both type of models over the state-of-the-art retrieval models are shown after the presentation of the corresponding models.The paper is concluded with some future directions of studies in Section 4.

BASELINE RETRIEVALS
In this section, a novel baseline retrieval method is presented.Namely, the Generalized Language Model (GLM), a baseline retrieval model that uses word embedding for better retrieval performance, is presented with experimental results in Section 2.1.(Ganguly et al. 2015) This model is intended to address the vocabulary mismatch problem for retrieval.Given a query Q = {q 1 , q 2 , . . .q n }, documents are returned as a list, ordered by the posterior probability P (d|q) which is estimated using the prior probabilities P (q|d) and P (q|C), that is the maximum likelihood estimated probabilities of generating a query term q from the document d and the collection C respectively, using frequency statistics (See Equation 1).Keeping the semantic meaning, what if a noise channel changes the word q to q 0 in the vocabulary while processing?In that case, Equation 1will not consider the presence of those q 0 in the documents.To overcome this problem, a term transformation based model is proposed (see Equation 2).In this model, the prior probability is estimated using three term transformation events (See Figure 1 for graphical representation).

Generalized Language Model
1. Direct term sampling: Standard language model term sampling.
2. Transformation via Document Sampling: Sampling a term q 0 from d which is then transformed c The Author.
Figure 1: Schematics of generating a query term q in GLM.
. to q by a noisy channel.
P (q, q 0 |C)P (q 0 ) = P (q|q 0 , C)P (q 0 |C) = sim(q, q 0 ) P t2Nq sim(q, t) cf (q 0 ) |coll| Here, sim(a, b) represents the cosine similarity of the embeddings of the term a and b.The prior probability of the transformed term q 0 , P (q 0 ) is kept uniform for all terms.Together, considering these three term transformation event, the retrieval function is stated in Equation 2. (2) Experiments on TREC collections show that this method yields significantly better performance than standard LM based baseline models as well as an LDA smoothed language model (See Table 1).

QUERY EXPANSION
The semantic similarity information can also be used for query expansion.In this section, two query expansion methods are demonstrated that use word embeddings.
The first work, in Section 3.1, is a nearest neighbour based approach that expands a given query by terms which are similar to the query terms in the embedded space.In Section 3.2, the semantic similarity is applied in conjunction with relevance model.Specifically, RM3 (Jaleel et al. 2004), the mixture model of Relevance In this section, we will describe two QE methods that use the embeddings of individual query terms.The first method is a kNN based QE method that makes use of the semantically similar terms to expand a query.Unlike pseudo relevance feedback (PRF) based QE methods, this approach does not require an initial round of retrieval.The second approach we tried is a straightforward variation of the first approach that uses word embeddings in conjunction with a set of pseudo relevant documents in an ad-hoc way.

Pre-retrieval kNN based approach
Let the given query Q be q 1 , . . ., q m .In this approach, we define the set C of candidate expansion terms by finding terms which are semantically similar to the q i s, following Equation 3.
For a term t to be chosen as the nearest-neighbours of Q, it should have high mean inner product similarity between t and all terms of Q, following Equation 4.
Post-retrieval kNN based approach In the next approach, a set of pseudo-relevant documents (PRD) are used to restrict the search domain for the A * in the Pre-ret and Post-ret columns denotes a significant improvement over the baseline.Significance testing has been performed using paired t-test with 95% confidence.
candidate expansion terms.Instead of searching for nearest neighbours within the entire vocabulary of the document collection, only those terms are considered that occur within PRD.The rest of the procedure for obtaining the expanded query is the same as it's preretrieval counterpart.
In Table 2, the empirical results of the methods are presented.As a baseline method, language model with linear smoothing is used.It can be seen that the semantic similarity is useful even when they are applied in an adhoc way.Note that, the performance of both the methods (pre and post-retrieval) are similar.The reason behind this behaviour is due to the fact that the embedding model is learned considering the entire vocabulary.Even if the search space is minimized, most of the terms having similar context as that of the query terms, are retrieved during the retrieval.

Word Embedding based Relevance Model (Roy et al. 2016a)
As shown in Section 3.1 that simple approaches using word embedding for retrieval can results in significantly better performance.Applying embedding in a systematic approach, in this section, a novel word embedding based relevance model is presented.Relevance model hypothesises that both a query and its relevant documents are sampled from a hidden relevance model R, which needs to be estimated.The query Q = {q 1 , . . ., q k } serves as the only evidence about R. In absence of relevance judgement, after performing retrieval, top ranked documents can be considered as set of (pseudo-) relevant documents.Thus, the probability density function that generates R, or the probability of sampling a term w from this model, denoted by P (w|R), is approximated by P (w|Q).This conditional probability can be expressed as the joint probability of observing the term w along with the query words.The probability P (w|Q) is then estimated in the following way: The term w is sampled conditionally, together with q 1 , . . ., q k from the same distribution, underlying a top ranked document.Thus the probability estimation of P (w|R) will follow Equation 5, where M is the set of (pseudo-)relevant documents.Finally, the terms with high P (w|R) are chosen as possible expansion terms and re-retrieval is performed.
( Jaleel et al. 2004) showed that RLM feedback model works significantly better when applied in conjunction with the query likelihood model, following Equation 6.
Kernel Density Estimation based Relevance Model Given a set of observed points, kernel density estimation (KDE) estimates the distribution function that generates the data.Mathematically, given n observed data points and a kernel function K(), Equation 7 represents the estimated probability distribution function where h is the bandwidth parameter and ↵ i is the weight of the local kernel function around i th data point, where P ↵ i = 1.For unknown data point x, f (x, ↵) will be high if x is similar to all observed data points x i (i = 1 . . .n) on an average.
In order to go beyond binary-term-independence based co-occurrence statistics, a word embedding based scheme is used.Following this scheme, if a word w and a query term q i are semantically related, they will occur in similar contexts; consequently, the vector embeddings of w and q i will be in close proximity.Now with the query terms embedded as vectors, the probability density function P (w|R) can be estimated with KDE.To see how this happens, imagine the existence of a continuous probability density function f (w) from which the discrete probabilities P (w|R) of the RLM (see Equations 5) are sampled.The shape of this relevance density function is controlled by a set of pivot points consisting of the query term embeddings.The key idea is illustrated in Figure 2, which shows a sample query with three terms.Note that the embedding of the query terms as vectors makes it possible to define the notion of distances between them.This enables the relevance model probability distribution function to be visualized as a function pivoted around the query vectors projected on a one dimensional line, as shown in Figure 2a.
For a query consisting of k terms, Q = {q 1 , . . ., q k }, instead of k pivot points, as in the one dimensional KDE, we now consider kM points, where M is the number of feedback documents.Each pivot point represents a query term q i , i = 1, . . ., k occurring in a document D j , j = 1, . . ., M, the points being shown in a grid layout in Figure 2c.The x-axis corresponds to the query term vectors similar to one dimensional KDE (Figures 2a and  2b).The y-axis, in this case, represents the normalized term frequency of query terms in respective feedback documents.The shape of the density function, which is now a mixture of two dimensional Gaussians, is shown as contour lines.It can be seen from Figure 2c that the value of the density function is maximum at the pivot points themselves and gradually decreases with increasing neighbourhood size.This is shown with different shades of grey in Figure 2c, where a darker shade denotes a higher value for the density function and a lighter one denotes a smaller value.Formally speaking, the kernel functions around the data points are bivariate normal distributions.
A pivot point x ij = (q i , D j ) in this two dimensional space encapsulates the word vector for q i and its normalized term frequency in feedback document D j , i.e.P (q i |D j ).
For simplicity, we define the covariance matrix ⌃ as a diagonal matrix with equal covariance for both the dimensions (notice in Figure 2c that the contours are circles instead of ellipses).
The density function estimation for a point x = (w, D j ) is shown in Equation 8. Substituting the values of x and the pivot points x ij in Equation 8, we get Equation 9. 9, the values for the weights for the kernels, ↵ ij , are set to P (w|D j )P (q i |D j ).From Equation 9, we can observe that a term w in document D j , denoted as the two dimensional vector x, will get a high value of the density function if: (i) the embedded vector of w is close to the query terms q i , i = 1 . . ., k, or in other words, w is semantically related to the query terms; and (ii) w frequently co-occurs with the query terms in each top ranked document D j , j = 1 . . ., M.
The empirical performance of the method is shown in Table 3.It can be seen that the performance of the proposed method is always significantly better than the RM3 baseline model.

CONCLUSION AND FUTURE WORK
There is a huge scope of improvements of retrieval results utilizing the features of word embeddings.A possible future work would be to look into the applications of word embeddings for other tasks in IR such as query categorization, novelty and diversity detection of documents.

Figure 2 :
Figure 2: Relevance model density estimation with 1 and 2 dimensional KDE.

Table 1 :
Comparative performance of LM, LDA and GLM on the TREC query sets.

Table 2 :
MAP for baseline retrieval and various QE strategies.

Table 3 :
Results of KDE feedback methods with QE.Parameters M (#fdbk docs) and N (#expansion terms) are tuned on the development sets -TREC 6. † denotes significance with respect to RLM.