Remedies against the Vocabulary Gap in Information Retrieval

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Search engines rely heavily on term-based approaches that represent queries and documents as bags of words. Text---a document or a query---is represented by a bag of its words that ignores grammar and word order, but retains word frequency counts. When presented with a search query, the engine then ranks documents according to their relevance scores by computing, among other things, the matching degrees between query and document terms. While term-based approaches are intuitive and effective in practice, they are based on the hypothesis that documents that exactly contain the query terms are highly relevant regardless of query semantics. Inversely, term-based approaches assume documents that do not contain query terms as irrelevant. However, it is known that a high matching degree at the term level does not necessarily mean high relevance and, vice versa, documents that match null query terms may still be relevant. Consequently, there exists a vocabulary gap between queries and documents that occurs when both use different words to describe the same concepts. It is the alleviation of the effect brought forward by this vocabulary gap that is the topic of this dissertation. More specifically, we propose (1) methods to formulate an effective query from complex textual structures and (2) latent vector space models that circumvent the vocabulary gap in information retrieval.

Related collections

Most cited references 60

Record: found
Abstract: found
Article: not found

Unskilled and unaware of it: how difficulties in recognizing one's own incompetence lead to inflated self-assessments.

J. Krüger, D Dunning (1999)

People tend to hold overly favorable views of their abilities in many social and intellectual domains. The authors suggest that this overestimation occurs, in part, because people who are unskilled in these domains suffer a dual burden: Not only do these people reach erroneous conclusions and make unfortunate choices, but their incompetence robs them of the metacognitive ability to realize it. Across 4 studies, the authors found that participants scoring in the bottom quartile on tests of humor, grammar, and logic grossly overestimated their test performance and ability. Although their test scores put them in the 12th percentile, they estimated themselves to be in the 62nd. Several analyses linked this miscalibration to deficits in metacognitive skill, or the capacity to distinguish accuracy from error. Paradoxically, improving the skills of participants, and thus increasing their metacognitive competence, helped them recognize the limitations of their abilities.

0 comments Cited 378 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

The Automatic Creation of Literature Abstracts

H. P. Luhn (1958)

0 comments Cited 203 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Scalable Nearest Neighbor Algorithms for High Dimensional Data.

Marius Muja, David Lowe (2014)

For many computer vision and machine learning problems, large training sets are key for good performance. However, the most computationally expensive part of many computer vision and machine learning algorithms consists of finding nearest neighbor matches to high dimensional vectors that represent the training data. We propose new algorithms for approximate nearest neighbor matching and evaluate and compare them with previous algorithms. For matching high dimensional features, we find two algorithms to be the most efficient: the randomized k-d forest and a new algorithm proposed in this paper, the priority search k-means tree. We also propose a new algorithm for matching binary features by searching multiple hierarchical clustering trees and show it outperforms methods typically used in the literature. We show that the optimal nearest neighbor algorithm and its parameters depend on the data set characteristics and describe an automated configuration procedure for finding the best algorithm to search a particular data set. In order to scale to very large data sets that would otherwise not fit in the memory of a single machine, we propose a distributed nearest neighbor matching framework that can be used with any of the algorithms described in the paper. All this research has been released as an open source library called fast library for approximate nearest neighbors (FLANN), which has been incorporated into OpenCV and is now one of the most popular libraries for nearest neighbor matching.

0 comments Cited 160 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Publication date Created: 16 November 2017

Article

ArXiV ID: 1711.06004

SO-VID: 195555a8-3bba-4ece-bdb8-9e9a48ee5885

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Comments PhD thesis

Categories cs.IR cs.AI cs.CL

Data availability:

Comments

Comment on this article

Most referenced authors 357

See all reference authors

Remedies against the Vocabulary Gap in Information Retrieval

Read this article at

Abstract

Related collections

Remedies to the replicability crisis

Most cited references 60

Unskilled and unaware of it: how difficulties in recognizing one's own incompetence lead to inflated self-assessments.

The Automatic Creation of Literature Abstracts

Scalable Nearest Neighbor Algorithms for High Dimensional Data.

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 74

Most referenced authors 357