2
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      High-dimensional similarity searches using query driven dynamic quantization and distributed indexing

      research-article

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The concept of similarity is used as the basis for many data exploration and data mining tasks. Nearest neighbor (NN) queries identify the most similar items, or in terms of distance the closest points to a query point. Similarity is traditionally characterized using a distance function between multi-dimensional feature vectors. However, when the data is high-dimensional, traditional distance functions fail to significantly distinguish between the closest and furthest points, as few dissimilar dimensions dominate the distance function. Localized similarity functions, i.e. functions that only consider dimensions close to the query, quantize each dimension independently and only compute similarity for the dimensions where the query and the points fall into the same bin. These quantizations are query-agnostic and there is potential to improve accuracy when a query-dependent quantization is used. In this work we propose a query dependent equi-depth (QED) on-the-fly quantization method to improve high-dimensional similarity searches. The quantization is done for each dimension at query time and localized scores are generated for the closest p fraction of the points while a constant penalty is applied for the rest of the points. QED not only improves the quality of the distancemetric,but also improves query time performance by filtering out non relevant data. We propose a distributed indexing and query algorithm to efficiently compute QED. Our experimental results show improvements in classification accuracy as well as query performance up to one order of magnitude faster than Manhattan-based sequential scan NN queries over datasets with hundreds of dimensions. Furthermore, similarity searches with QED show linear or better scalability in relation to the number of dimensions, and the number of compute nodes.

          Related collections

          Author and article information

          Journal
          101733213
          47954
          Distrib Parallel Databases
          Distrib Parallel Databases
          Distributed and parallel databases
          0926-8782
          1573-7578
          28 June 2019
          11 April 2019
          2020
          28 August 2020
          : 38
          : 255-286
          Affiliations
          [1. ]Department of Computer Engineering, San Jose State University, San Jose, CA, USA
          [2. ]Department of Electrical and Computer Engineering, The University of Iowa, Iowa, IA, USA
          Author notes
          Article
          PMC7453591 PMC7453591 7453591 nihpa1038726
          10.1007/s10619-019-07266-x
          7453591
          32863590
          4927d394-7286-4f3f-b6fe-0a1e6d794c23
          History
          Categories
          Article

          QED,Distributed and parallel algorithms,Query optimization,Query aware quantization,Bit-vector,Indexing,High-dimensional data,Similarity searches

          Comments

          Comment on this article