Task-agnostic Indexes for Deep Learning-based Queries over Unstructured
  Data

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Unstructured data is now commonly queried by using target deep neural networks (DNNs) to produce structured information, e.g., object types and positions in video. As these target DNNs can be computationally expensive, recent work uses proxy models to produce query-specific proxy scores. These proxy scores are then used in downstream query processing algorithms for improved query execution speeds. Unfortunately, proxy models are often trained per-query, require large amounts of training data from the target DNN, and new training methods per query type. In this work, we develop an index construction method (task-agnostic semantic trainable index, TASTI) that produces reusable embeddings that can be used to generate proxy scores for a wide range of queries, removing the need for query-specific proxies. We observe that many queries over the same dataset only require access to the schema induced by the target DNN. For example, an aggregation query counting the number of cars and a selection query selecting frames of cars require only the object types per frame of video. To leverage this opportunity, TASTI produces embeddings per record that have the key property that close embeddings have similar extracted attributes under the induced schema. Given this property, we show that clustering by embeddings can be used to answer downstream queries efficiently. We theoretically analyze TASTI and show that low training error guarantees downstream query accuracy for a natural class of queries. We evaluate TASTI on four video and text datasets, and three query types. We show that TASTI can be 10x less expensive to construct than proxy models and can outperform them by up to 24x at query time.

Related collections

Author and article information

Journal

Publication date Created: 09 September 2020

Article

ArXiV ID: 2009.04540

SO-VID: 03338374-b1c3-4507-a8dc-f8a382ca6063

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Categories cs.DB

ScienceOpen disciplines: Databases

Data availability:

ScienceOpen disciplines: Databases

Task-agnostic Indexes for Deep Learning-based Queries over Unstructured Data

Read this article at

Abstract

Related collections

Big Data Privacy

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 132