conLSH: Context based Locality Sensitive Hashing for Mapping of noisy
  SMRT Reads

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Single Molecule Real-Time (SMRT) sequencing is a recent advancement of Next Gen technology developed by Pacific Bio (PacBio). It comes with an explosion of long and noisy reads demanding cutting edge research to get most out of it. To deal with the high error probability of SMRT data, a novel contextual Locality Sensitive Hashing (conLSH) based algorithm is proposed in this article, which can effectively align the noisy SMRT reads to the reference genome. Here, sequences are hashed together based not only on their closeness, but also on similarity of context. The algorithm has \(\mathcal{O}(n^{\rho+1})\) space requirement, where \(n\) is the number of sequences in the corpus and \(\rho\) is a constant. The indexing time and querying time are bounded by \(\mathcal{O}( \frac{n^{\rho+1} \cdot \ln n}{\ln \frac{1}{P_2}})\) and \(\mathcal{O}(n^\rho)\) respectively, where \(P_2 > 0\), is a probability value. This algorithm is particularly useful for retrieving similar sequences, a widely used task in biology. The proposed conLSH based aligner is compared with rHAT, popularly used for aligning SMRT reads, and is found to comprehensively beat it in speed as well as in memory requirements. In particular, it takes approximately \(24.2\%\) less processing time, while saving about \(70.3\%\) in peak memory requirement for H.sapiens PacBio dataset.

Related collections

Most cited references 15

Record: found
Abstract: not found
Conference Proceedings: not found

Approximate nearest neighbors

Piotr Indyk, Rajeev Motwani (1998)

0 comments Cited 277 times – based on 0 reviews

Bookmark

Record: found
Abstract: not found
Conference Proceedings: not found

Locality-sensitive hashing scheme based on p-stable distributions

Mayur Datar, Nicole Immorlica, Piotr Indyk … (2004)

0 comments Cited 218 times – based on 0 reviews

Bookmark

Record: found
Abstract: found
Article: not found

Scalable Nearest Neighbor Algorithms for High Dimensional Data.

Marius Muja, David Lowe (2014)

For many computer vision and machine learning problems, large training sets are key for good performance. However, the most computationally expensive part of many computer vision and machine learning algorithms consists of finding nearest neighbor matches to high dimensional vectors that represent the training data. We propose new algorithms for approximate nearest neighbor matching and evaluate and compare them with previous algorithms. For matching high dimensional features, we find two algorithms to be the most efficient: the randomized k-d forest and a new algorithm proposed in this paper, the priority search k-means tree. We also propose a new algorithm for matching binary features by searching multiple hierarchical clustering trees and show it outperforms methods typically used in the literature. We show that the optimal nearest neighbor algorithm and its parameters depend on the data set characteristics and describe an automated configuration procedure for finding the best algorithm to search a particular data set. In order to scale to very large data sets that would otherwise not fit in the memory of a single machine, we propose a distributed nearest neighbor matching framework that can be used with any of the algorithms described in the paper. All this research has been released as an open source library called fast library for approximate nearest neighbors (FLANN), which has been incorporated into OpenCV and is now one of the most popular libraries for nearest neighbor matching.

0 comments Cited 159 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Publication date Created: 11 March 2019

Article

ArXiV ID: 1903.04925

SO-VID: d7332a6a-7d4d-4e29-a9c1-6579ef75e250

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Comments arXiv admin note: text overlap with arXiv:1705.03933

Categories q-bio.GN cs.DS cs.LG stat.ML

ScienceOpen disciplines: Data structures & Algorithms,Machine learning,Artificial intelligence,Genetics

Data availability:

ScienceOpen disciplines: Data structures & Algorithms, Machine learning, Artificial intelligence, Genetics

conLSH: Context based Locality Sensitive Hashing for Mapping of noisy SMRT Reads

Read this article at

Abstract

Related collections

Sex and gender-sensitive medicine

Most cited references 15

Approximate nearest neighbors

Locality-sensitive hashing scheme based on p-stable distributions

Scalable Nearest Neighbor Algorithms for High Dimensional Data.

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 338

Most referenced authors 1,973