Locality-sensitive hashing for the edit distance

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Motivation

Sequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality-sensitive hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment. Using an LSH, it is possible to separate, with high probability and relatively low computation, the pairs of sequences that do not have high-quality alignment from those that may. Therefore, an LSH reduces the overall computational requirement while not introducing many false negatives (i.e. omitting to report a valid alignment). However, current LSH methods treat sequences as a bag of k-mers and do not take into account the relative ordering of k-mers in sequences. In addition, due to the lack of a practical LSH method for edit distance, in practice, LSH methods for Jaccard similarity or Hamming similarity are used as a proxy.

Results

We present an LSH method, called Order Min Hash (OMH), for the edit distance. This method is a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH is sensitive not only to the k-mer contents of the sequences but also to the relative order of the k-mers in the sequences. We present theoretical guarantees of the OMH as a gapped LSH.

Availability and implementation

The code to generate the results is available at http://github.com/Kingsford-Group/omhismb2019.

Supplementary information

Supplementary data are available at Bioinformatics online.

Related collections

Most cited references 12

Record: found
Abstract: found
Article: not found

A whole-genome assembly of Drosophila.

E W Myers, G Sutton, A Delcher … (2000)

We report on the quality of a whole-genome assembly of Drosophila melanogaster and the nature of the computer algorithms that accomplished it. Three independent external data sources essentially agree with and support the assembly's sequence and ordering of contigs across the euchromatic portion of the genome. In addition, there are isolated contigs that we believe represent nonrepetitive pockets within the heterochromatin of the centromeres. Comparison with a previously sequenced 2.9- megabase region indicates that sequencing accuracy within nonrepetitive segments is greater than 99. 99% without manual curation. As such, this initial reconstruction of the Drosophila sequence should be of substantial value to the scientific community.

0 comments Cited 382 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Binary codes capable of correcting deletions, insertions, and reversals

VI Levenshtein, V. I. LEVENSHTEIN, V. Levenshtein … (1966)

0 comments Cited 243 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Whole-genome sequence assembly for mammalian genomes: Arachne 2.

Kerstin Lindblad-Toh, J. M. Butler, Michael C. Zody … (2002)

We previously described the whole-genome assembly program Arachne, presenting assemblies of simulated data for small to mid-sized genomes. Here we describe algorithmic adaptations to the program, allowing for assembly of mammalian-size genomes, and also improving the assembly of smaller genomes. Three principal changes were simultaneously made and applied to the assembly of the mouse genome, during a six-month period of development: (1) Supercontigs (scaffolds) were iteratively broken and rejoined using several criteria, yielding a 64-fold increase in length (N50), and apparent elimination of all global misjoins; (2) gaps between contigs in supercontigs were filled (partially or completely) by insertion of reads, as suggested by pairing within the supercontig, increasing the N50 contig length by 50%; (3) memory usage was reduced fourfold. The outcome of this mouse assembly and its analysis are described in (Mouse Genome Sequencing Consortium 2002).

0 comments Cited 92 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Bioinformatics

Journal ID (iso-abbrev): Bioinformatics

Journal ID (publisher-id): bioinformatics

Title: Bioinformatics

Publisher: Oxford University Press

ISSN (Print): 1367-4803

ISSN (Electronic): 1367-4811

Publication date (Print): July 2019

Publication date (Electronic): 05 July 2019

Publication date PMC-release: 05 July 2019

Volume: 35

Issue: 14

Pages: i127-i135

Affiliations

Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA

Author notes

To whom correspondence should be addressed. gmarcais@ 123456cs.cmu.edu or carlk@ 123456cs.cmu.edu

Article

Publisher ID: btz354

DOI: 10.1093/bioinformatics/btz354

PMC ID: 6612865

PubMed ID: 31510667

SO-VID: 73649ce4-49bb-4aae-9234-e3598a0e0549

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

History

Page count

Pages: 9

Funding

Funded by: Gordon and Betty Moore Foundation 10.13039/100000936

Funded by: Data-Driven Discovery Initiative

Award ID: GBMF4554

Funded by: US National Institutes of Health

Award ID: R01GM122935

Funded by: The Shurl and Kay Curci Foundation

Comments

Comment on this article

scite_

Cited by 12

See all cited by

Most referenced authors 609

See all reference authors

Locality-sensitive hashing for the edit distance

Read this article at

Abstract

Motivation

Results

Availability and implementation

Supplementary information

Related collections

Genetoberfest

Most cited references 12

A whole-genome assembly of Drosophila.

Binary codes capable of correcting deletions, insertions, and reversals

Whole-genome sequence assembly for mammalian genomes: Arachne 2.

Author and article information

Journal

Affiliations

Author notes

Article

History

Page count

Funding

Categories

Comments

Comment on this article

Similar content 180

Cited by 12

Most referenced authors 609