3
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Locality-sensitive hashing for the edit distance

      research-article
      , , ,
      Bioinformatics
      Oxford University Press

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Motivation

          Sequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality-sensitive hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment. Using an LSH, it is possible to separate, with high probability and relatively low computation, the pairs of sequences that do not have high-quality alignment from those that may. Therefore, an LSH reduces the overall computational requirement while not introducing many false negatives (i.e. omitting to report a valid alignment). However, current LSH methods treat sequences as a bag of k-mers and do not take into account the relative ordering of k-mers in sequences. In addition, due to the lack of a practical LSH method for edit distance, in practice, LSH methods for Jaccard similarity or Hamming similarity are used as a proxy.

          Results

          We present an LSH method, called Order Min Hash (OMH), for the edit distance. This method is a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH is sensitive not only to the k-mer contents of the sequences but also to the relative order of the k-mers in the sequences. We present theoretical guarantees of the OMH as a gapped LSH.

          Availability and implementation

          The code to generate the results is available at http://github.com/Kingsford-Group/omhismb2019.

          Supplementary information

          Supplementary data are available at Bioinformatics online.

          Related collections

          Most cited references12

          • Record: found
          • Abstract: found
          • Article: not found

          A whole-genome assembly of Drosophila.

          We report on the quality of a whole-genome assembly of Drosophila melanogaster and the nature of the computer algorithms that accomplished it. Three independent external data sources essentially agree with and support the assembly's sequence and ordering of contigs across the euchromatic portion of the genome. In addition, there are isolated contigs that we believe represent nonrepetitive pockets within the heterochromatin of the centromeres. Comparison with a previously sequenced 2.9- megabase region indicates that sequencing accuracy within nonrepetitive segments is greater than 99. 99% without manual curation. As such, this initial reconstruction of the Drosophila sequence should be of substantial value to the scientific community.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Binary codes capable of correcting deletions, insertions, and reversals

              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Whole-genome sequence assembly for mammalian genomes: Arachne 2.

              We previously described the whole-genome assembly program Arachne, presenting assemblies of simulated data for small to mid-sized genomes. Here we describe algorithmic adaptations to the program, allowing for assembly of mammalian-size genomes, and also improving the assembly of smaller genomes. Three principal changes were simultaneously made and applied to the assembly of the mouse genome, during a six-month period of development: (1) Supercontigs (scaffolds) were iteratively broken and rejoined using several criteria, yielding a 64-fold increase in length (N50), and apparent elimination of all global misjoins; (2) gaps between contigs in supercontigs were filled (partially or completely) by insertion of reads, as suggested by pairing within the supercontig, increasing the N50 contig length by 50%; (3) memory usage was reduced fourfold. The outcome of this mouse assembly and its analysis are described in (Mouse Genome Sequencing Consortium 2002).
                Bookmark

                Author and article information

                Journal
                Bioinformatics
                Bioinformatics
                bioinformatics
                Bioinformatics
                Oxford University Press
                1367-4803
                1367-4811
                July 2019
                05 July 2019
                05 July 2019
                : 35
                : 14
                : i127-i135
                Affiliations
                Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA
                Author notes
                To whom correspondence should be addressed. gmarcais@ 123456cs.cmu.edu or carlk@ 123456cs.cmu.edu
                Article
                btz354
                10.1093/bioinformatics/btz354
                6612865
                31510667
                73649ce4-49bb-4aae-9234-e3598a0e0549
                © The Author(s) 2019. Published by Oxford University Press.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

                History
                Page count
                Pages: 9
                Funding
                Funded by: Gordon and Betty Moore Foundation 10.13039/100000936
                Funded by: Data-Driven Discovery Initiative
                Award ID: GBMF4554
                Funded by: US National Institutes of Health
                Award ID: R01GM122935
                Funded by: The Shurl and Kay Curci Foundation
                Categories
                Ismb/Eccb 2019 Conference Proceedings
                Comparative and Functional Genomics

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article