25
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Improving the performance of minimizers and winnowing schemes

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Motivation: The minimizers scheme is a method for selecting k-mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a sequence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing time. Although very useful, the minimizers selection procedure has undesirable behaviors (e.g. too many k-mers are selected when processing certain sequences). Some of these problems were already known to the authors of the minimizers technique, and the natural lexicographic ordering of k-mers used by minimizers was recognized as their origin. Many software tools using minimizers employ ad hoc variations of the lexicographic order to alleviate those issues.

          Results: We provide an in-depth analysis of the effect of k-mer ordering on the performance of the minimizers technique. By using small universal hitting sets (a recently defined concept), we show how to significantly improve the performance of minimizers and avoid some of its worse behaviors. Based on these results, we encourage bioinformatics software developers to use an ordering based on a universal hitting set or, if not possible, a randomized ordering, rather than the lexicographic order. This analysis also settles negatively a conjecture (by Schleimer et al.) on the expected density of minimizers in a random sequence.

          Availability and Implementation: The software used for this analysis is available on GitHub: https://github.com/gmarcais/minimizers.git.

          Contact: gmarcais@ 123456cs.cmu.edu or carlk@ 123456cs.cmu.edu

          Related collections

          Most cited references4

          • Record: found
          • Abstract: found
          • Article: not found

          Reducing storage requirements for biological sequence comparison.

          Comparison of nucleic acid and protein sequences is a fundamental tool of modern bioinformatics. A dominant method of such string matching is the 'seed-and-extend' approach, in which occurrences of short subsequences called 'seeds' are used to search for potentially longer matches in a large database of sequences. Each such potential match is then checked to see if it extends beyond the seed. To be effective, the seed-and-extend approach needs to catalogue seeds from virtually every substring in the database of search strings. Projects such as mammalian genome assemblies and large-scale protein matching, however, have such large sequence databases that the resulting list of seeds cannot be stored in RAM on a single computer. This significantly slows the matching process. We present a simple and elegant method in which only a small fraction of seeds, called 'minimizers', needs to be stored. Using minimizers can speed up string-matching computations by a large factor while missing only a small fraction of the matches found using all seeds.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            KMC 2: fast and resource-frugal k-mer counting.

            Building the histogram of occurrences of every k-symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of k-mer counting. Its applications include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. The tremendous amounts of NGS data require fast algorithms for k-mer counting, preferably using moderate amounts of memory.
              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              A combinatorial problem

                Bookmark

                Author and article information

                Journal
                Bioinformatics
                Bioinformatics
                bioinformatics
                Bioinformatics
                Oxford University Press
                1367-4803
                1367-4811
                15 July 2017
                12 July 2017
                12 July 2017
                : 33
                : 14
                : i110-i117
                Affiliations
                [1 ]Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA
                [2 ]Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
                [3 ]Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA
                Author notes
                [* ]To whom correspondence should be addressed.
                Article
                btx235
                10.1093/bioinformatics/btx235
                5870760
                28881970
                fd72a357-fb3f-4ebc-8f88-94e3684ec358
                © The Author 2017. Published by Oxford University Press.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

                History
                Page count
                Pages: 8
                Funding
                Funded by: National Institutes of Health 10.13039/100000002
                Award ID: R01HG007104
                Funded by: Israel Science Foundation 10.13039/501100003977
                Categories
                Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017
                Hitseq

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article