Improving the performance of minimizers and winnowing schemes

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Motivation: The minimizers scheme is a method for selecting k-mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a sequence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing time. Although very useful, the minimizers selection procedure has undesirable behaviors (e.g. too many k-mers are selected when processing certain sequences). Some of these problems were already known to the authors of the minimizers technique, and the natural lexicographic ordering of k-mers used by minimizers was recognized as their origin. Many software tools using minimizers employ ad hoc variations of the lexicographic order to alleviate those issues.

Results: We provide an in-depth analysis of the effect of k-mer ordering on the performance of the minimizers technique. By using small universal hitting sets (a recently defined concept), we show how to significantly improve the performance of minimizers and avoid some of its worse behaviors. Based on these results, we encourage bioinformatics software developers to use an ordering based on a universal hitting set or, if not possible, a randomized ordering, rather than the lexicographic order. This analysis also settles negatively a conjecture (by Schleimer et al.) on the expected density of minimizers in a random sequence.

Availability and Implementation: The software used for this analysis is available on GitHub: https://github.com/gmarcais/minimizers.git.

Contact: gmarcais@ 123456cs.cmu.edu or carlk@ 123456cs.cmu.edu

Related collections

Most cited references 4

Record: found
Abstract: found
Article: not found

Reducing storage requirements for biological sequence comparison.

B Hunt, Stephen Mount, James A. Yorke … (2004)

Comparison of nucleic acid and protein sequences is a fundamental tool of modern bioinformatics. A dominant method of such string matching is the 'seed-and-extend' approach, in which occurrences of short subsequences called 'seeds' are used to search for potentially longer matches in a large database of sequences. Each such potential match is then checked to see if it extends beyond the seed. To be effective, the seed-and-extend approach needs to catalogue seeds from virtually every substring in the database of search strings. Projects such as mammalian genome assemblies and large-scale protein matching, however, have such large sequence databases that the resulting list of seeds cannot be stored in RAM on a single computer. This significantly slows the matching process. We present a simple and elegant method in which only a small fraction of seeds, called 'minimizers', needs to be stored. Using minimizers can speed up string-matching computations by a large factor while missing only a small fraction of the matches found using all seeds.

0 comments Cited 154 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

KMC 2: fast and resource-frugal k-mer counting.

Sebastian Deorowicz, Marek Kokot, Szymon Grabowski … (2015)

Building the histogram of occurrences of every k-symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of k-mer counting. Its applications include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. The tremendous amounts of NGS data require fast algorithms for k-mer counting, preferably using moderate amounts of memory.

0 comments Cited 70 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

A combinatorial problem

G. J. de Bruijn, Nicolass de Bruijn, NG De Bruijn … (1946)

0 comments Cited 28 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Bioinformatics

Journal ID (iso-abbrev): Bioinformatics

Journal ID (publisher-id): bioinformatics

Title: Bioinformatics

Publisher: Oxford University Press

ISSN (Print): 1367-4803

ISSN (Electronic): 1367-4811

Publication date (Print): 15 July 2017

Publication date (Electronic): 12 July 2017

Publication date PMC-release: 12 July 2017

Volume: 33

Issue: 14

Pages: i110-i117

Affiliations

[1 ]Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA

[2 ]Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel

[3 ]Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA

Author notes

[* ]To whom correspondence should be addressed.

Article

Publisher ID: btx235

DOI: 10.1093/bioinformatics/btx235

PMC ID: 5870760

PubMed ID: 28881970

SO-VID: fd72a357-fb3f-4ebc-8f88-94e3684ec358

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

History

Page count

Pages: 8

Funding

Funded by: National Institutes of Health 10.13039/100000002

Award ID: R01HG007104

Funded by: Israel Science Foundation 10.13039/501100003977

Comments

Comment on this article

scite_

Cited by 25

See all cited by

Most referenced authors 92

See all reference authors

Improving the performance of minimizers and winnowing schemes

Read this article at

Abstract

Related collections

Genetoberfest

Most cited references 4

Reducing storage requirements for biological sequence comparison.

KMC 2: fast and resource-frugal k-mer counting.

A combinatorial problem

Author and article information

Journal

Affiliations

Author notes

Article

History

Page count

Funding

Categories

Comments

Comment on this article

Similar content 58

Cited by 25

Most referenced authors 92