The khmer software package: enabling efficient nucleotide sequence analysis

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The khmer package is a freely available software library for working efficiently with fixed length DNA words, or k-mers. khmer provides implementations of a probabilistic k-mer counting data structure, a compressible De Bruijn graph representation, De Bruijn graph partitioning, and digital normalization. khmer is implemented in C++ and Python, and is freely available under the BSD license at https://github.com/dib-lab/khmer/.

Related collections

Most cited references 9

Record: found
Abstract: found
Article: not found

Tackling soil diversity with the assembly of large, complex metagenomes.

Adina Chuang Howe, Janet Jansson, Stephanie A. Malfatti … (2014)

The large volumes of sequencing data required to sample deeply the microbial communities of complex environments pose new challenges to sequence analysis. De novo metagenomic assembly effectively reduces the total amount of data to be analyzed but requires substantial computational resources. We combine two preassembly filtering approaches--digital normalization and partitioning--to generate previously intractable large metagenome assemblies. Using a human-gut mock community dataset, we demonstrate that these methods result in assemblies nearly identical to assemblies from unprocessed data. We then assemble two large soil metagenomes totaling 398 billion bp (equivalent to 88,000 Escherichia coli genomes) from matched Iowa corn and native prairie soils. The resulting assembled contigs could be used to identify molecular interactions and reaction networks of known metabolic pathways using the Kyoto Encyclopedia of Genes and Genomes Orthology database. Nonetheless, more than 60% of predicted proteins in assemblies could not be annotated against known databases. Many of these unknown proteins were abundant in both corn and prairie soils, highlighting the benefits of assembly for the discovery and characterization of novelty in soil biodiversity. Moreover, 80% of the sequencing data could not be assembled because of low coverage, suggesting that considerably more sequencing data are needed to characterize the functional content of soil.

0 comments Cited 140 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

SeqAn An efficient, generic C++ library for sequence analysis

Andreas Gogol-Döring, David Weese, Tobias Rausch … (2008)

Background The use of novel algorithmic techniques is pivotal to many important problems in life science. For example the sequencing of the human genome [1] would not have been possible without advanced assembly algorithms. However, owing to the high speed of technological progress and the urgent need for bioinformatics tools, there is a widening gap between state-of-the-art algorithmic techniques and the actual algorithmic components of tools that are in widespread use. Results To remedy this trend we propose the use of SeqAn, a library of efficient data types and algorithms for sequence analysis in computational biology. SeqAn comprises implementations of existing, practical state-of-the-art algorithmic components to provide a sound basis for algorithm testing and development. In this paper we describe the design and content of SeqAn and demonstrate its use by giving two examples. In the first example we show an application of SeqAn as an experimental platform by comparing different exact string matching algorithms. The second example is a simple version of the well-known MUMmer tool rewritten in SeqAn. Results indicate that our implementation is very efficient and versatile to use. Conclusion We anticipate that SeqAn greatly simplifies the rapid development of new bioinformatics tools by providing a collection of readily usable, well-designed algorithmic components which are fundamental for the field of sequence analysis. This leverages not only the implementation of new algorithms, but also enables a sound analysis and comparison of existing algorithms.

0 comments Cited 120 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

Jason Pell, Adina Howe (2012)

Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for {\em de novo} assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory. We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for {\em de novo} assembly of metagenomes. On one soil metagenome assembly, this approach achieves a nearly 40-fold decrease in the maximum memory requirements for assembly. This probabilistic graph representation is a significant theoretical advance in storing assembly graphs and also yields immediate leverage on metagenomic assembly.

0 comments Cited 61 times – based on 0 reviews

Preprint

     Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): F1000Res

Journal ID (iso-abbrev): F1000Res

Journal ID (pmc): F1000Research

Title: F1000Research

Publisher: F1000Research (London, UK )

ISSN (Electronic): 2046-1402

Publication date (Electronic): 25 September 2015

Publication date Collection: 2015

Volume: 4

Electronic Location Identifier: 900

Affiliations

[1 ]Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI, USA

[2 ]Department of Plant, Soil and Microbial Sciences, Michigan State University, East Lansing, MI, USA

[3 ]Population Health and Reproduction, University of California, Davis, Davis, CA, USA

[4 ]Department of Biomedical Engineering, Oregon Health and Science University, Portland, OR, USA

[5 ]Biology Department, San Jose State University, San Jose, CA, USA

[6 ]School of Life Sciences and The Biodesign Institute, Arizona State University, Tempe, AZ, USA

[7 ]Genetics, Michigan State University, East Lansing, MI, USA

[8 ]Computational and Evolutionary Biology, Faculty of Life Sciences, University of Manchester, Manchester, UK

[9 ]Micron Technology, Seattle, WA, USA

[10 ]Invitae, San Francisco, CA, USA

[11 ]Computer Science and Engineering, Michigan State University, East Lansing, MI, USA

[12 ]Independent Researcher, Munich, Germany

[13 ]Mathematics Institute, University of Warwick, Warwick, UK

[14 ]Eastlake Data, Seattle, WA, USA

[15 ]Graduate Program, University of Maryland, College Park, MD, USA

[16 ]Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital, Charlestown, MA, USA

[17 ]Independent Researcher, Seattle, WA, USA

[18 ]Center for Microbial Ecology, Michigan State University, East Lansing, MI, USA

[19 ]Department of Agricultural and Biosystems Engineering, Iowa State University, Ames, IA, USA

[20 ]Department of Biology, University of Utah, Salt Lake City, UT, USA

[21 ]ConSol* Software GmbH, Munchen, Germany

[22 ]Independent Researcher, Sydney, Australia

[23 ]Verdematics, Fremont, CA, USA

[24 ]Independent Researcher, San Francisco, CA, USA

[25 ]Clinical Pathology, Mansoura University, Mansoura, Egypt

[26 ]Addgene, Cambridge, MA, USA

[27 ]Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA

[28 ]ARC Centre of Excellence in Plant Energy Biology, The Australian National University, Canberra, ACT, Australia

[29 ]BEACON Center, Michigan State University, East Lansing, MI, USA

[30 ]Independent Researcher, New Orleans, LA, USA

[31 ]Centre for Ecological and Evolutionary Synthesis, Dept. of Biosciences, University of Oslo, Oslo, Norway

[32 ]Department of Computer Science, Rio Piedras Campus, University of Puerto Rico, San Juan, Puerto Rico

[33 ]Biochemistry, St. Louis College of Pharmacy, St. Louis, MO, USA

[34 ]Crop and Soil Sciences, Cornell University, Ithaca, NY, USA

[35 ]Department of Bioengineering, UC Berkeley, Berkeley, CA, USA

[36 ]Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA

[37 ]Data Visualization, Newline Technical Innovations, Windsor, CO, USA

[38 ]Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN, USA

[39 ]Ontario Institute for Cancer Research, Toronto, ON, Canada

[40 ]Computer Science, University of Toronto, Toronto, ON, Canada

[41 ]Division of Geological and Planetary Sciences, California Institute of Technology, Pasadena, CA, USA

[42 ]Dept of Physics and Dept of Materials, Imperial College London, London, UK

[43 ]Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA

[44 ]Department of Biology, Indiana University, Bloomington, IN, USA

[45 ]Bioinformatics and Computational Biology Graduate Program, Iowa State University, Ames, IA, USA

[46 ]Chemical Engineering & Materials Science, Michigan State University, East Lansing, MIS, USA

[47 ]The New York Eye and Ear Infirmary of Mount Sinai, New York, NY, USA

[48 ]Independent Researcher, Providence, RI, USA

[49 ]Mathematics and Computer Science Division, Argonne National Laboratory, Lemont, IL, USA

[50 ]Department of Genetics, Smurfit Institute, Trinity College Dublin, Dublin, Ireland

[51 ]Independent Researcher, Boston, MA, USA

[1 ]Computer Science Department, Stony Brook University, Stony Brook, NY, USA

[1 ]Computation Institute, University of Chicago, Chicago, IL, USA

University of Chicago, USA

[1 ]European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK

Author notes

[a ] titus@ 123456idyll.org

CTB is the primary investigator for the khmer software package. MRC is the lead software developer from July 2013 onwards. Many significant components of khmer have their own paper describing them (see “Use Cases”, above). The remaining authors each have one or more Git commits in their name.

Competing interests: No competing interests were disclosed.

Competing interests: none

Competing interests: No competing interests were disclosed.

Article

DOI: 10.12688/f1000research.6924.1

PMC ID: 4608353

PubMed ID: 26535114

SO-VID: 3b958d96-9d06-45e0-b4b9-d4ff11807249

License:

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date accepted : 25 September 2015

Funding

Funded by: USDA NIFA

Award ID: 2010-65205-20361

Funded by: National Institutes of Health

Award ID: R01HG007513

Funded by: Gordon and Betty Moore Foundation

Award ID: GBMF4551

khmer development has largely been supported by AFRI Competitive Grant no. 2010-65205-20361 from the USDA NIFA, and is now funded by the National Human Genome Research Institute of the National Institutes of Health under Award Number R01HG007513, as well as by the the Gordon and Betty Moore Foundation under Award number GBMF4551, all to CTB.

I confirm that the funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

The khmer software package: enabling efficient nucleotide sequence analysis

Read this article at

Abstract

Related collections

Software for SAXS correction and analysis

Most cited references 9

Tackling soil diversity with the assembly of large, complex metagenomes.

SeqAn An efficient, generic C++ library for sequence analysis

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

Author and article information

Journal

Affiliations

Author notes

Article

History

Funding

Categories

Comments

Comment on this article

Similar content 198

Cited by 146

Most referenced authors 255