81
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      The khmer software package: enabling efficient nucleotide sequence analysis

      research-article
      1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 11 , 13 , 14 , 15 , 16 , 17 , 18 , 1 , 1 , 19 , 20 , 21 , 11 , 22 , 23 , 24 , 3 , 25 , 26 , 11 , 27 , 28 , 29 , 30 , 31 , 32 , 33 , 11 , 34 , 35 , 36 , 11 , 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 47 , 48 , 11 , 49 , 50 , 11 , 11 , 11 , 51 , a , 1 , 3 , 11
      F1000Research
      F1000Research
      bioinformatics, dna sequencing analysis, k-mer, kmer, khmer, online, low-memory, streaming

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The khmer package is a freely available software library for working efficiently with fixed length DNA words, or k-mers. khmer provides implementations of a probabilistic k-mer counting data structure, a compressible De Bruijn graph representation, De Bruijn graph partitioning, and digital normalization. khmer is implemented in C++ and Python, and is freely available under the BSD license at  https://github.com/dib-lab/khmer/.

          Related collections

          Most cited references9

          • Record: found
          • Abstract: found
          • Article: not found

          Tackling soil diversity with the assembly of large, complex metagenomes.

          The large volumes of sequencing data required to sample deeply the microbial communities of complex environments pose new challenges to sequence analysis. De novo metagenomic assembly effectively reduces the total amount of data to be analyzed but requires substantial computational resources. We combine two preassembly filtering approaches--digital normalization and partitioning--to generate previously intractable large metagenome assemblies. Using a human-gut mock community dataset, we demonstrate that these methods result in assemblies nearly identical to assemblies from unprocessed data. We then assemble two large soil metagenomes totaling 398 billion bp (equivalent to 88,000 Escherichia coli genomes) from matched Iowa corn and native prairie soils. The resulting assembled contigs could be used to identify molecular interactions and reaction networks of known metabolic pathways using the Kyoto Encyclopedia of Genes and Genomes Orthology database. Nonetheless, more than 60% of predicted proteins in assemblies could not be annotated against known databases. Many of these unknown proteins were abundant in both corn and prairie soils, highlighting the benefits of assembly for the discovery and characterization of novelty in soil biodiversity. Moreover, 80% of the sequencing data could not be assembled because of low coverage, suggesting that considerably more sequencing data are needed to characterize the functional content of soil.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            SeqAn An efficient, generic C++ library for sequence analysis

            Background The use of novel algorithmic techniques is pivotal to many important problems in life science. For example the sequencing of the human genome [1] would not have been possible without advanced assembly algorithms. However, owing to the high speed of technological progress and the urgent need for bioinformatics tools, there is a widening gap between state-of-the-art algorithmic techniques and the actual algorithmic components of tools that are in widespread use. Results To remedy this trend we propose the use of SeqAn, a library of efficient data types and algorithms for sequence analysis in computational biology. SeqAn comprises implementations of existing, practical state-of-the-art algorithmic components to provide a sound basis for algorithm testing and development. In this paper we describe the design and content of SeqAn and demonstrate its use by giving two examples. In the first example we show an application of SeqAn as an experimental platform by comparing different exact string matching algorithms. The second example is a simple version of the well-known MUMmer tool rewritten in SeqAn. Results indicate that our implementation is very efficient and versatile to use. Conclusion We anticipate that SeqAn greatly simplifies the rapid development of new bioinformatics tools by providing a collection of readily usable, well-designed algorithmic components which are fundamental for the field of sequence analysis. This leverages not only the implementation of new algorithms, but also enables a sound analysis and comparison of existing algorithms.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

              Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for {\em de novo} assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory. We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for {\em de novo} assembly of metagenomes. On one soil metagenome assembly, this approach achieves a nearly 40-fold decrease in the maximum memory requirements for assembly. This probabilistic graph representation is a significant theoretical advance in storing assembly graphs and also yields immediate leverage on metagenomic assembly.
                Bookmark

                Author and article information

                Journal
                F1000Res
                F1000Res
                F1000Research
                F1000Research
                F1000Research (London, UK )
                2046-1402
                25 September 2015
                2015
                : 4
                : 900
                Affiliations
                [1 ]Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI, USA
                [2 ]Department of Plant, Soil and Microbial Sciences, Michigan State University, East Lansing, MI, USA
                [3 ]Population Health and Reproduction, University of California, Davis, Davis, CA, USA
                [4 ]Department of Biomedical Engineering, Oregon Health and Science University, Portland, OR, USA
                [5 ]Biology Department, San Jose State University, San Jose, CA, USA
                [6 ]School of Life Sciences and The Biodesign Institute, Arizona State University, Tempe, AZ, USA
                [7 ]Genetics, Michigan State University, East Lansing, MI, USA
                [8 ]Computational and Evolutionary Biology, Faculty of Life Sciences, University of Manchester, Manchester, UK
                [9 ]Micron Technology, Seattle, WA, USA
                [10 ]Invitae, San Francisco, CA, USA
                [11 ]Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
                [12 ]Independent Researcher, Munich, Germany
                [13 ]Mathematics Institute, University of Warwick, Warwick, UK
                [14 ]Eastlake Data, Seattle, WA, USA
                [15 ]Graduate Program, University of Maryland, College Park, MD, USA
                [16 ]Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital, Charlestown, MA, USA
                [17 ]Independent Researcher, Seattle, WA, USA
                [18 ]Center for Microbial Ecology, Michigan State University, East Lansing, MI, USA
                [19 ]Department of Agricultural and Biosystems Engineering, Iowa State University, Ames, IA, USA
                [20 ]Department of Biology, University of Utah, Salt Lake City, UT, USA
                [21 ]ConSol* Software GmbH, Munchen, Germany
                [22 ]Independent Researcher, Sydney, Australia
                [23 ]Verdematics, Fremont, CA, USA
                [24 ]Independent Researcher, San Francisco, CA, USA
                [25 ]Clinical Pathology, Mansoura University, Mansoura, Egypt
                [26 ]Addgene, Cambridge, MA, USA
                [27 ]Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA
                [28 ]ARC Centre of Excellence in Plant Energy Biology, The Australian National University, Canberra, ACT, Australia
                [29 ]BEACON Center, Michigan State University, East Lansing, MI, USA
                [30 ]Independent Researcher, New Orleans, LA, USA
                [31 ]Centre for Ecological and Evolutionary Synthesis, Dept. of Biosciences, University of Oslo, Oslo, Norway
                [32 ]Department of Computer Science, Rio Piedras Campus, University of Puerto Rico, San Juan, Puerto Rico
                [33 ]Biochemistry, St. Louis College of Pharmacy, St. Louis, MO, USA
                [34 ]Crop and Soil Sciences, Cornell University, Ithaca, NY, USA
                [35 ]Department of Bioengineering, UC Berkeley, Berkeley, CA, USA
                [36 ]Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA
                [37 ]Data Visualization, Newline Technical Innovations, Windsor, CO, USA
                [38 ]Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN, USA
                [39 ]Ontario Institute for Cancer Research, Toronto, ON, Canada
                [40 ]Computer Science, University of Toronto, Toronto, ON, Canada
                [41 ]Division of Geological and Planetary Sciences, California Institute of Technology, Pasadena, CA, USA
                [42 ]Dept of Physics and Dept of Materials, Imperial College London, London, UK
                [43 ]Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
                [44 ]Department of Biology, Indiana University, Bloomington, IN, USA
                [45 ]Bioinformatics and Computational Biology Graduate Program, Iowa State University, Ames, IA, USA
                [46 ]Chemical Engineering & Materials Science, Michigan State University, East Lansing, MIS, USA
                [47 ]The New York Eye and Ear Infirmary of Mount Sinai, New York, NY, USA
                [48 ]Independent Researcher, Providence, RI, USA
                [49 ]Mathematics and Computer Science Division, Argonne National Laboratory, Lemont, IL, USA
                [50 ]Department of Genetics, Smurfit Institute, Trinity College Dublin, Dublin, Ireland
                [51 ]Independent Researcher, Boston, MA, USA
                [1 ]Computer Science Department, Stony Brook University, Stony Brook, NY, USA
                [1 ]Computation Institute, University of Chicago, Chicago, IL, USA
                University of Chicago, USA
                [1 ]European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
                Author notes

                CTB is the primary investigator for the khmer software package. MRC is the lead software developer from July 2013 onwards. Many significant components of khmer have their own paper describing them (see “Use Cases”, above). The remaining authors each have one or more Git commits in their name.

                Competing interests: No competing interests were disclosed.

                Competing interests: No competing interests were disclosed.

                Competing interests: No competing interests were disclosed.

                Competing interests: none

                Competing interests: No competing interests were disclosed.

                Article
                10.12688/f1000research.6924.1
                4608353
                26535114
                3b958d96-9d06-45e0-b4b9-d4ff11807249
                Copyright: © 2015 Crusoe MR et al.

                This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 25 September 2015
                Funding
                Funded by: USDA NIFA
                Award ID: 2010-65205-20361
                Funded by: National Institutes of Health
                Award ID: R01HG007513
                Funded by: Gordon and Betty Moore Foundation
                Award ID: GBMF4551
                khmer development has largely been supported by AFRI Competitive Grant no. 2010-65205-20361 from the USDA NIFA, and is now funded by the National Human Genome Research Institute of the National Institutes of Health under Award Number R01HG007513, as well as by the the Gordon and Betty Moore Foundation under Award number GBMF4551, all to CTB.
                I confirm that the funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Software Tool Article
                Articles
                Bioinformatics

                bioinformatics,dna sequencing analysis,k-mer,kmer,khmer,online,low-memory,streaming

                Comments

                Comment on this article