RefSeq database growth influences the accuracy of  k -mer-based lowest common ancestor species identification

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

In order to determine the role of the database in taxonomic sequence classification, we examine the influence of the database over time on k-mer-based lowest common ancestor taxonomic classification. We present three major findings: the number of new species added to the NCBI RefSeq database greatly outpaces the number of new genera; as a result, more reads are classified with newer database versions, but fewer are classified at the species level; and Bayesian-based re-estimation mitigates this effect but struggles with novel genomes. These results suggest a need for new classification approaches specially adapted for large databases.

Related collections

Most cited references 30

Record: found
Abstract: found
Article: found

Is Open Access

Bracken: estimating species abundance in metagenomics data

Jennifer Lu, Florian P. Breitwieser, Peter Thielen … (2017)

Metagenomic experiments attempt to characterize microbial communities using high-throughput DNA sequencing. Identification of the microorganisms in a sample provides information about the genetic profile, population structure, and role of microorganisms within an environment. Until recently, most metagenomics studies focused on high-level characterization at the level of phyla, or alternatively sequenced the 16S ribosomal RNA gene that is present in bacterial species. As the cost of sequencing has fallen, though, metagenomics experiments have increasingly used unbiased shotgun sequencing to capture all the organisms in a sample. This approach requires a method for estimating abundance directly from the raw read data. Here we describe a fast, accurate new method that computes the abundance at the species level using the reads collected in a metagenomics experiment. Bracken (Bayesian Reestimation of Abundance after Classification with KrakEN) uses the taxonomic assignments made by Kraken, a very fast read-level classifier, along with information about the genomes themselves to estimate abundance at the species level, the genus level, or above. We demonstrate that Bracken can produce accurate species- and genus-level abundance estimates even when a sample contains multiple near-identical species.

0 comments Cited 627 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets

Robert Schmieder, Robert Edwards (2011)

High-throughput sequencing technologies have strongly impacted microbiology, providing a rapid and cost-effective way of generating draft genomes and exploring microbial diversity. However, sequences obtained from impure nucleic acid preparations may contain DNA from sources other than the sample. Those sequence contaminations are a serious concern to the quality of the data used for downstream analysis, causing misassembly of sequence contigs and erroneous conclusions. Therefore, the removal of sequence contaminants is a necessary and required step for all sequencing projects. We developed DeconSeq, a robust framework for the rapid, automated identification and removal of sequence contamination in longer-read datasets ( 150 bp mean read length). DeconSeq is publicly available as standalone and web-based versions. The results can be exported for subsequent analysis, and the databases used for the web-based version are automatically updated on a regular basis. DeconSeq categorizes possible contamination sequences, eliminates redundant hits with higher similarity to non-contaminant genomes, and provides graphical visualizations of the alignment results and classifications. Using DeconSeq, we conducted an analysis of possible human DNA contamination in 202 previously published microbial and viral metagenomes and found possible contamination in 145 (72%) metagenomes with as high as 64% contaminating sequences. This new framework allows scientists to automatically detect and efficiently remove unwanted sequence contamination from their datasets while eliminating critical limitations of current methods. DeconSeq's web interface is simple and user-friendly. The standalone version allows offline analysis and integration into existing data processing pipelines. DeconSeq's results reveal whether the sequencing experiment has succeeded, whether the correct sample was sequenced, and whether the sample contains any sequence contamination from DNA preparation or host. In addition, the analysis of 202 metagenomes demonstrated significant contamination of the non-human associated metagenomes, suggesting that this method is appropriate for screening all metagenomes. DeconSeq is available at http://deconseq.sourceforge.net/.

0 comments Cited 362 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Bacillus anthracis, Bacillus cereus, and Bacillus thuringiensis--one species on the basis of genetic evidence.

E Helgason, O A Okstad, D A Caugant … (2000)

Bacillus anthracis, Bacillus cereus, and Bacillus thuringiensis are members of the Bacillus cereus group of bacteria, demonstrating widely different phenotypes and pathological effects. B. anthracis causes the acute fatal disease anthrax and is a potential biological weapon due to its high toxicity. B. thuringiensis produces intracellular protein crystals toxic to a wide number of insect larvae and is the most commonly used biological pesticide worldwide. B. cereus is a probably ubiquitous soil bacterium and an opportunistic pathogen that is a common cause of food poisoning. In contrast to the differences in phenotypes, we show by multilocus enzyme electrophoresis and by sequence analysis of nine chromosomal genes that B. anthracis should be considered a lineage of B. cereus. This determination is not only a formal matter of taxonomy but may also have consequences with respect to virulence and the potential of horizontal gene transfer within the B. cereus group.

0 comments Cited 230 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Todd J. Treangen:

ORCID: http://orcid.org/0000-0002-3760-564X

treangen@rice.edu

Journal

Journal ID (nlm-ta): Genome Biol

Journal ID (iso-abbrev): Genome Biol

Title: Genome Biology

Publisher: BioMed Central (London )

ISSN (Print): 1474-7596

ISSN (Electronic): 1474-760X

Publication date (Electronic): 30 October 2018

Publication date PMC-release: 30 October 2018

Publication date Collection: 2018

Volume: 19

Electronic Location Identifier: 165

Affiliations

[1 ]ISNI 0000 0001 0941 7177, GRID grid.164295.d, Center for Bioinformatics and Computational Biology, , University of Maryland, ; College Park, MD USA

[2 ]ISNI 0000 0001 2233 9230, GRID grid.280128.1, Genome Informatics Section, Computational and Statistical Genomics Branch, , National Human Genome Research Institute, ; Bethesda, MD USA

[3 ]ISNI 0000 0004 1936 8278, GRID grid.21940.3e, Department of Computer Science, , Rice University, ; Houston, TX USA

Author information

Todd J. Treangen http://orcid.org/0000-0002-3760-564X

Article

Publisher ID: 1554

DOI: 10.1186/s13059-018-1554-6

PMC ID: 6206640

PubMed ID: 30373669

SO-VID: d3bf3094-6a31-4856-ab5e-9d0987e8747d

License:

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

History

Date received : 2 May 2018

Date accepted : 1 October 2018

Funding

Funded by: FundRef http://dx.doi.org/10.13039/100000183, Army Research Office;

Award ID: W911NF-17-2-0089

Award Recipient : Todd J. Treangen

Custom metadata

ScienceOpen disciplines: Genetics

Keywords: taxonomic classification,reference database,metagenomics,microbiome,comparative analysis,k-mer,lca

Data availability:

ScienceOpen disciplines: Genetics

Keywords: taxonomic classification, reference database, metagenomics, microbiome, comparative analysis, k-mer, lca

RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification

Read this article at

Abstract

Related collections

European Journal of Microbiology and Immunology

Most cited references 30

Bracken: estimating species abundance in metagenomics data

Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets

Bacillus anthracis, Bacillus cereus, and Bacillus thuringiensis--one species on the basis of genetic evidence.

Author and article information

Contributors

Journal

Affiliations

Author information

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 176

Cited by 60

Most referenced authors 1,734