Blog
About

  • Record: found
  • Abstract: found
  • Article: found
Is Open Access

Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible

, *

PLoS Computational Biology

Public Library of Science

Read this article at

Bookmark
      There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

      Abstract

      Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.

      Author Summary

      The term microbiome refers to the ecosystem of microbes that live in a defined environment. The decreasing cost and increasing speed of DNA sequencing technology has recently provided scientists with affordable and timely access to the genes and genomes of microbiomes that inhabit our planet and even our own bodies. In these investigations many microbiome samples are sequenced at the same time on the same DNA sequencing machine, but often result in total numbers of sequences per sample that are vastly different. The common procedure for addressing this difference in sequencing effort across samples – different library sizes – is to either (1) base analyses on the proportional abundance of each species in a library, or (2) rarefy, throw away sequences from the larger libraries so that all have the same, smallest size. We show that both of these normalization methods can work when comparing obviously-different whole microbiomes, but that neither method works well when comparing the relative proportions of each bacterial species across microbiome samples. We show that alternative methods based on a statistical mixture model perform much better and can be easily adapted from a separate biological sub-discipline, called RNA-Seq analysis.

      Related collections

      Most cited references 41

      • Record: found
      • Abstract: not found
      • Article: not found

      QIIME allows analysis of high-throughput community sequencing data.

        Bookmark
        • Record: found
        • Abstract: found
        • Article: not found

        Mapping and quantifying mammalian transcriptomes by RNA-Seq.

        We have mapped and quantified mouse transcriptomes by deeply sequencing them and recording how frequently each gene is represented in the sequence sample (RNA-Seq). This provides a digital measure of the presence and prevalence of transcripts from known and previously unknown genes. We report reference measurements composed of 41-52 million mapped 25-base-pair reads for poly(A)-selected RNA from adult mouse brain, liver and skeletal muscle tissues. We used RNA standards to quantify transcript prevalence and to test the linear range of transcript detection, which spanned five orders of magnitude. Although >90% of uniquely mapped reads fell within known exons, the remaining data suggest new and revised gene models, including changed or additional promoters, exons and 3' untranscribed regions, as well as new candidate microRNA precursors. RNA splice events, which are not readily measured by standard gene expression microarray or serial analysis of gene expression methods, were detected directly by mapping splice-crossing sequence reads. We observed 1.45 x 10(5) distinct splices, and alternative splices were prominent, with 3,500 different genes expressing one or more alternate internal splices.
          Bookmark
          • Record: found
          • Abstract: found
          • Article: not found

          Differential expression analysis for sequence count data

          High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. We propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, DESeq, as an R/Bioconductor package.
            Bookmark

            Author and article information

            Affiliations
            Statistics Department, Stanford University, Stanford, California, United States of America
            Heinrich Heine University, Germany
            Author notes

            The authors have declared that no competing interests exist.

            Conceived and designed the experiments: PJM SH. Performed the experiments: PJM. Analyzed the data: PJM SH. Contributed reagents/materials/analysis tools: PJM SH. Wrote the paper: PJM SH.

            Contributors
            Role: Editor
            Journal
            PLoS Comput Biol
            PLoS Comput. Biol
            plos
            ploscomp
            PLoS Computational Biology
            Public Library of Science (San Francisco, USA )
            1553-734X
            1553-7358
            April 2014
            3 April 2014
            : 10
            : 4
            24699258
            3974642
            PCOMPBIOL-D-13-01815
            10.1371/journal.pcbi.1003531
            (Editor)

            This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

            Counts
            Pages: 12
            Funding
            This work was supported by the NIH ( http://www.nih.gov) under grant number NIH R01-GM086884. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
            Categories
            Research Article
            Biology and Life Sciences
            Ecology
            Microbial Ecology
            Microbiology
            Medical Microbiology
            Physical Sciences
            Mathematics
            Statistics (Mathematics)
            Biostatistics
            Contingency Tables
            Statistical Methods

            Quantitative & Systems biology

            Comments

            Comment on this article