314
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible

      , *

      PLoS Computational Biology

      Public Library of Science

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.

          Author Summary

          The term microbiome refers to the ecosystem of microbes that live in a defined environment. The decreasing cost and increasing speed of DNA sequencing technology has recently provided scientists with affordable and timely access to the genes and genomes of microbiomes that inhabit our planet and even our own bodies. In these investigations many microbiome samples are sequenced at the same time on the same DNA sequencing machine, but often result in total numbers of sequences per sample that are vastly different. The common procedure for addressing this difference in sequencing effort across samples – different library sizes – is to either (1) base analyses on the proportional abundance of each species in a library, or (2) rarefy, throw away sequences from the larger libraries so that all have the same, smallest size. We show that both of these normalization methods can work when comparing obviously-different whole microbiomes, but that neither method works well when comparing the relative proportions of each bacterial species across microbiome samples. We show that alternative methods based on a statistical mixture model perform much better and can be easily adapted from a separate biological sub-discipline, called RNA-Seq analysis.

          Related collections

          Most cited references 31

          • Record: found
          • Abstract: not found
          • Article: not found

          Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            A molecular view of microbial diversity and the biosphere.

             N Pace (1997)
            Over three decades of molecular-phylogenetic studies, researchers have compiled an increasingly robust map of evolutionary diversification showing that the main diversity of life is microbial, distributed among three primary relatedness groups or domains: Archaea, Bacteria, and Eucarya. The general properties of representatives of the three domains indicate that the earliest life was based on inorganic nutrition and that photosynthesis and use of organic compounds for carbon and energy metabolism came comparatively later. The application of molecular-phylogenetic methods to study natural microbial ecosystems without the traditional requirement for cultivation has resulted in the discovery of many unexpected evolutionary lineages; members of some of these lineages are only distantly related to known organisms but are sufficiently abundant that they are likely to have impact on the chemistry of the biosphere.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex.

              We constructed error-correcting DNA barcodes that allow one run of a massively parallel pyrosequencer to process up to 1,544 samples simultaneously. Using these barcodes we processed bacterial 16S rRNA gene sequences representing microbial communities in 286 environmental samples, corrected 92% of sample assignment errors, and thus characterized nearly as many 16S rRNA genes as have been sequenced to date by Sanger sequencing.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS Comput Biol
                PLoS Comput. Biol
                plos
                ploscomp
                PLoS Computational Biology
                Public Library of Science (San Francisco, USA )
                1553-734X
                1553-7358
                April 2014
                3 April 2014
                : 10
                : 4
                Affiliations
                Statistics Department, Stanford University, Stanford, California, United States of America
                Heinrich Heine University, Germany
                Author notes

                The authors have declared that no competing interests exist.

                Conceived and designed the experiments: PJM SH. Performed the experiments: PJM. Analyzed the data: PJM SH. Contributed reagents/materials/analysis tools: PJM SH. Wrote the paper: PJM SH.

                Article
                PCOMPBIOL-D-13-01815
                10.1371/journal.pcbi.1003531
                3974642
                24699258

                This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                Page count
                Pages: 12
                Funding
                This work was supported by the NIH ( http://www.nih.gov) under grant number NIH R01-GM086884. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology and Life Sciences
                Ecology
                Microbial Ecology
                Microbiology
                Medical Microbiology
                Physical Sciences
                Mathematics
                Statistics (Mathematics)
                Biostatistics
                Contingency Tables
                Statistical Methods

                Quantitative & Systems biology

                Comments

                Comment on this article