+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Insights from Characterizing Extinct Human Gut Microbiomes

      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.


          In an effort to better understand the ancestral state of the human distal gut microbiome, we examine feces retrieved from archaeological contexts (coprolites). To accomplish this, we pyrosequenced the 16S rDNA V3 region from duplicate coprolite samples recovered from three archaeological sites, each representing a different depositional environment: Hinds Cave (∼8000 years B.P.) in the southern United States, Caserones (1600 years B.P.) in northern Chile, and Rio Zape in northern Mexico (1400 years B.P.). Clustering algorithms grouped samples from the same site. Phyletic representation was more similar within sites than between them. A Bayesian approach to source-tracking was used to compare the coprolite data to published data from known sources that include, soil, compost, human gut from rural African children, human gut, oral and skin from US cosmopolitan adults and non-human primate gut. The data from the Hinds Cave samples largely represented unknown sources. The Caserones samples, retrieved directly from natural mummies, matched compost in high proportion. A substantial and robust proportion of Rio Zape data was predicted to match the gut microbiome found in traditional rural communities, with more minor matches to other sources. One of the Rio Zape samples had taxonomic representation consistent with a child. To provide an idealized scenario for sample preservation, we also applied source tracking to previously published data for Ötzi the Iceman and a soldier frozen for 93 years on a glacier. Overall these studies reveal that human microbiome data has been preserved in some coprolites, and these preserved human microbiomes match more closely to those from the rural communities than to those from cosmopolitan communities. These results suggest that the modern cosmopolitan lifestyle resulted in a dramatic change to the human gut microbiome.

          Related collections

          Most cited references 15

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Accuracy and quality of massively parallel DNA pyrosequencing

          Background Direct interrogation of microbial genomes based upon comparisons of orthologous gene sequences or metagenomic surveys provides a means to assess the diversity of microbial communities without requiring the cultivation of microbes in the laboratory. Since the cost of cloning DNA templates and capillary-based DNA sequencing constrains the number of sequences included in most of these investigations, the detection of low abundance taxa demands surveys that are many orders of magnitude larger than those reported in the literature. Massively parallel pyrosequencing on the Roche GS20 system developed by 454 Life Sciences offers a means to more extensively sample molecular diversity in microbial populations. It is now possible to generate hundreds of thousands of short (100-200 nucleotide) DNA sequence reads in a few hours without requiring the preparation of sequence templates by conventional cloning. In the near future, technical advances will likely increase the number and length of sequence reads. Pyrosequencing technology relies upon enzyme cascades and CCD luminescence detection capabilities to measure the release of inorganic pyrophosphate with every nucleotide incorporation [1]. The GS20 system takes advantage of DNA capture beads that contain, on average, one single-stranded template, which is amplified to millions of copies in an oil emulsion PCR (emPCR). The beads are then distributed on a solid-phase sequencing substrate (a PicoTiterPlate™) with 1.6 million wells that can each hold a bead and additional reagents, including polymerase, luciferase, and ATP sulfurylase. Microfluidics cycle each of the four nucleotide triphosphates over the PicoTiterPlate™, and incorporation of a nucleotide releases pyrophosphate, the substrate for a luminescence reaction, which is recorded with a cooled CCD camera. The record of intensity of each flow of a nucleotide is a flowgram, analogous to a chromatogram that reports the order of A, C, G and T residues from a DNA sequencing template. Flowgram values correspond to the homopolymer length for that base. The average number of wells with detectable sequencing templates is about 450,000, which produces about 200,000 usable reads. This new methodology brings with it different sources of error to traditional dideoxy capillary sequencing. Since the nucleotide triphosphates are flowed one at a time, substitutions are less likely than with traditional methods. However, it is sometimes difficult to resolve the intensity of luminescence produced when a homopolymer is encountered. The result can be ambiguity of homopolymer length, particularly for longer homopolymers. In addition, insufficient flushing between flows can cause single base insertions (carry forward events) usually near but not adjacent to homopolymers. Insufficient nucleotides within a flow can cause incomplete extension within homopolymers. Generally, an excess of intermediate flowgram values indicates a poor quality read [2]. The GS20 software makes corrections for carry forward and incomplete extensions (CAFIE); it shortens reads from the 3' end until fewer than 3% of the remaining flowgram values are of intermediate value, and it removes reads if the trimming falls below a threshold length. The software identifies as ambiguous flow cycles in which no flowgram value was greater than 0.5. If 5% or more of the flow cycles for a read are ambiguous, the read is removed. The assembly of many overlapping pyrosequencing reads can produce highly accurate consensus sequences [3,4]. Wicker et al. [5] compared assemblies of the barley genome produced by reads from GS20 pyrosequencing and from ABI dideoxy sequencing. Both methods produced consensus sequences with error rates of approximately 0.07% at each consensus position. Gharizadeh et al. [6] compared pyrosequences with Sanger dideoxy methods for 4,747 templates. Comparisons of the traditional capillary sequences with the 25-30 nucleotide pyrosequence reads demonstrated similar levels of read accuracy. Assemblies of massively parallel pyrosequencing reads of plastid genomes from Nandina and Platanus exhibited overall error rates of 0.043% and 0.031%, respectively, in the consensus sequence [4]. The generation of consensus sequences to improve accuracy, however, is generally not appropriate for studies that seek information about natural variation from every read. For example, in metagenomic [7] or PCR amplicon [8] libraries from environmental DNA samples, each sequence read can theoretically represent DNA from a distinct gene from a complex mixture of microbial genes. A viable but imperfect alternative to building consensus sequences for metagenomic and diversity investigations is to identify and remove pyrosequencing reads that are likely to be incorrect. For example, Gilbert et al. [9], in a study of ancient woolly mammoth mitochondrial DNA, removed pyrosequencing reads that were not 98% identical to previously sequenced mammoth mitochondrial DNA sequences, assuming that they must be poor quality. Dostie et al. [10] sequenced an amplicon library and discarded reads in which the PCR primer was not recognized by BLAST. These studies removed 15% and 7% of their reads, respectively, but it is not clear that these statistics improved the quality of the remaining data. To explore error modalities, we used the GS20 system to generate more than 340,000 reads from a PCR amplicon library that was prepared from a collection of 43 reference templates of known sequence. Each reference template contains a distinct ribosomal RNA gene (rDNA), including the V6 hypervariable region from a collection of 43 divergent bacteria [11]. Differences between pyrosequences and their cognate reference sequences identified signatures of low quality data. Results Read accuracy We obtained 340,150 reads that passed the GS20 quality filters, that is, flowgrams for each read: contained the correct base key at the start (a portion of the 454 primer used to differentiate reads from internal quality control sequences); included at least 84 flows; had fewer than 5% of flow cycles resulting in an ambiguous base call (N); and had fewer than 3% of flowgram values between 0.5 and 0.7 [12]. We aligned each read to its reference sequence using an optimized Needleman-Wunsch algorithm. Our data included 159,981 total errors over 32,801,429 bases. The error rate, defined as the number of errors (miscalled bases plus inserted and deleted bases) divided by the total number of expected bases, was 0.49%. As shown in Table 1, 39% of these errors correspond to homopolymer effects, including extension (insertions), incomplete extensions (deletions) and carry forward errors (insertions and substitutions). Carry forward occurs when an incomplete flush of base flow results in a premature incorporation of a base. The presence of homopolymers tends to increase the likelihood of both carry forward and incomplete extension with the GS20 sequencer [12]. Insertions were the most common type of error (36% of errors) followed by deletions (27%), ambiguous bases, Ns (21%), and substitutions (16%). It should be noted that the V6 region does not contain long or frequent homopolymers. The errors did not correlate significantly with distance along the sequence (R2 60%, also contain Ns. Conclusion Our analysis of the GS20 sequencing error rate of the V6 region of bacterial rRNA genes shows a marked improvement over the original error rates published by Margulies et al [2]. The largest source of errors may be due to multi-templated beads, and enhancements to both the chemistry protocol for the GS20 and the built-in bioinformatics software may account for the change in error rates. Our results highlight that a small proportion of low quality reads, presumably from multi-templated beads, are responsible for the majority of sequencing errors. The ability to identify and remove these reads is the best way to improve the accuracy of the entire dataset. It is not a replacement for assigning quality scores to detect the position of miscalled bases. The interpretation of chromatograms by programs such as PHRED [13] employs quality scores that reflect the probability of any type of base call error. Although it uses the same scale, the GS20 software generates quality values based on the probability of homopolymer extension rather than probability of a correct base call. Regardless of the ultimate cause of poor reads, the presence of even a single ambiguous base (N) was an effective indicator of low-quality sequence. The removal of all reads containing one or more Ns can drastically improve the overall quality of the remaining dataset, reducing the error rate from about 0.5% to about 0.25% (Table 2). For our data, this strategy eliminated only 6% of the total reads. By excluding approximately 1% of all reads whose lengths lie outside of the main distribution, as well as those with inexact matches to the primer, the error rate for the V6-tag data dropped to less than 0.2%. The pyrosequencing technology provides such a large number of reads that the elimination of even 10% or more of the reads in a data set should be more than offset by the increase in quality of the remaining reads. Table 2 Identifying low-quality reads and their contribution to the error rate Data selection Percent of reads Error rate All reads 100.0% 0.49% Reads with no Ns 94.4% 0.24% Reads with one or more Ns 5.6% 4.7% Reads with length ≥81 and ≤108 98.8% 0.33% Reads with length 108 1.2% 18.9% Reads with no Ns and length ≥81 and ≤108 93.3% 0.20% Reads with no proximal errors 97.0% 0.45% Reads with fewer than three proximal errors >99.99% 0.48% Reads with more than three proximal errors <0.01% 12.2% Reads with no Ns and length ≥81 and ≤108 and no proximal errors 90.6% 0.16% Removing reads with Ns is the most effective means we found of removing low-quality data and improving the error rates. Read lengths that are either longer or shorter than expected, and are outside the peak of common reads, also correlate strongly with incorrect reads. Our strategy for detecting low quality reads circumvents the need to generate consensus sequences for improving data quality in massively parallel pyrosequencing experiments of environmental DNA. Our criteria for detecting reads with errors allows for the acquisition of pyrosequencing data in the context of molecular ecology that can surpass the accuracy of traditional capillary methods. Materials and methods Generation of 1 kb clone library and selection of clones for pyrosequencing DNA was extracted according to Huber et al. [14] from diffuse flow hydrothermal vent samples as described in Sogin et al. [8]. PCR primers were designed using ARB software [15] to target the bacterial 16S rDNA. The primers used were 337 F (5' CAN CCT ACG GGN GGC NGC) and 1391R (5' GAC GGG CGG TGW GTN CA). The amplification mix contained 5 units Pfu Turbo polymerase (Stratagene, La Jolla, CA, USA), 1× Pfu reaction buffer, 200 μM dNTPs (Pierce Nucleic Acid Technologies, Milwaukee, WI, USA), and 0.2 μM each primer in a volume of 100 μl. Environmental DNA (3-10 ng) was added to 3 separate 30 μl amplification mixes. A positive control (Marinobacter aquaeolei genomic DNA) and two negative controls (no DNA and genomic DNA from the archaeon Methanococcus jannaschii) were also run. An initial denaturation step of 3 min at 94°C was followed by 30 cycles of 94°C for 30 s, 55°C for 45 s, and 72°C for 2 minutes. The final extension step was 72°C for 2 minutes. Following PCR, three reactions for each sample were combined, purified, and concentrated using the MinElute PCR Purification Kit (Qiagen, Valencia, CA, USA) according to the manufacturer's instructions. PCR product quality was assessed on a 0.8% agarose gel stained with ethidium bromide and ligated with pCR4-TOPO vector for 20 minutes at room temperature and transformed with TOP10 electrocompetent cells according to the manufacturer's instructions (Invitrogen, Carlsbad, CA, USA). Colonies for each library were randomly selected and grown in SuperBroth with 50 mg/ml kanamycin in 96 deep-well blocks overnight. Alkaline lysis template preparation was carried out on cell pellets using the RevPrep Orbit II (Genomic Solutions, Ann Arbor, MI, USA) or the Biotech RoboPrep2500 (MWG Biotech, Ebersberg, Germany). The 1,000 base-pair amplicons were sequenced bidirectionly using primers T3 (5'-ATT AAC CCT CAC TAA AGG GA) and T7 (5'-TAA TAC GAC TCA CTA TAG GG), and on an ABI 3730 × l genetic analyzer. Sequences were aligned with MUSCLE (with parameters -diags and -maxiters 10) [16] and manually manipulated in the BioEdit 7.0.1 program [17]. Distance matrixes were calculated using quickdist [8], and taxonomic identities determined using RDP-II Classifier [18]. Sequences were trimmed to include only the V6 region of the gene, the distance matrix re-calculated, and from this analysis, 43 divergent sequences were chosen for further experimentation. The average length of the V6 region for these clones was 101 bases, ranging from 95 to 109, with one longer reference of 183 bases. The 16S rDNA sequences are deposited at GenBank under accession numbers DQ909092, DQ909128 DQ909132, DQ909133, DQ909142, DQ909144, DQ909158, DQ909184, DQ909202, DQ909204, DQ909218, DQ909223, DQ909224, DQ909248, DQ909251, DQ909253, DQ909266, DQ909274, DQ909337, DQ909368. DQ909392, DQ909396, DQ909400, DQ909414, DQ909423, DQ909438, DQ909440, DQ909465, DQ909474, DQ909498, DQ909513, DQ909519, DQ909538, DQ909603, DQ909618, DQ909631, DQ909662, DQ909688, DQ909702, DQ909706, DQ909719. DQ909727, DQ909753. Generation of known V6 amplicon library We treated each plasmid with plasmid-safe DNAase (Epicentre, Madison, WI, USA) to remove Escherichia coli genomic DNA and confirmed that each plasmid produced an amplification product of the expected size with primers targeting the V6 region of the bacterial rDNA according to Sogin et al. [8]. We then pooled the individual plasmids and amplified with the primers that flank the V6 region of rRNA genes according to Sogin et al. [8]. We assessed the product quality using a BioAnalyzer Agilent DNA 1000 LabChip following the manufacturer's instructions. Three reactions were combined, purified, and concentrated using the MinElute PCR Purification Kit (Qiagen). The final amplicon library was sequenced independently by 454 Life Sciences and our own lab. Both labs used the Roche Genome Sequencer 20 (GS20) according to the manufacturer's specifications [2]. The original GS20 output files as text are available in Additional data files 3-5. Error rate calculations We combined the data from both sequencing runs for a total of 340,150 reads (226,150 and 114,000), with an average read length of 94.5 nucleotides and a total of 32,816,656 bases. These sequences are available in fasta format in Additional data file 2. To determine the reference sequence source of each pyrosequencing read, we ran a separate multiple sequence alignment of each individual read against the 43 reference sequences using MUSCLE [16] (default options plus maxiters 2, diags). We calculated the number of sequence differences between each read and the reference sequences to determine the reference sequence to which each read mapped most closely. All subsequent error calculations are based on comparing reads to their assigned reference sequence. The overall error rate is the number of errors in a read divided by the length of sequence. Specifically, we calculated errors in several ways. In all methods, each base mismatch or N in the test sequence counts as an error, and a terminal gap caused by a GS20 read terminating before the end of the reference does not count as an error. In the first and second methods each base of an insertion or deletion counts as one error. In the third method, insertions or deletions are counted by homopolymer runs. If a TAAA is inserted, it is counted as two insertions, one single-T and one multi-A insertion. The denominator for the first method was the read length. The denominator for the second and third methods was the length of reference sequence minus any discounted terminal gaps. All error rate calculations produced essentially the same results. We report error rates using all base errors (not by homopolymer run) divided by the expected length (reference sequence length minus terminal gaps). The error rates were calculated for each sequence in a pairwise comparison of the pyrosequencing read and the reference sequence to which it was assigned. We used the Needleman-Wunsch algorithm [19] for these pair-wise alignments because it selects for the best possible alignment given its run parameters. Using a set of 100 sequences and a matrix of Needleman-Wunsch run parameter combinations, we found that a gap opening penalty of 5.75 and a gap extension penalty of 2.75 minimized the calculated error rate. Error rates, reference sequences and read sequences were imported into a MySQL database for storage and analysis. Additional data files The following additional data are available with the online version of this paper. Additional data file 1 is a fasta file of the 43 known sequences used. Additional data file 2 is a gzip-compressed fasta file of the sequences output by the GS20. These sequences correspond to those included in Additional data files 3, 4, 5 but include only the final sequence information. Additional data files 3, 4, 5 are three compressed text files representing the text translations of the original GS20 binary output (sff) files for all of the sequencing used in the analysis, including sequence, flowgram and other run information. GS20 data are reported by region of the PicoTiterPlate™; we sequenced three plate regions. Supplementary Material Additional data file 1 The 43 known sequences used Click here for file Additional data file 2 These sequences correspond to those included in Additional data files 3-5 but include only the final sequence information in fasta format. Click here for file Additional data file 3 Text translation of the original GS20 binary output (sff) file for the first of three PicoTiterPlate™ regions used in the analysis. Click here for file Additional data file 4 Text translation of the original GS20 binary output (sff) file for the second of three PicoTiterPlate™ regions used in the analysis. Click here for file Additional data file 5 Text translation of the original GS20 binary output (sff) file for the third of three PicoTiterPlate™ regions used in the analysis. Click here for file
            • Record: found
            • Abstract: found
            • Article: not found

            Comparative Analysis of Human Gut Microbiota by Barcoded Pyrosequencing

            Humans host complex microbial communities believed to contribute to health maintenance and, when in imbalance, to the development of diseases. Determining the microbial composition in patients and healthy controls may thus provide novel therapeutic targets. For this purpose, high-throughput, cost-effective methods for microbiota characterization are needed. We have employed 454-pyrosequencing of a hyper-variable region of the 16S rRNA gene in combination with sample-specific barcode sequences which enables parallel in-depth analysis of hundreds of samples with limited sample processing. In silico modeling demonstrated that the method correctly describes microbial communities down to phylotypes below the genus level. Here we applied the technique to analyze microbial communities in throat, stomach and fecal samples. Our results demonstrate the applicability of barcoded pyrosequencing as a high-throughput method for comparative microbial ecology.
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Evolutionary Relationships of Wild Hominids Recapitulated by Gut Microbial Communities

              Introduction The mammalian digestive tract is sterile at birth but is soon colonized by bacteria that typically derive from the mother [1]–[3]. In the absence of any subsequent alterations or additional colonizations, strict parental inheritance would result in a pattern in which the constituents and composition of the microbial flora would co-diversify with and ultimately mirror the evolutionary relationships of their hosts. Such a situation has been observed in some bacteria within the digestive tract, such as Helicobacter pylori, which is present in the stomachs of about half of the human population and whose patterns of divergence closely follow those of their human hosts [4]. Numerous internal and external factors, including diet, geography, host physiology, disease state, and features of the gut itself, contribute to the community composition of the gut microbiota [5]–[9] and can result in discordance with the host phylogeny. Despite the wide variation among individuals, the gut microbiotae of members of the same species are often more similar to one another than to those of other species. But above this level of organization, the composition of these microbial communities is thought to assort according to the broad dietary habits of their hosts [6],[10]. Based on very limited samplings of nonhuman primates, mostly from captive individuals, conspecifics sometimes retain very similar microbial communities (e.g., Hymadryas baboons), but sometimes do not (e.g., western lowland gorillas). And in a previous phylogenetic analysis of mammals based on their gut microbiotae, the great apes were interspersed in multiple clades along with distantly related species [6],[10]. For example, humans, bonobos, and two of the three gorilla species landed in a large “omnivore” clade along with lemurs, an elephant, and an armadillo, whereas the chimpanzees and orangutans grouped with a flying fox in a divergent clade [10]. From such isolated cases of displaced or zoo-raised hosts, it is difficult to extract the degree to which host and environmental factors shape the primate gut microbiota: both factors are certainly important, but their relative contributions cannot be established based on previous sampling. To address questions pertaining to the stability and variation in the great ape gut microbiota over evolutionary timescales, we performed high-coverage sequencing of the small subunit ribosomal RNA genes [11]–[13] present in the feces of apes collected in their native ranges. The samples from these wild-living hominids included eastern and western lowland gorillas, bonobos, and three subspecies of chimpanzees [14]–[17], as well as two human hosts from different continents. This sampling provided a more comprehensive and less biased view of bacterial species diversity and abundance within the primate distal gut, and revealed that the relationships among microbial communities parallel the host-species phylogeny. Our results indicate that evolutionary changes in host physiology that occurred during the divergence of great apes have been the dominant factor in shaping the distal gut microbial community present in each host species. Results To investigate factors affecting the diversification of distal gut microbial communities, we assayed the microbiota of humans and multiple members of four other great ape species (eastern lowland gorilla, Gorilla beringei; western lowland gorilla, G. gorilla; bonobo, Pan paniscus; and the three subspecies of chimpanzees; P. troglodytes troglodytes, P. t. schweinfurthii, and P. t. ellioti) sampled in their native ranges (Figure 1). For the 26 great ape samples, we generated a total of 1.5 million reads, of which nearly 1.3 million had a recognizable primer and identifying sequence tag. After quality filtering and length trimming, reads shorter than 150 nucleotides in length were removed, leaving a total of 1,107,714 reads (termed “pyrotags”) that could be assigned taxonomically to a class with greater than 70% bootstrap support. The number of sequencing reads per sample ranged from 14,762 to 178,473, with a median value of 27,945 reads per sample. In addition to characterizing 16S rRNA gene diversity, fecal DNA was assayed for multiple variable regions of ape mitochondrial DNA (mtDNA) (HV1, HV2, HV3, and COIII), and for a Y-chromosome marker, to confirm source species, to determine host gender, and to establish the phylogenetic relationships among hosts. Variation in these host sequences confirmed that each fecal sample was derived from a different individual. 10.1371/journal.pbio.1000546.g001 Figure 1 Sample key, locations, collection dates, gender, and taxonomic classification of great apes whose gut microbiotae were analyzed. Note that P. t. ellioti is the current nomenclature for P. t. vellerosus. Based on mtDNA analyses, each fecal sample represented a different individual. Phylum-Level Diversity in Gut Microbial Communities The gut microbiotae of these hosts encompass one archaeal and 18 bacterial phyla, of which five (Actinobacteria, Bacteroidetes, Firmicutes, Proteobacteria, and Verrucomicrobia) were present in all samples. Several phyla not typically observed in gut microbiota of primates, including Euryarchaeota, Acidobacteria, Fibrobacteres, Lentisphaerae, Planctomycetes, and candidate phylum TM7, were recovered at very low relative frequencies (<10−3) from at least nine hosts. In addition, five bacterial phyla (Chlamydiae, Chloroflexi, Deferribacteres, OP10, and Gemmatimonadetes) were detected in only one or few hosts. Contributing most to these rare variants was chimpanzee BB089, which had the highest phylum-level diversity of any sample and harbored four of these five uncommon phyla (Table S1). The three most dominant phyla were Firmicutes, Proteobacteria, and Bacteroidetes, which together constituted over 80% of the reads identified in every sample. Although divergent mammals can harbor broadly similar gut microbiotae at the level of bacterial phylum [7],[10], the two species of gorilla differed from those of other great apes in the relative frequencies of the dominant phyla. Firmicutes was numerically dominant in all great apes but was less common than Proteobacteria and Bacteroidetes in both species of gorilla. Among other phyla represented in all hosts, Actinobacteria was common but only occurred at frequencies of greater than 10% in two of the five bonobos and in a single chimpanzee (CP470), and Verrucomicrobia, usually at frequencies of only 1%–10%, constituted approximately 20% of the microbiota of one human (KS477). To determine whether great ape species can be distinguished based on the diversity of microbes in their fecal samples, we first performed a phylogenetic analysis of the phylum-level diversity within their gut microbiota. For this analysis, we contructed phylogenetic trees based on the abundance in each sample of pyrotags that were classified to phylum (see Materials and Methods and below for results based on species-level microbial diversity). Despite variation in the distribution and abundance of numerous microbial phyla, these phylum-level phylogenies did not resolve any of the ape species as discrete groups (Figure S1). There were 160 most parsimonious trees, and no clade recovered had greater than 70% bootstrap support. This result is due to both the sporadic occurrence of certain phyla among individual members of the same great ape species and the phylum-level diversity present in the microbiota of chimpanzees, which broadly overlaps that of the other great apes. Resolution of Phylotypes and Microbial Species The majority of the variation in the microbiotae of the great ape hosts is represented as unique pyrotags recovered from a single sample, indicating that differentiation in the gut microbiota among hosts may occur at lower taxonomic levels. Although these 16S rDNA sequences are indicative of a broad range of species-level bacterial diversity in these samples, the source, relevance, and reproducibility of this “rare biosphere” has recently been questioned [18]–[20]. To assess how experimental factors might contribute to the contents of the rare biosphere, we performed a high-coverage technical replicate (WE464R) on an independent preparation of the fecal sample from chimpanzee WE464. Even with sequencing to 3.5 times the depth of the initial sample (51,648 versus 14,762 reads), there were no identical matches for approximately 30% of the reads in the original sample, but when allowing for up to 0.5% sequence divergence between reads (i.e., no more than a single one-nucleotide mismatch or indel), this proportion shrank to less than 10%. Therefore, to assemble the most biologically robust segment of our entire 1,107,714-pyrotag dataset, we grouped sequencing reads by applying a 99.5% identity threshold to correct for most potential sequencing artifacts. We have noted previously that thresholds higher than 97% will inflate richness estimates using amplicon pyrosequencing [20]; however, the approach we take here will be minimally impacted by this artifact. More importantly, application of this high threshold improves the likelihood that clustered pyrotags belong to the same bacterial species, whereas the conventional criterion of 97% sequence identity often unites bacteria typed to different taxonomic groups [21],[22]. To determine the degree to which the gut microbial communities present in these great apes are similar in the frequencies of their constituent microbial species, we retained only those 99.5% operational taxonomic units (OTUs) detected in two or more host samples (since unique OTUs are not phylogenetically informative and provide no information about evolutionary relatedness). The set of reproducible 99.5% OTUs contained a total of 1,017,478 reads that formed a total of 8,914 microbial phylotypes (hereafter referred to as “species”), which were used to examine the fine-scale taxonomic structure and similarities of the great ape gut microbiotae (Figure 2). 10.1371/journal.pbio.1000546.g002 Figure 2 Relative abundance of microbial species across great ape hosts. Rows represent samples (color-coded by host species as in Figure 1); columns represent microbial species (reproducible 99.5% OTUs; n = 8,914). Microbial phyla represented in the gut microbiota of these hosts are shown horizontally across the top: OTUs classified to one archaeal (Euryarchaeota) and 14 bacterial phyla, as indicated. Based both on the number of OTUs and on read counts, species classified as Bacteroidetes, Firmicutes, and Proteobacteria are the most dominant phyla in these samples. Samples from four chimpanzees (BB089, BB095, WE457, and WE458) have more proteobacterial reads than do other chimpanzee samples, and samples from both gorilla species contained relatively fewer Firmicutes species than did samples from other ape species. Individual cells are color-coded by Z-scores to show the normalized abundance of a particular OTU in one sample relative to the mean abundance across all samples. Intensity of the colors indicates how many standard deviations the observed OTU abundance is above or below the mean. Whereas our analyses of phylum-level microbial diversity were not sufficiently grained to differentiate great ape species based on their microbiota (a similar limitation encountered by Ley et al. [10] when only 100–200 bacterial sequences were sampled from each host), the assemblage of microbial species (that manifest as reproducible 99.5% OTUs) discriminated the individual primate hosts and assorted them into taxonomic groupings. For example, the two humans shared relatively few phylotypes with other great apes, and both species of gorillas shared high frequencies of proteobacterial phylotypes and very low frequencies of Firmicutes species that were present in the majority of chimpanzee samples. This result indicates that deep sampling of the microbiota is necessary to fully recover the evolutionary signal in gut microbial community data. The number of reproducible microbial species recovered from individual hosts ranged from 265 (in African human KS477) to 3,247 (in chimpanzee WE458, the host for which we obtained the highest number of reads). The highest frequency attained by an individual bacterial species was 27.7% for a phylotype classified as Megasphaera (Firmicutes: Clostridium) in the sample from the African human. Relatedness of Great Ape Hosts Based on Gut Microbiota To conduct a phylogenetic analysis of the occurrence and frequencies of species present in the fecal microbial communities in the primate hosts, we treated each microbial species as an individual standard data character assigned to one of six possible states that correspond to order-of-magnitude differences in the normalized frequency of each microbial species in each ape host sample. This approach is similar to that typically used with morphometric data, where, for example, femur length might be treated as a standard character coded with a handful of possible states ranging from very small to very large (e.g., the multi-log-difference size range from mouse to elephant). The aggregate character matrix of several morphometric characters (or, in our case, the frequencies of the various members of a microbial community) can then be analyzed using traditional phylogenetic techniques. The 8,914-species, six-state data matrix was subjected to a heuristic maximum parsimony tree search, as performed previously for the phylum-level tree phylogeny, with 1,000 pseudo-replicates used to assess bootstrap support. Because microbial species are defined as reproducible OTUs clustered at 99.5% sequence identity, all 8,914 characters are parsimony-informative, and the phylogenetic analysis recovered a single maximum parsimony tree (p-score = 49,475). The unrooted maximum parsimony tree exhibited a species-level topology that was completely congruent with the unrooted mtDNA topology of the hosts (Figure 3). Moreover, these groupings were supported by high bootstrap values (98%–100% for each ape species, with the exception of the chimpanzee clade, which reached only 68% bootstrap support). There are more than 2,000,000 possible unrooted topologies for a ten-taxon phylogenetic tree. Therefore, if one constructs a tree with two humans, two chimpanzees, two bonobos, two eastern lowland gorillas, and two western lowland gorillas, the chance of randomly generating a tree that is entirely congruent with the species tree in placing each species with its conspecific, and also placing the two gorilla species as sister groups and the chimpanzees and bonobos as sister groups (as in Figure 3) is less than 1/2,000,000. 10.1371/journal.pbio.1000546.g003 Figure 3 Phylogeny of great apes based on mtDNA sequence variation and composition of host gut microbiota. (A) mtDNA phylogeny. (B) Gut microbiota phylogeny. Color-coding of terminal branches leading to great ape hosts corresponds to those in Figure 1. Internal branches of these trees (thick grey lines), although of different relative lengths, show identical branching orders. In (A), the internal nodes were all supported with bootstrap values greater than 98%; in (B), each internal node showed bootstrap support greater than 98%, except that leading to chimpanzees, which had 68% support. Discussion Based on the compositions of the distal gut microbial communities from hosts living in their natural environments, we were able to discriminate species of great apes. The topological concordance between the species-level branching orders obtained for hosts and their microbiotae shows that over evolutionary timescales, host phylogeny is the overriding factor determining the microbial composition of the great ape gut microbiota. This recapitulation of the species relationships in the frequencies of the microbial constituents of their distal gut communities contrasts with previous notions that diet is the most important factor governing the grouping of gut microbiotae within primates [6],[10]. This new view of great ape microbiota evolution emerged as a consequence of the sampling depth, which allowed the recovery of large sets of evolutionarily informative phylotypes. This allowed the application of standard parsimony-based phylogenetic approaches that were based on the frequency of each microbial species shared among hosts. Previous studies of gut microbiotae that surveyed only on the order of 100 sequences per sample [10],[23],[24] could not accurately gauge either the diversity present in complex microbial communities or the relative abundance of the constituent species. Given the species complexity within the distal gut microbiota, it is necessary to obtain more than 104 reads per host to accurately access the relationships among divergent microbial communities. However, recent advances in sequencing methodologies render this number of reads both technically and economically feasible. The fact that the gut microbial community phylogeny matches the great ape species phylogeny is not readily attributable to factors other than the evolutionary diversification of hosts. For example, the broad geographic range of chimpanzees, as well as the intercontinental distance separating our sampled humans, establishes that geographic proximity is not a major factor in the clustering of microbial communities by host species. Likewise, chimpanzees and gorillas within the same locale exhibited phylogenetically distinct gut microbial communities. That the composition of gut microbiotae assorts to species despite their geographic locations suggests that similarities in local factors, such as those that relate to diet, do not explain the close correspondence between host phylogeny and microbial community composition. To further evaluate whether host species differentiate according to diet, we examined the populations of chloroplast sequences within each fecal sample. Although the diversity of chloroplasts serves as an indicator only of plant diet at the time of sampling, there was no clear indication that the great ape species (except for G. beringei) have widely different diets or that the diets of great apes structure according to host phylogeny (Figure S2). As evident from the differences in relative branch lengths between the mtDNA (Figure 3A) and microbial community (Figure 3B) trees, it is clear that the degree of genetic differentiation between hosts does not fully account for the variation in great ape gut microbiota. The host phylogeny signal that we uncovered can be masked by factors occurring on more proximate timescales (such as diet, geography, or health status). Only by conducting a phylogenetic analysis of communities that have been more deeply sampled is it possible to detect this signal. To assess the degree to which differences in gut microbiota reflect the genetic distance between hosts, we compared the amount of variation assigned to the terminal branches of the tree (i.e., those leading to individual hosts) relative to that encompassed in the seven internal branches that differentiate the five great ape species (grey branches in Figure 3). The species-discriminating branches together represent 73% of the total genetic distance present in the mtDNA phylogeny, but only 7% of the total distance in the tree based on microbial communities. This contrasts with the situation for individual hosts, whose branch lengths together constitute 70% of the distance in the microbial tree but encompass only 11% of the total genetic distance. This disparity reflects the broad variation in microbial communities among members of the same species, as has already been observed in humans [5]–[7],[25]–[28]. Next, to discount the effects of individual variation, we calculated the correlation coefficient between the relative branch lengths of the seven internal branches in the microbial community tree and the corresponding distances in the mtDNA tree. Despite the congruence in branching orders, the branch lengths in the mtDNA tree explain only about 25% of the variation in the microbial community tree. This indicates that gut microbiotae, although diverging in a manner consistent with vertical inheritance, are not changing in a strict time-dependent fashion that reflects the degree of genetic divergence among hosts. The difference in branch length indicates that individual-level variation in microbial community structure is extensive relative to between-species variation. Our analysis indicates that host phylogeny has a major role in the diversification of distal gut microbial communities in great apes, a conclusion that can become apparent only when sampling is adequate for robust phylogenetic and evolutionary analyses of microbial species compositions. Numerous studies have applied UniFrac and related approaches to establish the relationships among microbial communities derived from a wide range of hosts and environmental sources [29]–[33]. Despite the highly supported tree that we obtained by parsimony analysis, subjecting our dataset to UniFrac did not recover a tree that matches the host-species phylogeny (Figure S3). Unlike parsimony, UniFrac relies on an input tree to specify the evolutionary relationship among bacterial taxa to infer the similarity among microbial communities. However, for a large dataset with nearly 9,000 characters, ensuring the correct inference of tree topology and branch lengths is difficult. The task of inferring an input tree is all the more problematic because of the relatively short and highly variable sequencing reads that are generated for most metagenomic studies. The quality of multiple sequence alignment, which is critical for inferring the guide tree, is greatly impacted by the limited read length, the level of sequence variation, and the propensity towards indel sequencing errors. This problem was almost entirely eliminated from our parsimony analysis (of species abundance data) by performing multiple sequence alignments on sets of reads assigned to a particular taxonomic class, not the entire dataset. Furthermore, when calculating pair-wise sequence identities among reads typed to the same class, indel sequencing errors present in taxonomically different reads are ignored. Since the V6 region has previously been shown to have low phylogenetic congruency with full-length small subunit ribosomal RNA topologies [34], the described methods based on species abundances and community compositions serve as an alternative and complementary approach for analyzing pyrotag data. With the availability of methods that allow the scrutiny of microbial diversity and community structure at finer levels, the challenge now is to determine how best to characterize each specific environment in order to extract the relevant biological information about its constituents. In the present study, we found that sampling at levels of greater than 10,000 reads per sample, the application of stringent cutoffs for species identity, and the focus on parsimony-informative characters helped resolve host phylogeny as the major determinant of distal gut microbial communities in great apes. Materials and Methods Sample Collection and Processing Ape fecal samples used in this study were selected from an existing bank of previously collected specimens [14]–[17]. All samples except one (GM173) were collected from wild-living, non-habituated apes at remote forest sites in Cameroon, the Central African Republic, and the Democratic Republic of the Congo (DRC). Sample GM173 was obtained from a habituated male chimpanzee (Ch-045) in Gombe National Park, Tanzania (Figure 1). In the field, fecal samples were identified to be of likely chimpanzee, gorilla, or bonobo origin by experienced trackers; however, species and subspecies origins were subsequently confirmed in the laboratory by mtDNA analysis. This genetic analysis revealed a limited number of initially misidentified specimens from other mammal species, including a handful of samples that were of human origin. One such sample (KS477) from an unknown individual in the DRC was included in this study. In addition, a fecal sample was supplied by a human male residing in Tucson, Arizona (United States). All fecal samples were collected, stored, and shipped in RNAlater (Ambion). Time, date, and collection site were recorded for each sample. Samples were shipped at ambient temperatures but subsequently stored at −80°C. DNA was extracted from 200-µl aliquots of thawed fecal samples by spin-column filtration using the QIAamp DNA Stool Kit (Qiagen), following the manufacturer's protocol for isolating DNA for pathogen detection. DNA was quantified on a Qubit fluorometer (Invitrogen) and subjected to PCR amplification of the 16S rDNA region spanned by primers 926F (5′-aaactYaaaKgaattgacgg-3′) and 1492R (5′-tacggYtaccttgttacgactt-3′). Amplicons encompassed the V6 region, which was selected because of prior use and high level of variability [13],[20],[34]–[36]. To multiplex amplicons for inclusion on a single sequencing run (454 Life Sciences/Roche), the appropriate 454 Life Sciences adaptor sequence and a unique three- or four-nucleotide sequence tag (barcode) were added to the 5′ end of the forward and reverse 16S amplification primers. For each primer pair, PCR was performed in triplicate and pooled to minimize PCR biases that might occur in individual reactions. Each 50-µl reaction consisted of 1.25 units of Taq (GE Healthcare), 5 µl of supplied 10× buffer, 0.25 µl of 10 mM dNTP mix (MBI Fermentas), 1.5 µl of 10 mg/ml BSA (New England Biolabs), 0.5 µl of each 10 µM primer, and 40 ng of template DNA, and proceeded at 95°C for 3 min; followed by 25 cycles of 95°C for 30 s, 55°C for 45 s, and 72°C for 90 s; followed by a final extension at 72°C for 10 min. Amplification products were purified on MinElute PCR columns (Qiagen) and quantified. To obtain a similar number of reads from each sample, amplicons were mixed in equal concentrations prior to pyroseqencing. Emulsion PCR and sequencing were performed using a GS-FLX emPCR amplicon kit (454 Life Sciences/Roche), following the manufacturer's protocols. Pyrosequencing proceeded from the barcode at the 5′ end of the 926F primer. To confirm species and differentiate individual hosts, DNA samples were also tested with primers designed to amplify three hypervariable regions of the D-loop of the great ape mitochondrial genome: HV1 (nucleotides 15997 to 16498), and HV2 and HV3 (nucleotides 16517 to 607). Amplified PCR products were treated with exonuclease I and calf intestinal phosphatase, and directly sequenced from both ends using the amplification primers on an ABI 3700 sequencer (Applied Biosystems). Sequences were assembled in Sequencher (Gene Codes Corporation) and compared to the published mtDNA sequences of great apes. In addition, we used unambiguous polymorphisms to confirm that each sample came from a different individual. Informatic Analyses, Quality Filtering, and Taxonomic Assignment Pyrosequencing flowgrams were converted to sequence reads using software provided by 454 Life Sciences/Roche. Reads were end-trimmed with LUCY [37], using an accuracy threshold of 0.5% per base error probability. Reads lacking exact matches to a recognizable barcode and primer sequence were removed from the dataset, leaving a total of 1,292,542 reads (elsewhere referred to as “pyrotags”) out of the original total of 1,501,806 reads. Reads were assigned to individual samples based on identifying barcode sequences. Barcode and primer sequences were removed from the 5′ end of each read, and the taxonomic origin of each read was established using RDP Classifier. For quality filtering, we excluded the reads that (i) were shorter than 150 nucleotides in length, (ii) received bootstrap support for class assignment lower than 70% (based on RDP Classifier), (iii) mapped to the incorrect region of 16S gene, or (iv) were of chloroplast origin (based on RDP Classifier). This final filter removed more than 99% of the Cyanobacteria reads, leaving only one read that was assigned to genus GpVII. For sequences classified as Archaea, we required reads to start at position 844–850 (relative to the reference sequence from RDP Aligner); for sequences classified as Bacteria, we required the reads to start at position 851–857. For the 1,107,714 pyrotags that could be assigned to a taxonomic class, we performed all pair-wise comparisons to identify unique sequence types. If two otherwise identical reads differed in length, they were trimmed to the same length. We used RDP Aligner to perform multiple sequence alignments of all unique sequence types. Based on these alignments, we calculated the percent identity of all pairs typed to the same class. Terminal gaps in the 5′ or 3′ end of the alignment were excluded when calculating percent identities. The clustering step was done using the MCL (Markov Clustering) algorithm with the inflation value set to 1.5 [38]. 99.5% OTUs were partitioned into two sets: unique (present in one sample) and shared (present in at least two samples). Frequencies of shared OTUs were visualized via heatmaps generated in R and used for subsequent analyses. Community Relatedness and Phylogenetic Analyses To establish the degree to which the gut microbiotae of samples were similar with respect to the compositions of their constituent microbes, we constructed phylum-level and species-level phylogenies of hosts based on the frequencies of taxonomically assigned OTUs in their gut microbial communities. Character matrices based on all reads were converted to phylogenetic trees using a parsimony-based approach. Each character corresponds to a taxonomically assigned OTU whose frequency in each sample has been normalized by coding with one of six ordered states reflecting log-unit differences in its occurrence, with a OTU absent from a sample coded as state 0. Given the range in the occurrence of each OTU across samples (from 0 to 83,840 at the phylum level), this resulted in six-state data matrix, which was then subjected to a heuristic maximum parsimony tree search using PAUP version 4.0b10, using default settings. Characters were considered to be ordered, such that transitions between distant states (i.e., samples having very divergent frequencies of a particular phylotype) were more costly than between similar states. To produce the tree for input to UniFrac, we took the longest read within each OTU as the representative for multiple sequence alignment and generated alignments using RDP Aligner (applying the Bacteria model since only ten of the 8,914 OTUs represented Archaea). The resulting alignment contained 554 aligned nucleotide sites, and the tree relating the 8,914 OTUs was inferred in FastTree version 2.1.1 [39],[40]. Tree topology and sample information were uploaded to the Fast UniFrac Web server [41] for the clustering analysis, using the weighted and normalized options to account for differences in OTU abundance and read depth among samples. Supporting Information Figure S1 Phylogeny of great apes based on phylum-level classification of gut microbiota. Color-coding and sample names of individual great ape hosts follow those presented in Figure 1. This phylogeny is based on a character matrix in which each character is a microbial phylum obtained from all classifiable sequencing reads. Each character is coded with one of six ordered states reflecting log-unit differences in the normalized frequency of that particular phylum in each sample. From a total of 160 most parsimonious trees, no clade showed bootstrap support greater than 70%. (0.26 MB EPS) Click here for additional data file. Figure S2 Great ape phylogeny based on diet composition. Color-coding and sample names of individual great ape hosts correspond to those in Figure 1. Presented is one of 59 most parsimonious trees recovered from analysis of a character matrix encompassing the frequency of 99.5% OTUs of chloroplast reads from each fecal sample (i.e., applying the same approach in the microbial community structure analysis, but instead considering a “snapshot” of the plant component of the recent diet of each animal.) Nodes present in greater than 75% of the 59 most parsimonious trees are indicated with a star. Aside from one species of gorilla, there is no clear indication (at least from this admittedly crude proxy of diet) that the great ape species are distinct, let alone that the deeper structure of the host phylogeny is matched. In other words, the diets of these animals do not appear to be structured simply according to host phylogeny. Accordingly, diet on its own seems an unlikely explanation for the observation of congruence between the microbial community structure tree and the species-level host (mtDNA) phylogeny. (0.27 MB EPS) Click here for additional data file. Figure S3 UniFrac analysis of the microbial communities within the distal gut of great apes. Color-coding and sample names of great ape hosts correspond to those presented in Figure 1. (0.14 MB PDF) Click here for additional data file. Table S1 Phylum-level prokaryotic diversity in great ape fecal samples. (0.05 MB PDF) Click here for additional data file.

                Author and article information

                Role: Editor
                PLoS One
                PLoS ONE
                PLoS ONE
                Public Library of Science (San Francisco, USA )
                12 December 2012
                : 7
                : 12
                [1 ]Department of Anthropology, University of Oklahoma, Norman, Oklahoma, United States of America
                [2 ]Department of Computer Science, University of Colorado, Boulder, Colorado, United States of America
                [3 ]Department of Chemistry and Biochemistry, University of Colorado, Boulder, Colorado, United States of America
                [4 ]Department of Chemistry and Biochemistry, Advanced Center for Genome Technology, University of Oklahoma, Norman, Oklahoma, United States of America
                [5 ]School of Natural Resources, University of Nebraska, Lincoln, Nebraska, United States of America
                [6 ]Climate Change Institute and Department of Anthropology, University of Maine, Orono, Maine, United States of America
                University of York, United Kingdom
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                Provided contributions to drafts of the manuscript including edits, comments, and/or figures and tables: RYT DK JM AJOT LC FN BR KR KS SB MF PS RK CML. Conceived and designed the experiments: RYT DK RK CML. Performed the experiments: RYT DK AJOT LC JM. Analyzed the data: RYT DK RK CML. Contributed reagents/materials/analysis tools: FN BR KR KS SB RK MF PS CML. Wrote the paper: RYT CML.


                This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                Page count
                Pages: 8
                Support for this research is from the National Institutes of Health (grants R01 HG005172-01 and R01 GM089886-01A1) and the National Science Foundation (NSF#0845314). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Research Article
                Evolutionary Biology
                Microbial Ecology
                Population Biology
                Population Ecology
                Social and Behavioral Sciences



                Comment on this article