+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      VizBin - an application for reference-independent visualization and human-augmented binning of metagenomic data


      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.



          Metagenomics is limited in its ability to link distinct microbial populations to genetic potential due to a current lack of representative isolate genome sequences. Reference-independent approaches, which exploit for example inherent genomic signatures for the clustering of metagenomic fragments (binning), offer the prospect to resolve and reconstruct population-level genomic complements without the need for prior knowledge.


          We present VizBin, a Java™-based application which offers efficient and intuitive reference-independent visualization of metagenomic datasets from single samples for subsequent human-in-the-loop inspection and binning. The method is based on nonlinear dimension reduction of genomic signatures and exploits the superior pattern recognition capabilities of the human eye-brain system for cluster identification and delineation. We demonstrate the general applicability of VizBin for the analysis of metagenomic sequence data by presenting results from two cellulolytic microbial communities and one human-borne microbial consortium. The superior performance of our application compared to other analogous metagenomic visualization and binning methods is also presented.


          VizBin can be applied de novo for the visualization and subsequent binning of metagenomic datasets from single samples, and it can be used for the post hoc inspection and refinement of automatically generated bins. Due to its computational efficiency, it can be run on common desktop machines and enables the analysis of complex metagenomic datasets in a matter of minutes. The software implementation is available at https://claczny.github.io/VizBin under the BSD License (four-clause) and runs under Microsoft Windows™, Apple Mac OS X™ (10.7 to 10.10), and Linux.

          Electronic supplementary material

          The online version of this article (doi:10.1186/s40168-014-0066-1) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references12

          • Record: found
          • Abstract: found
          • Article: not found

          Community structure and metabolism through reconstruction of microbial genomes from the environment.

          Microbial communities are vital in the functioning of all ecosystems; however, most microorganisms are uncultivated, and their roles in natural systems are unclear. Here, using random shotgun sequencing of DNA from a natural acidophilic biofilm, we report reconstruction of near-complete genomes of Leptospirillum group II and Ferroplasma type II, and partial recovery of three other genomes. This was possible because the biofilm was dominated by a small number of species populations and the frequency of genomic rearrangements and gene insertions or deletions was relatively low. Because each sequence read came from a different individual, we could determine that single-nucleotide polymorphisms are the predominant form of heterogeneity at the strain level. The Leptospirillum group II genome had remarkably few nucleotide polymorphisms, despite the existence of low-abundance variants. The Ferroplasma type II genome seems to be a composite from three ancestral strains that have undergone homologous recombination to form a large population of mosaic genomes. Analysis of the gene complement for each organism revealed the pathways for carbon and nitrogen fixation and energy generation, and provided insights into survival strategies in an extreme environment.
            • Record: found
            • Abstract: found
            • Article: not found

            Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla.

            BD1-5, OP11, and OD1 bacteria have been widely detected in anaerobic environments, but their metabolisms remain unclear owing to lack of cultivated representatives and minimal genomic sampling. We uncovered metabolic characteristics for members of these phyla, and a new lineage, PER, via cultivation-independent recovery of 49 partial to near-complete genomes from an acetate-amended aquifer. All organisms were nonrespiring anaerobes predicted to ferment. Three augment fermentation with archaeal-like hybrid type II/III ribulose-1,5-bisphosphate carboxylase-oxygenase (RuBisCO) that couples adenosine monophosphate salvage with CO(2) fixation, a pathway not previously described in Bacteria. Members of OD1 reduce sulfur and may pump protons using archaeal-type hydrogenases. For six organisms, the UGA stop codon is translated as tryptophan. All bacteria studied here may play previously unrecognized roles in hydrogen production, sulfur cycling, and fermentation of refractory sedimentary carbon.
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Community-wide analysis of microbial genome sequence signatures

              Background The age of genomics has opened up new perspectives on the natural microbial world, offering insights into organisms that drive geochemical cycles and are critical to human and environmental health. The prevalence of horizontal gene transfer, recombination, and population-level genomic diversity underscores the dynamic nature of bacterial and archaeal genomes and demands reconsideration of fundamental issues such as microbial taxonomy [1,2] and the concept of microbial species [3,4]. Application of genomics to uncultivated assemblages of microorganisms in natural environments ('metagenomics' or 'community genomics') has provided a new window into in situ microbial diversity and function [5-7]. To date, community genomics has revealed the form and extent of recombination and heterogeneity in gene content [8-11], elucidated virus-host interactions [12], redefined the extent of genetic and biochemical diversity in the oceans [13-15], uncovered new metabolic capabilities [16-19] and taxonomic groups [20], and shown how functions are distributed across environmental gradients [21]. An important approach to study evolutionary and ecological processes, pioneered by Karlin and others [22], is the analysis of nucleotide compositional characteristics of genomes. The simplest and most widely used measure of nucleotide composition, the abundance of guanine plus cytosine (%GC), is shaped by multiple factors encompassing both neutral and selective processes. Neutral factors include intrinsic properties of the replication, repair, and recombination machinery that result in mutational biases [23,24]. Selective processes encompass both internal (for example, translation machinery) and external influences such as physical (temperature, pressure), chemical (salinity, pH) and ecological factors (competition for metabolic resources [25] and niche complexity [26]). Although the relative importance of these factors remains uncertain [27], it is clear that %GC varies widely between species but is relatively constant within species. Thus, %GC has been used to trace origins of DNA fragments within genomes [28] and to assign fragmentary metagenomic sequences to candidate organisms [16]. Such inferences must be made with caution: %GC simplifies nucleotide composition down to a single parameter with known limitations for investigating genome dynamics [29]. Oligonucleotide frequencies capture species-specific characteristics of nucleotide composition more effectively than %GC [30]. Analyses of genome sequences from cultivated organisms have shown that the frequency at which oligonucleotides occur is unique between species while being conserved genome-wide within species [22,30-34]. Taken together, the frequency of all oligonucleotides of a given length defines the 'genome signature' (for example, the frequency of all possible 256 tetranucleotides). Sequence signatures are evident in oligonucleotides ranging from di- (two-mers) to octanucleotides (eight-mers). While the specificity of genome signatures increases with oligonucleotide length [35], the number of possible oligomers increases exponentially with oligomer length, so signatures based on longer oligomers require calculations over larger genomic regions to achieve sufficient sampling. Genome signatures have been used to detect horizontally transferred DNA [36-39], reconstruct phylogenetic relationships [22,32,40] and infer lifestyles of bacteriophage [41,42]. Genome signatures also offer a compelling means of assigning metagenomic sequence fragments to microbial taxa, a procedure termed 'binning' [43]. This is a prerequisite for realizing some of the most valuable opportunities random shotgun metagenomics offers, including assignment of ecological and biogeochemical functions to particular community members and assessment of population-level genomic diversity and community structure. However, binning is a formidable challenge because: the inherent diversity of microbial communities typically limits genomic assembly, resulting in highly fragmentary data [13]; there are few universally conserved phylogenetically informative markers, leaving the vast majority of metagenomic sequence fragments 'anonymous' with regard to their organism of origin; and current sequence databases grossly under-represent the microbial diversity in the natural world, limiting the utility of fragment recruitment or BLAST-based methods [13,44,45]. Consequently, it is important to develop methods that classify all genome sequence fragments independently of reference databases. Genome signatures are a promising approach for sequence classification. However, it is important to understand the source of the signal and how environmental effects and evolutionary distance will compromise it. To date, sequence signatures have been explored using genomes from cultivated microbes [22,30-34], and prospects for binning have been evaluated based largely on simulated datasets consisting of mixtures of isolate genomes [44,46-48]. Although these studies are indispensable in that they allow theoretical evaluation of binning capability, they do not represent the diversity (community-wide and within population) and dynamics (for example, horizontal gene transfer, recombination, viruses) of real microbial communities. Further, they employ genomes derived from disparate environments and so do not address the extent to which environmental factors shape genome signatures. It has been reported that environment shapes nucleotide composition [26,49-51]. If so, then genome signatures may not discriminate coexisting, coevolving organisms, especially where environmental pressures are extreme. On the other hand, binning results of real microbial communities [46,48,52] are inherently difficult to evaluate because the true identity of most sequence fragments is unknown. Thus, there remain fundamental questions regarding the forces and processes that give rise to and maintain genome signatures, and the extent to which these signatures are obscured by shared environmental pressures and community interactions such as horizontal gene transfer and broad host range viruses. Here we present a comprehensive analysis of genome signatures in sequences derived from natural biofilms inhabiting a subsurface chemolithoautotrophic acid mine drainage (AMD) ecosystem in the Richmond Mine at Iron Mountain, CA [53]. The biofilms are dominated by just a handful of organisms that are sustained primarily by the oxidation of Fe(II) derived from pyrite (FeS2) dissolution [54]. Due to this relatively low diversity, modest levels of shotgun sequencing (approximately 100 Mb per sample) have yielded deep genomic sampling (10 to 20× sequence coverage) of the dominant populations, enabling reconstruction of 12 near-complete genomes from three samples [16,55,56] (BJ Baker et al., submitted). These assembled composite genomes provide the organism affiliation of sequences with which binning accuracy can be evaluated. Therefore, the dataset allows assessment of binning performance while capturing sequence heterogeneity that is an intrinsic feature of natural microbial populations. We find that AMD biofilm microorganisms are indeed distinguished by population-specific genome signatures and show that sequence signatures can be used to identify and cluster sequences from low-abundance community members de novo, without reference genomes or reliance on databases. Our results have implications for metagenomic binning and provide new insights into the sources of genome signatures that distinguish coexisting populations. Results Description of samples, community genomic sequencing and assembly An overview of our methodology is shown in Figure 1. Community genomic sequence was obtained from two previously described biofilm samples from the UBA location of the Richmond Mine at Iron Mountain: a pink subaerial biofilm collected in June 2005 ('UBA') [55] and a thicker floating biofilm collected in November 2005 ('UBA BS') [12]. These two biofilms contained overlapping subsets of organisms in different proportions. The UBA biofilm was dominated by bacterial Leptospirillum spp. group II and group III (Nitrospirae) populations, for which near-complete genomes have been reconstructed [55,56]. The most abundant microorganisms represented in the UBA BS genomic data were from archaeal populations, including an uncultivated representative of a novel euryarchaeal lineage, ARMAN-2 [20], and A-plasma, E-plasma, and I-plasma, members of the order Thermoplasmatales. To facilitate reconstruction of genomes from these and other lower-abundance organisms, a combined assembly included unassigned sequences from UBA and all sequences from UBA BS. Random shotgun sequences derived from both ends of approximately 3-kb DNA fragments, and each fragment was likely sampled from a different individual cell with a potentially distinct genome sequence. Therefore, genome reconstructions represent composite sequences. However, single nucleotide polymorphism density was typically very low ( 2 kb were considered). Note that the map is continuous from top to bottom and side to side. (a) Each point represents a sequence fragment; sequences whose origin is known (from assembly information) are colored as indicated below. Unassigned sequences are shown in green. Regions are numbered as follows: (1) ARMAN-2, brown; (2) Ferroplasma (F. acidarmanus fer1, dark orange; fer1(env), orange; fer2(env), light orange); (3) I-plasma, purple; (4) Leptospirillum group II, light blue; (5) Leptospirillum group III, pink; (6) A-plasma, navy blue; (7) E-plasma, light purple; (8) G-plasma, turquoise; (9) ARMAN-4, black; (10) ARMAN-5, red. Regions 11 to 17 are novel genomic regions identified in this study: (11) putative Leptospirillum plasmid; (12) A-plasma variant and C-plasma; (13) D-plasma; (14) Leptospirillum group III variant; (15) an actinobacterium; (16) mixed Actinobacteria; (17) mixed low-abundance bacteria, including Sulfobacillus spp., other Firmicutes, and a gammaproteobacterium. (b) Topography (U-Matrix) representing the structure of the underlying tetranucleotide frequency data from (a). 'Elevation' represents the difference in tetranucleotide frequency profile between nodes of the ESOM matrix (see legend); high 'elevations' (brown, white) indicate large differences in tetranucleotide frequency and thus represent natural divisions between taxonomic groups. Inspection of the clustering results in light of assembly information provided a broad measure of the ability of tetranucleotide frequency-based ESOM (tetra-ESOM) to resolve sequences from coexisting populations of the community. To quantify the degree of segregation of fragments from genomes at various evolutionary distances, we adapted a method using fixed point kernel densities (Figure 4; Additional data file 1). We found that sequence fragments from closely related strains or species could not be distinguished. For example, two strains of F. acidarmanus sharing 97% average nucleotide identity (fer1 and fer1(env) [8]) mapped directly on top of each other, as did two types of Leptospirillum group II, which share 95% average nucleotide identity [55] (only one type of Leptospirillum group II is shown in Figure 3 for this reason; Figures 3 and 4). Sequences from Ferroplasma types I and II, which share 83% average nucleotide identity and are known to participate in homologous recombination [10], were segregated to some extent by tetra-ESOM, but type II was split and there was no well-defined boundary between the two types. Good separation of Leptospirillum groups II and III was achieved, except for certain genomic regions containing mobile elements, as described further below. Among members of the Thermoplasmatales, populations were distinguished by genome signatures but borders were variably well-defined (Figure 3). In particular, G- and E-plasma were not well resolved. I-plasma, which is quite divergent from the other Thermoplasmatales (Figure 2), was the only member of the Thermoplasmatales for which a distance-based border was clearly delineated. Although genomes with similar %GC were generally more difficult to separate, several genomes with near-identical %GC were easily separated (for example, G-plasma versus Ferroplasma) (Figures 3 and 4). Figure 4 Ability of tetra-ESOM to resolve AMD populations as a function of evolutionary distance (average amino acid identity) and %GC. Black points represent comparisons between genomes with different %GC (> 2% different), red points are genome pairs with 90% were achieved (Additional data file 2). Sensitivity was somewhat lower for Leptospirillum groups II and III due to poor resolution of certain genomic regions between these two populations. When Leptospirillum was considered as a single group, binning sensitivity was comparable to the other reference genomes. Sensitivity decreased notably only when shorter ( 1,000 bp) (data not shown). The noncoding complement of each Thermoplasmatales genome formed a distinct cluster adjacent to noncoding regions of the other Thermoplasmatales. The only outlier to this trend was A-plasma, which has the highest %GC among these organisms. Based on U-Matrix background, the distance between noncoding sequences of different genomes is comparable to the distance between noncoding and coding sequences of the same genome. To determine if the presence of noncoding sequence influences binning accuracy in the initial experiments, we calculated the percentage of coding sequence on incorrectly binned fragments from the five reference genomes (5 kb and 1 kb window sizes). For many genomes, the incorrectly binned fragments do indeed have a smaller average percentage of coding sequence. However, this percentage varied widely on incorrectly binned fragments. Only a small fraction of such fragments had a percentage of coding sequence smaller than one standard deviation below the genome-wide average (Additional data file 8). For sequence signatures to differentiate populations in a genome-wide manner, it is necessary that within-genome differences resulting from atypical regions of amino acid and/or synonymous codon usage are smaller than between-genome differences. This issue is especially relevant in AMD, where proteins are under diverse constraints depending on whether they function in the extracellular (around pH 1) or intracellular (around pH 5) environment [65]. Indeed, proteins from the AMD populations in these two fractions have disparate isoelectric points owing to the unique amino acid composition of acid-stable proteins [66]. We identified 106 Leptospirillum group II-UBA proteins that are consistently enriched in the extracellular fraction according to environmental shotgun proteomics data [55,66] and compared sequence signatures of their genes with the other 2,522 Leptospirillum group II genes. No systematic differences were detected via tetra-ESOM, suggesting that genome signatures persist even when gene sequences are influenced by considerable protein-coding constraints (Additional data file 9). Selection for codons that optimize translation rate may also influence codon usage. We analyzed genome signatures for the 50 Leptospirillum group II proteins most abundantly detected via environmental shotgun proteomics [55,66]. With the exception of one subset of genes encoding mainly ribosomal proteins (which mapped into the mixed region between Leptospirillum groups II and III), highly expressed genes clustered with the rest of the genome (Additional data file 9). Discussion Through analysis of a deeply sampled and extensively curated community genomic dataset, we have demonstrated that genome signatures can be used to differentiate coexisting microbial populations despite functional and environmental constraints, processes such as lateral gene transfer, and pressures imposed by viral predation that might have diminished them to the point that they are no longer diagnostic. The genome-wide nature of the signatures makes them potentially useful for classification of sequence fragments. Results from our AMD dataset show that the signal can be detected on fragments as small as 500 bp, genome clusters can be defined using fragments as short as 1,400 bp (Additional data file 2) and a small fraction of the genome (Additional data file 3). These findings suggest broad applicability of the tetra-ESOM approach for metagenomic studies. However, in order to understand and predict its utility for binning, it is important to identify sources of genome signatures as well as processes that are likely to diminish the signal. Insights into the sources of distinctive genome signatures It has been suggested that environmental constraints strongly shape nucleotide composition [26,49-51]. If this were the case, two effects should be apparent in genome signatures of AMD populations. First, shared pressures deriving from the extreme AMD environment would drive genome signatures together, potentially obscuring differences between populations. Second, since each genome encodes proteins destined for diverse environments (that is, intracellular and extracellular), there should be prominent intra-genome variation of genome signature and scattering of fragments from the same genome into disparate regions of the SOM. Neither of these expectations is met in the AMD dataset. There are vast differences in nucleotide composition between populations, with genomic %GC ranging from 35% (ARMAN-4 and ARMAN-5) to 69% (low-abundance Actinobacteria) and genome signatures forming discrete clusters. Amino acid compositional constraints required for stability of proteins exposed to acidic solutions do not result in sequence signatures that are markedly distinct from the rest of the genome. In other words, within-population differences in genome signature are small relative to differences between populations. Although we do not rule out some environmental influence on genome signatures, we conclude that, in AMD, this influence is not strong enough to obscure differences between populations. Similar community-wide analyses need to be conducted in other systems to determine whether our findings extend to other extremophilic microbial communities. Our results show that genome signatures are related to several traits, including %GC, amino acid composition, synonymous codon usage, and palindrome avoidance. These characteristics are interrelated and further connected to a host of biochemical, ecological, and evolutionary processes (Additional data file 10). Large differences in %GC and/or amino acid composition guarantee distinctive genome signatures but are not required to differentiate genomes. At finer evolutionary scales, where %GC and amino acid composition are not informative, populations can be readily distinguished through subtle differences in tetranucleotide frequency, which correlate with genome-specific synonymous codon usage. Tetra-ESOM analyses based on codon usage and tetranucleotide frequency displayed similar clustering resolution, indicating that little signal derives from longer-range characteristics such as codon pair bias. It should be noted, however, that using tetranucleotide frequency rather than codon composition has practical advantages for binning because it is independent of coding strand and reading frame and thus insensitive to errors in gene-calling or frame shifts due to poor quality sequence. These issues are particularly important for short, low-coverage sequence fragments. Although genome signatures are largely manifested through codon composition, the observation that population-specific signatures also occur in non-coding regions (Additional data file 7) suggests a mechanism of generation that is independent of protein coding. We hypothesize this underlying process is mutational bias associated with DNA replication and repair, which exerts directional pressure on nucleotide composition [24]. In fact, between-genome codon biases can be predicted solely by %GC and context-dependent nucleotide biases (that is, mutation rates at each site are dependent on the identity of neighboring nucleotides) calculated from non-coding regions [67,68]. It is interesting to note that non-coding regions mapped into discrete clusters, distinct from coding regions of the same genome or non-coding regions of different genomes, including those with identical %GC. Differences in genome signature of coding and non-coding sequences from the same genome are to be expected based on differing functional constraints on these regions (for example, coding amino acids versus small RNAs or regulatory elements such as promoters). The distinction of non-coding regions from different genomes is consistent with genome-specific mutational biases. An alternative to the mutation bias hypothesis, at least for coding sequences, is that genome signatures are shaped by factors related to translation. Changes in codon usage can be driven by changes in the tRNA gene complement [69,70] that may occur, for example, through interaction with plasmids and viruses [71]. However, we found AMD genomes with distinct genome signatures, such as G-plasma, E-plasma, and Ferroplasma, that have only minor differences in tRNA gene content, and these differences do not correspond to observed differences in codon usage. In addition to tRNA gene complement, there may be changes in tRNA gene regulation, which can significantly impact cellular tRNA concentrations and have been correlated with changes in codon usage [72]. Thus, although we cannot rule out a tRNA regulatory influence on genome signatures, our findings suggest that coevolution of tRNA gene content and codon usage is not a primary mechanism underlying the divergence of genome signatures in related AMD populations. Codon bias can also arise as the result of selection for certain codons that are optimal for fast and/or accurate translation [73]. This form of codon bias primarily influences the subset of genes encoding highly expressed proteins, is prevalent for fast-growing organisms [69,74], and correlates with ecological strategy [75]. In fact, a Leptospirillum group II genome fragment encoding nine ribosomal proteins and two translation elongation factors had distinctive tetranucleotide composition, indicating that this mode of codon bias occurs in AMD organisms. However, as commonly construed, translational selection would influence within-genome codon bias, not the genome-wide codon biases that differentiate populations as observed in our study. It is tempting to speculate that differences in ecological strategy (for example, response rate to resource availability [76]) could have genome-wide influence on codon usage, but there is currently no evidence in our dataset to suggest that this is the case. Finally, restriction avoidance places another selective genome-wide constraint on DNA composition that may contribute to genome signatures. Under-representation of palindromic tetranucleotides (Figure 6) has been attributed to avoidance of enzymes designed to recognize and degrade foreign DNA [22,32,46]. Our data show that palindrome avoidance contributes to the genome signature but is not the sole or even primary determinant. Most archaeal viruses and bacteriophage have sequence signatures that resemble their hosts, including avoidance of specific subsets of palindromes. However, mismatches between the tetranucleotide signatures of AMDV2 and AMDV5 and their respective hosts point to the lesser importance of palindrome avoidance in these organisms. In the case of AMDV5, other evidence suggests a recent alteration in host range [12]. It is interesting to note that the genomes of archaeal AMD viruses encode several restriction modification (RM) system genes. These may have significance for virus host-interactions [77] and for influencing genome signatures. Broad host range viruses or viruses that jump to new hosts can potentially drive changes in the host sequence signatures if they replace or supplement the restriction systems of the host. Alternatively, the degree of similarity in tetranucleotide signatures of viruses and their hosts may be a function of the extent to which the virus relies upon its host's replication and translation machinery (for example, associated with a lysogenic versus lytic lifestyle) [41,42,63]. Implications for metagenomic, ecological, and evolutionary studies Due to the high levels of diversity in most natural systems, random sequencing approaches yield fragmentary data, often comprising genomic sequences no more than a few kilobases in length. While more comprehensive coverage of individual organisms can be achieved by single cell genomics [78-80] or targeted, large-insert approaches [81,82], random shotgun approaches retain two important advantages: the random nature provides insights that are unbiased by preconceived notions of community composition; and population-level variation is captured because each sequencing read derives from a different individual cell. A key challenge for virtually all shotgun metagenomics investigations is the assignment of genome fragments to the organism they derive from. This step links organism to metabolism and function and is essential if we are to understand microbial community dynamics and predict ecosystem level impacts of changes in community membership and structure. Binning is particularly challenging for lower-abundance organisms, which may play keystone roles that are critical to ecosystem function. Thus, our finding that tetra-ESOM can resolve the phylogenetic affiliation of genome fragments on the scale of two mate-paired reads is of great significance. This approach has clear applicability to low-complexity datasets such as those derived from our AMD biofilms, bioreactors [83], and enrichment cultures [84]. In fact, even for the relatively extensively analyzed AMD dataset, it revealed multiple new genomic clusters, including a near complete genome of a novel actinobacterium (GJ Dick et al., in preparation), a putative plasmid, and many discrete but less well-sampled populations. Tetra-ESOM may also provide a powerful method for analysis of unassembled data from complex samples such as soil, seawater, and the human microbiome if representative isolate genomes are available. The feasibility of binning metagenomic sequences from complex samples using reference genomes will increase with current initiatives to fill in the phylogenetic tree with genome sequences from cultivated microorganisms. An important advantage of unsupervised, compositional-based approaches such as tetra-ESOM is that gene sequences need not be represented in databases to be identified; only representation of the genome signature is required. This is in contrast to fragment recruitment [13] and BLAST-based binning approaches that only work for homologous sequences. We found that clusters of a few hundred kilobases of sequence (as little as 20% of the genome) were resolved, suggesting that a few fosmids or bacterial artificial chromosomes linked to 16S rRNA genes can be sufficient to serve as a reference to define a bin. Thus, recent progress in using large-insert metagenomic libraries to link 16S rRNA genes to genomic sequence from diverse uncultivated microorganisms is very valuable in this regard [85]. Because the reach of composition-based approaches to binning extends beyond gene content of reference genomes, they hold great promise for identifying and classifying genes from the variable fraction of the pan-genome (present in only a subset of strains or species), an important determinant of pathogenicity and niche differentiation [86-88]. In AMD populations, genome reconstruction has shown that this strain-variable fraction often involves inserted plasmid and virus sequences [8,9]. In the current study, these integrated elements clustered either with the host genome or in regions shared between different species or genera. Since horizontally transferred DNA is rapidly converted to the genome signature of its new host [22,28,89], the extent to which such genomic regions reflect the genome-wide signature of nucleotide composition is likely a function of the donor of the genetic material and how recently they were acquired. Recently acquired sequences with distinctive tetranucleotide patterns may bin incorrectly, and unexpected binning outcomes can be used to identify laterally transferred regions [62,90]. Although the tetra-ESOM method works well to separate sequence fragments from organisms distinct at the genus or higher level, it has some limitations. Tetra-ESOM is generally unable to distinguish closely related species or strains. An important question, especially for more diverse samples, is whether limitations in genome sequence signature space will impose an inherent constraint on the number of populations that can be resolved. There are a staggering 6 × 10222 ways to code for a typical protein in our samples (based on an average protein size of 467 amino acids and assuming an average of 3 possible ways to code for any amino acid). This richness of protein coding space suggests ample capacity for numerous genome signatures. To date, SOMs have shown promising results in resolving up to 81 complete genomes, in successfully classifying fragments of 1,502 genomes into phylogenetic groups, and in visualizing phylogenetic clustering of sequences in complex environmental samples [46]. However, it remains difficult to assess the accuracy and phylogenetic resolution of oligonucleotide-based SOMs on metagenomic datasets from diverse natural microbial communities. Another concern is computational demand. Continued increases in processor speeds will likely need to be supplemented with more efficient and/or accurate algorithms such as the recently introduced hyperbolic SOM [91] and growing SOM [59]. Conclusions Bacterial, archaeal, and viral populations in the AMD biofilm community have genome-wide signatures of nucleotide composition that are effectively captured and visualized through self-organizing maps of tetranucleotide frequency. We conclude that even under extremely acidic conditions, shared environmental pressure does not obscure genome signatures of nucleotide composition. Our data point to pervasive mechanisms of generating and maintaining genome signatures; although a variety of factors and processes contribute, we propose that mutational bias is the primary underlying mechanism driving the divergence of genome signature between closely related organisms. The resulting signal, evident through synonymous codon usage, is genome-wide and sufficiently diagnostic to classify fragmentary metagenomic data from coexisting populations of a natural microbial community at approximately the genus level. However, distinguishing features of genome signatures may be subtle, being masked by within-genome heterogeneity and the multidimensional nature of tetranucleotide frequency patterns. Tetra-ESOM is a key method for visualizing and exposing these potentially weak signals. Being unsupervised, it requires no database representation of the organisms present. Visualization of the data structure highlights differences between populations and reveals atypical regions corresponding to biologically meaningful genomic features such as mobile elements or previously unrecognized genotypes present at low abundance in the community. When employed in conjunction with complementary methods such as genomic assembly and analysis of phylogenetic marker genes, genome signatures offer powerful perspectives on metagenomic data. Materials and methods Sample collection, construction of genomic libraries, sequencing, and community genomic assembly An overview of the samples and methodology used in this study is provided in Figure 1. Sample collection, DNA extraction, random fragmentation and cloning of approximately 3-kb fragments, Sanger sequencing, assembly, and curation of community genomics data were performed using phred/phrap/consed package as detailed previously [12,55]. The combined UBAs nonLeptos dataset was constructed by assembling sequencing reads derived from both the UBA BS and UBA biofilm samples (with UBA reads previously assigned to Leptospirillum spp. removed). This included 229,082 reads and approximately 210 Mb of total sequence, which assembled into 15,929 contigs and 36.6 Mb of composite sequence. Phylogenetic analysis The phylogenetic tree of 16S rRNA genes was constructed by neighbor joining (default parameters) with the ARB software package [92] and 'SILVA SSU ref' database [93]. Calculation of tetranucleotide frequencies and clustering by ESOM Tetranucleotide frequencies were determined for each assembled contig using a custom Perl script. Frequencies were calculated with a 1-bp sliding window and pairs of reverse complementary tetranucleotides were summed in order to avoid strand bias. Longer contigs and assembled genomes were split into 5-kb windows and only contigs longer than 2 kb were considered unless noted otherwise. To assess binning accuracy, data points (representing contigs/windows) are colored according to their genome of origin (when known), but this information is not available to the clustering process. Contigs were clustered by tetranucleotide frequency utilizing Databionics ESOM Tools [94]. The input for tetra-ESOM was a 136-dimensional vector (representing the frequencies of the 136 unique reverse complement tetranucleotide pairs, normalized for contig length) for each contig/window. These raw frequencies were transformed with the 'Robust ZT' option built into Databionics ESOM Tools, which normalizes the data using robust estimates of mean and variance. Data were permuted before each run to avoid errors due to sampling order. Maps were toroidal (borderless) with Euclidean grid distance and dimensions scaled from the default map size (50 × 82) as a function of the number of data points, to a ratio of approximately 5.5 map nodes per data point. For example, a typical clustering with approximately 7,500 data points was run on map with dimensions 155 × 255. Training was conducted with the K-Batch algorithm (k = 0.15%) for 20 training epochs. The standard best match search method was used with local best match search radius of 8. Other training parameters were as follows: Gaussian weight initialization method; Euclidean data space function; starting value for training radius of 50 with linear cooling to 1; starting value for learning rate of 0.5 with linear cooling to 0.1; Gaussian kernel function. Clustering resolution versus evolutionary distance To quantify the degree of clustering between closely related genomes, we analyzed SOM maps using fixed point kernel densities [95]. Spatial data from the SOM was imported into ArcGIS (ESRI Software) and clusters were defined using Hawth's Analysis Tools for ArcGIS [96]. Cluster boundaries were determined using density estimators that captured 90% of data points from each genome (Additional data file 1). We then calculated separation between genomes as a percentage (Non-overlapping points/Total number of points) for two bins being compared. Average amino acid identity was calculated as described previously [1]. Predicted tetranucleotide frequency The predicted frequency of each unique pair of reverse complementary tetranucleotides was calculated based on genome-wide frequencies of potentially contributing codons. As shown in Figure 5, for any given tetranucleotide there are 12 potentially associated codons depending on coding strand and reading frame. Four codons (numbers 3, 4, 9, and 10 in Figure 5) are fully captured by the tetranucleotide, four are partially captured at two of three positions (numbers 2, 5, 8, and 11), and four are partially captured at one of three positions (codons 1, 6, 7, and 12). Each of these three classes is weighted according to their contribution: 1, 2/3, and 1/3 respectively. For partially captured codons, contributions of all possibilities were taken into account; for example, in Figure 5, codon number 5 (TGX) there are four possible codons - TGA, TGT, TGC, and TGG. Binning performance on variable length sequence fragments and subsampled genomes Sensitivity (percentage of fragments from each genome correctly identified) and precision (percentage of fragments in each bin belonging to the correct genome) of binning were calculated for a subset of assembled genomes that are deeply sampled and manually curated (Table 1; Additional data file 2). Fragment size was varied in two ways: all contigs were broken into a given size (2, 4, 6, or 10 kb); or 10% of each genome was randomly selected and fragmented (0.5, 1.0, 1.5, or 2.0 kb) while the remaining fraction of the genome was fragmented into 5-kb windows (Additional data file 2). Bin territories were defined manually, using boundaries apparent via distance-based background topology (U-Matrix) as guidelines. It is important to note this method allows data points between bins or near borders to remain unclassified. Analysis of subsampled genomes was conducted with assembled genomes only - unassigned fragments were excluded to prevent them from contributing to definition of bins. Genomes were fragmented into 5-kb sequences, which were then randomly selected to obtain the indicated percentage of the genome. Sequence signatures in coding versus non-coding regions Intergenic regions were extracted and concatenated, with 'N's inserted between regions to avoid generation of erroneous tetranucleotides. Intergenic regions were grouped by size (in 20-bp bins) to monitor variance in sequence signatures from intergenic regions of differing lengths. All coding sequences were similarly concatenated with interleaving 'N's. Concatenated coding and non-coding regions were then broken into 5-kb windows and run against the same background dataset of assembled genomes and unassigned sequences as usual. Sequence signatures in extracellular and highly expressed protein-coding genes Shotgun proteomics data were obtained for Leptospirillum group II extracellular and whole cell fractions from the ABend, ABfront, and UBA locations of the Richmond mine [55,66]. Proteins were defined as enriched in the extracellular fraction if, in at least two of the three samples, they were only detected in the extracellular fraction, or the ratio of spectral counts from extracellular to intracellular fraction was > 2. The 50 most abundantly expressed proteins were identified on the basis of tandem mass spectrometry (MS/MS) spectral counts. ESOM analysis of genes encoding extracellular and highly expressed proteins were both conducted as described above; open reading frames were concatenated, interleaved with 'N's, then split into 5-kb windows and analyzed along with the full dataset. Nucleotide sequence accession numbers This Whole Genome Shotgun project has been deposited at DDBJ/EMBL/GenBank under the project accessions ACXJ00000000 (unassigned contigs), ACXK00000000 (A-plasma), ACXL00000000 (E-plasma), ACXM00000000, (I-plasma), and ACVJ00000000 (ARMAN-2, described in detail in BJ Baker et al., in preparation). The versions described in this paper are the first versions, ACXJ01000000, ACXK01000000, ACXL01000000, ACXM01000000, and ACVJ01000000. Abbreviations AMD: acid mine drainage; ESOM: emergent self-organizing map; %GC: percentage content of guanine plus cytosine; SOM: self-organizing map. Authors' contributions GJD, AFA, SLS, and JFB conceived and designed the experiments. GJD, BJB, SLS, APY, and BCT performed the experiments. GJD, AFA, SLS, BCT, BJB, APY, and JFB analyzed the data. GJD and JFB wrote the paper. Additional data files The following additional data are available with the online version of this paper: a figure showing automated clustering of tetra-ESOM data using fixed point kernel densities (Additional data file 1); an evaluation of binning accuracy based on deeply sampled metagenomes for which contigs are assigned to genomes with a high degree of confidence (Additional data file 2); binning accuracy calculated for genomes that were sampled to varying extents of completeness (10 to 100%) (Additional data file 3); a heat map of average genome-wide frequency of each tetranucleotide for each genome, including bacteria, archaea, viruses, and a putative plasmid (Additional data file 4); comparison of tetra-ESOMs of assembled genomes based on amino acid composition, codon composition, and tetranucleotide frequency (Additional data file 5); a figure showing that the observed difference in frequency of each tetranucleotide between pairs of genomes correlates with the difference predicted based on codon composition (Additional data file 6); a figure showing tetra-ESOM of deeply sampled genomes for which coding and noncoding regions were separated (Additional data file 7); a figure showing for incorrectly binned fragments the percentage of sequence coding for genes in comparison with the genome-wide coding percentage (Additional data file 8); a figure showing tetra-ESOM of Leptospirillum group II genes coding for highly expressed proteins or proteins enriched in the extracellular fraction analyzed as separate fractions from the rest of the genome (Additional data file 9); a schematic of processes and factors influencing genome signature (Additional data file 10). Supplementary Material Additional File 1 Automated clustering of tetra-ESOM data using fixed point kernel densities. Click here for file Additional File 2 Evaluation of binning accuracy based on deeply sampled metagenomes for which contigs are assigned to genomes with a high degree of confidence. Click here for file Additional File 3 Binning accuracy calculated for genomes that were sampled to varying extents of completeness (10 to 100%). Click here for file Additional File 4 Heat map of average genome-wide frequency of each tetranucleotide for each genome, including bacteria, archaea, viruses, and a putative plasmid. Click here for file Additional File 5 Comparison of tetra-ESOMs of assembled genomes based on (a) amino acid composition, (b) codon composition, and (c) tetranucleotide frequency Click here for file Additional File 6 The observed difference in frequency of each tetranucleotide between pairs of genomes correlates with the difference predicted based on codon composition. Click here for file Additional File 7 Tetra-ESOM of deeply sampled genomes for which coding and noncoding regions were separated. Click here for file Additional File 8 Percentage of sequence coding for genes in comparison with the genome-wide coding percentage for incorrectly binned fragments. Click here for file Additional File 9 Tetra-ESOM of Leptospirillum group II genes coding for highly expressed proteins or proteins enriched in the extracellular fraction analyzed as separate fractions from the rest of the genome. Click here for file Additional File 10 Processes and factors influencing genome signature. Click here for file

                Author and article information

                BioMed Central (London )
                20 January 2015
                20 January 2015
                : 3
                : 1
                : 1
                [ ]Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, 4362 Luxembourg
                [ ]Institute of Computing Science, Poznan University of Technology, Poznan, 60-965 Poland
                [ ]Computer Science and Communications Research Unit, University of Luxembourg, Luxembourg, 1359 Luxembourg
                [ ]Pattern Recognition and Bioinformatics Group, Delft University of Technology, CD Delft, 2628 Netherlands
                [ ]Adobe Research, Adobe, San Jose, 95110 USA
                © Laczny et al.; licensee BioMed Central. 2015

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                : 15 September 2014
                : 18 December 2014
                Custom metadata
                © The Author(s) 2015

                metagenomics,machine learning,visualization,binning
                metagenomics, machine learning, visualization, binning


                Comment on this article