20
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Composite Metagenome-Assembled Genomes Reduce the Quality of Public Genome Repositories

      letter
      a , , b , c ,
      mBio
      American Society for Microbiology

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          LETTER In their recent study, Espinoza et al. employ genome-resolved metagenomics to investigate supragingival plaque metagenomes of 88 individuals (1). The 34 metagenome-assembled genomes (MAGs) that the authors report include those that resolve to clades that have largely evaded cultivation efforts, such as Gracilibacteria (formerly GN02) and Saccharibacteria (formerly TM7) of the recently described Candidate Phyla Radiation (2). Generating new genomic insights into the understudied members of the human oral cavity is of critical importance for a comprehensive understanding of the microbial ecology and functioning of this biome, and we acknowledge the contribution of the authors on this front. However, the redundant occurrence of bacterial single-copy core genes suggests that more than half of the MAGs that Espinoza et al. report are composite genomes that do not meet the recent quality guidelines suggested by the community (3). Composite genomes that aggregate sequences originating from multiple distinct populations can yield misleading insights when treated and reported as single genomes (4). To briefly demonstrate their composite nature, we refined some of the key Espinoza et al. MAGs through a previously described approach (5) and the data that the authors kindly provided (1). We found that MAG IV.A, MAG IV.B, and MAG III.A described multiple discrete populations with distinct distribution patterns across individuals (Fig. 1). A phylogenomic analysis of refined MAG IV.A genomes resolved to the candidate phylum Absconditabacteria (formerly SR1) and not to Gracilibacteria as reported by Espinoza et al. (Fig. 1D). A pangenomic analysis of the original and refined MAG III.A genomes with other publicly available Saccharibacteria genomes showed a 7-fold increase in the number of single-copy core genes (Fig. 1E). These findings demonstrate the potential implications of composite MAGs in comparative genomics studies where single-copy core genes are commonly used to infer diversity, phylogeny, and taxonomy (6). Composite MAGs can also lead to inaccurate ecological insights through inflated abundance and prevalence estimates. For instance, the original MAG III.A recruited a total of 1,849,593 reads from Espinoza et al. metagenomes; however, the most abundant refined III.A genome (MAG III.A.2, Fig. 1C) recruited only 629,291 reads. FIG 1 Refinement of three composite genome bins. (A to C) The top left corners of these panels display the original name of a given Espinoza et al. MAG (see Table 1 in the original study) and its estimated completion and redundancy (C/R) based on a bacterial single-copy core gene collection (10). Each concentric circle represents one of the 88 metagenomes in the original study, dendrograms show hierarchical clustering of contigs based on sequence composition and differential mean coverage across metagenomes (using Euclidean distance and Ward’s method), and each data point represents the read recruitment statistic of a given contig in a given metagenome. Arcs at the outermost layers mark contigs that belong to a refined bin along with their new completion and redundancy estimates (C/R). (D) The phylogenomic tree organizes genomes based on 37 concatenated ribosomal proteins. Coloring of genome names matches their taxonomy in NCBI, and branch colors match the consensus taxonomy of genomes they represent. Espinoza et al. reported MAG IV.A as Gracilibacteria (hence the red color); however, this phylogenomic analysis places refined MAGs under Absconditabacteria. (E) Pangenomic analysis of Espinoza et al. Saccharibacteria MAG III.A before (left) and after (right) refinement together with the Saccharibacteria genomes from panel D. Pangenomes describe 575 and 497 gene clusters, respectively, where each concentric circle represents a genome and bars correspond to the number of genes that a given genome is contributing to a given gene cluster (the maximum value is set to 2 for readability). Outermost layers mark single-copy core gene clusters to which every genome contributes precisely a single gene. We used Bowtie2 (11) to recruit reads from metagenomes, and anvi’o (12) to visualize and refine Espinoza et al. MAGs. FAMSA (13) aligned anvi’o-reported ribosomal protein amino acid sequences, trimAl (14) curated them, and IQ-TREE (15) computed the tree for the phylogenomic analysis. Anvi’o used DIAMOND (16) and MCL (17) algorithms to determine pangenomes. A reproducible bioinformatics workflow and FASTA files for refined MAGs are available at http://merenlab.org/data/refining-espinoza-mags. Co-assembly of a large number of metagenomes that contain very closely related populations often hinders confident assignments of shared contigs into individual bins. Nevertheless, even when proper refinement is not possible, reporting composite MAGs as single genomes should be avoided. As of today, highly composite Espinoza et al. MAGs (Fig. 1 in this letter and Table 1 in the work of Espinoza et al.) are available as single genomes in public databases of the National Center for Biotechnology Information (NCBI). The rapidly increasing number of MAGs in public databases already competes with the total number of microbial isolate genomes (3), and increasingly frequent studies that report large collections of MAGs offer a glimpse of the future (7 – 9). Despite their growing availability, metagenomes are inherently complex and demand researchers to orchestrate an intricate combination of rapidly evolving computational tools and approaches with many alternatives to reconstruct, characterize, and finalize MAGs. We must continue to champion studies such as the one by Espinoza et al. for their contribution to our collective effort to shed light on the darker branches of the ever-growing Tree of Life. At the same time, editors and reviewers of genome-resolved metagenomics studies should properly scrutinize the quality and accuracy of MAGs prior to their publication. A systematic failure at this will reduce the quality of public genome repositories while yielding adverse effects such as misleading insights into novel microbial groups and reduced trust among scientists in findings that emerge from genome-resolved metagenomics.

          Related collections

          Most cited references4

          • Record: found
          • Abstract: found
          • Article: not found

          No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini.

          Tardigrades are meiofaunal ecdysozoans that are key to understanding the origins of Arthropoda. Many species of Tardigrada can survive extreme conditions through cryptobiosis. In a recent paper [Boothby TC, et al. (2015) Proc Natl Acad Sci USA 112(52):15976-15981], the authors concluded that the tardigrade Hypsibius dujardini had an unprecedented proportion (17%) of genes originating through functional horizontal gene transfer (fHGT) and speculated that fHGT was likely formative in the evolution of cryptobiosis. We independently sequenced the genome of H. dujardini As expected from whole-organism DNA sampling, our raw data contained reads from nontarget genomes. Filtering using metagenomics approaches generated a draft H. dujardini genome assembly of 135 Mb with superior assembly metrics to the previously published assembly. Additional microbial contamination likely remains. We found no support for extensive fHGT. Among 23,021 gene predictions we identified 0.2% strong candidates for fHGT from bacteria and 0.2% strong candidates for fHGT from nonmetazoan eukaryotes. Cross-comparison of assemblies showed that the overwhelming majority of HGT candidates in the Boothby et al. genome derived from contaminants. We conclude that fHGT into H. dujardini accounts for at most 1-2% of genes and that the proposal that one-sixth of tardigrade genes originate from functional HGT events is an artifact of undetected contamination.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies

            High-throughput sequencing provides a fast and cost-effective mean to recover genomes of organisms from all domains of life. However, adequate curation of the assembly results against potential contamination of non-target organisms requires advanced bioinformatics approaches and practices. Here, we re-analyzed the sequencing data generated for the tardigrade Hypsibius dujardini, and created a holistic display of the eukaryotic genome assembly using DNA data originating from two groups and eleven sequencing libraries. By using bacterial single-copy genes, k-mer frequencies, and coverage values of scaffolds we could identify and characterize multiple near-complete bacterial genomes from the raw assembly, and curate a 182 Mbp draft genome for H. dujardini supported by RNA-Seq data. Our results indicate that most contaminant scaffolds were assembled from Moleculo long-read libraries, and most of these contaminants have differed between library preparations. Our re-analysis shows that visualization and curation of eukaryotic genome assemblies can benefit from tools designed to address the needs of today’s microbiologists, who are constantly challenged by the difficulties associated with the identification of distinct microbial genomes in complex environmental metagenomes.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              FAMSA: Fast and accurate multiple sequence alignment of huge protein families

              Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8 GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                mBio
                MBio
                mbio
                mbio
                mBio
                mBio
                American Society for Microbiology (1752 N St., N.W., Washington, DC )
                2150-7511
                4 June 2019
                May-Jun 2019
                : 10
                : 3
                : e00725-19
                Affiliations
                [a ]Graduate Program in Biophysical Sciences, University of Chicago, Chicago, Illinois, USA
                [b ]Department of Medicine, University of Chicago, Chicago, Illinois, USA
                [c ]Josephine Bay Paul Center, Marine Biological Laboratory, Woods Hole, Massachusetts, USA
                Stanford University
                Author notes
                Address correspondence to Alon Shaiber, ashaiber@ 123456uchicago.edu , or A. Murat Eren, meren@ 123456uchicago.edu .
                Author information
                https://orcid.org/0000-0001-9013-4827
                Article
                mBio00725-19
                10.1128/mBio.00725-19
                6550520
                31164461
                012efe0d-0658-4f45-b427-7dbb77a8a355
                Copyright © 2019 Shaiber and Eren.

                This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license.

                History
                Page count
                Figures: 1, Tables: 0, Equations: 0, References: 17, Pages: 3, Words: 1782
                Categories
                Letter to the Editor
                Host-Microbe Biology
                Custom metadata
                May/June 2019

                Life sciences
                Life sciences

                Comments

                Comment on this article