311
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Identifying viral sequences in mixed metagenomes containing both viral and host contigs is a critical first step in analyzing the viral component of samples. Current tools for distinguishing prokaryotic virus and host contigs primarily use gene-based similarity approaches. Such approaches can significantly limit results especially for short contigs that have few predicted proteins or lack proteins with similarity to previously known viruses.

          Methods

          We have developed VirFinder, the first k-mer frequency based, machine learning method for virus contig identification that entirely avoids gene-based similarity searches. VirFinder instead identifies viral sequences based on our empirical observation that viruses and hosts have discernibly different k-mer signatures. VirFinder’s performance in correctly identifying viral sequences was tested by training its machine learning model on sequences from host and viral genomes sequenced before 1 January 2014 and evaluating on sequences obtained after 1 January 2014.

          Results

          VirFinder had significantly better rates of identifying true viral contigs (true positive rates (TPRs)) than VirSorter, the current state-of-the-art gene-based virus classification tool, when evaluated with either contigs subsampled from complete genomes or assembled from a simulated human gut metagenome. For example, for contigs subsampled from complete genomes, VirFinder had 78-, 2.4-, and 1.8-fold higher TPRs than VirSorter for 1, 3, and 5 kb contigs, respectively, at the same false positive rates as VirSorter (0, 0.003, and 0.006, respectively), thus VirFinder works considerably better for small contigs than VirSorter. VirFinder furthermore identified several recently sequenced virus genomes (after 1 January 2014) that VirSorter did not and that have no nucleotide similarity to previously sequenced viruses, demonstrating VirFinder’s potential advantage in identifying novel viral sequences. Application of VirFinder to a set of human gut metagenomes from healthy and liver cirrhosis patients reveals higher viral diversity in healthy individuals than cirrhosis patients. We also identified contig bins containing crAssphage-like contigs with higher abundance in healthy patients and a putative Veillonella genus prophage associated with cirrhosis patients.

          Conclusions

          This innovative k-mer based tool complements gene-based approaches and will significantly improve prokaryotic viral sequence identification, especially for metagenomic-based studies of viral ecology.

          Electronic supplementary material

          The online version of this article (doi:10.1186/s40168-017-0283-5) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references37

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          metaSPAdes: a new versatile metagenomic assembler

          While metagenomics has emerged as a technology of choice for analyzing bacterial populations, the assembly of metagenomic data remains challenging, thus stifling biological discoveries. Moreover, recent studies revealed that complex bacterial populations may be composed from dozens of related strains, thus further amplifying the challenge of metagenomic assembly. metaSPAdes addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes. We benchmark metaSPAdes against other state-of-the-art metagenome assemblers and demonstrate that it results in high-quality assemblies across diverse data sets.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Ray Meta: scalable de novo metagenome assembly and profiling

            Voluminous parallel sequencing datasets, especially metagenomic experiments, require distributed computing for de novo assembly and taxonomic profiling. Ray Meta is a massively distributed metagenome assembler that is coupled with Ray Communities, which profiles microbiomes based on uniquely-colored k-mers. It can accurately assemble and profile a three billion read metagenomic experiment representing 1,000 bacterial genomes of uneven proportions in 15 hours with 1,024 processor cores, using only 1.5 GB per core. The software will facilitate the processing of large and complex datasets, and will help in generating biological insights for specific environments. Ray Meta is open source and available at http://denovoassembler.sf.net.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Codon usage in bacteria: correlation with gene expressivity.

              The nucleic acid sequence bank now contains over 600 protein coding genes of which 107 are from prokaryotic organisms. Codon frequencies in each new prokaryotic gene are given. Analysis of genetic code usage in the 83 sequenced genes of the Escherichia coli genome (chromosome, transposons and plasmids) is presented, taking into account new data on gene expressivity and regulation as well as iso-tRNA specificity and cellular concentration. The codon composition of each gene is summarized using two indexes: one is based on the differential usage of iso-tRNA species during gene translation, the other on choice between Cytosine and Uracil for third base. A strong relationship between codon composition and mRNA expressivity is confirmed, even for genes transcribed in the same operon. The influence of codon use of peptide elongation rate and protein yield is discussed. Finally, the evolutionary aspect of codon selection in mRNA sequences is studied.
                Bookmark

                Author and article information

                Contributors
                nahlgren@clarku.edu
                fsun@usc.edu
                Journal
                Microbiome
                Microbiome
                Microbiome
                BioMed Central (London )
                2049-2618
                6 July 2017
                6 July 2017
                2017
                : 5
                : 69
                Affiliations
                [1 ]ISNI 0000 0001 2156 6853, GRID grid.42505.36, Molecular and Computational Biology Program, , University of Southern California, ; 1050 Childs Way, Los Angeles, CA 90089 USA
                [2 ]ISNI 0000 0001 2156 6853, GRID grid.42505.36, Department of Biological Sciences, , University of Southern California, ; 3616 Trousdale Pkwy, Los Angeles, CA 90089 USA
                [3 ]ISNI 0000 0001 0125 2443, GRID grid.8547.e, Center for Computational Systems Biology, , Fudan University, ; 200433 Shanghai, China
                [4 ]ISNI 0000 0004 0486 8069, GRID grid.254277.1, Present address: Biology Department, , Clark University, ; 950 Main St, Worcester, MA 01610 USA
                Article
                283
                10.1186/s40168-017-0283-5
                5501583
                28683828
                993f1108-33dd-469f-97eb-1092fc4e3632
                © The Author(s). 2017

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 31 January 2017
                : 5 June 2017
                Funding
                Funded by: FundRef http://dx.doi.org/10.13039/100000057, National Institute of General Medical Sciences;
                Award ID: R01GM120624
                Award ID: R01GM120624
                Award Recipient :
                Funded by: FundRef http://dx.doi.org/10.13039/100000121, Division of Mathematical Sciences;
                Award ID: DMS 1518001
                Award ID: DMS 1518001
                Award Recipient :
                Funded by: FundRef http://dx.doi.org/10.13039/100000141, Division of Ocean Sciences;
                Award ID: 1136818
                Award ID: 1136818
                Award Recipient :
                Funded by: Gordon and Betty Moore Foundation (US)
                Award ID: GBMF3779
                Award Recipient :
                Categories
                Methodology
                Custom metadata
                © The Author(s) 2017

                metagenome,virus,k-mer,human gut,liver cirrhosis
                metagenome, virus, k-mer, human gut, liver cirrhosis

                Comments

                Comment on this article