VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Identifying viral sequences in mixed metagenomes containing both viral and host contigs is a critical first step in analyzing the viral component of samples. Current tools for distinguishing prokaryotic virus and host contigs primarily use gene-based similarity approaches. Such approaches can significantly limit results especially for short contigs that have few predicted proteins or lack proteins with similarity to previously known viruses.

Methods

We have developed VirFinder, the first k-mer frequency based, machine learning method for virus contig identification that entirely avoids gene-based similarity searches. VirFinder instead identifies viral sequences based on our empirical observation that viruses and hosts have discernibly different k-mer signatures. VirFinder’s performance in correctly identifying viral sequences was tested by training its machine learning model on sequences from host and viral genomes sequenced before 1 January 2014 and evaluating on sequences obtained after 1 January 2014.

Results

VirFinder had significantly better rates of identifying true viral contigs (true positive rates (TPRs)) than VirSorter, the current state-of-the-art gene-based virus classification tool, when evaluated with either contigs subsampled from complete genomes or assembled from a simulated human gut metagenome. For example, for contigs subsampled from complete genomes, VirFinder had 78-, 2.4-, and 1.8-fold higher TPRs than VirSorter for 1, 3, and 5 kb contigs, respectively, at the same false positive rates as VirSorter (0, 0.003, and 0.006, respectively), thus VirFinder works considerably better for small contigs than VirSorter. VirFinder furthermore identified several recently sequenced virus genomes (after 1 January 2014) that VirSorter did not and that have no nucleotide similarity to previously sequenced viruses, demonstrating VirFinder’s potential advantage in identifying novel viral sequences. Application of VirFinder to a set of human gut metagenomes from healthy and liver cirrhosis patients reveals higher viral diversity in healthy individuals than cirrhosis patients. We also identified contig bins containing crAssphage-like contigs with higher abundance in healthy patients and a putative Veillonella genus prophage associated with cirrhosis patients.

Conclusions

This innovative k-mer based tool complements gene-based approaches and will significantly improve prokaryotic viral sequence identification, especially for metagenomic-based studies of viral ecology.

Electronic supplementary material

The online version of this article (doi:10.1186/s40168-017-0283-5) contains supplementary material, which is available to authorized users.

Related collections

Most cited references 37

Record: found
Abstract: found
Article: found

Is Open Access

metaSPAdes: a new versatile metagenomic assembler

Sergey Nurk, Dmitry Meleshko, Anton Korobeynikov … (2017)

While metagenomics has emerged as a technology of choice for analyzing bacterial populations, the assembly of metagenomic data remains challenging, thus stifling biological discoveries. Moreover, recent studies revealed that complex bacterial populations may be composed from dozens of related strains, thus further amplifying the challenge of metagenomic assembly. metaSPAdes addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes. We benchmark metaSPAdes against other state-of-the-art metagenome assemblers and demonstrate that it results in high-quality assemblies across diverse data sets.

0 comments Cited 1129 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Ray Meta: scalable de novo metagenome assembly and profiling

Sébastien Boisvert, Frédéric Raymond, Élénie Godzaridis … (2012)

Voluminous parallel sequencing datasets, especially metagenomic experiments, require distributed computing for de novo assembly and taxonomic profiling. Ray Meta is a massively distributed metagenome assembler that is coupled with Ray Communities, which profiles microbiomes based on uniquely-colored k-mers. It can accurately assemble and profile a three billion read metagenomic experiment representing 1,000 bacterial genomes of uneven proportions in 15 hours with 1,024 processor cores, using only 1.5 GB per core. The software will facilitate the processing of large and complex datasets, and will help in generating biological insights for specific environments. Ray Meta is open source and available at http://denovoassembler.sf.net.

0 comments Cited 262 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Codon usage in bacteria: correlation with gene expressivity.

M Gouy, C Gautier (1982)

The nucleic acid sequence bank now contains over 600 protein coding genes of which 107 are from prokaryotic organisms. Codon frequencies in each new prokaryotic gene are given. Analysis of genetic code usage in the 83 sequenced genes of the Escherichia coli genome (chromosome, transposons and plasmids) is presented, taking into account new data on gene expressivity and regulation as well as iso-tRNA specificity and cellular concentration. The codon composition of each gene is summarized using two indexes: one is based on the differential usage of iso-tRNA species during gene translation, the other on choice between Cytosine and Uracil for third base. A strong relationship between codon composition and mRNA expressivity is confirmed, even for genes transcribed in the same operon. The influence of codon use of peptide elongation rate and protein yield is discussed. Finally, the evolutionary aspect of codon selection in mRNA sequences is studied.

0 comments Cited 243 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Nathan A. Ahlgren: nahlgren@clarku.edu

Fengzhu Sun: fsun@usc.edu

Journal

Journal ID (nlm-ta): Microbiome

Journal ID (iso-abbrev): Microbiome

Title: Microbiome

Publisher: BioMed Central (London )

ISSN (Electronic): 2049-2618

Publication date (Electronic): 6 July 2017

Publication date PMC-release: 6 July 2017

Publication date Collection: 2017

Volume: 5

Electronic Location Identifier: 69

Affiliations

[1 ]ISNI 0000 0001 2156 6853, GRID grid.42505.36, Molecular and Computational Biology Program, , University of Southern California, ; 1050 Childs Way, Los Angeles, CA 90089 USA

[2 ]ISNI 0000 0001 2156 6853, GRID grid.42505.36, Department of Biological Sciences, , University of Southern California, ; 3616 Trousdale Pkwy, Los Angeles, CA 90089 USA

[3 ]ISNI 0000 0001 0125 2443, GRID grid.8547.e, Center for Computational Systems Biology, , Fudan University, ; 200433 Shanghai, China

[4 ]ISNI 0000 0004 0486 8069, GRID grid.254277.1, Present address: Biology Department, , Clark University, ; 950 Main St, Worcester, MA 01610 USA

Article

Publisher ID: 283

DOI: 10.1186/s40168-017-0283-5

PMC ID: 5501583

PubMed ID: 28683828

SO-VID: 993f1108-33dd-469f-97eb-1092fc4e3632

License:

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

History

Date received : 31 January 2017

Date accepted : 5 June 2017

Funding

Funded by: FundRef http://dx.doi.org/10.13039/100000057, National Institute of General Medical Sciences;

Award ID: R01GM120624

Award Recipient : Jed A. Fuhrman Fengzhu Sun

Funded by: FundRef http://dx.doi.org/10.13039/100000121, Division of Mathematical Sciences;

Award ID: DMS 1518001

Award Recipient : Jed A. Fuhrman Fengzhu Sun

Funded by: FundRef http://dx.doi.org/10.13039/100000141, Division of Ocean Sciences;

Award ID: 1136818

Award Recipient : Jed A. Fuhrman Fengzhu Sun

Funded by: Gordon and Betty Moore Foundation (US)

Award ID: GBMF3779

Award Recipient : Jed A. Fuhrman

Custom metadata

Keywords: metagenome,virus,k-mer,human gut,liver cirrhosis

Data availability:

Keywords: metagenome, virus, k-mer, human gut, liver cirrhosis

Comments

Comment on this article

scite_

Cited by 219

See all cited by

Most referenced authors 2,356

See all reference authors

VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data

Read this article at

Abstract

Background

Methods

Results

Conclusions

Electronic supplementary material

Related collections

Tick microbiome

Most cited references 37

metaSPAdes: a new versatile metagenomic assembler

Ray Meta: scalable de novo metagenome assembly and profiling

Codon usage in bacteria: correlation with gene expressivity.

Author and article information

Contributors

Journal

Affiliations

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 143

Cited by 219

Most referenced authors 2,356