4
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples

      research-article
      1 , 2 , 2 , 3 , 1 , *
      PLoS ONE
      Public Library of Science

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Despite its clinical importance, detection of highly divergent or yet unknown viruses is a major challenge. When human samples are sequenced, conventional alignments classify many assembled contigs as “unknown” since many of the sequences are not similar to known genomes. In this work, we developed ViraMiner, a deep learning-based method to identify viruses in various human biospecimens. ViraMiner contains two branches of Convolutional Neural Networks designed to detect both patterns and pattern-frequencies on raw metagenomics contigs. The training dataset included sequences obtained from 19 metagenomic experiments which were analyzed and labeled by BLAST. The model achieves significantly improved accuracy compared to other machine learning methods for viral genome classification. Using 300 bp contigs ViraMiner achieves 0.923 area under the ROC curve. To our knowledge, this is the first machine learning methodology that can detect the presence of viral sequences among raw metagenomic contigs from diverse human samples. We suggest that the proposed model captures different types of information of genome composition, and can be used as a recommendation system to further investigate sequences labeled as “unknown” by conventional alignment methods. Exploring these highly-divergent viruses, in turn, can enhance our knowledge of infectious causes of diseases.

          Related collections

          Most cited references28

          • Record: found
          • Abstract: found
          • Article: not found

          ART: a next-generation sequencing read simulator.

          ART is a set of simulation tools that generate synthetic next-generation sequencing reads. This functionality is essential for testing and benchmarking tools for next-generation sequencing data analysis including read alignment, de novo assembly and genetic variation discovery. ART generates simulated sequencing reads by emulating the sequencing process with built-in, technology-specific read error models and base quality value profiles parameterized empirically in large sequencing datasets. We currently support all three major commercial next-generation sequencing platforms: Roche's 454, Illumina's Solexa and Applied Biosystems' SOLiD. ART also allows the flexibility to use customized read error model parameters and quality profiles. Both source and binary software packages are available at http://www.niehs.nih.gov/research/resources/software/art.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Metagenomics - a guide from sampling to data analysis

            Metagenomics applies a suite of genomic technologies and bioinformatics tools to directly access the genetic content of entire communities of organisms. The field of metagenomics has been responsible for substantial advances in microbial ecology, evolution, and diversity over the past 5 to 10 years, and many research laboratories are actively engaged in it now. With the growing numbers of activities also comes a plethora of methodological knowledge and expertise that should guide future developments in the field. This review summarizes the current opinions in metagenomics, and provides practical guidance and advice on sample processing, sequencing technology, assembly, binning, annotation, experimental design, statistical analysis, data storage, and data sharing. As more metagenomic datasets are generated, the availability of standardized procedures and shared data storage and analysis becomes increasingly important to ensure that output of individual projects can be assessed and compared.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data

              Background Identifying viral sequences in mixed metagenomes containing both viral and host contigs is a critical first step in analyzing the viral component of samples. Current tools for distinguishing prokaryotic virus and host contigs primarily use gene-based similarity approaches. Such approaches can significantly limit results especially for short contigs that have few predicted proteins or lack proteins with similarity to previously known viruses. Methods We have developed VirFinder, the first k-mer frequency based, machine learning method for virus contig identification that entirely avoids gene-based similarity searches. VirFinder instead identifies viral sequences based on our empirical observation that viruses and hosts have discernibly different k-mer signatures. VirFinder’s performance in correctly identifying viral sequences was tested by training its machine learning model on sequences from host and viral genomes sequenced before 1 January 2014 and evaluating on sequences obtained after 1 January 2014. Results VirFinder had significantly better rates of identifying true viral contigs (true positive rates (TPRs)) than VirSorter, the current state-of-the-art gene-based virus classification tool, when evaluated with either contigs subsampled from complete genomes or assembled from a simulated human gut metagenome. For example, for contigs subsampled from complete genomes, VirFinder had 78-, 2.4-, and 1.8-fold higher TPRs than VirSorter for 1, 3, and 5 kb contigs, respectively, at the same false positive rates as VirSorter (0, 0.003, and 0.006, respectively), thus VirFinder works considerably better for small contigs than VirSorter. VirFinder furthermore identified several recently sequenced virus genomes (after 1 January 2014) that VirSorter did not and that have no nucleotide similarity to previously sequenced viruses, demonstrating VirFinder’s potential advantage in identifying novel viral sequences. Application of VirFinder to a set of human gut metagenomes from healthy and liver cirrhosis patients reveals higher viral diversity in healthy individuals than cirrhosis patients. We also identified contig bins containing crAssphage-like contigs with higher abundance in healthy patients and a putative Veillonella genus prophage associated with cirrhosis patients. Conclusions This innovative k-mer based tool complements gene-based approaches and will significantly improve prokaryotic viral sequence identification, especially for metagenomic-based studies of viral ecology. Electronic supplementary material The online version of this article (doi:10.1186/s40168-017-0283-5) contains supplementary material, which is available to authorized users.
                Bookmark

                Author and article information

                Contributors
                Role: ConceptualizationRole: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: SoftwareRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: ConceptualizationRole: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: SoftwareRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: Funding acquisitionRole: Project administrationRole: ResourcesRole: SupervisionRole: ValidationRole: Writing – review & editing
                Role: ConceptualizationRole: Funding acquisitionRole: Project administrationRole: ResourcesRole: SupervisionRole: ValidationRole: Writing – original draft
                Role: Editor
                Journal
                PLoS One
                PLoS ONE
                plos
                plosone
                PLoS ONE
                Public Library of Science (San Francisco, CA USA )
                1932-6203
                2019
                11 September 2019
                : 14
                : 9
                : e0222271
                Affiliations
                [1 ] Computational Neuroscience Lab, Institute of Computer Science, University of Tartu, Tartu, Estonia
                [2 ] Department of Laboratory Medicine, Karolinska Institutet, Stockholm, Sweden
                [3 ] Karolinska University Laboratory, Karolinska University Hospital, Stockholm, Sweden
                Oklahoma State University, UNITED STATES
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                Author information
                http://orcid.org/0000-0002-2120-1712
                Article
                PONE-D-19-16058
                10.1371/journal.pone.0222271
                6738585
                31509583
                5e45d659-425b-4302-ade6-f3065e309cc6
                © 2019 Tampuu et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 7 June 2019
                : 22 August 2019
                Page count
                Figures: 6, Tables: 1, Pages: 17
                Funding
                Funded by: funder-id http://dx.doi.org/10.13039/501100002301, Eesti Teadusagentuur;
                Award ID: PUT 1476
                Award Recipient :
                Funded by: Estonian Center of Excellence in IT
                Award ID: TK148
                Award Recipient :
                Funded by: funder-id http://dx.doi.org/10.13039/501100001729, Stiftelsen för Strategisk Forskning;
                Award ID: RB13-0011
                Award Recipient :
                Funded by: funder-id http://dx.doi.org/10.13039/501100004785, NordForsk;
                Award ID: 62721
                Award Recipient :
                Funded by: funder-id http://dx.doi.org/10.13039/501100004359, Vetenskapsrådet;
                Award ID: 2017-01841_3
                Award Recipient :
                ZB and JD were supported by three sources: 1) awarded to JD: Swedish Foundation for Strategic Research, Proj. no RB13-0011, https://strategiska.se/en/; 2) awarded to JD: NordForsk, Proj no 62721, https://www.nordforsk.org/en?set_language=en; 3) awarded to JD: Swedish Research Council, Proj no 2017-01841_3, https://www.vr.se/english.html. AT and RV were supported by Estonian Research Council, project number PUT 1476 ( https://www.etis.ee/Portal/Projects/Display/52ed4301-f2ef-4364-9770-397e31936f93?lang=ENG) and Estonian Centre of Excellence in IT (EXCITE) project number TK148 ( https://www.etis.ee/Portal/Projects/Display/fd0aeffa-a7d3-4191-b468-0f44aa2847af?lang=ENG). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology and Life Sciences
                Genetics
                Genomics
                Metagenomics
                Research and Analysis Methods
                Database and Informatics Methods
                Bioinformatics
                Sequence Analysis
                Sequence Alignment
                Biology and Life Sciences
                Genetics
                Genomics
                Microbial Genomics
                Viral Genomics
                Viral Genome
                Biology and Life Sciences
                Microbiology
                Microbial Genomics
                Viral Genomics
                Viral Genome
                Biology and Life Sciences
                Microbiology
                Virology
                Viral Genomics
                Viral Genome
                Research and analysis methods
                Database and informatics methods
                Bioinformatics
                Sequence analysis
                BLAST algorithm
                Biology and life sciences
                Molecular biology
                Molecular biology techniques
                Molecular biology assays and analysis techniques
                DNA filter assay
                Research and analysis methods
                Molecular biology techniques
                Molecular biology assays and analysis techniques
                DNA filter assay
                Research and analysis methods
                Database and informatics methods
                Bioinformatics
                Sequence analysis
                DNA sequence analysis
                Computer and Information Sciences
                Neural Networks
                Biology and Life Sciences
                Neuroscience
                Neural Networks
                Computer and Information Sciences
                Artificial Intelligence
                Machine Learning
                Custom metadata
                All necessary data files are available from the GitHub repository associated with this work: https://github.com/NeuroCSUT/ViraMiner. Also all code is available in there.

                Uncategorized
                Uncategorized

                Comments

                Comment on this article