8
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Machine Learning for detection of viral sequences in human metagenomic datasets

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Detection of highly divergent or yet unknown viruses from metagenomics sequencing datasets is a major bioinformatics challenge. When human samples are sequenced, a large proportion of assembled contigs are classified as “unknown”, as conventional methods find no similarity to known sequences. We wished to explore whether machine learning algorithms using Relative Synonymous Codon Usage frequency (RSCU) could improve the detection of viral sequences in metagenomic sequencing data.

          Results

          We trained Random Forest and Artificial Neural Network using metagenomic sequences taxonomically classified into virus and non-virus classes. The algorithms achieved accuracies well beyond chance level, with area under ROC curve 0.79. Two codons (TCG and CGC) were found to have a particularly strong discriminative capacity.

          Conclusion

          RSCU-based machine learning techniques applied to metagenomic sequencing data can help identify a large number of putative viral sequences and provide an addition to conventional methods for taxonomic classification.

          Electronic supplementary material

          The online version of this article (10.1186/s12859-018-2340-x) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references23

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Metagenomics - a guide from sampling to data analysis

          Metagenomics applies a suite of genomic technologies and bioinformatics tools to directly access the genetic content of entire communities of organisms. The field of metagenomics has been responsible for substantial advances in microbial ecology, evolution, and diversity over the past 5 to 10 years, and many research laboratories are actively engaged in it now. With the growing numbers of activities also comes a plethora of methodological knowledge and expertise that should guide future developments in the field. This review summarizes the current opinions in metagenomics, and provides practical guidance and advice on sample processing, sequencing technology, assembly, binning, annotation, experimental design, statistical analysis, data storage, and data sharing. As more metagenomic datasets are generated, the availability of standardized procedures and shared data storage and analysis becomes increasingly important to ensure that output of individual projects can be assessed and compared.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Empirical characterization of random forest variable importance measures

              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Human Skin Microbiota: High Diversity of DNA Viruses Identified on the Human Skin by High Throughput Sequencing

              The human skin is a complex ecosystem that hosts a heterogeneous flora. Until recently, the diversity of the cutaneous microbiota was mainly investigated for bacteria through culture based assays subsequently confirmed by molecular techniques. There are now many evidences that viruses represent a significant part of the cutaneous flora as demonstrated by the asymptomatic carriage of beta and gamma-human papillomaviruses on the healthy skin. Furthermore, it has been recently suggested that some representatives of the Polyomavirus genus might share a similar feature. In the present study, the cutaneous virome of the surface of the normal-appearing skin from five healthy individuals and one patient with Merkel cell carcinoma was investigated through a high throughput metagenomic sequencing approach in an attempt to provide a thorough description of the cutaneous flora, with a particular focus on its viral component. The results emphasize the high diversity of the viral cutaneous flora with multiple polyomaviruses, papillomaviruses and circoviruses being detected on normal-appearing skin. Moreover, this approach resulted in the identification of new Papillomavirus and Circovirus genomes and confirmed a very low level of genetic diversity within human polyomavirus species. Although viruses are generally considered as pathogen agents, our findings support the existence of a complex viral flora present at the surface of healthy-appearing human skin in various individuals. The dynamics and anatomical variations of this skin virome and its variations according to pathological conditions remain to be further studied. The potential involvement of these viruses, alone or in combination, in skin proliferative disorders and oncogenesis is another crucial issue to be elucidated.
                Bookmark

                Author and article information

                Contributors
                zurab.bzhalava@ki.se
                ardi.tampuu@ut.ee
                bala@icm.edu.pl
                raulvicente@gmail.com
                joakim.dillner@ki.se
                Journal
                BMC Bioinformatics
                BMC Bioinformatics
                BMC Bioinformatics
                BioMed Central (London )
                1471-2105
                24 September 2018
                24 September 2018
                2018
                : 19
                : 336
                Affiliations
                [1 ]ISNI 0000 0004 1937 0626, GRID grid.4714.6, Dept. of Laboratory Medicine, Karolinska Institutet, ; F46, Karolinska University Hospital Huddinge, Stockholm, Sweden
                [2 ]ISNI 0000 0001 0943 7661, GRID grid.10939.32, Institute of Computer Science, University of Tartu, ; Tartu, Estonia
                [3 ]ISNI 0000 0004 1937 1290, GRID grid.12847.38, Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ; Warsaw, Poland
                Article
                2340
                10.1186/s12859-018-2340-x
                6154907
                86838809-ae54-45d8-86d2-da094b1a93e1
                © The Author(s) 2018

                Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 26 September 2017
                : 28 August 2018
                Funding
                Funded by: Estonian Research Competency Council (EE)
                Award ID: PUT1476
                Funded by: FundRef http://dx.doi.org/10.13039/501100004785, NordForsk;
                Award ID: 62721
                Funded by: FundRef http://dx.doi.org/10.13039/501100001729, Stiftelsen för Strategisk Forskning;
                Award ID: RB13-0011
                Categories
                Research Article
                Custom metadata
                © The Author(s) 2018

                Bioinformatics & Computational biology
                machine learning,metagenomic sequencing,human samples,viral genomes

                Comments

                Comment on this article