79
views
0
recommends
+1 Recommend
2 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study

      research-article

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The 2019 novel coronavirus (renamed SARS-CoV-2, and generally referred to as the COVID-19 virus) has spread to 184 countries with over 1.5 million confirmed cases. Such major viral outbreaks demand early elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. This paper identifies an intrinsic COVID-19 virus genomic signature and uses it together with a machine learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of whole COVID-19 virus genomes. The proposed method combines supervised machine learning with digital signal processing (MLDSP) for genome analyses, augmented by a decision tree approach to the machine learning component, and a Spearman’s rank correlation coefficient analysis for result validation. These tools are used to analyze a large dataset of over 5000 unique viral genomic sequences, totalling 61.8 million bp, including the 29 COVID-19 virus sequences available on January 27, 2020. Our results support a hypothesis of a bat origin and classify the COVID-19 virus as Sarbecovirus, within Betacoronavirus. Our method achieves 100% accurate classification of the COVID-19 virus sequences, and discovers the most relevant relationships among over 5000 viral genomes within a few minutes, ab initio, using raw DNA sequence data alone, and without any specialized biological knowledge, training, gene or genome annotations. This suggests that, for novel viral and pathogen genome sequences, this alignment-free whole-genome machine-learning approach can provide a reliable real-time option for taxonomic classification.

          Related collections

          Most cited references60

          • Record: found
          • Abstract: found
          • Article: not found

          A Novel Coronavirus from Patients with Pneumonia in China, 2019

          Summary In December 2019, a cluster of patients with pneumonia of unknown cause was linked to a seafood wholesale market in Wuhan, China. A previously unknown betacoronavirus was discovered through the use of unbiased sequencing in samples from patients with pneumonia. Human airway epithelial cells were used to isolate a novel coronavirus, named 2019-nCoV, which formed a clade within the subgenus sarbecovirus, Orthocoronavirinae subfamily. Different from both MERS-CoV and SARS-CoV, 2019-nCoV is the seventh member of the family of coronaviruses that infect humans. Enhanced surveillance and further investigation are ongoing. (Funded by the National Key Research and Development Program of China and the National Major Project for Control and Prevention of Infectious Disease in China.)
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            A pneumonia outbreak associated with a new coronavirus of probable bat origin

            Since the outbreak of severe acute respiratory syndrome (SARS) 18 years ago, a large number of SARS-related coronaviruses (SARSr-CoVs) have been discovered in their natural reservoir host, bats 1–4 . Previous studies have shown that some bat SARSr-CoVs have the potential to infect humans 5–7 . Here we report the identification and characterization of a new coronavirus (2019-nCoV), which caused an epidemic of acute respiratory syndrome in humans in Wuhan, China. The epidemic, which started on 12 December 2019, had caused 2,794 laboratory-confirmed infections including 80 deaths by 26 January 2020. Full-length genome sequences were obtained from five patients at an early stage of the outbreak. The sequences are almost identical and share 79.6% sequence identity to SARS-CoV. Furthermore, we show that 2019-nCoV is 96% identical at the whole-genome level to a bat coronavirus. Pairwise protein sequence analysis of seven conserved non-structural proteins domains show that this virus belongs to the species of SARSr-CoV. In addition, 2019-nCoV virus isolated from the bronchoalveolar lavage fluid of a critically ill patient could be neutralized by sera from several patients. Notably, we confirmed that 2019-nCoV uses the same cell entry receptor—angiotensin converting enzyme II (ACE2)—as SARS-CoV.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding

              Summary Background In late December, 2019, patients presenting with viral pneumonia due to an unidentified microbial agent were reported in Wuhan, China. A novel coronavirus was subsequently identified as the causative pathogen, provisionally named 2019 novel coronavirus (2019-nCoV). As of Jan 26, 2020, more than 2000 cases of 2019-nCoV infection have been confirmed, most of which involved people living in or visiting Wuhan, and human-to-human transmission has been confirmed. Methods We did next-generation sequencing of samples from bronchoalveolar lavage fluid and cultured isolates from nine inpatients, eight of whom had visited the Huanan seafood market in Wuhan. Complete and partial 2019-nCoV genome sequences were obtained from these individuals. Viral contigs were connected using Sanger sequencing to obtain the full-length genomes, with the terminal regions determined by rapid amplification of cDNA ends. Phylogenetic analysis of these 2019-nCoV genomes and those of other coronaviruses was used to determine the evolutionary history of the virus and help infer its likely origin. Homology modelling was done to explore the likely receptor-binding properties of the virus. Findings The ten genome sequences of 2019-nCoV obtained from the nine patients were extremely similar, exhibiting more than 99·98% sequence identity. Notably, 2019-nCoV was closely related (with 88% identity) to two bat-derived severe acute respiratory syndrome (SARS)-like coronaviruses, bat-SL-CoVZC45 and bat-SL-CoVZXC21, collected in 2018 in Zhoushan, eastern China, but were more distant from SARS-CoV (about 79%) and MERS-CoV (about 50%). Phylogenetic analysis revealed that 2019-nCoV fell within the subgenus Sarbecovirus of the genus Betacoronavirus, with a relatively long branch length to its closest relatives bat-SL-CoVZC45 and bat-SL-CoVZXC21, and was genetically distinct from SARS-CoV. Notably, homology modelling revealed that 2019-nCoV had a similar receptor-binding domain structure to that of SARS-CoV, despite amino acid variation at some key residues. Interpretation 2019-nCoV is sufficiently divergent from SARS-CoV to be considered a new human-infecting betacoronavirus. Although our phylogenetic analysis suggests that bats might be the original host of this virus, an animal sold at the seafood market in Wuhan might represent an intermediate host facilitating the emergence of the virus in humans. Importantly, structural analysis suggests that 2019-nCoV might be able to bind to the angiotensin-converting enzyme 2 receptor in humans. The future evolution, adaptation, and spread of this virus warrant urgent investigation. Funding National Key Research and Development Program of China, National Major Project for Control and Prevention of Infectious Disease in China, Chinese Academy of Sciences, Shandong First Medical University.
                Bookmark

                Author and article information

                Contributors
                Role: ConceptualizationRole: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: ResourcesRole: SoftwareRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: Formal analysisRole: InvestigationRole: Writing – original draftRole: Writing – review & editing
                Role: Formal analysisRole: Writing – review & editing
                Role: Formal analysisRole: Writing – review & editing
                Role: Formal analysisRole: Funding acquisitionRole: InvestigationRole: Project administrationRole: SupervisionRole: Writing – review & editing
                Role: Funding acquisitionRole: MethodologyRole: Project administrationRole: SupervisionRole: Writing – review & editing
                Role: Editor
                Journal
                PLoS One
                PLoS ONE
                plos
                plosone
                PLoS ONE
                Public Library of Science (San Francisco, CA USA )
                1932-6203
                2020
                24 April 2020
                : 15
                : 4
                : e0232391
                Affiliations
                [1 ] Department of Computer Science, The University of Western Ontario, London, ON, Canada
                [2 ] Department of Biology, The University of Western Ontario, London, ON, Canada
                [3 ] Department of Statistical and Actuarial Sciences, The University of Western Ontario, London, ON, Canada
                [4 ] School of Computer Science, University of Waterloo, Waterloo, ON, Canada
                Kliniken der Stadt Köln gGmbH, GERMANY
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                Author information
                http://orcid.org/0000-0003-1054-125X
                http://orcid.org/0000-0001-7495-5203
                http://orcid.org/0000-0002-4020-701X
                Article
                PONE-D-20-04991
                10.1371/journal.pone.0232391
                7182198
                32330208
                4f2851ae-6aa1-4b1e-bd59-50ad92c09b5a
                © 2020 Randhawa et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 20 February 2020
                : 14 April 2020
                Page count
                Figures: 8, Tables: 4, Pages: 24
                Funding
                Funded by: funder-id http://dx.doi.org/10.13039/501100000038, Natural Sciences and Engineering Research Council of Canada;
                Award ID: R2824A01
                Award Recipient :
                Funded by: funder-id http://dx.doi.org/10.13039/501100000038, Natural Sciences and Engineering Research Council of Canada;
                Award ID: R3511A12
                Award Recipient :
                LK, R2824A01, NSERC (Natural Science and Engineering Research Council of Canada), https://www.nserc-crsng.gc.ca/, The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. KAH, R3511A12, NSERC (Natural Science and Engineering Research Council of Canada), https://www.nserc-crsng.gc.ca/, The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Research and Analysis Methods
                Database and Informatics Methods
                Bioinformatics
                Sequence Analysis
                Sequence Alignment
                Biology and life sciences
                Organisms
                Viruses
                RNA viruses
                Coronaviruses
                Biology and Life Sciences
                Microbiology
                Medical Microbiology
                Microbial Pathogens
                Viral Pathogens
                Coronaviruses
                Medicine and Health Sciences
                Pathology and Laboratory Medicine
                Pathogens
                Microbial Pathogens
                Viral Pathogens
                Coronaviruses
                Biology and Life Sciences
                Organisms
                Viruses
                Viral Pathogens
                Coronaviruses
                Biology and Life Sciences
                Genetics
                Genomics
                Microbial Genomics
                Viral Genomics
                Biology and Life Sciences
                Microbiology
                Microbial Genomics
                Viral Genomics
                Biology and Life Sciences
                Microbiology
                Virology
                Viral Genomics
                Computer and Information Sciences
                Artificial Intelligence
                Machine Learning
                Biology and Life Sciences
                Taxonomy
                Microbial Taxonomy
                Viral Taxonomy
                Computer and Information Sciences
                Data Management
                Taxonomy
                Microbial Taxonomy
                Viral Taxonomy
                Biology and Life Sciences
                Microbiology
                Virology
                Viral Taxonomy
                Biology and Life Sciences
                Taxonomy
                Computer and Information Sciences
                Data Management
                Taxonomy
                Biology and Life Sciences
                Organisms
                Eukaryota
                Animals
                Vertebrates
                Amniotes
                Mammals
                Bats
                Biology and Life Sciences
                Computational Biology
                Comparative Genomics
                Biology and Life Sciences
                Genetics
                Genomics
                Comparative Genomics
                Custom metadata
                All sequence data used in this paper is either from NCBI, from Virus-Host-DB, or from GISAID. The sequences from NCBI and Virus-Host-DB in fasta format, and the accession numbers of all sequences from GISAID, are available at https://sourceforge.net/projects/mldsp-gui/files/COVID19Dataset/ In addition, the accession numbers of all the sequences used in this study are listed in Supplementary Material, Tables S2, S3.

                Uncategorized
                Uncategorized

                Comments

                Comment on this article