0
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Maximum likelihood pandemic-scale phylogenetics

      Preprint
      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Summary

          Phylogenetics plays a crucial role in the interpretation of genomic data 1 . Phylogenetic analyses of SARS-CoV-2 genomes have allowed the detailed study of the virus’s origins 2 , of its international 3, 4 and local 49 spread, and of the emergence 10 and reproductive success 11 of new variants, among many applications. These analyses have been enabled by the unparalleled volumes of genome sequence data generated and employed to study and help contain the pandemic 12 . However, preferred model-based phylogenetic approaches including maximum likelihood and Bayesian methods, mostly based on Felsenstein’s ‘pruning’ algorithm 13, 14 , cannot scale to the size of the datasets from the current pandemic 4, 15 , hampering our understanding of the virus’s evolution and transmission 16 . We present new approaches, based on reworking Felsenstein’s algorithm, for likelihood-based phylogenetic analysis of epidemiological genomic datasets at unprecedented scales. We exploit near-certainty regarding ancestral genomes, and the similarities between closely related and densely sampled genomes, to greatly reduce computational demands for memory and time. Combined with new methods for searching amongst candidate evolutionary trees, this results in our MAPLE (‘MAximum Parsimonious Likelihood Estimation’) software giving better results than popular approaches such as FastTree 2 17 , IQ-TREE 2 18 , RAxML-NG 19 and UShER 15 . Our approach therefore allows complex and accurate probabilistic phylogenetic analyses of millions of microbial genomes, extending the reach of genomic epidemiology. Future epidemiological datasets are likely to be even larger than those currently associated with COVID-19, and other disciplines such as metagenomics and biodiversity science are also generating huge numbers of genome sequences 2022 . Our methods will permit continued use of preferred likelihood-based phylogenetic analyses.

          Related collections

          Most cited references58

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies

          Motivation: Phylogenies are increasingly used in all fields of medical and biological research. Moreover, because of the next-generation sequencing revolution, datasets used for conducting phylogenetic analyses grow at an unprecedented pace. RAxML (Randomized Axelerated Maximum Likelihood) is a popular program for phylogenetic analyses of large datasets under maximum likelihood. Since the last RAxML paper in 2006, it has been continuously maintained and extended to accommodate the increasingly growing input datasets and to serve the needs of the user community. Results: I present some of the most notable new features and extensions of RAxML, such as a substantial extension of substitution models and supported data types, the introduction of SSE3, AVX and AVX2 vector intrinsics, techniques for reducing the memory requirements of the code and a plethora of operations for conducting post-analyses on sets of trees. In addition, an up-to-date 50-page user manual covering all new RAxML options is available. Availability and implementation: The code is available under GNU GPL at https://github.com/stamatak/standard-RAxML. Contact: alexandros.stamatakis@h-its.org Supplementary information: Supplementary data are available at Bioinformatics online.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            A new coronavirus associated with human respiratory disease in China

            Emerging infectious diseases, such as severe acute respiratory syndrome (SARS) and Zika virus disease, present a major threat to public health 1–3 . Despite intense research efforts, how, when and where new diseases appear are still a source of considerable uncertainty. A severe respiratory disease was recently reported in Wuhan, Hubei province, China. As of 25 January 2020, at least 1,975 cases had been reported since the first patient was hospitalized on 12 December 2019. Epidemiological investigations have suggested that the outbreak was associated with a seafood market in Wuhan. Here we study a single patient who was a worker at the market and who was admitted to the Central Hospital of Wuhan on 26 December 2019 while experiencing a severe respiratory syndrome that included fever, dizziness and a cough. Metagenomic RNA sequencing 4 of a sample of bronchoalveolar lavage fluid from the patient identified a new RNA virus strain from the family Coronaviridae, which is designated here ‘WH-Human 1’ coronavirus (and has also been referred to as ‘2019-nCoV’). Phylogenetic analysis of the complete viral genome (29,903 nucleotides) revealed that the virus was most closely related (89.1% nucleotide similarity) to a group of SARS-like coronaviruses (genus Betacoronavirus, subgenus Sarbecovirus) that had previously been found in bats in China 5 . This outbreak highlights the ongoing ability of viral spill-over from animals to cause severe disease in humans.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments

              Background We recently described FastTree, a tool for inferring phylogenies for alignments with up to hundreds of thousands of sequences. Here, we describe improvements to FastTree that improve its accuracy without sacrificing scalability. Methodology/Principal Findings Where FastTree 1 used nearest-neighbor interchanges (NNIs) and the minimum-evolution criterion to improve the tree, FastTree 2 adds minimum-evolution subtree-pruning-regrafting (SPRs) and maximum-likelihood NNIs. FastTree 2 uses heuristics to restrict the search for better trees and estimates a rate of evolution for each site (the “CAT” approximation). Nevertheless, for both simulated and genuine alignments, FastTree 2 is slightly more accurate than a standard implementation of maximum-likelihood NNIs (PhyML 3 with default settings). Although FastTree 2 is not quite as accurate as methods that use maximum-likelihood SPRs, most of the splits that disagree are poorly supported, and for large alignments, FastTree 2 is 100–1,000 times faster. FastTree 2 inferred a topology and likelihood-based local support values for 237,882 distinct 16S ribosomal RNAs on a desktop computer in 22 hours and 5.8 gigabytes of memory. Conclusions/Significance FastTree 2 allows the inference of maximum-likelihood phylogenies for huge alignments. FastTree 2 is freely available at http://www.microbesonline.org/fasttree.
                Bookmark

                Author and article information

                Journal
                bioRxiv
                BIORXIV
                bioRxiv
                Cold Spring Harbor Laboratory
                18 July 2022
                : 2022.03.22.485312
                Affiliations
                [1 ]European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
                [2 ]Max Planck Institute for Molecular Genetics, Ihnestraße 63-73 14195 Berlin, Germany
                [3 ]Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA 92093, USA
                [4 ]Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
                [5 ]Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA
                [6 ]School of Computing, College of Engineering and Computer Science, Australian National University, Canberra, ACT 2600, Australia
                Author notes

                Author contributions

                N.D.M. conceived and implemented the methods, performed the simulations and real data analyses, and wrote the first draft of the manuscript.

                N.G. supervised the work and finalized the manuscript.

                B.Q.M., R.C.-D., Y.T. and P.K. provided support during the analyses, method implementation and the drafting of the manuscript.

                Correspondence and requests for materials should be addressed to Nicola De Maio.
                Article
                10.1101/2022.03.22.485312
                8963701
                35350209
                961c4f77-6310-482e-aa39-0c44f4d05621

                This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.

                History
                Categories
                Article

                Comments

                Comment on this article