12
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees

      research-article
      1 , , 1 , 2 , 1 ,
      BMC Bioinformatics
      BioMed Central

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          The rapid accumulation of molecular sequence data, driven by novel wet-lab sequencing technologies, poses new challenges for large-scale maximum likelihood-based phylogenetic analyses on trees with more than 30,000 taxa and several genes. The three main computational challenges are: numerical stability, the scalability of search algorithms, and the high memory requirements for computing the likelihood.

          Results

          We introduce methods for solving these three key problems and provide respective proof-of-concept implementations in RAxML. The mechanisms presented here are not RAxML-specific and can thus be applied to any likelihood-based (Bayesian or maximum likelihood) tree inference program. We develop a new search strategy that can reduce the time required for tree inferences by more than 50% while yielding equally good trees (in the statistical sense) for well-chosen starting trees. We present an adaptation of the Subtree Equality Vector technique for phylogenomic datasets with missing data (already available in RAxML v728) that can reduce execution times and memory requirements by up to 50%. Finally, we discuss issues pertaining to the numerical stability of the Γ model of rate heterogeneity on very large trees and argue in favor of rate heterogeneity models that use a single rate or rate category for each site to resolve these problems.

          Conclusions

          We address three major issues pertaining to large scale tree reconstruction under maximum likelihood and propose respective solutions. Respective proof-of-concept/production-level implementations of our ideas are made available as open-source code.

          Related collections

          Most cited references9

          • Record: found
          • Abstract: found
          • Article: not found

          Phyutility: a phyloinformatics tool for trees, alignments and molecular data.

          Phyutility provides a set of phyloinformatics tools for summarizing and manipulating phylogenetic trees, manipulating molecular data and retrieving data from NCBI. Its simple command-line interface allows for easy integration into scripted analyses, and is able to handle large datasets with an integrated database. Phyutility, including source code, documentation, examples, and executables, is available at http://code.google.com/p/phyutility.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Pyrosequencing sheds light on DNA sequencing.

            M Ronaghi (2000)
            DNA sequencing is one of the most important platforms for the study of biological systems today. Sequence determination is most commonly performed using dideoxy chain termination technology. Recently, pyrosequencing has emerged as a new sequencing methodology. This technique is a widely applicable, alternative technology for the detailed characterization of nucleic acids. Pyrosequencing has the potential advantages of accuracy, flexibility, parallel processing, and can be easily automated. Furthermore, the technique dispenses with the need for labeled primers, labeled nucleotides, and gel-electrophoresis. This article considers key features regarding different aspects of pyrosequencing technology, including the general principles, enzyme properties, sequencing modes, instrumentation, and potential applications.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Time and memory efficient likelihood-based tree searches on phylogenomic alignments with missing data

              Motivation: The current molecular data explosion poses new challenges for large-scale phylogenomic analyses that can comprise hundreds or even thousands of genes. A property that characterizes phylogenomic datasets is that they tend to be gappy, i.e. can contain taxa with (many and disparate) missing genes. In current phylogenomic analyses, this type of alignment gappyness that is induced by missing data frequently exceeds 90%. We present and implement a generally applicable mechanism that allows for reducing memory footprints of likelihood-based [maximum likelihood (ML) or Bayesian] phylogenomic analyses proportional to the amount of missing data in the alignment. We also introduce a set of algorithmic rules to efficiently conduct tree searches via subtree pruning and re-grafting moves using this mechanism. Results: On a large phylogenomic DNA dataset with 2177 taxa, 68 genes and a gappyness of 90%, we achieve a memory footprint reduction from 9 GB down to 1 GB, a speedup for optimizing ML model parameters of 11, and accelerate the Subtree Pruning Regrafting tree search phase by factor 16. Thus, our approach can be deployed to improve efficiency for the two most important resources, CPU time and memory, by up to one order of magnitude. Availability: Current open-source version of RAxML v7.2.6 available at http://wwwkramer.in.tum.de/exelixis/software.html. Contact: stamatak@cs.tum.edu
                Bookmark

                Author and article information

                Journal
                BMC Bioinformatics
                BMC Bioinformatics
                BioMed Central
                1471-2105
                2011
                13 December 2011
                : 12
                : 470
                Affiliations
                [1 ]The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Schloss-Wolfsbrunnenweg 35, D-69118 Heidelberg, Germany
                [2 ]2 Smith Lab, Dept. Ecology and Evolutionary Biology, University of Michigan, 2005 Kraus Natural Science Building, Ann Arbor, MI 48109-1048 USA
                Article
                1471-2105-12-470
                10.1186/1471-2105-12-470
                3267785
                22165866
                cf177371-d4c9-4ab7-a452-397bdf4dc24e
                Copyright ©2011 Izquierdo-Carrasco et al; licensee BioMed Central Ltd.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 23 April 2011
                : 13 December 2011
                Categories
                Research Article

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article