Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

The rapid accumulation of molecular sequence data, driven by novel wet-lab sequencing technologies, poses new challenges for large-scale maximum likelihood-based phylogenetic analyses on trees with more than 30,000 taxa and several genes. The three main computational challenges are: numerical stability, the scalability of search algorithms, and the high memory requirements for computing the likelihood.

Results

We introduce methods for solving these three key problems and provide respective proof-of-concept implementations in RAxML. The mechanisms presented here are not RAxML-specific and can thus be applied to any likelihood-based (Bayesian or maximum likelihood) tree inference program. We develop a new search strategy that can reduce the time required for tree inferences by more than 50% while yielding equally good trees (in the statistical sense) for well-chosen starting trees. We present an adaptation of the Subtree Equality Vector technique for phylogenomic datasets with missing data (already available in RAxML v728) that can reduce execution times and memory requirements by up to 50%. Finally, we discuss issues pertaining to the numerical stability of the Γ model of rate heterogeneity on very large trees and argue in favor of rate heterogeneity models that use a single rate or rate category for each site to resolve these problems.

Conclusions

We address three major issues pertaining to large scale tree reconstruction under maximum likelihood and propose respective solutions. Respective proof-of-concept/production-level implementations of our ideas are made available as open-source code.

Related collections

Most cited references 9

Record: found
Abstract: found
Article: not found

Phyutility: a phyloinformatics tool for trees, alignments and molecular data.

Stephen A. Smith, Casey W. Dunn (2008)

Phyutility provides a set of phyloinformatics tools for summarizing and manipulating phylogenetic trees, manipulating molecular data and retrieving data from NCBI. Its simple command-line interface allows for easy integration into scripted analyses, and is able to handle large datasets with an integrated database. Phyutility, including source code, documentation, examples, and executables, is available at http://code.google.com/p/phyutility.

0 comments Cited 239 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Pyrosequencing sheds light on DNA sequencing.

M Ronaghi (2000)

DNA sequencing is one of the most important platforms for the study of biological systems today. Sequence determination is most commonly performed using dideoxy chain termination technology. Recently, pyrosequencing has emerged as a new sequencing methodology. This technique is a widely applicable, alternative technology for the detailed characterization of nucleic acids. Pyrosequencing has the potential advantages of accuracy, flexibility, parallel processing, and can be easily automated. Furthermore, the technique dispenses with the need for labeled primers, labeled nucleotides, and gel-electrophoresis. This article considers key features regarding different aspects of pyrosequencing technology, including the general principles, enzyme properties, sequencing modes, instrumentation, and potential applications.

0 comments Cited 78 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Time and memory efficient likelihood-based tree searches on phylogenomic alignments with missing data

Alexandros Stamatakis, Nikolaos Alachiotis (2010)

Motivation: The current molecular data explosion poses new challenges for large-scale phylogenomic analyses that can comprise hundreds or even thousands of genes. A property that characterizes phylogenomic datasets is that they tend to be gappy, i.e. can contain taxa with (many and disparate) missing genes. In current phylogenomic analyses, this type of alignment gappyness that is induced by missing data frequently exceeds 90%. We present and implement a generally applicable mechanism that allows for reducing memory footprints of likelihood-based [maximum likelihood (ML) or Bayesian] phylogenomic analyses proportional to the amount of missing data in the alignment. We also introduce a set of algorithmic rules to efficiently conduct tree searches via subtree pruning and re-grafting moves using this mechanism. Results: On a large phylogenomic DNA dataset with 2177 taxa, 68 genes and a gappyness of 90%, we achieve a memory footprint reduction from 9 GB down to 1 GB, a speedup for optimizing ML model parameters of 11, and accelerate the Subtree Pruning Regrafting tree search phase by factor 16. Thus, our approach can be deployed to improve efficiency for the two most important resources, CPU time and memory, by up to one order of magnitude. Availability: Current open-source version of RAxML v7.2.6 available at http://wwwkramer.in.tum.de/exelixis/software.html. Contact: stamatak@cs.tum.edu

0 comments Cited 51 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central

ISSN (Electronic): 1471-2105

Publication date Collection: 2011

Publication date (Electronic): 13 December 2011

Volume: 12

Page: 470

Affiliations

[1 ]The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Schloss-Wolfsbrunnenweg 35, D-69118 Heidelberg, Germany

[2 ]2 Smith Lab, Dept. Ecology and Evolutionary Biology, University of Michigan, 2005 Kraus Natural Science Building, Ann Arbor, MI 48109-1048 USA

Article

Publisher ID: 1471-2105-12-470

DOI: 10.1186/1471-2105-12-470

PMC ID: 3267785

PubMed ID: 22165866

SO-VID: cf177371-d4c9-4ab7-a452-397bdf4dc24e

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees

Read this article at

Abstract

Background

Results

Conclusions

Related collections

Genetoberfest

Most cited references 9

Phyutility: a phyloinformatics tool for trees, alignments and molecular data.

Pyrosequencing sheds light on DNA sequencing.

Time and memory efficient likelihood-based tree searches on phylogenomic alignments with missing data

Author and article information

Journal

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 44

Cited by 24

Most referenced authors 528