A scalable method for identifying frequent subtrees in sets of large phylogenetic trees

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

We consider the problem of finding the maximum frequent agreement subtrees (MFASTs) in a collection of phylogenetic trees. Existing methods for this problem often do not scale beyond datasets with around 100 taxa. Our goal is to address this problem for datasets with over a thousand taxa and hundreds of trees.

Results

We develop a heuristic solution that aims to find MFASTs in sets of many, large phylogenetic trees. Our method works in multiple phases. In the first phase, it identifies small candidate subtrees from the set of input trees which serve as the seeds of larger subtrees. In the second phase, it combines these small seeds to build larger candidate MFASTs. In the final phase, it performs a post-processing step that ensures that we find a frequent agreement subtree that is not contained in a larger frequent agreement subtree. We demonstrate that this heuristic can easily handle data sets with 1000 taxa, greatly extending the estimation of MFASTs beyond current methods.

Conclusions

Although this heuristic does not guarantee to find all MFASTs or the largest MFAST, it found the MFAST in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on large empirical data sets. Its performance is robust to the number and size of the input trees. Overall, this method provides a simple and fast way to identify strongly supported subtrees within large phylogenetic hypotheses.

Related collections

Most cited references 15

Record: found
Abstract: found
Article: found

Is Open Access

The Newick utilities: high-throughput phylogenetic tree processing in the Unix shell

Thomas Junier, Evgeny Zdobnov (2010)

Summary: We present a suite of Unix shell programs for processing any number of phylogenetic trees of any size. They perform frequently-used tree operations without requiring user interaction. They also allow tree drawing as scalable vector graphics (SVG), suitable for high-quality presentations and further editing, and as ASCII graphics for command-line inspection. As an example we include an implementation of bootscanning, a procedure for finding recombination breakpoints in viral genomes. Availability: C source code, Python bindings and executables for various platforms are available from http://cegg.unige.ch/newick_utils. The distribution includes a manual and example data. The package is distributed under the BSD License. Contact: thomas.junier@unige.ch

0 comments Cited 271 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Accommodating phylogenetic uncertainty in evolutionary studies.

B Rannala, J Huelsenbeck, John P. Masly (2000)

Many evolutionary studies use comparisons across species to detect evidence of natural selection and to examine the rate of character evolution. Statistical analyses in these studies are usually performed by means of a species phylogeny to accommodate the effects of shared evolutionary history. The phylogeny is usually treated as known without error; this assumption is problematic because inferred phylogenies are subject to both stochastic and systematic errors. We describe methods for accommodating phylogenetic uncertainty in evolutionary studies by means of Bayesian inference. The methods are computationally intensive but general enough to be applied in most comparative evolutionary studies.

0 comments Cited 73 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Understanding angiosperm diversification using small and large phylogenetic trees.

Jake Beaulieu, Alastair Smith, Alexandros Stamatakis … (2011)

How will the emerging possibility of inferring ultra-large phylogenies influence our ability to identify shifts in diversification rate? For several large angiosperm clades (Angiospermae, Monocotyledonae, Orchidaceae, Poaceae, Eudicotyledonae, Fabaceae, and Asteraceae), we explore this issue by contrasting two approaches: (1) using small backbone trees with an inferred number of extant species assigned to each terminal clade and (2) using a mega-phylogeny of 55473 seed plant species represented in GenBank. The mega-phylogeny approach assumes that the sample of species in GenBank is at least roughly proportional to the actual species diversity of different lineages, as appears to be the case for many major angiosperm lineages. Using both approaches, we found that diversification rate shifts are not directly associated with the major named clades examined here, with the sole exception of Fabaceae in the GenBank mega-phylogeny. These agreements are encouraging and may support a generality about angiosperm evolution: major shifts in diversification may not be directly associated with major named clades, but rather with clades that are nested not far within these groups. An alternative explanation is that there have been increased extinction rates in early-diverging lineages within these clades. Based on our mega-phylogeny, the shifts in diversification appear to be distributed quite evenly throughout the angiosperms. Mega-phylogenetic studies of diversification hold great promise for revealing new patterns, but we will need to focus more attention on properly specifying null expectation.

0 comments Cited 70 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): BMC Bioinformatics

Journal ID (iso-abbrev): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central

ISSN (Electronic): 1471-2105

Publication date Collection: 2012

Publication date (Electronic): 3 October 2012

Volume: 13

Page: 256

Affiliations

[1 ]Electrical and Computer Engineering, University of Florida, Gainesville, FL, USA

[2 ]Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA

[3 ]Department of Biology, University of Florida, Gainesville, FL, USA

Article

Publisher ID: 1471-2105-13-256

DOI: 10.1186/1471-2105-13-256

PMC ID: 3543182

PubMed ID: 23033843

SO-VID: 040277ef-1462-4c5c-b310-91ca7d6f4264

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

A scalable method for identifying frequent subtrees in sets of large phylogenetic trees

Read this article at

Abstract

Background

Results

Conclusions

Related collections

Genetoberfest

Most cited references 15

The Newick utilities: high-throughput phylogenetic tree processing in the Unix shell

Accommodating phylogenetic uncertainty in evolutionary studies.

Understanding angiosperm diversification using small and large phylogenetic trees.

Author and article information

Journal

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 28

Cited by 3

Most referenced authors 213