Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Thanks to the large amount of signal contained in genome-wide sequence alignments, phylogenomic analyses are converging towards highly supported trees. However, high statistical support does not imply that the tree is accurate. Systematic errors, such as the Long Branch Attraction (LBA) artefact, can be misleading, in particular when the taxon sampling is poor, or the outgroup is distant. In an otherwise consistent probabilistic framework, systematic errors in genome-wide analyses can be traced back to model mis-specification problems, which suggests that better models of sequence evolution should be devised, that would be more robust to tree reconstruction artefacts, even under the most challenging conditions.

Methods

We focus on a well characterized LBA artefact analyzed in a previous phylogenomic study of the metazoan tree, in which two fast-evolving animal phyla, nematodes and platyhelminths, emerge either at the base of all other Bilateria, or within protostomes, depending on the outgroup. We use this artefactual result as a case study for comparing the robustness of two alternative models: a standard, site-homogeneous model, based on an empirical matrix of amino-acid replacement (WAG), and a site-heterogeneous mixture model (CAT). In parallel, we propose a posterior predictive test, allowing one to measure how well a model acknowledges sequence saturation.

Results

Adopting a Bayesian framework, we show that the LBA artefact observed under WAG disappears when the site-heterogeneous model CAT is used. Using cross-validation, we further demonstrate that CAT has a better statistical fit than WAG on this data set. Finally, using our statistical goodness-of-fit test, we show that CAT, but not WAG, correctly accounts for the overall level of saturation, and that this is due to a better estimation of site-specific amino-acid preferences.

Conclusion

The CAT model appears to be more robust than WAG against LBA artefacts, essentially because it correctly anticipates the high probability of convergences and reversions implied by the small effective size of the amino-acid alphabet at each site of the alignment. More generally, our results provide strong evidence that site-specificities in the substitution process need be accounted for in order to obtain more reliable phylogenetic trees.

Related collections

Most cited references 37

Record: found
Abstract: not found
Article: not found

A Concern for Evidence and a Phylogenetic Hypothesis of Relationships Among Epicrates (Boidae, Serpentes)

Arnold G. Kluge (1989)

0 comments Cited 233 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Phylogenetic estimation of context-dependent substitution rates by maximum likelihood.

David Haussler, David Haussler (2004)

Nucleotide substitution in both coding and noncoding regions is context-dependent, in the sense that substitution rates depend on the identity of neighboring bases. Context-dependent substitution has been modeled in the case of two sequences and an unrooted phylogenetic tree, but it has only been accommodated in limited ways with more general phylogenies. In this article, extensions are presented to standard phylogenetic models that allow for better handling of context-dependent substitution, yet still permit exact inference at reasonable computational cost. The new models improve goodness of fit substantially for both coding and noncoding data. Considering context dependence leads to much larger improvements than does using a richer substitution model or allowing for rate variation across sites, under the assumption of site independence. The observed improvements appear to derive from three separate properties of the models: their explicit characterization of context-dependent substitution within N-tuples of adjacent sites, their ability to accommodate overlapping N-tuples, and their rich parameterization of the substitution process. Parameter estimation is accomplished using an expectation maximization algorithm, with a quasi-Newton algorithm for the maximization step; this approach is shown to be preferable to ordinary Newton methods for parameter-rich models. Overlapping tuples are efficiently handled by assuming Markov dependence of the observed bases at each site on those at the N - 1 preceding sites, and the required conditional probabilities are computed with an extension of Felsenstein's algorithm. Estimated substitution rates based on a data set of about 160,000 noncoding sites in mammalian genomes indicate a pronounced CpG effect, but they also suggest a complex overall pattern of context-dependent substitution, comprising a variety of subtle effects. Estimates based on about 3 million sites in coding regions demonstrate that amino acid substitution rates can be learned at the nucleotide level, and suggest that context effects across codon boundaries are significant.

0 comments Cited 132 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data.

Mark Marty Pagel, Andrew Meade (2004)

We describe a general likelihood-based 'mixture model' for inferring phylogenetic trees from gene-sequence or other character-state data. The model accommodates cases in which different sites in the alignment evolve in qualitatively distinct ways, but does not require prior knowledge of these patterns or partitioning of the data. We call this qualitative variability in the pattern of evolution across sites "pattern-heterogeneity" to distinguish it from both a homogenous process of evolution and from one characterized principally by differences in rates of evolution. We present studies to show that the model correctly retrieves the signals of pattern-heterogeneity from simulated gene-sequence data, and we apply the method to protein-coding genes and to a ribosomal 12S data set. The mixture model outperforms conventional partitioning in both these data sets. We implement the mixture model such that it can simultaneously detect rate- and pattern-heterogeneity. The model simplifies to a homogeneous model or a rate-variability model as special cases, and therefore always performs at least as well as these two approaches, and often considerably improves upon them. We make the model available within a Bayesian Markov-chain Monte Carlo framework for phylogenetic inference, as an easy-to-use computer program.

0 comments Cited 128 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Conference

Journal ID (nlm-ta): BMC Evol Biol

Title: BMC Evolutionary Biology

Publisher: BioMed Central (London )

ISSN (Electronic): 1471-2148

Publication date Collection: 2007

Publication date (Electronic): 8 February 2007

Volume: 7

Issue: Suppl 1

Page: S4

Affiliations

[1 ]Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier, UMR 5506, CNRS-Université de Montpellier 2, 161, rue Ada, 34392 Montpellier Cedex 5, France

[2 ]Canadian Institute for Advanced Research, Département de Biochimie, Université de Montréal, Montréal, Québec Canada

Article

Publisher ID: 1471-2148-7-S1-S4

DOI: 10.1186/1471-2148-7-S1-S4

PMC ID: 1796613

PubMed ID: 17288577

SO-VID: 62313ec8-43ce-47a2-8a8d-e800eb4bec00

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Conference name: First International Conference on Phylogenomics

Conference location: Sainte-Adèle, Québec, Canada

Conference date: 15–19 March 2006

History

Comments

Comment on this article

scite_

Cited by 243

See all cited by

Most referenced authors 334

See all reference authors

- Version 1

Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model

Read this article at

Abstract

Background

Methods

Results

Conclusion

Related collections

Evolutionary Cell Biology

Most cited references 37

A Concern for Evidence and a Phylogenetic Hypothesis of Relationships Among Epicrates (Boidae, Serpentes)

Phylogenetic estimation of context-dependent substitution rates by maximum likelihood.

A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data.

Author and article information

Conference

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 77

Cited by 243

Most referenced authors 334