Missing Data and Influential Sites: Choice of Sites for Phylogenetic Analysis Can Be As Important As Taxon Sampling and Model Choice

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Phylogenetic studies based on molecular sequence alignments are expected to become more accurate as the number of sites in the alignments increases. With the advent of genomic-scale data, where alignments have very large numbers of sites, bootstrap values close to 100% and posterior probabilities close to 1 are the norm, suggesting that the number of sites is now seldom a limiting factor on phylogenetic accuracy. This provokes the question, should we be fussy about the sites we choose to include in a genomic-scale phylogenetic analysis? If some sites contain missing data, ambiguous character states, or gaps, then why not just throw them away before conducting the phylogenetic analysis? Indeed, this is exactly the approach taken in many phylogenetic studies. Here, we present an example where the decision on how to treat sites with missing data is of equal importance to decisions on taxon sampling and model choice, and we introduce a graphical method for illustrating this.

Related collections

Most cited references 15

Record: found
Abstract: found
Article: not found

ProtTest: selection of best-fit models of protein evolution.

F. Abascal, R Zardoya, D Posada (2005)

Using an appropriate model of amino acid replacement is very important for the study of protein evolution and phylogenetic inference. We have built a tool for the selection of the best-fit model of evolution, among a set of candidate models, for a given protein sequence alignment. ProtTest is available under the GNU license from http://darwin.uvigo.es

0 comments Cited 1184 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

The Effect of Ambiguous Data on Phylogenetic Estimates Obtained by Maximum Likelihood and Bayesian Inference

Alan Lemmon, Jeremy Brown, Kathrin Stanger-Hall … (2009)

Abstract Although an increasing number of phylogenetic data sets are incomplete, the effect of ambiguous data on phylogenetic accuracy is not well understood. We use 4-taxon simulations to study the effects of ambiguous data (i.e., missing characters or gaps) in maximum likelihood (ML) and Bayesian frameworks. By introducing ambiguous data in a way that removes confounding factors, we provide the first clear understanding of 1 mechanism by which ambiguous data can mislead phylogenetic analyses. We find that in both ML and Bayesian frameworks, among-site rate variation can interact with ambiguous data to produce misleading estimates of topology and branch lengths. Furthermore, within a Bayesian framework, priors on branch lengths and rate heterogeneity parameters can exacerbate the effects of ambiguous data, resulting in strongly misleading bipartition posterior probabilities. The magnitude and direction of the ambiguous data bias are a function of the number and taxonomic distribution of ambiguous characters, the strength of topological support, and whether or not the model is correctly specified. The results of this study have major implications for all analyses that rely on accurate estimates of topology or branch lengths, including divergence time estimation, ancestral state reconstruction, tree-dependent comparative methods, rate variation analysis, phylogenetic hypothesis testing, and phylogeographic analysis.

0 comments Cited 150 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Analysis of Acorus calamus chloroplast genome and its phylogenetic implications.

F Hellwig, Vadim Goremykin, Barbara Holland … (2005)

Determining the phylogenetic relationships among the major lines of angiosperms is a long-standing problem, yet the uncertainty as to the phylogenetic affinity of these lines persists. While a number of studies have suggested that the ANITA (Amborella-Nymphaeales-Illiciales-Trimeniales-Aristolochiales) grade is basal within angiosperms, studies of complete chloroplast genome sequences also suggested an alternative tree, wherein the line leading to the grasses branches first among the angiosperms. To improve taxon sampling in the existing chloroplast genome data, we sequenced the chloroplast genome of the monocot Acorus calamus. We generated a concatenated alignment (89,436 positions for 15 taxa), encompassing almost all sequences usable for phylogeny reconstruction within spermatophytes. The data still contain support for both the ANITA-basal and grasses-basal hypotheses. Using simulations we can show that were the ANITA-basal hypothesis true, parsimony (and distance-based methods with many models) would be expected to fail to recover it. The self-evident explanation for this failure appears to be a long-branch attraction (LBA) between the clade of grasses and the out-group. However, this LBA cannot explain the discrepancies observed between tree topology recovered using the maximum likelihood (ML) method and the topologies recovered using the parsimony and distance-based methods when grasses are deleted. Furthermore, the fact that neither maximum parsimony nor distance methods consistently recover the ML tree, when according to the simulations they would be expected to, when the out-group (Pinus) is deleted, suggests that either the generating tree is not correct or the best symmetric model is misspecified (or both). We demonstrate that the tree recovered under ML is extremely sensitive to model specification and that the best symmetric model is misspecified. Hence, we remain agnostic regarding phylogenetic relationships among basal angiosperm lineages.

0 comments Cited 69 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Genome Biol Evol

Journal ID (iso-abbrev): Genome Biol Evol

Journal ID (publisher-id): gbe

Journal ID (hwp): gbe

Title: Genome Biology and Evolution

Publisher: Oxford University Press

ISSN (Electronic): 1759-6653

Publication date (Print): 2013

Publication date (Electronic): 6 March 2013

Publication date Collection: April 2013

Publication date PMC-release: 6 March 2013

Volume: 5

Issue: 4

Pages: 681-687

Affiliations

¹The Edmond and Lily Safra Center for Brain Sciences, The Hebrew University of Jerusalem, Israel

²Institute of Molecular BioSciences, Massey University, Palmerston North, New Zealand

³School of Mathematics and Physics, University of Tasmania, Hobart, Australia

Author notes

*Corresponding author: E-mail: liat.sg@ 123456mail.huji.ac.il ; liats80@ 123456hotmail.com .

Associate editor: David Bryant

Article

Publisher ID: evt032

DOI: 10.1093/gbe/evt032

PMC ID: 3641631

PubMed ID: 23471508

SO-VID: 2b0edb75-21cf-424a-a0c0-406dcf3ea21b

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/3.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date accepted : 20 February 2013

Page count

Pages: 7

Comments

Comment on this article

scite_

Cited by 11

See all cited by

Most referenced authors 216

See all reference authors

Missing Data and Influential Sites: Choice of Sites for Phylogenetic Analysis Can Be As Important As Taxon Sampling and Model Choice

Read this article at

Abstract

Related collections

Genomic Prediction

Most cited references 15

ProtTest: selection of best-fit models of protein evolution.

The Effect of Ambiguous Data on Phylogenetic Estimates Obtained by Maximum Likelihood and Bayesian Inference

Analysis of Acorus calamus chloroplast genome and its phylogenetic implications.

Author and article information

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 216

Cited by 11

Most referenced authors 216