31
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      The Effect of Ambiguous Data on Phylogenetic Estimates Obtained by Maximum Likelihood and Bayesian Inference

      research-article

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Although an increasing number of phylogenetic data sets are incomplete, the effect of ambiguous data on phylogenetic accuracy is not well understood. We use 4-taxon simulations to study the effects of ambiguous data (i.e., missing characters or gaps) in maximum likelihood (ML) and Bayesian frameworks. By introducing ambiguous data in a way that removes confounding factors, we provide the first clear understanding of 1 mechanism by which ambiguous data can mislead phylogenetic analyses. We find that in both ML and Bayesian frameworks, among-site rate variation can interact with ambiguous data to produce misleading estimates of topology and branch lengths. Furthermore, within a Bayesian framework, priors on branch lengths and rate heterogeneity parameters can exacerbate the effects of ambiguous data, resulting in strongly misleading bipartition posterior probabilities. The magnitude and direction of the ambiguous data bias are a function of the number and taxonomic distribution of ambiguous characters, the strength of topological support, and whether or not the model is correctly specified. The results of this study have major implications for all analyses that rely on accurate estimates of topology or branch lengths, including divergence time estimation, ancestral state reconstruction, tree-dependent comparative methods, rate variation analysis, phylogenetic hypothesis testing, and phylogeographic analysis.

          Related collections

          Most cited references52

          • Record: found
          • Abstract: found
          • Article: not found

          Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees.

          Seq-Gen is a program that will simulate the evolution of nucleotide sequences along a phylogeny, using common models of the substitution process. A range of models of molecular evolution are implemented, including the general reversible model. Nucleotide frequencies and other parameters of the model may be given and site-specific rate heterogeneity can also be incorporated in a number of ways. Any number of trees may be read in and the program will produce any number of data sets for each tree. Thus, large sets of replicate simulations can be easily created. This can be used to test phylogenetic hypotheses using the parametric bootstrap. Seq-Gen can be obtained by WWW from http:/(/)evolve.zoo.ox.ac.uk/Seq-Gen/seq-gen.html++ + or by FTP from ftp:/(/)evolve.zoo.ox.ac.uk/packages/Seq-Gen/. The package includes the source code, manual and example files. An Apple Macintosh version is available from the same sites.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous.

            All inferences in comparative biology depend on accurate estimates of evolutionary relationships. Recent phylogenetic analyses have turned away from maximum parsimony towards the probabilistic techniques of maximum likelihood and bayesian Markov chain Monte Carlo (BMCMC). These probabilistic techniques represent a parametric approach to statistical phylogenetics, because their criterion for evaluating a topology--the probability of the data, given the tree--is calculated with reference to an explicit evolutionary model from which the data are assumed to be identically distributed. Maximum parsimony can be considered nonparametric, because trees are evaluated on the basis of a general metric--the minimum number of character state changes required to generate the data on a given tree--without assuming a specific distribution. The shift to parametric methods was spurred, in large part, by studies showing that although both approaches perform well most of the time, maximum parsimony is strongly biased towards recovering an incorrect tree under certain combinations of branch lengths, whereas maximum likelihood is not. All these evaluations simulated sequences by a largely homogeneous evolutionary process in which data are identically distributed. There is ample evidence, however, that real-world gene sequences evolve heterogeneously and are not identically distributed. Here we show that maximum likelihood and BMCMC can become strongly biased and statistically inconsistent when the rates at which sequence sites evolve change non-identically over time. Maximum parsimony performs substantially better than current parametric methods over a wide range of conditions tested, including moderate heterogeneity and phylogenetic problems not normally considered difficult.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites.

              Q. Z. Yang (1993)
              Felsenstein's maximum-likelihood approach for inferring phylogeny from DNA sequences assumes that the rate of nucleotide substitution is constant over different nucleotide sites. This assumption is sometimes unrealistic, as has been revealed by analysis of real sequence data. In the present paper Felsenstein's method is extended to the case where substitution rates over sites are described by the gamma distribution. A numerical example is presented to show that the method fits the data better than do previous models.
                Bookmark

                Author and article information

                Journal
                Syst Biol
                Syst Biol
                sysbio
                sysbio
                Systematic Biology
                Oxford University Press
                1063-5157
                1076-836X
                February 2009
                21 May 2009
                : 58
                : 1
                : 130-145
                Affiliations
                [1 ]Section of Integrative Biology, University of Texas at Austin, 1 University Station C0930, Austin, TX 78712, USA
                [2 ]Present address: Department of Scientif ic Computing, Florida State University, Dirac Science Library, Tallahassee, FL 32306-4120, USA
                [3 ]Present address: Department of Biological Science, Florida State University, Tallahassee, FL 32306, USA
                [4 ]Plant Biology Department, University of Georgia, 403 Biosciences Building, Athens, GA 30602, USA
                Author notes
                [* ]Correspondence to be sent to: Department of Scientif ic Computing, Florida State University, Dirac Science Library, Tallahassee, FL 32306-4120, USA; E-mail: alemmon@ 123456evotutor.org .

                Associate Editor: Lars Jermiin

                Article
                10.1093/sysbio/syp017
                7539334
                20525573
                15ca77fa-6bb1-40f6-aa23-781cdc873f4f
                © Society of Systematic Biologists

                This article is made available via the PMC Open Access Subset for unrestricted re-use and analyses in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the COVID-19 pandemic or until permissions are revoked in writing. Upon expiration of these permissions, PMC is granted a perpetual license to make this article available via PMC and Europe PMC, consistent with existing copyright protections.

                History
                : 8 October 2007
                : 10 January 2008
                : 30 December 2008
                Categories
                Regular Articles

                Animal science & Zoology
                ambiguous characters,ambiguous data,bayesian,bias,maximum likelihood,missing data,model misspecification,phylogenetics,posterior probabilities,prior

                Comments

                Comment on this article