51
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      Bushes in the Tree of Life

      other
      * ,
      PLoS Biology
      Public Library of Science

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          A discussion of how homoplasy (the frequency of independently evolved characters) and the spacing of cladogenetic events limit our ability to reconstruct the tree of life using existing phylogenetic methods.

          Related collections

          Most cited references58

          • Record: found
          • Abstract: found
          • Article: not found

          Phylogenetic classification and the universal tree.

          From comparative analyses of the nucleotide sequences of genes encoding ribosomal RNAs and several proteins, molecular phylogeneticists have constructed a "universal tree of life," taking it as the basis for a "natural" hierarchical classification of all living things. Although confidence in some of the tree's early branches has recently been shaken, new approaches could still resolve many methodological uncertainties. More challenging is evidence that most archaeal and bacterial genomes (and the inferred ancestral eukaryotic nuclear genome) contain genes from multiple sources. If "chimerism" or "lateral gene transfer" cannot be dismissed as trivial in extent or limited to special categories of genes, then no hierarchical universal classification can be taken as natural. Molecular phylogeneticists will have failed to find the "true tree," not because their methods are inadequate or because they have chosen the wrong genes, but because the history of life cannot properly be represented as a tree. However, taxonomies based on molecular sequences will remain indispensable, and understanding of the evolutionary process will ultimately be enriched, not impoverished.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Population size does not influence mitochondrial genetic diversity in animals.

            Within-species genetic diversity is thought to reflect population size, history, ecology, and ability to adapt. Using a comprehensive collection of polymorphism data sets covering approximately 3000 animal species, we show that the widely used mitochondrial DNA (mtDNA) marker does not reflect species abundance or ecology: mtDNA diversity is not higher in invertebrates than in vertebrates, in marine than in terrestrial species, or in small than in large organisms. Nuclear loci, in contrast, fit these intuitive expectations. The unexpected mitochondrial diversity distribution is explained by recurrent adaptive evolution, challenging the neutral theory of molecular evolution and questioning the relevance of mtDNA in biodiversity and conservation studies.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Discordance of Species Trees with Their Most Likely Gene Trees

              Introduction In typical phylogenetic studies of individual genes, the estimated gene tree topology is used as the estimate of the species tree topology. When many loci are studied, the species tree topology is often estimated using the most frequently inferred gene tree topology [1–5]. Although it is well-known that the sorting of gene lineages at speciation can cause gene trees to differ in topology from species trees [6–9], the assumption that the most probable gene tree topology to be produced by this sorting is the same as the species tree topology—the implicit premise that makes it sensible to estimate a species tree using a single gene tree or the most common among several gene trees—has remained unquestioned. Here, under a population-genetic model for the evolution of gene lineages, we show that discordance can occur between the species tree and the most likely gene tree. Consequently, use of the most commonly observed gene tree topology to estimate the species tree topology—the “democratic vote” procedure among gene trees [10]—can be “positively misleading,” that is [11], convergent upon an erroneous estimate as the number of genes increases. Results We refer to gene trees that are more likely than the tree that matches the species tree as anomalous gene trees (AGTs). To characterize the conditions under which AGTs exist, consider a rooted binary species tree σ with topology ψ and with a vector of positive branch lengths λ , where λi denotes the length of branch i. Following previous studies of gene trees and species trees [6,7,12–15], we use the coalescent process from population genetics [16,17] to model gene evolution in genetically variable populations along branches of a species tree. We consider gene trees that are known exactly, assuming that mutations have not obscured the underlying relationships among gene lineages. For n species, and one gene lineage sampled per species, there are n − 2 internal branches of the species tree that affect gene tree probabilities under the coalescent. Branch lengths are measured in coalescent time units, which can be converted to units of generations under any of several choices for models of evolution within species [16–18]. In the simplest model for diploids, each species has constant population size N/2 individuals, and λi coalescent units equal λiN generations. We can view gene lineages as moving backward in time, eventually coalescing down to one lineage. In each interval, lineages entering the interval from a more recent time period have the opportunity to coalesce, with coalescence equiprobable for each pair of lineages—as specified by the Yule model [19–22]—and the coalescence rate following the coalescent process [16,17]. For the fixed species tree σ, the gene tree topology G is viewed as a random variable whose distribution depends on σ. Under the model, this distribution is known for arbitrary rooted binary species trees [15]. Using Pσ (G = g) to denote the probability that a random gene tree has topology g when the species tree is σ, we define anomalous gene trees as follows. Definition 1 (i) A gene tree topology g is anomalous for a species tree σ = (ψ, λ ) if Pσ (G = g) > Pσ (G = ψ). (ii) A topology ψ produces anomalies if there exists a vector of branch lengths λ such that the species tree σ = (ψ, λ) has at least one anomalous gene tree. (iii) The anomaly zone for a topology ψ is the set of vectors of branch lengths λ for which σ = (ψ, λ ) has at least one anomalous gene tree. In other words, a gene tree topology g is anomalous for a species tree σ if a gene evolving along the branches of σ is more likely to have the topology g than it is to have the same topology as the species tree. AGTs do not exist for species trees with three taxa—the smallest number in a nontrivial, rooted, binary phylogeny. Denoting the length of the one internal branch in a three-taxon tree by λ, the probability is 1 − (2/3)e−λ that a gene tree has the same topology as the species tree [6,12,13]. This value always exceeds the probability that the gene tree topology matches one of the other two topologies, or (1/3)e−λ. What about four taxa? If the species tree has sufficiently short branches, all coalescences of gene lineages may happen more anciently than its root. When coalescences are “deep,” the fact that random joining of lineages has a higher probability of producing some topologies than others [19,20,22] makes it likely that a gene tree has one of the high-probability topologies, regardless of the shape of the species tree. For four taxa, symmetric topologies each have probability 1/9, whereas asymmetric topologies each have probability 1/18 [6,19,20]. Thus, if the species tree is asymmetric with short branch lengths, symmetric gene tree topologies are more likely to be produced than are asymmetric topologies (Figure 1). Figure 1 Anomalous Gene Trees for Four Taxa Colored lines represent gene lineages that trace back to a common ancestor along the branches of a species tree with topology (((AB)C)D). The figure illustrates how a gene tree can have a higher probability of having a symmetric topology, in this case ((AD)(BC)), than of having the topology that matches the species tree. If the internal branches of the species tree—x and y—are short so that coalescences occur deep in the tree, the two sequences of coalescences that produce a given symmetric gene tree topology together have higher probability than the single sequence that produces the topology that matches the species tree. (a) and (b) Two coalescence sequences leading to gene tree topology ((AD)(BC)). In (a), the lineages from B and C coalesce more recently than those from A and D, and in (b), the reverse is true. (c) The single sequence of coalescences leading to gene tree topology (((AB)C)D). The set of branch lengths that lie in the four-taxon anomaly zone can be computed from the complete enumeration of probabilities for combinations of four-taxon gene trees and species trees [14,15]. For AGTs to occur with four taxa, the species tree must be asymmetric and the gene tree must be symmetric. To see that AGTs cannot occur with a symmetric four-taxon species tree, note that in Table 4 of Rosenberg [14], when the species tree has topology ((AB)(CD)), the terms for the probability that a gene tree has any four-taxon topology are subsumed among the terms for the probability of the topology ((AB)(CD)). Suppose now that the species tree for the four taxa has the asymmetric topology (((AB)C)D). Let x be the length of the deeper internal branch and let y be the length of the shallower internal branch. Let f(x,y), g(x,y), and h(x,y) denote the probabilities for a gene tree evolving along this species tree to have topologies (((AB)C)D), ((AC)(BD)), and ((AB)(CD)), respectively. These functions can be obtained from Table 5 of Rosenberg [14], and they equal: It is straightforward to show that for any positive values of x and y, h(x,y) > g(x,y). From this relationship, and from the fact that ((AC)(BD)) and ((AD)(BC)) are equiprobable gene tree topologies for a species tree with topology (((AB)C)D), it follows that the species tree gives rise to: 0 AGTs if f(x,y) ≥ h(x,y) 1 AGT if g(x,y) ≤ f(x,y) ɛ + q1 . Thus, for species tree σ, gene tree topology ψ* is more probable than ψ, and ψ therefore produces anomalies. Proof of Lemma 4 Consider a species tree that has n species and a labeled topology L that is not n-maximally probable. The probability that no coalescences of gene lineages in a gene tree on the species tree occur more recently than the species tree root can be bounded below as follows. The species tree has n − 2 internal branches, where the length of branch i is λi coalescent time units. If ni is the number of lineages “entering” branch i (that is, the number available for coalescence on branch i), the probability that the ni lineages coalesce to j lineages during coalescent time λi is a known function pni ,j (λ i ) [17,23,24], among whose properties are limλi →∞ p n i ,1(λ i ) = 1 and lim λi →0 pni,ni (λ i ) = 1. Because , decreases as ni and λi increase. Therefore, denoting , the probability of no coalescences on any internal branch is Let q1 be the probability under the Yule model that a gene tree has labeled topology L, and let q2 be the probability that a gene tree has the n-maximally probable labeled topology M. Because L is not n-maximally probable, q2 > q1 . For ɛ > 0, because lim λ→0 pn,n (λ) = 1, λ can be chosen small enough that pn,n (λ) > (1 − ɛ)1/(n − 2), so that the probability that no coalescences occur on any internal branch (and all coalescences occur more anciently than the root) is greater than 1 − ɛ. Let ɛ ɛ + q1 , from which it follows that topology L produces anomalies. Proof of Lemma 5 To identify the n-maximally probable labeled topologies for n ∈ {5,6,7,8}, the probability of each labeled topology L can be calculated as , where dr (L) is the number of internal nodes in the topology that have exactly r descendants (Table 1) [20,22]. It now must be shown that each of these n-maximally probable topologies produces anomalies. Table 1 n-Maximally Probable Topologies for n = 5, 6, 7 Consider the species trees in Figure 3. Let x and y denote lengths of internal branches, as shown in the figure. For each tree, let λ be the total time between the root and the MRCA of A and B. (For n = 6,7,8, we can assume without loss of generality that the MRCA of C and D is at least as ancient as the MRCA of A and B.) For n = 5 and ɛ > 0, λ can be made short enough and x + y large enough so that when the species tree root is reached, the probability is at least 1 − ɛ that the gene lineages from species D and E have coalesced and that no other coalescences have occurred. The probability that the gene tree matches the species tree is at most ɛ + (1 − ɛ)(1/18), and the probability that its topology is ((AB)(C(DE))) is at least (1 − ɛ)(1/19). For ɛ 0, λ can be made small enough and x + y large enough that when the species tree root is reached, the probability is at least 1 − ɛ that the gene lineages from species E and F have coalesced and that no other coalescences have occurred. The probability that the gene tree matches the species tree is at most ɛ + (1 − ɛ)(1/90), and the probability that its topology is (((AB)C)(D(EF))) is at least (1 − ɛ)(1/60). For ɛ <1/181, the species tree topology produces an anomaly. For n = 7 and n = 8, the proof follows the same argument as for n = 6 but with x and y both large, and with AGTs of (((AB)C)(D(E(FG)))) and (((AB)C)(D((EF)(GH)))), respectively. Discussion We have shown that all species tree topologies with five or more taxa, as well as asymmetric topologies with four taxa, have anomaly zones, regions in branch length space in which the most frequently produced gene tree differs from the species tree topology. In this region, assuming that gene trees are known exactly, the “democratic vote” procedure of using the most common gene tree as the estimate of the species tree is statistically inconsistent for phylogenetic inference. This inconsistency has a noticeable parallel with the inconsistency of maximum parsimony methods for inferring gene trees [11], as both settings experience a transition when the number of taxa n reaches five. Under the assumption of equal evolutionary rates throughout a tree, only if n ≥ 5 can parsimony be inconsistent [25], and under the model we have studied for gene tree evolution along the branches of species trees, AGTs—although they can occur for n = 4 with asymmetric species tree topologies—occur for all species tree topologies only if n ≥ 5. Species trees with at least one short branch, especially if it is deep in the tree, are particularly susceptible to producing AGTs. For an asymmetric species tree with four taxa, by solving a(x) ≤ x, it can be seen that the anomaly zone includes the region in which both internal branch lengths are below ≈0.156 coalescent time units, or 0.156N generations if the species along these branches were constant-sized diploid populations with effective size N/2 individuals. However, if the deeper internal branch is shorter than 0.156 coalescent units, the shallower internal branch can become much longer without exiting the anomaly zone. Anomalous gene trees might not exist for typical four-taxon species trees, as branch lengths of 0.1–0.2 coalescent units are probably small compared to the time scale of most speciations. For example, for the human-chimp-gorilla-orangutan tree, Rannala and Yang [26] obtained an estimate of 1.2 million y for the shorter of the two internal branches in the tree, namely the branch separating the divergence time of humans and chimpanzees and the more ancient divergence of gorillas from the human-chimp lineage. Using their estimate of 24,600 for the effective size N/2 and 20 y for the generation time, this value translates into 1.2 coalescent units. Although there is considerable uncertainty in each aspect of the calculation, it seems unlikely that AGTs arise for the human-chimp-gorilla-orangutan tree. If the number of taxa considered is large, however, the AGT problem may be quite severe, as species trees with many taxa typically contain some deep short branches. This is especially true as taxonomic sampling increases, because the addition of taxa within a monophyletic group necessarily shortens some internal branches. Thus, AGTs are more likely to complicate inference for such speciose and rapidly diverging groups as Drosophila, in which large effective population sizes may have caused intervals between speciations to be relatively short in coalescent time. AGTs may also result from adaptive radiations, during which many divergences may have occurred in rapid succession, and from population divergences in population genetics and phylogeography, as population trees generally involve very closely related groups. Although AGTs are easiest to find when the gene tree has more symmetry than the species tree, a consequence of Proposition 2 is that an AGT can have less symmetry than its underlying species tree. Additionally, a set (or forest) W of species trees can exhibit a surprising form of mutual anomalousness (Figure 4). We refer to a set W of at least two trees as a wicked forest if σi, σj ∈ W and i ≠ j imply that the topology of σi is anomalous for σj . By choosing one of the trees in a set to be n-maximally probable, it is not difficult to find examples of wicked forests, and although the example in Figure 4 has two trees, wicked forests can also be found that contain three or more trees (not shown). The counterintuitive result is that if two trees from the same wicked forest were considered as hypotheses for a phylogeny, observing a higher proportion of gene trees that match one species tree would be evidence in favor of the other species tree, and vice versa. Figure 4 A Wicked Forest (a) The two long internal branches have length 2, and the two short internal branches have length 0.1. For this species tree the probabilities that a random gene tree has topology ψ i are 0.085 and 0.103 for i = 1 and i = 2, respectively. Hence ψ 2 is anomalous for σ 1. (b) The one long internal branch has length 4, the shortest internal branch has length 0.1, and the other two internal branches have length 0.3. For this species tree, the gene tree probabilities are 0.066 and 0.060 for topologies ψ 1 and ψ 2, respectively. Note that the two topologies disagree only on the placement of taxon D and that neither is 6-maximally probable. It is noteworthy that our theoretical results apply to known—rather than estimated—gene trees, and do not consider the effect of mutations on inference of gene trees. This issue is important, as mutational history is a key factor in determining when an empirical study might actually be misled by AGTs. As an illustration, in one human-chimp-gorilla study, a substantial fraction of loci—six of 45 considered—had no informative substitutions that could provide support to any particular phylogenetic grouping [3]. That this many loci would not have any phylogenetic information in the human-chimp-gorilla clade suggests that for the smaller branch lengths typical of the anomaly zone, the fraction of uninformative loci could be much greater. Thus, situations that give rise to AGTs may coincide largely with situations for which the history of mutation does not produce enough informative sites to allow multifurcations in estimated gene trees to be resolved into sequences of bifurcations. However, the occurrence of informative sites depends on other factors besides those that lead to AGTs, such as external branch lengths of the species tree and rates of mutation and substitution; consequently, high substitution rates for species trees in the anomaly zone may very well lead to production of detectable AGTs. Just as an average over species trees generated from a speciation model can be used to assess how often maximum parsimony is inconsistent [27], such an analysis could be used to evaluate the frequency with which realistic species trees give rise to AGTs. Of particular interest will be the extent to which AGTs occur at branch lengths and substitution rates for which the effects of mutation do not render gene trees unrecoverable; for species trees with these parameter values, empirical phylogenetic studies could be misled specifically by AGTs rather than by other difficulties in estimation. What implications do AGTs have for the design of phylogenetic studies? First, their existence demonstrates that adding more genes to a phylogenetic analysis will not necessarily improve the inference, unless this approach is combined with algorithms that avoid the problem of AGTs. The commonly used concatenation procedure [28,29]—in which the species tree is inferred by concatenating a set of loci and then employing the resulting sequence alignment to estimate a single gene tree—is not immune to the AGT problem (L. S. Kubatko and J. H. Degnan, unpublished data). Other types of data, such as inversions or genomic rearrangements, also would not necessarily help, as our results apply to any traits that evolve genealogically. One strategy that may circumvent the occurrence of AGTs is the use of a sample with multiple individuals per species. Because many lineages from each species may persist reasonably far into the past, the chance of coalescences on a short branch is higher if many lineages are present [7,14,30]. Thus, increasing the sample size has a similar effect to lengthening short branches near the tips. As multiple sampled lineages from a species will coalesce on recent branches of the species tree, however, increased sample sizes will not assist the inference if recent branches are long but deep branches in the species tree are short. Additionally, because AGTs are absent for sets of three species, a sensible approach may be to use many genes to decisively infer all nC 3 species trees for sets of three species, and to then use the uniqueness of species trees given their three-taxon clades [31,32] for species tree inference. Different algorithms for combining data on multiple loci will have different degrees of susceptibility to the occurrence of AGTs, and a challenge for phylogenetics is to identify those procedures that are best able to overcome this new obstacle to accurate inference of species trees. Materials and Methods The methods used are included in the Results section.
                Bookmark

                Author and article information

                Journal
                PLoS Biol
                pbio
                PLoS Biology
                Public Library of Science (San Francisco, USA )
                1544-9173
                1545-7885
                November 2006
                14 November 2006
                : 4
                : 11
                : e352
                Author notes
                * To whom correspondence should be addressed. E-mail: arokas@ 123456mit.edu
                Article
                06-PLBI-E-1250R1
                10.1371/journal.pbio.0040352
                1637082
                17105342
                64f35f84-554d-4d23-bc49-9c789e3ad944
                Copyright: © 2006 Rokas and Carroll. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
                History
                Page count
                Pages: 6
                Categories
                Essay
                Bioinformatics/Computational Biology
                Evolution
                Paleontology
                Zoology
                Custom metadata
                Rokas A, Carroll SB (2006) Bushes in the tree of life. PLoS Biol 4(11): e352. DOI: 10.1371/journal.pbio.0040352

                Life sciences
                Life sciences

                Comments

                Comment on this article