55
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Selecting optimal partitioning schemes for phylogenomic datasets

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Partitioning involves estimating independent models of molecular evolution for different subsets of sites in a sequence alignment, and has been shown to improve phylogenetic inference. Current methods for estimating best-fit partitioning schemes, however, are only computationally feasible with datasets of fewer than 100 loci. This is a problem because datasets with thousands of loci are increasingly common in phylogenetics.

          Methods

          We develop two novel methods for estimating best-fit partitioning schemes on large phylogenomic datasets: strict and relaxed hierarchical clustering. These methods use information from the underlying data to cluster together similar subsets of sites in an alignment, and build on clustering approaches that have been proposed elsewhere.

          Results

          We compare the performance of our methods to each other, and to existing methods for selecting partitioning schemes. We demonstrate that while strict hierarchical clustering has the best computational efficiency on very large datasets, relaxed hierarchical clustering provides scalable efficiency and returns dramatically better partitioning schemes as assessed by common criteria such as AICc and BIC scores.

          Conclusions

          These two methods provide the best current approaches to inferring partitioning schemes for very large datasets. We provide free open-source implementations of the methods in the PartitionFinder software. We hope that the use of these methods will help to improve the inferences made from large phylogenomic datasets.

          Related collections

          Most cited references 30

          • Record: found
          • Abstract: found
          • Article: not found

          Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales.

          Although massively parallel sequencing has facilitated large-scale DNA sequencing, comparisons among distantly related species rely upon small portions of the genome that are easily aligned. Methods are needed to efficiently obtain comparable DNA fragments prior to massively parallel sequencing, particularly for biologists working with non-model organisms. We introduce a new class of molecular marker, anchored by ultraconserved genomic elements (UCEs), that universally enable target enrichment and sequencing of thousands of orthologous loci across species separated by hundreds of millions of years of evolution. Our analyses here focus on use of UCE markers in Amniota because UCEs and phylogenetic relationships are well-known in some amniotes. We perform an in silico experiment to demonstrate that sequence flanking 2030 UCEs contains information sufficient to enable unambiguous recovery of the established primate phylogeny. We extend this experiment by performing an in vitro enrichment of 2386 UCE-anchored loci from nine, non-model avian species. We then use alignments of 854 of these loci to unambiguously recover the established evolutionary relationships within and among three ancient bird lineages. Because many organismal lineages have UCEs, this type of genetic marker and the analytical framework we outline can be applied across the tree of life, potentially reshaping our understanding of phylogeny at many taxonomic levels.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences.

            Although phylogenetic inference of protein-coding sequences continues to dominate the literature, few analyses incorporate evolutionary models that consider the genetic code. This problem is exacerbated by the exclusion of codon-based models from commonly employed model selection techniques, presumably due to the computational cost associated with codon models. We investigated an efficient alternative to standard nucleotide substitution models, in which codon position (CP) is incorporated into the model. We determined the most appropriate model for alignments of 177 RNA virus genes and 106 yeast genes, using 11 substitution models including one codon model and four CP models. The majority of analyzed gene alignments are best described by CP substitution models, rather than by standard nucleotide models, and without the computational cost of full codon models. These results have significant implications for phylogenetic inference of coding sequences as they make it clear that substitution models incorporating CPs not only are a computationally realistic alternative to standard models but may also frequently be statistically superior.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Phylogenomic datasets provide both precision and accuracy in estimating the timescale of placental mammal phylogeny.

              The fossil record suggests a rapid radiation of placental mammals following the Cretaceous-Paleogene (K-Pg) mass extinction 65 million years ago (Ma); nevertheless, molecular time estimates, while highly variable, are generally much older. Early molecular studies suffer from inadequate dating methods, reliance on the molecular clock, and simplistic and over-confident interpretations of the fossil record. More recent studies have used Bayesian dating methods that circumvent those issues, but the use of limited data has led to large estimation uncertainties, precluding a decisive conclusion on the timing of mammalian diversifications. Here we use a powerful Bayesian method to analyse 36 nuclear genomes and 274 mitochondrial genomes (20.6 million base pairs), combined with robust but flexible fossil calibrations. Our posterior time estimates suggest that marsupials diverged from eutherians 168-178 Ma, and crown Marsupialia diverged 64-84 Ma. Placentalia diverged 88-90 Ma, and present-day placental orders (except Primates and Xenarthra) originated in a ∼20 Myr window (45-65 Ma) after the K-Pg extinction. Therefore we reject a pre K-Pg model of placental ordinal diversification. We suggest other infamous instances of mismatch between molecular and palaeontological divergence time estimates will be resolved with this same approach.
                Bookmark

                Author and article information

                Contributors
                Journal
                BMC Evol Biol
                BMC Evol. Biol
                BMC Evolutionary Biology
                BioMed Central
                1471-2148
                2014
                17 April 2014
                : 14
                : 82
                Affiliations
                [1 ]Ecology Evolution and Genetics, Research School of Biology, Australian National University, Canberra, ACT, Australia
                [2 ]National Evolutionary Synthesis Center, Durham, NC, USA
                [3 ]Philosophy Program, Research School of Social Sciences, Australian National University, Canberra, ACT, Australia
                [4 ]Zoologisches Forschungsmuseum Alexander Koenig, Bonn, Germany
                [5 ]The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
                [6 ]Karlsruhe Institute of Technology, Institute for Theoretical Informatics, Postfach 6980, 76128 Karlsruhe, Germany
                Article
                1471-2148-14-82
                10.1186/1471-2148-14-82
                4012149
                24742000
                Copyright © 2014 Lanfear et al.; licensee BioMed Central Ltd.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                Categories
                Methodology Article

                Comments

                Comment on this article