33
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      ReproPhylo: An Environment for Reproducible Phylogenomics

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The reproducibility of experiments is key to the scientific process, and particularly necessary for accurate reporting of analyses in data-rich fields such as phylogenomics. We present ReproPhylo, a phylogenomic analysis environment developed to ensure experimental reproducibility, to facilitate the handling of large-scale data, and to assist methodological experimentation. Reproducibility, and instantaneous repeatability, is built in to the ReproPhylo system and does not require user intervention or configuration because it stores the experimental workflow as a single, serialized Python object containing explicit provenance and environment information. This ‘single file’ approach ensures the persistence of provenance across iterations of the analysis, with changes automatically managed by the version control program Git. This file, along with a Git repository, are the primary reproducibility outputs of the program. In addition, ReproPhylo produces an extensive human-readable report and generates a comprehensive experimental archive file, both of which are suitable for submission with publications. The system facilitates thorough experimental exploration of both parameters and data. ReproPhylo is a platform independent CC0 Python module and is easily installed as a Docker image or a WinPython self-sufficient package, with a Jupyter Notebook GUI, or as a slimmer version in a Galaxy distribution.

          Related collections

          Most cited references19

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences

          Increased reliance on computational approaches in the life sciences has revealed grave concerns about how accessible and reproducible computation-reliant results truly are. Galaxy http://usegalaxy.org, an open web-based platform for genomic research, addresses these problems. Galaxy automatically tracks and manages data provenance and provides support for capturing the context and intent of computational methods. Galaxy Pages are interactive, web-based documents that provide users with a medium to communicate a complete computational analysis.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            PyEvolve: a toolkit for statistical modelling of molecular evolution

            Background Examining the distribution of variation has proven an extremely profitable technique in the effort to identify sequences of biological significance. Most approaches in the field, however, evaluate only the conserved portions of sequences – ignoring the biological significance of sequence differences. A suite of sophisticated likelihood based statistical models from the field of molecular evolution provides the basis for extracting the information from the full distribution of sequence variation. The number of different problems to which phylogeny-based maximum likelihood calculations can be applied is extensive. Available software packages that can perform likelihood calculations suffer from a lack of flexibility and scalability, or employ error-prone approaches to model parameterisation. Results Here we describe the implementation of PyEvolve, a toolkit for the application of existing, and development of new, statistical methods for molecular evolution. We present the object architecture and design schema of PyEvolve, which includes an adaptable multi-level parallelisation schema. The approach for defining new methods is illustrated by implementing a novel dinucleotide model of substitution that includes a parameter for mutation of methylated CpG's, which required 8 lines of standard Python code to define. Benchmarking was performed using either a dinucleotide or codon substitution model applied to an alignment of BRCA1 sequences from 20 mammals, or a 10 species subset. Up to five-fold parallel performance gains over serial were recorded. Compared to leading alternative software, PyEvolve exhibited significantly better real world performance for parameter rich models with a large data set, reducing the time required for optimisation from ~10 days to ~6 hours. Conclusion PyEvolve provides flexible functionality that can be used either for statistical modelling of molecular evolution, or the development of new methods in the field. The toolkit can be used interactively or by writing and executing scripts. The toolkit uses efficient processes for specifying the parameterisation of statistical models, and implements numerous optimisations that make highly parameter rich likelihood functions solvable within hours on multi-cpu hardware. PyEvolve can be readily adapted in response to changing computational demands and hardware configurations to maximise performance. PyEvolve is released under the GPL and can be downloaded from .
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates.

              Using simulated data, we compared five methods of phylogenetic tree estimation: parsimony, compatibility, maximum likelihood, Fitch-Margoliash, and neighbor joining. For each combination of substitution rates and sequence length, 100 data sets were generated for each of 50 trees, for a total of 5,000 replications per condition. Accuracy was measured by two measures of the distance between the true tree and the estimate of the tree, one measure sensitive to accuracy of branch lengths and the other not. The distance-matrix methods (Fitch-Margoliash and neighbor joining) performed best when they were constrained from estimating negative branch lengths; all comparisons with other methods used this constraint. Parsimony and compatibility had similar results, with compatibility generally inferior; Fitch-Margoliash and neighbor joining had similar results, with neighbor joining generally slightly inferior. Maximum likelihood was the most successful method overall, although for short sequences Fitch-Margoliash and neighbor joining were sometimes better. Bias of the estimates was inferred by measuring whether the independent estimates of a tree for different data sets were closer to the true tree than to each other. Parsimony and compatibility had particular difficulty with inaccuracy and bias when substitution rates varied among different branches. When rates of evolution varied among different sites, all methods showed signs of inaccuracy and bias.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS Comput Biol
                PLoS Comput. Biol
                plos
                ploscomp
                PLoS Computational Biology
                Public Library of Science (San Francisco, CA USA )
                1553-734X
                1553-7358
                3 September 2015
                September 2015
                : 11
                : 9
                : e1004447
                Affiliations
                [1 ]Evolutionary Biology Group, School of Biological, Biomedical & Environmental Sciences, The University of Hull, Hull, United Kingdom
                [2 ]Institute of Evolutionary Biology, The University of Edinburgh, Edinburgh, United Kingdom
                University of Canterbury, NEW ZEALAND
                Author notes

                The authors have declared that no competing interests exist.

                Conceived and designed the experiments: DHL AS MLB. Performed the experiments: AS DHL MJ. Analyzed the data: AS MJ. Contributed reagents/materials/analysis tools: DHL MLB. Wrote the paper: AS DHL MLB MJ.

                Article
                PCOMPBIOL-D-15-00858
                10.1371/journal.pcbi.1004447
                4559436
                26335558
                be956af2-203d-4615-a887-bf5b16ca4334
                Copyright @ 2015

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

                History
                : 27 May 2015
                : 13 July 2015
                Page count
                Figures: 3, Tables: 1, Pages: 13
                Funding
                The Science of the Environment Council grant ( http://www.nerc.ac.uk/) NE/J011355/1 was awarded to DHL and MLB. The Science of the Environment Council grant ( http://www.nerc.ac.uk/) R8/H10/56 was awarded to GenPool, University of Edinburgh. The Medical Research Council grant ( http://www.mrc.ac.uk/) G0900740 was awarded to GenPool, University of Edinburgh. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Custom metadata
                ReproPhylo is distributed under the CC0 license and uses open access dependencies. It is under active development within a publicly accessible GitHub repository ( http://goo.gl/s6EdVM). Documentation is provided as a version tracked publicly-editable Google Docs manual ( http://goo.gl/yW6J1J). A frozen version of the programme (Version 1.0), utilizing Jupyter Notebook as interface, is available as a self contained environment in a Docker image ( http://goo.gl/JcHMGN). Use cases discussed in this manuscript are also available as Git repositories on GitHub (use case 1: https://goo.gl/BsOxfL, nbviewer: http://goo.gl/KzFAvj, use case 2: https://goo.gl/26IaiF, nbviewer: http://goo.gl/g3XP5B), and in FigShare ( http://dx.doi.org/10.6084/m9.figshare.1409426).

                Quantitative & Systems biology
                Quantitative & Systems biology

                Comments

                Comment on this article