ReproPhylo: An Environment for Reproducible Phylogenomics

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The reproducibility of experiments is key to the scientific process, and particularly necessary for accurate reporting of analyses in data-rich fields such as phylogenomics. We present ReproPhylo, a phylogenomic analysis environment developed to ensure experimental reproducibility, to facilitate the handling of large-scale data, and to assist methodological experimentation. Reproducibility, and instantaneous repeatability, is built in to the ReproPhylo system and does not require user intervention or configuration because it stores the experimental workflow as a single, serialized Python object containing explicit provenance and environment information. This ‘single file’ approach ensures the persistence of provenance across iterations of the analysis, with changes automatically managed by the version control program Git. This file, along with a Git repository, are the primary reproducibility outputs of the program. In addition, ReproPhylo produces an extensive human-readable report and generates a comprehensive experimental archive file, both of which are suitable for submission with publications. The system facilitates thorough experimental exploration of both parameters and data. ReproPhylo is a platform independent CC0 Python module and is easily installed as a Docker image or a WinPython self-sufficient package, with a Jupyter Notebook GUI, or as a slimmer version in a Galaxy distribution.

Related collections

Most cited references 19

Record: found
Abstract: found
Article: found

Is Open Access

Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences

Jeremy Goecks, Anton Nekrutenko, James E. Taylor (2010)

Increased reliance on computational approaches in the life sciences has revealed grave concerns about how accessible and reproducible computation-reliant results truly are. Galaxy http://usegalaxy.org, an open web-based platform for genomic research, addresses these problems. Galaxy automatically tracks and manages data provenance and provides support for capturing the context and intent of computational methods. Galaxy Pages are interactive, web-based documents that provide users with a medium to communicate a complete computational analysis.

0 comments Cited 1410 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

PyEvolve: a toolkit for statistical modelling of molecular evolution

Andrew Butterfield, Vivek Vedagiri, Edward Lang … (2004)

Background Examining the distribution of variation has proven an extremely profitable technique in the effort to identify sequences of biological significance. Most approaches in the field, however, evaluate only the conserved portions of sequences – ignoring the biological significance of sequence differences. A suite of sophisticated likelihood based statistical models from the field of molecular evolution provides the basis for extracting the information from the full distribution of sequence variation. The number of different problems to which phylogeny-based maximum likelihood calculations can be applied is extensive. Available software packages that can perform likelihood calculations suffer from a lack of flexibility and scalability, or employ error-prone approaches to model parameterisation. Results Here we describe the implementation of PyEvolve, a toolkit for the application of existing, and development of new, statistical methods for molecular evolution. We present the object architecture and design schema of PyEvolve, which includes an adaptable multi-level parallelisation schema. The approach for defining new methods is illustrated by implementing a novel dinucleotide model of substitution that includes a parameter for mutation of methylated CpG's, which required 8 lines of standard Python code to define. Benchmarking was performed using either a dinucleotide or codon substitution model applied to an alignment of BRCA1 sequences from 20 mammals, or a 10 species subset. Up to five-fold parallel performance gains over serial were recorded. Compared to leading alternative software, PyEvolve exhibited significantly better real world performance for parameter rich models with a large data set, reducing the time required for optimisation from ~10 days to ~6 hours. Conclusion PyEvolve provides flexible functionality that can be used either for statistical modelling of molecular evolution, or the development of new methods in the field. The toolkit can be used interactively or by writing and executing scripts. The toolkit uses efficient processes for specifying the parameterisation of statistical models, and implements numerous optimisations that make highly parameter rich likelihood functions solvable within hours on multi-cpu hardware. PyEvolve can be readily adapted in response to changing computational demands and hardware configurations to maximise performance. PyEvolve is released under the GPL and can be downloaded from .

0 comments Cited 182 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates.

M. Kuhner, J Felsenstein (1994)

Using simulated data, we compared five methods of phylogenetic tree estimation: parsimony, compatibility, maximum likelihood, Fitch-Margoliash, and neighbor joining. For each combination of substitution rates and sequence length, 100 data sets were generated for each of 50 trees, for a total of 5,000 replications per condition. Accuracy was measured by two measures of the distance between the true tree and the estimate of the tree, one measure sensitive to accuracy of branch lengths and the other not. The distance-matrix methods (Fitch-Margoliash and neighbor joining) performed best when they were constrained from estimating negative branch lengths; all comparisons with other methods used this constraint. Parsimony and compatibility had similar results, with compatibility generally inferior; Fitch-Margoliash and neighbor joining had similar results, with neighbor joining generally slightly inferior. Maximum likelihood was the most successful method overall, although for short sequences Fitch-Margoliash and neighbor joining were sometimes better. Bias of the estimates was inferred by measuring whether the independent estimates of a tree for different data sets were closer to the true tree than to each other. Parsimony and compatibility had particular difficulty with inaccuracy and bias when substitution rates varied among different branches. When rates of evolution varied among different sites, all methods showed signs of inaccuracy and bias.

0 comments Cited 158 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Paul P Gardner: Role: Editor

Journal

Journal ID (nlm-ta): PLoS Comput Biol

Journal ID (iso-abbrev): PLoS Comput. Biol

Journal ID (publisher-id): plos

Journal ID (pmc): ploscomp

Title: PLoS Computational Biology

Publisher: Public Library of Science (San Francisco, CA USA )

ISSN (Print): 1553-734X

ISSN (Electronic): 1553-7358

Publication date (Electronic): 3 September 2015

Publication date Collection: September 2015

Volume: 11

Issue: 9

Electronic Location Identifier: e1004447

Affiliations

[1 ]Evolutionary Biology Group, School of Biological, Biomedical & Environmental Sciences, The University of Hull, Hull, United Kingdom

[2 ]Institute of Evolutionary Biology, The University of Edinburgh, Edinburgh, United Kingdom

University of Canterbury, NEW ZEALAND

Author notes

The authors have declared that no competing interests exist.

Conceived and designed the experiments: DHL AS MLB. Performed the experiments: AS DHL MJ. Analyzed the data: AS MJ. Contributed reagents/materials/analysis tools: DHL MLB. Wrote the paper: AS DHL MLB MJ.

* E-mail: A.Szitenberg@ 123456hull.ac.uk

Article

Publisher ID: PCOMPBIOL-D-15-00858

DOI: 10.1371/journal.pcbi.1004447

PMC ID: 4559436

PubMed ID: 26335558

SO-VID: be956af2-203d-4615-a887-bf5b16ca4334

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

History

Date received : 27 May 2015

Date accepted : 13 July 2015

Page count

Figures: 3, Tables: 1, Pages: 13

Funding

The Science of the Environment Council grant ( http://www.nerc.ac.uk/) NE/J011355/1 was awarded to DHL and MLB. The Science of the Environment Council grant ( http://www.nerc.ac.uk/) R8/H10/56 was awarded to GenPool, University of Edinburgh. The Medical Research Council grant ( http://www.mrc.ac.uk/) G0900740 was awarded to GenPool, University of Edinburgh. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Custom metadata

Data Availability ReproPhylo is distributed under the CC0 license and uses open access dependencies. It is under active development within a publicly accessible GitHub repository ( http://goo.gl/s6EdVM). Documentation is provided as a version tracked publicly-editable Google Docs manual ( http://goo.gl/yW6J1J). A frozen version of the programme (Version 1.0), utilizing Jupyter Notebook as interface, is available as a self contained environment in a Docker image ( http://goo.gl/JcHMGN). Use cases discussed in this manuscript are also available as Git repositories on GitHub (use case 1: https://goo.gl/BsOxfL, nbviewer: http://goo.gl/KzFAvj, use case 2: https://goo.gl/26IaiF, nbviewer: http://goo.gl/g3XP5B), and in FigShare ( http://dx.doi.org/10.6084/m9.figshare.1409426).

ReproPhylo: An Environment for Reproducible Phylogenomics

Read this article at

Abstract

Related collections

Research Paper of the Future and the Reproducible Research Compendium

Most cited references 19

Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences

PyEvolve: a toolkit for statistical modelling of molecular evolution

A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates.

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Page count

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 236

Cited by 9

Most referenced authors 690