ModelTest-NG: A New and Scalable Tool for the Selection of DNA and Protein Evolutionary Models

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

ModelTest-NG is a reimplementation from scratch of jModelTest and ProtTest, two popular tools for selecting the best-fit nucleotide and amino acid substitution models, respectively. ModelTest-NG is one to two orders of magnitude faster than jModelTest and ProtTest but equally accurate and introduces several new features, such as ascertainment bias correction, mixture, and free-rate models, or the automatic processing of single partitions. ModelTest-NG is available under a GNU GPL3 license at https://github.com/ddarriba/modeltest , last accessed September 2, 2019.

Related collections

Most cited references 11

Record: found
Abstract: found
Article: not found

A space-time process model for the evolution of DNA sequences.

Q. Z. Yang (1995)

We describe a model for the evolution of DNA sequences by nucleotide substitution, whereby nucleotide sites in the sequence evolve over time, whereas the rates of substitution are variable and correlated over sites. The temporal process used to describe substitutions between nucleotides is a continuous-time Markov process, with the four nucleotides as the states. The spatial process used to describe variation and dependence of substitution rates over sites is based on a serially correlated gamma distribution, i.e., an auto-gamma model assuming Markov-dependence of rates at adjacent sites. To achieve computational efficiency, we use several equal-probability categories to approximate the gamma distribution, and the result is an auto-discrete-gamma model for rates over sites. Correlation of rates at sites then is modeled by the Markov chain transition of rates at adjacent sites from one rate category to another, the states of the chain being the rate categories. Two versions of nonparametric models, which place no restrictions on the distributional forms of rates for sites, also are considered, assuming either independence or Markov dependence. The models are applied to data of a segment of mitochondrial genome from nine primate species. Model parameters are estimated by the maximum likelihood method, and models are compared by the likelihood ratio test. Tremendous variation of rates among sites in the sequence is revealed by the analyses, and when rate differences for different codon positions are appropriately accounted for in the models, substitution rates at adjacent sites are found to be strongly (positively) correlated. Robustness of the results to uncertainty of the phylogenetic tree linking the species is examined.

0 comments Cited 102 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Performance-based selection of likelihood models for phylogeny estimation.

Vladimir N. Minin, Zaid Abdo, Paul Joyce … (2003)

Phylogenetic estimation has largely come to rely on explicitly model-based methods. This approach requires that a model be chosen and that that choice be justified. To date, justification has largely been accomplished through use of likelihood-ratio tests (LRTs) to assess the relative fit of a nested series of reversible models. While this approach certainly represents an important advance over arbitrary model selection, the best fit of a series of models may not always provide the most reliable phylogenetic estimates for finite real data sets, where all available models are surely incorrect. Here, we develop a novel approach to model selection, which is based on the Bayesian information criterion, but incorporates relative branch-length error as a performance measure in a decision theory (DT) framework. This DT method includes a penalty for overfitting, is applicable prior to running extensive analyses, and simultaneously compares all models being considered and thus does not rely on a series of pairwise comparisons of models to traverse model space. We evaluate this method by examining four real data sets and by using those data sets to define simulation conditions. In the real data sets, the DT method selects the same or simpler models than conventional LRTs. In order to lend generality to the simulations, codon-based models (with parameters estimated from the real data sets) were used to generate simulated data sets, which are therefore more complex than any of the models we evaluate. On average, the DT method selects models that are simpler than those chosen by conventional LRTs. Nevertheless, these simpler models provide estimates of branch lengths that are more accurate both in terms of relative error and absolute error than those derived using the more complex (yet still wrong) models chosen by conventional LRTs. This method is available in a program called DT-ModSel.

0 comments Cited 102 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Modeling protein evolution with several amino acid replacement matrices depending on site rates.

Si Le, Cuong Dang, Olivier Gascuel (2012)

Most protein substitution models use a single amino acid replacement matrix summarizing the biochemical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors that influence the substitution patterns. In this paper, we investigate the use of different substitution matrices for different site evolutionary rates. Indeed, the variability of evolutionary rates corresponds to one of the most apparent heterogeneity factors among sites, and there is no reason to assume that the substitution patterns remain identical regardless of the evolutionary rate. We first introduce LG4M, which is composed of four matrices, each corresponding to one discrete gamma rate category (of four). These matrices differ in their amino acid equilibrium distributions and in their exchangeabilities, contrary to the standard gamma model where only the global rate differs from one category to another. Next, we present LG4X, which also uses four different matrices, but leaves aside the gamma distribution and follows a distribution-free scheme for the site rates. All these matrices are estimated from a very large alignment database, and our two models are tested using a large sample of independent alignments. Detailed analysis of resulting matrices and models shows the complexity of amino acid substitutions and the advantage of flexible models such as LG4M and LG4X. Both significantly outperform single-matrix models, providing gains of dozens to hundreds of log-likelihood units for most data sets. LG4X obtains substantial gains compared with LG4M, thanks to its distribution-free scheme for site rates. Since LG4M and LG4X display such advantages but require the same memory space and have comparable running times to standard models, we believe that LG4M and LG4X are relevant alternatives to single replacement matrices. Our models, data, and software are available from http://www.atgc-montpellier.fr/models/lg4x.

0 comments Cited 102 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Keith Crandall: Role: Associate Editor

Journal

Journal ID (nlm-ta): Mol Biol Evol

Journal ID (iso-abbrev): Mol. Biol. Evol

Journal ID (publisher-id): molbev

Title: Molecular Biology and Evolution

Publisher: Oxford University Press

ISSN (Print): 0737-4038

ISSN (Electronic): 1537-1719

Publication date (Print): January 2020

Publication date (Electronic): 21 August 2019

Publication date PMC-release: 21 August 2019

Volume: 37

Issue: 1

Pages: 291-294

Affiliations

[1 ] Computer Architecture Group, Centro de investigación CITIC , Universidade da Coruña, Elviña, A Coruña, Spain

[2 ] Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies , Heidelberg, Germany

[3 ] Department of Biochemistry, Genetics, and Immunology , University of Vigo, Vigo, Spain

[4 ] Biomedical Research Center (CINBIO), University of Vigo , Vigo, Spain

[5 ] Galicia Sur Health Research Institute , Vigo, Spain

[6 ] Institute of Theoretical Informatics, Karlsruhe Institute of Technology , Karlsruhe, Germany

[7 ] Department of Genetics, Evolution and Environment, University College London , London, United Kingdom

Author notes

Corresponding author: E-mail: diego.darriba@ 123456udc.es .

Author information

David Posada http://orcid.org/0000-0003-1407-3406

Alexandros Stamatakis http://orcid.org/0000-0003-0353-0691

Article

Publisher ID: msz189

DOI: 10.1093/molbev/msz189

PMC ID: 6984357

PubMed ID: 31432070

SO-VID: 10c20035-110c-4ae0-a687-58bc68728d96

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Page count

Pages: 4

Funding

Funded by: Ministry of Economy and Competitiveness of Spain

Award ID: TIN2016-75845-P

Award ID: ED431C 2017/04

Funded by: Klaus Tschira Foundation and DFG

Award ID: STA-860/6

Comments

Comment on this article

scite_

Cited by 495

See all cited by

Most referenced authors 609

See all reference authors

- Version 1

ModelTest-NG: A New and Scalable Tool for the Selection of DNA and Protein Evolutionary Models

Read this article at

Abstract

Related collections

Higher order chromatin architecture

Most cited references 11

A space-time process model for the evolution of DNA sequences.

Performance-based selection of likelihood models for phylogeny estimation.

Modeling protein evolution with several amino acid replacement matrices depending on site rates.

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Comments

Comment on this article

Similar content 135

Cited by 495

Most referenced authors 609