Application of the MAFFT sequence alignment program to large data—reexamination of the usefulness of chained guide trees

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Motivation: Large multiple sequence alignments (MSAs), consisting of thousands of sequences, are becoming more and more common, due to advances in sequencing technologies. The MAFFT MSA program has several options for building large MSAs, but their performances have not been sufficiently assessed yet, because realistic benchmarking of large MSAs has been difficult. Recently, such assessments have been made possible through the HomFam and ContTest benchmark protein datasets. Along with the development of these datasets, an interesting theory was proposed: chained guide trees increase the accuracy of MSAs of structurally conserved regions. This theory challenges the basis of progressive alignment methods and needs to be examined by being compared with other known methods including computationally intensive ones.

Results: We used HomFam, ContTest and OXFam (an extended version of OXBench) to evaluate several methods enabled in MAFFT: (1) a progressive method with approximate guide trees, (2) a progressive method with chained guide trees, (3) a combination of an iterative refinement method and a progressive method and (4) a less approximate progressive method that uses a rigorous guide tree and consistency score. Other programs, Clustal Omega and UPP, available for large MSAs, were also included into the comparison. The effect of method 2 (chained guide trees) was positive in ContTest but negative in HomFam and OXFam. Methods 3 and 4 increased the benchmark scores more consistently than method 2 for the three datasets, suggesting that they are safer to use.

Availability and Implementation: http://mafft.cbrc.jp/alignment/software/

Contact: katoh@ 123456ifrec.osaka-u.ac.jp

Supplementary information: Supplementary data are available at Bioinformatics online.

Related collections

Most cited references 21

Record: found
Abstract: found
Article: not found

Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era.

Hetunandan Kamisetty, Sergey Ovchinnikov, David. Baker (2013)

Recently developed methods have shown considerable promise in predicting residue-residue contacts in protein 3D structures using evolutionary covariance information. However, these methods require large numbers of evolutionarily related sequences to robustly assess the extent of residue covariation, and the larger the protein family, the more likely that contact information is unnecessary because a reasonable model can be built based on the structure of a homolog. Here we describe a method that integrates sequence coevolution and structural context information using a pseudolikelihood approach, allowing more accurate contact predictions from fewer homologous sequences. We rigorously assess the utility of predicted contacts for protein structure prediction using large and representative sequence and structure databases from recent structure prediction experiments. We find that contact predictions are likely to be accurate when the number of aligned sequences (with sequence redundancy reduced to 90%) is greater than five times the length of the protein, and that accurate predictions are likely to be useful for structure modeling if the aligned sequences are more similar to the protein of interest than to the closest homolog of known structure. These conditions are currently met by 422 of the protein families collected in the Pfam database.

0 comments Cited 294 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A comprehensive comparison of multiple sequence alignment programs.

J. D. Thompson, F Plewniak, O Poch (1999)

In recent years improvements to existing programs and the introduction of new iterative algorithms have changed the state-of-the-art in protein sequence alignment. This paper presents the first systematic study of the most commonly used alignment programs using BAliBASE benchmark alignments as test cases. Even below the 'twilight zone' at 10-20% residue identity, the best programs were capable of correctly aligning on average 47% of the residues. We show that iterative algorithms often offer improved alignment accuracy though at the expense of computation time. A notable exception was the effect of introducing a single divergent sequence into a set of closely related sequences, causing the iteration to diverge away from the best alignment. Global alignment programs generally performed better than local methods, except in the presence of large N/C-terminal extensions and internal insertions. In these cases, a local algorithm was more successful in identifying the most conserved motifs. This study enables us to propose appropriate alignment strategies, depending on the nature of a particular set of sequences. The employment of more than one program based on different alignment techniques should significantly improve the quality of automatic protein sequence alignment methods. The results also indicate guidelines for improvement of alignment algorithms.

0 comments Cited 166 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

An alignment confidence score capturing robustness to guide tree uncertainty.

Giddy Landan, Eyal Privman, Osnat Penn … (2010)

Multiple sequence alignment (MSA) is the basis for a wide range of comparative sequence analyses from molecular phylogenetics to 3D structure prediction. Sophisticated algorithms have been developed for sequence alignment, but in practice, many errors can be expected and extensive portions of the MSA are unreliable. Hence, it is imperative to understand and characterize the various sources of errors in MSAs and to quantify site-specific alignment confidence. In this paper, we show that uncertainties in the guide tree used by progressive alignment methods are a major source of alignment uncertainty. We use this insight to develop a novel method for quantifying the robustness of each alignment column to guide tree uncertainty. We build on the widely used bootstrap method for perturbing the phylogenetic tree. Specifically, we generate a collection of trees and use each as a guide tree in the alignment algorithm, thus producing a set of MSAs. We next test the consistency of every column of the MSA obtained from the unperturbed guide tree with respect to the set of MSAs. We name this measure the "GUIDe tree based AligNment ConfidencE" (GUIDANCE) score. Using the Benchmark Alignment data BASE benchmark as well as simulation studies, we show that GUIDANCE scores accurately identify errors in MSAs. Additionally, we compare our results with the previously published Heads-or-Tails score and show that the GUIDANCE score is a better predictor of unreliably aligned regions.

0 comments Cited 144 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Bioinformatics

Journal ID (iso-abbrev): Bioinformatics

Journal ID (publisher-id): bioinformatics

Journal ID (hwp): bioinfo

Title: Bioinformatics

Publisher: Oxford University Press

ISSN (Print): 1367-4803

ISSN (Electronic): 1367-4811

Publication date (Print): 01 November 2016

Publication date (Electronic): 04 July 2016

Publication date PMC-release: 04 July 2016

Volume: 32

Issue: 21

Pages: 3246-3251

Affiliations

¹Graduate School of Information Sciences, Tohoku University, Sendai 980-8579, Japan

²Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064, Japan

³Biotechnology Research Institute for Drug Discovery, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064, Japan

⁴Immunology Frontier Research Center, Osaka University, Suita 565-0871, Japan

Author notes

*To whom correspondence should be addressed

Associate Editor: Alfonso Valencia

Article

Publisher ID: btw412

DOI: 10.1093/bioinformatics/btw412

PMC ID: 5079479

PubMed ID: 27378296

SO-VID: a61e2fc5-f4cc-4e13-97a2-adf054b8692d

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

History

Date received : 28 March 2016

Date revision received : 27 May 2016

Date accepted : 20 June 2016

Page count

Pages: 6

Comments

Comment on this article

scite_

Cited by 126

See all cited by

Most referenced authors 570

See all reference authors

Application of the MAFFT sequence alignment program to large data—reexamination of the usefulness of chained guide trees

Read this article at

Abstract

Related collections

Genetoberfest

Most cited references 21

Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era.

A comprehensive comparison of multiple sequence alignment programs.

An alignment confidence score capturing robustness to guide tree uncertainty.

Author and article information

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 87

Cited by 126

Most referenced authors 570