progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Multiple genome alignment remains a challenging problem. Effects of recombination including rearrangement, segmental duplication, gain, and loss can create a mosaic pattern of homology even among closely related organisms.

Methodology/Principal Findings

We describe a new method to align two or more genomes that have undergone rearrangements due to recombination and substantial amounts of segmental gain and loss (flux). We demonstrate that the new method can accurately align regions conserved in some, but not all, of the genomes, an important case not handled by our previous work. The method uses a novel alignment objective score called a sum-of-pairs breakpoint score, which facilitates accurate detection of rearrangement breakpoints when genomes have unequal gene content. We also apply a probabilistic alignment filtering method to remove erroneous alignments of unrelated sequences, which are commonly observed in other genome alignment methods. We describe new metrics for quantifying genome alignment accuracy which measure the quality of rearrangement breakpoint predictions and indel predictions. The new genome alignment algorithm demonstrates high accuracy in situations where genomes have undergone biologically feasible amounts of genome rearrangement, segmental gain and loss. We apply the new algorithm to a set of 23 genomes from the genera Escherichia, Shigella, and Salmonella. Analysis of whole-genome multiple alignments allows us to extend the previously defined concepts of core- and pan-genomes to include not only annotated genes, but also non-coding regions with potential regulatory roles. The 23 enterobacteria have an estimated core-genome of 2.46Mbp conserved among all taxa and a pan-genome of 15.2Mbp. We document substantial population-level variability among these organisms driven by segmental gain and loss. Interestingly, much variability lies in intergenic regions, suggesting that the Enterobacteriacae may exhibit regulatory divergence.

Conclusions

The multiple genome alignments generated by our software provide a platform for comparative genomic and population genomic studies. Free, open-source software implementing the described genome alignment approach is available from http://gel.ahabs.wisc.edu/mauve.

Related collections

Most cited references 73

Record: found
Abstract: found
Article: found

Is Open Access

MUSCLE: a multiple sequence alignment method with reduced time and space complexity

Robert Edgar (2004)

Background In a previous paper, we introduced MUSCLE, a new program for creating multiple alignments of protein sequences, giving a brief summary of the algorithm and showing MUSCLE to achieve the highest scores reported to date on four alignment accuracy benchmarks. Here we present a more complete discussion of the algorithm, describing several previously unpublished techniques that improve biological accuracy and / or computational complexity. We introduce a new option, MUSCLE-fast, designed for high-throughput applications. We also describe a new protocol for evaluating objective functions that align two profiles. Results We compare the speed and accuracy of MUSCLE with CLUSTALW, Progressive POA and the MAFFT script FFTNS1, the fastest previously published program known to the author. Accuracy is measured using four benchmarks: BAliBASE, PREFAB, SABmark and SMART. We test three variants that offer highest accuracy (MUSCLE with default settings), highest speed (MUSCLE-fast), and a carefully chosen compromise between the two (MUSCLE-prog). We find MUSCLE-fast to be the fastest algorithm on all test sets, achieving average alignment accuracy similar to CLUSTALW in times that are typically two to three orders of magnitude less. MUSCLE-fast is able to align 1,000 sequences of average length 282 in 21 seconds on a current desktop computer. Conclusions MUSCLE offers a range of options that provide improved speed and / or alignment accuracy compared with currently available programs. MUSCLE is freely available at .

0 comments Cited 1321 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Mauve: multiple alignment of conserved genomic sequence with rearrangements.

Aaron C E Darling, Bob Mau, Frederick R Blattner … (2004)

As genomes evolve, they undergo large-scale evolutionary processes that present a challenge to sequence comparison not posed by short sequences. Recombination causes frequent genome rearrangements, horizontal transfer introduces new sequences into bacterial chromosomes, and deletions remove segments of the genome. Consequently, each genome is a mosaic of unique lineage-specific segments, regions shared with a subset of other genomes and segments conserved among all the genomes under consideration. Furthermore, the linear order of these segments may be shuffled among genomes. We present methods for identification and alignment of conserved genomic DNA in the presence of rearrangements and horizontal transfer. Our methods have been implemented in a software package called Mauve. Mauve has been applied to align nine enterobacterial genomes and to determine global rearrangement structure in three mammalian genomes. We have evaluated the quality of Mauve alignments and drawn comparison to other methods through extensive simulations of genome evolution. Copyright 2004 Cold Spring Harbor Laboratory Press ISSN

0 comments Cited 1069 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

OrthoMCL: identification of ortholog groups for eukaryotic genomes.

Li Li, Christian Stoeckert, David S. Roos (2003)

The identification of orthologous groups is useful for genome annotation, studies on gene/protein evolution, comparative genomics, and the identification of taxonomically restricted sequences. Methods successfully exploited for prokaryotic genome analysis have proved difficult to apply to eukaryotes, however, as larger genomes may contain multiple paralogous genes, and sequence information is often incomplete. OrthoMCL provides a scalable method for constructing orthologous groups across multiple eukaryotic taxa, using a Markov Cluster algorithm to group (putative) orthologs and paralogs. This method performs similarly to the INPARANOID algorithm when applied to two genomes, but can be extended to cluster orthologs from multiple species. OrthoMCL clusters are coherent with groups identified by EGO, but improved recognition of "recent" paralogs permits overlapping EGO groups representing the same gene to be merged. Comparison with previously assigned EC annotations suggests a high degree of reliability, implying utility for automated eukaryotic genome annotation. OrthoMCL has been applied to the proteome data set from seven publicly available genomes (human, fly, worm, yeast, Arabidopsis, the malaria parasite Plasmodium falciparum, and Escherichia coli). A Web interface allows queries based on individual genes or user-defined phylogenetic patterns (http://www.cbil.upenn.edu/gene-family). Analysis of clusters incorporating P. falciparum genes identifies numerous enzymes that were incompletely annotated in first-pass annotation of the parasite genome.

0 comments Cited 1029 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

: Role: Editor

Journal

Journal ID (nlm-ta): PLoS One

Journal ID (publisher-id): plos

Journal ID (pmc): plosone

Title: PLoS ONE

Publisher: Public Library of Science (San Francisco, USA )

ISSN (Electronic): 1932-6203

Publication date Collection: 2010

Publication date (Electronic): 25 June 2010

Volume: 5

Issue: 6

Electronic Location Identifier: e11147

Affiliations

[1 ]Genome Center and Department of Computer Science, University of Wisconsin, Madison, Wisconsin, United States of America

[2 ]Biotechnology Center and Department of Oncology, University of Wisconsin, Madison, Wisconsin, United States of America

[3 ]Genome Center and Department of Genetics, University of Wisconsin, Madison, Wisconsin, United States of America

University of California Riverside, United States of America

Author notes

* E-mail: aarondarling@ 123456ucdavis.edu

Conceived and designed the experiments: AED BM NP. Performed the experiments: AED. Analyzed the data: AED BM. Contributed reagents/materials/analysis tools: AED. Wrote the paper: AED BM NP.

[¤]

Current address: Genome Center, University of California Davis, Davis, California, United States of America

Article

Publisher ID: 10-PONE-RA-16920R1

DOI: 10.1371/journal.pone.0011147

PMC ID: 2892488

PubMed ID: 20593022

SO-VID: 7f881df0-a5b7-46c0-8a3e-c9a0ff48ff1d

Copyright © Darling et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 10 March 2010

Date accepted : 24 May 2010

Page count

Pages: 17

Comments

Comment on this article

scite_

Cited by 1,937

See all cited by

Most referenced authors 1,159

See all reference authors

- Version 1
- Version 1

progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement

Read this article at

Abstract

Background

Methodology/Principal Findings

Conclusions

Related collections

PLOS Climate

Most cited references 73

MUSCLE: a multiple sequence alignment method with reduced time and space complexity

Mauve: multiple alignment of conserved genomic sequence with rearrangements.

OrthoMCL: identification of ortholog groups for eukaryotic genomes.

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 72

Cited by 1,937

Most referenced authors 1,159