High-quality genome (re)assembly using chromosomal contact data

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Closing gaps in draft genome assemblies can be costly and time-consuming, and published genomes are therefore often left ‘unfinished.’ Here we show that genome-wide chromosome conformation capture (3C) data can be used to overcome these limitations, and present a computational approach rooted in polymer physics that determines the most likely genome structure using chromosomal contact data. This algorithm—named GRAAL—generates high-quality assemblies of genomes in which repeated and duplicated regions are accurately represented and offers a direct probabilistic interpretation of the computed structures. We first validated GRAAL on the reference genome of Saccharomyces cerevisiae, as well as other yeast isolates, where GRAAL recovered both known and unknown complex chromosomal structural variations. We then applied GRAAL to the finishing of the assembly of Trichoderma reesei and obtained a number of contigs congruent with the know karyotype of this species. Finally, we showed that GRAAL can accurately reconstruct human chromosomes from either fragments generated in silico or contigs obtained from de novo assembly. In all these applications, GRAAL compared favourably to recently published programmes implementing related approaches.

Abstract

The correct assembly of genomes from sequencing data remains a challenge due to difficulties in correctly assigning the location of repeated DNA elements. Here the authors describe GRAAL, an algorithm that utilizes genome-wide chromosome contact data within a probabilistic framework to produce accurate genome assemblies.

Related collections

Most cited references 28

Record: found
Abstract: found
Article: not found

Genome sequencing and analysis of the biomass-degrading fungus Trichoderma reesei (syn. Hypocrea jecorina).

Diego Felipe Martinez, Randy M Berka, Bernard Henrissat … (2008)

Trichoderma reesei is the main industrial source of cellulases and hemicellulases used to depolymerize biomass to simple sugars that are converted to chemical intermediates and biofuels, such as ethanol. We assembled 89 scaffolds (sets of ordered and oriented contigs) to generate 34 Mbp of nearly contiguous T. reesei genome sequence comprising 9,129 predicted gene models. Unexpectedly, considering the industrial utility and effectiveness of the carbohydrate-active enzymes of T. reesei, its genome encodes fewer cellulases and hemicellulases than any other sequenced fungus able to hydrolyze plant cell wall polysaccharides. Many T. reesei genes encoding carbohydrate-active enzymes are distributed nonrandomly in clusters that lie between regions of synteny with other Sordariomycetes. Numerous genes encoding biosynthetic pathways for secondary metabolites may promote survival of T. reesei in its competitive soil habitat, but genome analysis provided little mechanistic insight into its extraordinary capacity for protein secretion. Our analysis, coupled with the genome sequence data, provides a roadmap for constructing enhanced T. reesei strains for industrial applications such as biofuel production.

0 comments Cited 403 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

GAGE: A critical evaluation of genome assemblies and assembly algorithms.

S. L. Salzberg, A. M. Phillippy, A. Zimin … (2012)

New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.

0 comments Cited 290 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

, , … (2013)

Background - The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly. Results - In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies. Conclusions - Many current genome assemblers produced useful assemblies, containing a significant representation of their genes, regulatory sequences, and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another.

0 comments Cited 259 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Nat Commun

Journal ID (iso-abbrev): Nat Commun

Title: Nature Communications

Publisher: Nature Pub. Group

ISSN (Electronic): 2041-1723

Publication date (Electronic): 17 December 2014

Volume: 5

Electronic Location Identifier: 5695

Affiliations

[1 ]Institut Pasteur, Department of Genomes and Genetics, Groupe Régulation Spatiale des Génomes , 75015 Paris, France

[2 ]CNRS, UMR 3525 , 75015 Paris, France

[3 ]Institut Pasteur, Unité Imagerie et Modélisation , 75015 Paris, France

[4 ]CNRS, URA 2582 , 75015 Paris, France

[5 ]Sorbonne Universités, UPMC Univ Paris06, IFD , 4 place Jussieu, 75252 Paris, France

[6 ]Max Planck Institute for Dynamics and Self-Organization, Group Biological Physics and Evolutionary Dynamics , Bunsenstr. 10, 37073 Göttingen, Germany

[7 ]Institute for Research on Cancer and Ageing of Nice (IRCAN), CNRS UMR 7284—INSERM U108, Université de Nice Sophia Antipolis , 06107 Nice, France

[8 ]IFP Energies Nouvelles , 1 et 4 avenue de Bois-Préau, 92852 Rueil-Malmaison, France

[9 ]Institut Pasteur, Unité Cell Biology of Parasitism , 75015 Paris, France

Author notes

[a ] hervemn@ 123456berkeley.edu

[b ] czimmer@ 123456pasteur.fr

[c ] romain.koszul@ 123456pasteur.fr

[*]

These authors contributed equally to this work

Article

Publisher Item ID: ncomms6695

DOI: 10.1038/ncomms6695

PMC ID: 4284522

PubMed ID: 25517223

SO-VID: 59a2c332-df18-43fa-8067-0497fc887ddc

License:

This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

High-quality genome (re)assembly using chromosomal contact data

Read this article at

Abstract

Abstract

Related collections

Genome Engineering using CRISPR

Most cited references 28

Genome sequencing and analysis of the biomass-degrading fungus Trichoderma reesei (syn. Hypocrea jecorina).

GAGE: A critical evaluation of genome assemblies and assembly algorithms.

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

Author and article information

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 22

Cited by 79

Most referenced authors 1,359