Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Genome assemblies that are accurate, complete and contiguous are essential for identifying important structural and functional elements of genomes and for identifying genetic variation. Nevertheless, most recent genome assemblies remain incomplete and fragmented. While long molecule sequencing promises to deliver more complete genome assemblies with fewer gaps, concerns about error rates, low yields, stringent DNA requirements and uncertainty about best practices may discourage many investigators from adopting this technology. Here, in conjunction with the platinum standard Drosophila melanogaster reference genome, we analyze recently published long molecule sequencing data to identify what governs completeness and contiguity of genome assemblies. We also present a hybrid meta-assembly approach that achieves remarkable assembly contiguity for both Drosophila and human assemblies with only modest long molecule sequencing coverage. Our results motivate a set of preliminary best practices for obtaining accurate and contiguous assemblies, a ‘missing manual’ that guides key decisions in building high quality de novo genome assemblies, from DNA isolation to polishing the assembly.

Related collections

Most cited references 25

Record: found
Abstract: found
Article: not found

Assembly algorithms for next-generation sequencing data.

Jason R. Miller, Sergey Koren, Granger Sutton (2010)

The emergence of next-generation sequencing platforms led to resurgence of research in whole-genome shotgun assembly algorithms and software. DNA sequencing data from the Roche 454, Illumina/Solexa, and ABI SOLiD platforms typically present shorter read lengths, higher coverage, and different error profiles compared with Sanger sequencing data. Since 2005, several assembly software packages have been created or revised specifically for de novo assembly of next-generation sequencing data. This review summarizes and compares the published descriptions of packages named SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo. More generally, it compares the two standard methods known as the de Bruijn graph approach and the overlap/layout/consensus approach to assembly. Copyright 2010 Elsevier Inc. All rights reserved.

0 comments Cited 422 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A whole-genome assembly of Drosophila.

E W Myers, G Sutton, A Delcher … (2000)

We report on the quality of a whole-genome assembly of Drosophila melanogaster and the nature of the computer algorithms that accomplished it. Three independent external data sources essentially agree with and support the assembly's sequence and ordering of contigs across the euchromatic portion of the genome. In addition, there are isolated contigs that we believe represent nonrepetitive pockets within the heterochromatin of the centromeres. Comparison with a previously sequenced 2.9- megabase region indicates that sequencing accuracy within nonrepetitive segments is greater than 99. 99% without manual curation. As such, this initial reconstruction of the Drosophila sequence should be of substantial value to the scientific community.

0 comments Cited 382 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

GAGE: A critical evaluation of genome assemblies and assembly algorithms.

S. L. Salzberg, A. M. Phillippy, A. Zimin … (2012)

New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.

0 comments Cited 290 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Nucleic Acids Res

Journal ID (iso-abbrev): Nucleic Acids Res

Journal ID (hwp): nar

Journal ID (publisher-id): nar

Title: Nucleic Acids Research

Publisher: Oxford University Press

ISSN (Print): 0305-1048

ISSN (Electronic): 1362-4962

Publication date (Print): 02 November 2016

Publication date (Electronic): 25 July 2016

Publication date PMC-release: 25 July 2016

Volume: 44

Issue: 19

Page: e147

Affiliations

[1 ]Department of Ecology and Evolutionary Biology, University of California Irvine, Irvine, CA 92697, USA

[2 ]Center for Complex Biological Systems, University of California Irvine, Irvine, CA 92697, USA

Author notes

[* ]To whom correspondence should be addressed. Tel: +1 949 824 9527; Fax: +1 949 824 2181; Email: jje@ 123456uci.edu

[† ]These authors contributed equally to the paper as first authors.

Article

DOI: 10.1093/nar/gkw654

PMC ID: 5100563

PubMed ID: 27458204

SO-VID: 7822207b-d0db-482e-b362-b9f41fead923

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date accepted : 09 July 2016

Date revision received : 10 June 2016

Date received : 11 December 2015

Page count

Pages: 12

Custom metadata

cover-date 02 November 2016

ScienceOpen disciplines: Genetics

Data availability:

ScienceOpen disciplines: Genetics

Comments

Comment on this article

scite_

Cited by 199

See all cited by

Most referenced authors 1,209

See all reference authors

Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage

Read this article at

Abstract

Related collections

Genome Integrity

Most cited references 25

Assembly algorithms for next-generation sequencing data.

A whole-genome assembly of Drosophila.

GAGE: A critical evaluation of genome assemblies and assembly algorithms.

Author and article information

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Custom metadata

Comments

Comment on this article

Similar content 119

Cited by 199

Most referenced authors 1,209