Repetitive DNA and next-generation sequencing: computational challenges and solutions

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Repetitive DNA sequences are abundant in a broad range of species, from bacteria to mammals, and they cover nearly half of the human genome. Repeats have always presented technical challenges for sequence alignment and assembly programs. Next-generation sequencing projects, with their short read lengths and high data volumes, have made these challenges more difficult. From a computational perspective, repeats create ambiguities in alignment and assembly, which, in turn, can produce biases and errors when interpreting results. Simply ignoring repeats is not an option, as this creates problems of its own and may mean that important biological phenomena are missed. We discuss the computational problems surrounding repeats and describe strategies used by current bioinformatics systems to solve them.

Related collections

Most cited references 39

Record: found
Abstract: found
Article: not found

The transcriptional landscape of the yeast genome defined by RNA sequencing.

U Nagalakshmi, Z. Wang, K. Waern … (2008)

The identification of untranslated regions, introns, and coding regions within an organism remains challenging. We developed a quantitative sequencing-based method called RNA-Seq for mapping transcribed regions, in which complementary DNA fragments are subjected to high-throughput sequencing and mapped to the genome. We applied RNA-Seq to generate a high-resolution transcriptome map of the yeast genome and demonstrated that most (74.5%) of the nonrepetitive sequence of the yeast genome is transcribed. We confirmed many known and predicted introns and demonstrated that others are not actively used. Alternative initiation codons and upstream open reading frames also were identified for many yeast genes. We also found unexpected 3'-end heterogeneity and the presence of many overlapping genes. These results indicate that the yeast transcriptome is more complex than previously appreciated.

0 comments Cited 970 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads.

Gerton Lunter, Martin Goodson (2011)

High-volume sequencing of DNA and RNA is now within reach of any research laboratory and is quickly becoming established as a key research tool. In many workflows, each of the short sequences ("reads") resulting from a sequencing run are first "mapped" (aligned) to a reference sequence to infer the read from which the genomic location derived, a challenging task because of the high data volumes and often large genomes. Existing read mapping software excel in either speed (e.g., BWA, Bowtie, ELAND) or sensitivity (e.g., Novoalign), but not in both. In addition, performance often deteriorates in the presence of sequence variation, particularly so for short insertions and deletions (indels). Here, we present a read mapper, Stampy, which uses a hybrid mapping algorithm and a detailed statistical model to achieve both speed and sensitivity, particularly when reads include sequence variation. This results in a higher useable sequence yield and improved accuracy compared to that of existing software.

0 comments Cited 528 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Identification of novel transcripts in annotated genomes using RNA-Seq.

Adam Roberts, Harold Pimentel, Cole Trapnell … (2011)

We describe a new 'reference annotation based transcript assembly' problem for RNA-Seq data that involves assembling novel transcripts in the context of an existing annotation. This problem arises in the analysis of expression in model organisms, where it is desirable to leverage existing annotations for discovering novel transcripts. We present an algorithm for reference annotation-based transcript assembly and show how it can be used to rapidly investigate novel transcripts revealed by RNA-Seq in comparison with a reference annotation. The methods described in this article are implemented in the Cufflinks suite of software for RNA-Seq, freely available from http://bio.math.berkeley.edu/cufflinks. The software is released under the BOOST license. cole@broadinstitute.org; lpachter@math.berkeley.edu Supplementary data are available at Bioinformatics online.

0 comments Cited 474 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Title: Nature Reviews Genetics

Abbreviated Title: Nat Rev Genet

Publisher: Springer Science and Business Media LLC

ISSN (Print): 1471-0056

ISSN (Electronic): 1471-0064

Publication date Created: January 2012

Publication date (Electronic): November 29 2011

Publication date (Print): January 2012

Volume: 13

Issue: 1

Pages: 36-46

Article

DOI: 10.1038/nrg3117

PMC ID: 3324860

PubMed ID: 22124482

SO-VID: 02cca835-3bf5-4b37-88b9-435767424f36

License:

http://www.springer.com/tdm

History

Data availability:

Comments

Comment on this article

scite_

Cited by 587

See all cited by

- Version 1

Repetitive DNA and next-generation sequencing: computational challenges and solutions

Read this article at

Abstract

Related collections

Computational epistasis

Most cited references 39

The transcriptional landscape of the yeast genome defined by RNA sequencing.

Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads.

Identification of novel transcripts in annotated genomes using RNA-Seq.

Author and article information

Journal

Article

History

Comments

Comment on this article

Similar content 2,139

Cited by 587