TopHat: discovering splice junctions with RNA-Seq

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Motivation: A new protocol for sequencing the messenger RNA in a cell, known as RNA-Seq, generates millions of short sequence fragments in a single run. These fragments, or ‘reads’, can be used to measure levels of gene expression and to identify novel splice variants of genes. However, current software for aligning RNA-Seq data to a genome relies on known splice junctions and cannot identify novel ones. TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites.

Results: We mapped the RNA-Seq reads from a recent mammalian RNA-Seq experiment and recovered more than 72% of the splice junctions reported by the annotation-based software from that study, along with nearly 20 000 previously unreported junctions. The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer. We describe several challenges unique to ab initio splice site discovery from RNA-Seq reads that will require further algorithm development.

Availability: TopHat is free, open-source software available from http://tophat.cbcb.umd.edu

Contact: cole@ 123456cs.umd.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Related collections

Most cited references 10

Record: found
Abstract: found
Article: not found

Whole-genome sequencing and variant discovery in C. elegans.

LaDeana Hillier, Gabor T Marth, Aaron Quinlan … (2008)

Massively parallel sequencing instruments enable rapid and inexpensive DNA sequence data production. Because these instruments are new, their data require characterization with respect to accuracy and utility. To address this, we sequenced a Caernohabditis elegans N2 Bristol strain isolate using the Solexa Sequence Analyzer, and compared the reads to the reference genome to characterize the data and to evaluate coverage and representation. Massively parallel sequencing facilitates strain-to-reference comparison for genome-wide sequence variant discovery. Owing to the short-read-length sequences produced, we developed a revised approach to determine the regions of the genome to which short reads could be uniquely mapped. We then aligned Solexa reads from C. elegans strain CB4858 to the reference, and screened for single-nucleotide polymorphisms (SNPs) and small indels. This study demonstrates the utility of massively parallel short read sequencing for whole genome resequencing and for accurate discovery of genome-wide polymorphisms.

0 comments Cited 163 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

SeqAn An efficient, generic C++ library for sequence analysis

Andreas Gogol-Döring, David Weese, Tobias Rausch … (2008)

Background The use of novel algorithmic techniques is pivotal to many important problems in life science. For example the sequencing of the human genome [1] would not have been possible without advanced assembly algorithms. However, owing to the high speed of technological progress and the urgent need for bioinformatics tools, there is a widening gap between state-of-the-art algorithmic techniques and the actual algorithmic components of tools that are in widespread use. Results To remedy this trend we propose the use of SeqAn, a library of efficient data types and algorithms for sequence analysis in computational biology. SeqAn comprises implementations of existing, practical state-of-the-art algorithmic components to provide a sound basis for algorithm testing and development. In this paper we describe the design and content of SeqAn and demonstrate its use by giving two examples. In the first example we show an application of SeqAn as an experimental platform by comparing different exact string matching algorithms. The second example is a simple version of the well-known MUMmer tool rewritten in SeqAn. Results indicate that our implementation is very efficient and versatile to use. Conclusion We anticipate that SeqAn greatly simplifies the rapid development of new bioinformatics tools by providing a collection of readily usable, well-designed algorithmic components which are fundamental for the field of sequence analysis. This leverages not only the implementation of new algorithms, but also enables a sound analysis and comparison of existing algorithms.

0 comments Cited 120 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Replacing suffix trees with enhanced suffix arrays

Mohamed Abouelhoda, Stefan Kurtz, Enno Ohlebusch (2004)

0 comments Cited 115 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Bioinformatics

Journal ID (publisher-id): bioinformatics

Journal ID (hwp): bioinfo

Title: Bioinformatics

Publisher: Oxford University Press

ISSN (Print): 1367-4803

ISSN (Electronic): 1460-2059

Publication date (Print): 1 May 2009

Publication date (Electronic): 16 March 2009

Publication date PMC-release: 16 March 2009

Volume: 25

Issue: 9

Pages: 1105-1111

Affiliations

¹Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742 and ²Department of Mathematics, University of California, Berkeley, CA 94720, USA

Author notes

*To whom correspondence should be addressed.

Associate Editor: Ivo Hofacker

Article

Publisher ID: btp120

DOI: 10.1093/bioinformatics/btp120

PMC ID: 2672628

PubMed ID: 19289445

SO-VID: 01882866-8513-4e09-93e6-c337d6ece027

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 23 October 2008

Date revision received : 24 February 2009

Date accepted : 26 February 2009

Comments

Comment on this article

scite_

Cited by 5,543

See all cited by

- Version 1
- Version 1

TopHat: discovering splice junctions with RNA-Seq

Read this article at

Abstract

Related collections

RNA drug delivery

Most cited references 10

Whole-genome sequencing and variant discovery in C. elegans.

SeqAn An efficient, generic C++ library for sequence analysis

Replacing suffix trees with enhanced suffix arrays

Author and article information

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 119

Cited by 5,543