Accurate inference of isoforms from multiple sample RNA-Seq data

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

RNA-Seq based transcriptome assembly has become a fundamental technique for studying expressed mRNAs ( i.e., transcripts or isoforms) in a cell using high-throughput sequencing technologies, and is serving as a basis to analyze the structural and quantitative differences of expressed isoforms between samples. However, the current transcriptome assembly algorithms are not specifically designed to handle large amounts of errors that are inherent in real RNA-Seq datasets, especially those involving multiple samples, making downstream differential analysis applications difficult. On the other hand, multiple sample RNA-Seq datasets may provide more information than single sample datasets that can be utilized to improve the performance of transcriptome assembly and abundance estimation, but such information remains overlooked by the existing assembly tools.

Results

We formulate a computational framework of transcriptome assembly that is capable of handling noisy RNA-Seq reads and multiple sample RNA-Seq datasets efficiently. We show that finding an optimal solution under this framework is an NP-hard problem. Instead, we develop an efficient heuristic algorithm, called Iterative Shortest Path (ISP), based on linear programming (LP) and integer linear programming (ILP). Our preliminary experimental results on both simulated and real datasets and comparison with the existing assembly tools demonstrate that (i) the ISP algorithm is able to assemble transcriptomes with a greatly increased precision while keeping the same level of sensitivity, especially when many samples are involved, and (ii) its assembly results help improve downstream differential analysis. The source code of ISP is freely available at http://alumni.cs.ucr.edu/~liw/isp.html.

Related collections

Most cited references 18

Record: found
Abstract: found
Article: found

Is Open Access

NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy

Kim D. Pruitt, Tatiana Tatusova, Garth R. Brown … (2011)

The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of genomic, transcript and protein sequence records. These records are selected and curated from public sequence archives and represent a significant reduction in redundancy compared to the volume of data archived by the International Nucleotide Sequence Database Collaboration. The database includes over 16 000 organisms, 2.4 × 106 genomic records, 13 × 106 proteins and 2 × 106 RNA records spanning prokaryotes, eukaryotes and viruses (RefSeq release 49, September 2011). The RefSeq database is maintained by a combined approach of automated analyses, collaboration and manual curation to generate an up-to-date representation of the sequence, its features, names and cross-links to related sources of information. We report here on recent growth, the status of curating the human RefSeq data set, more extensive feature annotation and current policy for eukaryotic genome annotation via the NCBI annotation pipeline. More information about the resource is available online (see http://www.ncbi.nlm.nih.gov/RefSeq/).

0 comments Cited 542 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Identification of functional elements and regulatory circuits by Drosophila modENCODE.

Sushmita Roy, Jason Ernst, Peter V. Kharchenko … (2011)

To gain insight into how genomic information is translated into cellular and developmental programs, the Drosophila model organism Encyclopedia of DNA Elements (modENCODE) project is comprehensively mapping transcripts, histone modifications, chromosomal proteins, transcription factors, replication proteins and intermediates, and nucleosome properties across a developmental time course and in multiple cell lines. We have generated more than 700 data sets and discovered protein-coding, noncoding, RNA regulatory, replication, and chromatin elements, more than tripling the annotated portion of the Drosophila genome. Correlated activity patterns of these elements reveal a functional regulatory network, which predicts putative new functions for genes, reveals stage- and tissue-specific regulators, and enables gene-expression prediction. Our results provide a foundation for directed experimental and computational studies in Drosophila and related species and also a model for systematic data integration toward comprehensive genomic and functional annotation.

0 comments Cited 505 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs

Mitchell Guttman, Manuel Garber, Joshua Levin … (2010)

RNA-Seq provides an unbiased way to study a transcriptome, including both coding and non-coding genes. To date, most RNA-Seq studies have critically depended on existing annotations, and thus focused on expression levels and variation in known transcripts. Here, we present Scripture, a method to reconstruct the transcriptome of a mammalian cell using only RNA-Seq reads and the genome sequence. We apply it to mouse embryonic stem cells, neuronal precursor cells, and lung fibroblasts to accurately reconstruct the full-length gene structures for the vast majority of known expressed genes. We identify substantial variation in protein-coding genes, including thousands of novel 5′-start sites, 3′-ends, and internal coding exons. We then determine the gene structures of over a thousand lincRNA and antisense loci. Our results open the way to direct experimental manipulation of thousands of non-coding RNAs, and demonstrate the power of ab initio reconstruction to render a comprehensive picture of mammalian transcriptomes.

0 comments Cited 499 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Tao Jiang

Wei Li

Conference

Journal ID (nlm-ta): BMC Genomics

Journal ID (iso-abbrev): BMC Genomics

Title: BMC Genomics

Publisher: BioMed Central

ISSN (Electronic): 1471-2164

Publication date Collection: 2015

Publication date (Electronic): 21 January 2015

Volume: 16

Issue: Suppl 2

Page: S15

Affiliations

[1 ]Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, 92507, USA

[2 ]MOE Key Lab of Bioinformatics and Bioinformatics Division, TNLIST / Department of Automation, Tsinghua University, Beijing, 100084, China

[3 ]MOE Key Lab of Bioinformatics and Bioinformatics Division, TNLIST / Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China

[4 ]Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute and Harvard School of Public Health, Boston, MA, 02215, USA

Article

Publisher ID: 1471-2164-16-S2-S15

DOI: 10.1186/1471-2164-16-S2-S15

PMC ID: 4331715

PubMed ID: 25708199

SO-VID: c76f657a-82cd-458f-b531-1c8225070c56

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Conference name: The Thirteenth Asia Pacific Bioinformatics Conference (APBC 2015)

Accurate inference of isoforms from multiple sample RNA-Seq data

Read this article at

Abstract

Background

Results

Related collections

RNA drug delivery

Most cited references 18

NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy

Identification of functional elements and regulatory circuits by Drosophila modENCODE.

Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs

Author and article information

Contributors

Conference

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 92

Cited by 5

Most referenced authors 1,899