A multi-sample approach increases the accuracy of transcript assembly

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Transcript assembly from RNA-seq reads is a critical step in gene expression and subsequent functional analyses. Here we present PsiCLASS, an accurate and efficient transcript assembler based on an approach that simultaneously analyzes multiple RNA-seq samples. PsiCLASS combines mixture statistical models for exonic feature selection across multiple samples with splice graph based dynamic programming algorithms and a weighted voting scheme for transcript selection. PsiCLASS achieves significantly better sensitivity-precision tradeoff, and renders precision up to 2-3 fold higher than the StringTie system and Scallop plus TACO, the two best current approaches. PsiCLASS is efficient and scalable, assembling 667 GEUVADIS samples in 9 h, and has robust accuracy with large numbers of samples.

Abstract

Transcript assembly is an important step in analysis of RNA-seq data whose accuracy influences downstream quantification, detection and characterization of alternative splice variants. Here, the authors develop PsiCLASS, a transcript assembler leveraging simultaneous analysis of multiple RNA-seq samples.

Related collections

Most cited references 27

Record: found
Abstract: found
Article: not found

STAR: ultrafast universal RNA-seq aligner.

Alexander Dobin, Carrie A. Davis, Felix Schlesinger … (2013)

Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.

0 comments Cited 13276 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

HISAT: a fast spliced aligner with low memory requirements.

Daehwan Kim, Ben Langmead, Steven L Salzberg (2018)

HISAT (hierarchical indexing for spliced alignment of transcripts) is a highly efficient system for aligning reads from RNA sequencing experiments. HISAT uses an indexing scheme based on the Burrows-Wheeler transform and the Ferragina-Manzini (FM) index, employing two types of indexes for alignment: a whole-genome FM index to anchor each alignment and numerous local FM indexes for very rapid extensions of these alignments. HISAT's hierarchical index for the human genome contains 48,000 local FM indexes, each representing a genomic region of ∼64,000 bp. Tests on real and simulated data sets showed that HISAT is the fastest system currently available, with equal or better accuracy than any other method. Despite its large number of indexes, HISAT requires only 4.3 gigabytes of memory. HISAT supports genomes of any size, including those larger than 4 billion bases.

0 comments Cited 5793 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

StringTie enables improved reconstruction of a transcriptome from RNA-seq reads.

Mihaela Pertea, Geo M Pertea, Corina M Antonescu … (2016)

Methods used to sequence the transcriptome often produce more than 200 million short sequences. We introduce StringTie, a computational method that applies a network flow algorithm originally developed in optimization theory, together with optional de novo assembly, to assemble these complex data sets into transcripts. When used to analyze both simulated and real data sets, StringTie produces more complete and accurate reconstructions of genes and better estimates of expression levels, compared with other leading transcript assembly programs including Cufflinks, IsoLasso, Scripture and Traph. For example, on 90 million reads from human blood, StringTie correctly assembled 10,990 transcripts, whereas the next best assembly was of 7,187 transcripts by Cufflinks, which is a 53% increase in transcripts assembled. On a simulated data set, StringTie correctly assembled 7,559 transcripts, which is 20% more than the 6,310 assembled by Cufflinks. As well as producing a more complete transcriptome assembly, StringTie runs faster on all data sets tested to date compared with other assembly software, including Cufflinks.

0 comments Cited 3255 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Liliana Florea:

ORCID: http://orcid.org/0000-0001-8124-2324

florea@jhu.edu

Journal

Journal ID (nlm-ta): Nat Commun

Journal ID (iso-abbrev): Nat Commun

Title: Nature Communications

Publisher: Nature Publishing Group UK (London )

ISSN (Electronic): 2041-1723

Publication date (Electronic): 1 November 2019

Publication date PMC-release: 1 November 2019

Publication date Collection: 2019

Volume: 10

Electronic Location Identifier: 5000

Affiliations

[1 ]ISNI 0000 0001 2171 9311, GRID grid.21107.35, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, ; Baltimore, MD USA

[2 ]ISNI 0000 0001 2171 9311, GRID grid.21107.35, Department of Computer Science, , Johns Hopkins University, ; Baltimore, MD USA

[3 ]ISNI 0000 0001 2171 9311, GRID grid.21107.35, Department of Pediatrics, , Johns Hopkins School of Medicine, ; Baltimore, MD USA

[4 ]ISNI 0000 0001 2171 9311, GRID grid.21107.35, Department of Medicine, , Johns Hopkins School of Medicine, ; Baltimore, MD USA

[5 ]ISNI 0000 0001 2106 9910, GRID grid.65499.37, Present Address: Department of Data Sciences, , Dana Farber Cancer Institute, ; Boston, MA USA

Author information

Liliana Florea http://orcid.org/0000-0001-8124-2324

Article

Publisher ID: 12990

DOI: 10.1038/s41467-019-12990-0

PMC ID: 6825223

PubMed ID: 31676772

SO-VID: 0a5e39e5-91ba-4e2f-bcc8-43b9e4bed42d

License:

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

History

Date received : 16 February 2019

Date accepted : 11 October 2019

Funding

Funded by: FundRef https://doi.org/10.13039/100000001, National Science Foundation (NSF);

Award ID: 1356078

Award ID: 1339134

Award Recipient : Liliana Florea

Funded by: FundRef https://doi.org/10.13039/100006955, U.S. Department of Health & Human Services | NIH | Office of Extramural Research, National Institutes of Health (OER);

Award ID: R01GM124531

Award Recipient : Liliana Florea

Funded by: FundRef https://doi.org/10.13039/100007123, Stanley Medical Research Institute (SMRI);

Award ID: n/a

Award Recipient : Sarven Sabunciyan

Custom metadata

ScienceOpen disciplines: Uncategorized

Keywords: genome informatics,software

Data availability:

ScienceOpen disciplines: Uncategorized

Keywords: genome informatics, software

Comments

Comment on this article

scite_

Cited by 16

See all cited by

Most referenced authors 838

See all reference authors

- Version 1

A multi-sample approach increases the accuracy of transcript assembly

Read this article at

Abstract

Abstract

Related collections

Software for SAXS correction and analysis

Most cited references 27

STAR: ultrafast universal RNA-seq aligner.

HISAT: a fast spliced aligner with low memory requirements.

StringTie enables improved reconstruction of a transcriptome from RNA-seq reads.

Author and article information

Contributors

Journal

Affiliations

Author information

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 83

Cited by 16

Most referenced authors 838