Improved transcriptome assembly using a hybrid of long and short reads with StringTie

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Short-read RNA sequencing and long-read RNA sequencing each have their strengths and weaknesses for transcriptome assembly. While short reads are highly accurate, they are rarely able to span multiple exons. Long-read technology can capture full-length transcripts, but its relatively high error rate often leads to mis-identified splice sites. Here we present a new release of StringTie that performs hybrid-read assembly. By taking advantage of the strengths of both long and short reads, hybrid-read assembly with StringTie is more accurate than long-read only or short-read only assembly, and on some datasets it can more than double the number of correctly assembled transcripts, while obtaining substantially higher precision than the long-read data assembly alone. Here we demonstrate the improved accuracy on simulated data and real data from Arabidopsis thaliana, Mus musculus, and human. We also show that hybrid-read assembly is more accurate than correcting long reads prior to assembly while also being substantially faster. StringTie is freely available as open source software at https://github.com/gpertea/stringtie.

Author summary

Identifying the genes that are active in a cell is a critical step in studying cell development, disease, the response to infection, the effects of mutations, and much more. During the last decade, high-throughput RNA-sequencing data have proven essential in characterizing the set of genes expressed in different cell types and conditions, which has driven a strong need for highly efficient, scalable and accurate computational methods to process these data. As sequencing costs have dropped, ever-larger experiments have been designed, often capturing hundreds of millions or even billions of reads in a single study. These enormous data sets require highly efficient and accurate computational methods for analysis, and they also present opportunities for discovery. Recently developed long-read technology now allows researchers to capture entire transcripts in a single long read, enabling more accurate reconstruction of the full exon-intron structure of genes, although these reads have higher error rates and higher costs. In this study we use the high accuracy of short reads to correct the alignments of long RNA reads, with the goal of improving the identification of novel gene isoforms, and ultimately our understanding of transcriptome complexity.

Related collections

Most cited references 21

Record: found
Abstract: found
Article: found

Is Open Access

The Sequence Alignment/Map format and SAMtools

Heng Li, Bob Handsaker, Alec Wysoker … (2009)

Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: rd@sanger.ac.uk

0 comments Cited 13847 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Minimap2: pairwise alignment for nucleotide sequences

Heng Li (2018)

Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms.

0 comments Cited 3805 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

Daehwan Kim, Joseph M Paggi, Chanhee Park … (2021)

Rapid advances in next-generation sequencing technologies have dramatically changed our ability to perform genome-scale analyses. The human reference genome used for most genomic analyses represents only a small number of individuals, limiting its usefulness for genotyping. We designed a novel method, HISAT2, for representing and searching an expanded model of the human reference genome, in which a large catalogue of known genomic variants and haplotypes is incorporated into the data structure used for searching and alignment. This strategy for representing a population of genomes, along with a fast and memory-efficient search algorithm, enables more detailed and accurate variant analyses than previous methods. We demonstrate two initial applications of HISAT2: HLA typing, a critical need in human organ transplantation, and DNA fingerprinting, widely used in forensics. These applications are part of HISAT-genotype, with performance not only surpassing earlier computational methods, but matching or exceeding the accuracy of laboratory-based assays.

0 comments Cited 3467 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Alaina Shumate:

ORCID: https://orcid.org/0000-0002-4450-1857

Role: Formal analysisRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing

Brandon Wong:

ORCID: https://orcid.org/0000-0002-7644-1916

Role: Data curationRole: Formal analysisRole: Visualization

Geo Pertea:

ORCID: https://orcid.org/0000-0002-3210-7182

Role: Data curationRole: SoftwareRole: ValidationRole: Writing – review & editing

Mihaela Pertea:

ORCID: https://orcid.org/0000-0003-0762-8637

Role: ConceptualizationRole: Formal analysisRole: Funding acquisitionRole: MethodologyRole: SoftwareRole: SupervisionRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing

Jinyan Li: Role: Editor

Journal

Journal ID (nlm-ta): PLoS Comput Biol

Journal ID (iso-abbrev): PLoS Comput Biol

Journal ID (publisher-id): plos

Title: PLoS Computational Biology

Publisher: Public Library of Science (San Francisco, CA USA )

ISSN (Print): 1553-734X

ISSN (Electronic): 1553-7358

Publication date (Electronic): 1 June 2022

Publication date Collection: June 2022

Volume: 18

Issue: 6

Electronic Location Identifier: e1009730

Affiliations

[1 ] Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America

[2 ] Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, United States of America

[3 ] Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, United States of America

[4 ] Department of Applied Math and Statistics, Johns Hopkins University, Baltimore, Maryland, United States of America

[5 ] The Lieber Institute for Brain Development, Baltimore, Maryland, United States of America

University of Technology Sydney, AUSTRALIA

Author notes

The authors have declared that no competing interests exist.

* E-mail: mpertea@ 123456jhu.edu

Author information

Alaina Shumate https://orcid.org/0000-0002-4450-1857

Brandon Wong https://orcid.org/0000-0002-7644-1916

Geo Pertea https://orcid.org/0000-0002-3210-7182

Mihaela Pertea https://orcid.org/0000-0003-0762-8637

Article

Publisher ID: PCOMPBIOL-D-21-02222

DOI: 10.1371/journal.pcbi.1009730

PMC ID: 9191730

PubMed ID: 35648784

SO-VID: b0e060d4-ae50-4e5b-8a9e-02d788563315

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 8 December 2021

Date accepted : 11 May 2022

Page count

Figures: 5, Tables: 1, Pages: 18

Funding

Funded by: funder-id http://dx.doi.org/10.13039/501100008982, National Science Foundation;

Award ID: DBI-1759518

Award Recipient :

ORCID: https://orcid.org/0000-0003-0762-8637

Mihaela Pertea

This study was funded by the National Science Foundation (grant DBI-1759518) awarded to MP. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Custom metadata

PLOS Publication Stage vor-update-to-uncorrected-proof

Publication Update 2022-06-13

Data Availability StringTie is freely available as open source software at https://github.com/gpertea/stringtie.

Improved transcriptome assembly using a hybrid of long and short reads with StringTie

Read this article at

Abstract

Author summary

Related collections

Journal of Systems Thinking Preprints

Most cited references 21

The Sequence Alignment/Map format and SAMtools

Minimap2: pairwise alignment for nucleotide sequences

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 90

Cited by 45

Most referenced authors 1,735