19
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Improved transcriptome assembly using a hybrid of long and short reads with StringTie

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Short-read RNA sequencing and long-read RNA sequencing each have their strengths and weaknesses for transcriptome assembly. While short reads are highly accurate, they are rarely able to span multiple exons. Long-read technology can capture full-length transcripts, but its relatively high error rate often leads to mis-identified splice sites. Here we present a new release of StringTie that performs hybrid-read assembly. By taking advantage of the strengths of both long and short reads, hybrid-read assembly with StringTie is more accurate than long-read only or short-read only assembly, and on some datasets it can more than double the number of correctly assembled transcripts, while obtaining substantially higher precision than the long-read data assembly alone. Here we demonstrate the improved accuracy on simulated data and real data from Arabidopsis thaliana, Mus musculus, and human. We also show that hybrid-read assembly is more accurate than correcting long reads prior to assembly while also being substantially faster. StringTie is freely available as open source software at https://github.com/gpertea/stringtie.

          Author summary

          Identifying the genes that are active in a cell is a critical step in studying cell development, disease, the response to infection, the effects of mutations, and much more. During the last decade, high-throughput RNA-sequencing data have proven essential in characterizing the set of genes expressed in different cell types and conditions, which has driven a strong need for highly efficient, scalable and accurate computational methods to process these data. As sequencing costs have dropped, ever-larger experiments have been designed, often capturing hundreds of millions or even billions of reads in a single study. These enormous data sets require highly efficient and accurate computational methods for analysis, and they also present opportunities for discovery. Recently developed long-read technology now allows researchers to capture entire transcripts in a single long read, enabling more accurate reconstruction of the full exon-intron structure of genes, although these reads have higher error rates and higher costs. In this study we use the high accuracy of short reads to correct the alignments of long RNA reads, with the goal of improving the identification of novel gene isoforms, and ultimately our understanding of transcriptome complexity.

          Related collections

          Most cited references21

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          The Sequence Alignment/Map format and SAMtools

          Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: rd@sanger.ac.uk
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Minimap2: pairwise alignment for nucleotide sequences

            Heng Li (2018)
            Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

              Rapid advances in next-generation sequencing technologies have dramatically changed our ability to perform genome-scale analyses. The human reference genome used for most genomic analyses represents only a small number of individuals, limiting its usefulness for genotyping. We designed a novel method, HISAT2, for representing and searching an expanded model of the human reference genome, in which a large catalogue of known genomic variants and haplotypes is incorporated into the data structure used for searching and alignment. This strategy for representing a population of genomes, along with a fast and memory-efficient search algorithm, enables more detailed and accurate variant analyses than previous methods. We demonstrate two initial applications of HISAT2: HLA typing, a critical need in human organ transplantation, and DNA fingerprinting, widely used in forensics. These applications are part of HISAT-genotype, with performance not only surpassing earlier computational methods, but matching or exceeding the accuracy of laboratory-based assays.
                Bookmark

                Author and article information

                Contributors
                Role: Formal analysisRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: Data curationRole: Formal analysisRole: Visualization
                Role: Data curationRole: SoftwareRole: ValidationRole: Writing – review & editing
                Role: ConceptualizationRole: Formal analysisRole: Funding acquisitionRole: MethodologyRole: SoftwareRole: SupervisionRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: Editor
                Journal
                PLoS Comput Biol
                PLoS Comput Biol
                plos
                PLoS Computational Biology
                Public Library of Science (San Francisco, CA USA )
                1553-734X
                1553-7358
                1 June 2022
                June 2022
                : 18
                : 6
                : e1009730
                Affiliations
                [1 ] Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America
                [2 ] Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, United States of America
                [3 ] Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, United States of America
                [4 ] Department of Applied Math and Statistics, Johns Hopkins University, Baltimore, Maryland, United States of America
                [5 ] The Lieber Institute for Brain Development, Baltimore, Maryland, United States of America
                University of Technology Sydney, AUSTRALIA
                Author notes

                The authors have declared that no competing interests exist.

                Author information
                https://orcid.org/0000-0002-4450-1857
                https://orcid.org/0000-0002-7644-1916
                https://orcid.org/0000-0002-3210-7182
                https://orcid.org/0000-0003-0762-8637
                Article
                PCOMPBIOL-D-21-02222
                10.1371/journal.pcbi.1009730
                9191730
                35648784
                b0e060d4-ae50-4e5b-8a9e-02d788563315
                © 2022 Shumate et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 8 December 2021
                : 11 May 2022
                Page count
                Figures: 5, Tables: 1, Pages: 18
                Funding
                Funded by: funder-id http://dx.doi.org/10.13039/501100008982, National Science Foundation;
                Award ID: DBI-1759518
                Award Recipient :
                This study was funded by the National Science Foundation (grant DBI-1759518) awarded to MP. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology and Life Sciences
                Computational Biology
                Genome Analysis
                Transcriptome Analysis
                Biology and Life Sciences
                Genetics
                Genomics
                Genome Analysis
                Transcriptome Analysis
                Research and Analysis Methods
                Animal Studies
                Experimental Organism Systems
                Model Organisms
                Arabidopsis Thaliana
                Research and Analysis Methods
                Model Organisms
                Arabidopsis Thaliana
                Biology and Life Sciences
                Organisms
                Eukaryota
                Plants
                Brassica
                Arabidopsis Thaliana
                Research and Analysis Methods
                Animal Studies
                Experimental Organism Systems
                Plant and Algal Models
                Arabidopsis Thaliana
                Research and Analysis Methods
                Simulation and Modeling
                Biology and Life Sciences
                Computational Biology
                Genome Analysis
                Genome Annotation
                Biology and Life Sciences
                Genetics
                Genomics
                Genome Analysis
                Genome Annotation
                Earth Sciences
                Mineralogy
                Minerals
                Talc
                Biology and life sciences
                Molecular biology
                Molecular biology techniques
                Sequencing techniques
                RNA sequencing
                Research and analysis methods
                Molecular biology techniques
                Sequencing techniques
                RNA sequencing
                Research and Analysis Methods
                Database and Informatics Methods
                Bioinformatics
                Sequence Analysis
                Sequence Alignment
                Biology and Life Sciences
                Molecular Biology
                Molecular Biology Techniques
                Sequencing Techniques
                Nucleotide Sequencing
                Research and Analysis Methods
                Molecular Biology Techniques
                Sequencing Techniques
                Nucleotide Sequencing
                Custom metadata
                vor-update-to-uncorrected-proof
                2022-06-13
                StringTie is freely available as open source software at https://github.com/gpertea/stringtie.

                Quantitative & Systems biology
                Quantitative & Systems biology

                Comments

                Comment on this article