0
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      A multi-sample approach increases the accuracy of transcript assembly

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Transcript assembly from RNA-seq reads is a critical step in gene expression and subsequent functional analyses. Here we present PsiCLASS, an accurate and efficient transcript assembler based on an approach that simultaneously analyzes multiple RNA-seq samples. PsiCLASS combines mixture statistical models for exonic feature selection across multiple samples with splice graph based dynamic programming algorithms and a weighted voting scheme for transcript selection. PsiCLASS achieves significantly better sensitivity-precision tradeoff, and renders precision up to 2-3 fold higher than the StringTie system and Scallop plus TACO, the two best current approaches. PsiCLASS is efficient and scalable, assembling 667 GEUVADIS samples in 9 h, and has robust accuracy with large numbers of samples.

          Abstract

          Transcript assembly is an important step in analysis of RNA-seq data whose accuracy influences downstream quantification, detection and characterization of alternative splice variants. Here, the authors develop PsiCLASS, a transcript assembler leveraging simultaneous analysis of multiple RNA-seq samples.

          Related collections

          Most cited references27

          • Record: found
          • Abstract: found
          • Article: not found

          STAR: ultrafast universal RNA-seq aligner.

          Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            HISAT: a fast spliced aligner with low memory requirements.

            HISAT (hierarchical indexing for spliced alignment of transcripts) is a highly efficient system for aligning reads from RNA sequencing experiments. HISAT uses an indexing scheme based on the Burrows-Wheeler transform and the Ferragina-Manzini (FM) index, employing two types of indexes for alignment: a whole-genome FM index to anchor each alignment and numerous local FM indexes for very rapid extensions of these alignments. HISAT's hierarchical index for the human genome contains 48,000 local FM indexes, each representing a genomic region of ∼64,000 bp. Tests on real and simulated data sets showed that HISAT is the fastest system currently available, with equal or better accuracy than any other method. Despite its large number of indexes, HISAT requires only 4.3 gigabytes of memory. HISAT supports genomes of any size, including those larger than 4 billion bases.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              StringTie enables improved reconstruction of a transcriptome from RNA-seq reads.

              Methods used to sequence the transcriptome often produce more than 200 million short sequences. We introduce StringTie, a computational method that applies a network flow algorithm originally developed in optimization theory, together with optional de novo assembly, to assemble these complex data sets into transcripts. When used to analyze both simulated and real data sets, StringTie produces more complete and accurate reconstructions of genes and better estimates of expression levels, compared with other leading transcript assembly programs including Cufflinks, IsoLasso, Scripture and Traph. For example, on 90 million reads from human blood, StringTie correctly assembled 10,990 transcripts, whereas the next best assembly was of 7,187 transcripts by Cufflinks, which is a 53% increase in transcripts assembled. On a simulated data set, StringTie correctly assembled 7,559 transcripts, which is 20% more than the 6,310 assembled by Cufflinks. As well as producing a more complete transcriptome assembly, StringTie runs faster on all data sets tested to date compared with other assembly software, including Cufflinks.
                Bookmark

                Author and article information

                Contributors
                florea@jhu.edu
                Journal
                Nat Commun
                Nat Commun
                Nature Communications
                Nature Publishing Group UK (London )
                2041-1723
                1 November 2019
                1 November 2019
                2019
                : 10
                : 5000
                Affiliations
                [1 ]ISNI 0000 0001 2171 9311, GRID grid.21107.35, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, ; Baltimore, MD USA
                [2 ]ISNI 0000 0001 2171 9311, GRID grid.21107.35, Department of Computer Science, , Johns Hopkins University, ; Baltimore, MD USA
                [3 ]ISNI 0000 0001 2171 9311, GRID grid.21107.35, Department of Pediatrics, , Johns Hopkins School of Medicine, ; Baltimore, MD USA
                [4 ]ISNI 0000 0001 2171 9311, GRID grid.21107.35, Department of Medicine, , Johns Hopkins School of Medicine, ; Baltimore, MD USA
                [5 ]ISNI 0000 0001 2106 9910, GRID grid.65499.37, Present Address: Department of Data Sciences, , Dana Farber Cancer Institute, ; Boston, MA USA
                Author information
                http://orcid.org/0000-0001-8124-2324
                Article
                12990
                10.1038/s41467-019-12990-0
                6825223
                31676772
                0a5e39e5-91ba-4e2f-bcc8-43b9e4bed42d
                © The Author(s) 2019

                Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

                History
                : 16 February 2019
                : 11 October 2019
                Funding
                Funded by: FundRef https://doi.org/10.13039/100000001, National Science Foundation (NSF);
                Award ID: 1356078
                Award ID: 1339134
                Award Recipient :
                Funded by: FundRef https://doi.org/10.13039/100006955, U.S. Department of Health & Human Services | NIH | Office of Extramural Research, National Institutes of Health (OER);
                Award ID: R01GM124531
                Award Recipient :
                Funded by: FundRef https://doi.org/10.13039/100007123, Stanley Medical Research Institute (SMRI);
                Award ID: n/a
                Award Recipient :
                Categories
                Article
                Custom metadata
                © The Author(s) 2019

                Uncategorized
                genome informatics,software
                Uncategorized
                genome informatics, software

                Comments

                Comment on this article