+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: not found

      SOAPsplice: Genome-Wide ab initio Detection of Splice Junctions from RNA-Seq Data

      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.


          RNA-Seq, a method using next generation sequencing technologies to sequence the transcriptome, facilitates genome-wide analysis of splice junction sites. In this paper, we introduce SOAPsplice, a robust tool to detect splice junctions using RNA-Seq data without using any information of known splice junctions. SOAPsplice uses a novel two-step approach consisting of first identifying as many reasonable splice junction candidates as possible, and then, filtering the false positives with two effective filtering strategies. In both simulated and real datasets, SOAPsplice is able to detect many reliable splice junctions with low false positive rate. The improvement gained by SOAPsplice, when compared to other existing tools, becomes more obvious when the depth of sequencing is low. SOAPsplice is freely available at

          Related collections

          Most cited references 14

          • Record: found
          • Abstract: found
          • Article: not found

          Function of alternative splicing.

          Alternative splicing is one of the most important mechanisms to generate a large number of mRNA and protein isoforms from the surprisingly low number of human genes. Unlike promoter activity, which primarily regulates the amount of transcripts, alternative splicing changes the structure of transcripts and their encoded proteins. Together with nonsense-mediated decay (NMD), at least 25% of all alternative exons are predicted to regulate transcript abundance. Molecular analyses during the last decade demonstrate that alternative splicing determines the binding properties, intracellular localization, enzymatic activity, protein stability and posttranslational modifications of a large number of proteins. The magnitude of the effects range from a complete loss of function or acquisition of a new function to very subtle modulations, which are observed in the majority of cases reported. Alternative splicing factors regulate multiple pre-mRNAs and recent identification of physiological targets shows that a specific splicing factor regulates pre-mRNAs with coherent biological functions. Therefore, evidence is now accumulating that alternative splicing coordinates physiologically meaningful changes in protein isoform expression and is a key mechanism to generate the complex proteome of multicellular organisms.
            • Record: found
            • Abstract: found
            • Article: not found

            Whole-genome sequencing and variant discovery in C. elegans.

            Massively parallel sequencing instruments enable rapid and inexpensive DNA sequence data production. Because these instruments are new, their data require characterization with respect to accuracy and utility. To address this, we sequenced a Caernohabditis elegans N2 Bristol strain isolate using the Solexa Sequence Analyzer, and compared the reads to the reference genome to characterize the data and to evaluate coverage and representation. Massively parallel sequencing facilitates strain-to-reference comparison for genome-wide sequence variant discovery. Owing to the short-read-length sequences produced, we developed a revised approach to determine the regions of the genome to which short reads could be uniquely mapped. We then aligned Solexa reads from C. elegans strain CB4858 to the reference, and screened for single-nucleotide polymorphisms (SNPs) and small indels. This study demonstrates the utility of massively parallel short read sequencing for whole genome resequencing and for accurate discovery of genome-wide polymorphisms.
              • Record: found
              • Abstract: found
              • Article: not found

              Analysis of canonical and non-canonical splice sites in mammalian genomes.

              A set of 43 337 splice junction pairs was extracted from mammalian GenBank annotated genes. Expressed sequence tag (EST) sequences support 22 489 of them. Of these, 98.71% contain canonical dinucleotides GT and AG for donor and acceptor sites, respectively; 0.56% hold non-canonical GC-AG splice site pairs; and the remaining 0.73% occurs in a lot of small groups (with a maximum size of 0.05%). Studying these groups we observe that many of them contain splicing dinucleotides shifted from the annotated splice junction by one position. After close examination of such cases we present a new classification consisting of only eight observed types of splice site pairs (out of 256 a priori possible combinations). EST alignments allow us to verify the exonic part of the splice sites, but many non-canonical cases may be due to intron sequencing errors. This idea is given substantial support when we compare the sequences of human genes having non-canonical splice sites deposited in GenBank by high throughput genome sequencing projects (HTG). A high proportion (156 out of 171) of the human non-canonical and EST-supported splice site sequences had a clear match in the human HTG. They can be classified after corrections as: 79 GC-AG pairs (of which one was an error that corrected to GC-AG), 61 errors that were corrected to GT-AG canonical pairs, six AT-AC pairs (of which two were errors that corrected to AT-AC), one case was produced from non-existent intron, seven cases were found in HTG that were deposited to GenBank and finally there were only two cases left of supported non-canonical splice sites. If we assume that approximately the same situation is true for the whole set of annotated mammalian non-canonical splice sites, then the 99.24% of splice site pairs should be GT-AG, 0.69% GC-AG, 0.05% AT-AC and finally only 0.02% could consist of other types of non-canonical splice sites. We analyze several characteristics of EST-verified splice sites and build weight matrices for the major groups, which can be incorporated into gene prediction programs. We also present a set of EST-verified canonical splice sites larger by two orders of magnitude than the current one (22 199 entries versus approximately 600) and finally, a set of 290 EST-supported non-canonical splice sites. Both sets should be significant for future investigations of the splicing mechanism.

                Author and article information

                Front Genet
                Front. Gene.
                Frontiers in Genetics
                Frontiers Research Foundation
                07 July 2011
                : 2
                1simpleBioinformatics Center, Beijing Genomics Institute at Shenzhen Shenzhen, China
                2simpleDepartment of Computer Science, The University of Hong Kong Hong Kong, China
                Author notes

                Edited by: Paul T. Spellman, Oregon Health and Sciences University, USA

                Reviewed by: Bertrand Tan, Chang Gung University, Taiwan; Xiyin Wang, Hebei United University, China; Obi Lee Griffith, Lawrence Berkeley National Laboratory, USA

                *Correspondence: Zhiyu Peng, Beijing Genomics Institute at Shenzhen, Shenzhen 518083, China. e-mail: pengbgi@ ;Siu-Ming Yiu, Department of Computer Science, The University of Hong Kong, Hong Kong, China. e-mail: smyiu@

                Songbo Huang and Jinbo Zhang have contributed equally to this work.

                This article was submitted to Frontiers in Genomic Assay Technology, a specialty of Frontiers in Genetics.

                Copyright © 2011 Huang, Zhang, Li, Zhang, He, Lam, Peng and Yiu.

                This is an open-access article subject to a non-exclusive license between the authors and Frontiers Media SA, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and other Frontiers conditions are complied with.

                Page count
                Figures: 5, Tables: 6, Equations: 0, References: 28, Pages: 12, Words: 6461
                Methods Article


                spliced alignment, rna-seq, splice junction


                Comment on this article