826
views
1
recommends
+1 Recommend
0 collections
    8
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          TopHat is a popular spliced aligner for RNA-sequence (RNA-seq) experiments. In this paper, we describe TopHat2, which incorporates many significant enhancements to TopHat. TopHat2 can align reads of various lengths produced by the latest sequencing technologies, while allowing for variable-length indels with respect to the reference genome. In addition to de novo spliced alignment, TopHat2 can align reads across fusion breaks, which can occur after genomic translocations. TopHat2 combines the ability to identify novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes. TopHat2 is available at http://ccb.jhu.edu/software/tophat.

          Related collections

          Most cited references5

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Improving RNA-Seq expression estimates by correcting for fragment bias

          The biochemistry of RNA-Seq library preparation results in cDNA fragments that are not uniformly distributed within the transcripts they represent. This non-uniformity must be accounted for when estimating expression levels, and we show how to perform the needed corrections using a likelihood based approach. We find improvements in expression estimates as measured by correlation with independently performed qRT-PCR and show that correction of bias leads to improved replicability of results across libraries and sequencing technologies.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            The GENCODE pseudogene resource

            Background Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data. Results As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large-scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection. Conclusions At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome.

              Processed pseudogenes were created by reverse-transcription of mRNAs; they provide snapshots of ancient genes existing millions of years ago in the genome. To find them in the present-day human, we developed a pipeline using features such as intron-absence, frame-disruption, polyadenylation, and truncation. This has enabled us to identify in recent genome drafts approximately 8000 processed pseudogenes (distributed from http://pseudogene.org). Overall, processed pseudogenes are very similar to their closest corresponding human gene, being 94% complete in coding regions, with sequence similarity of 75% for amino acids and 86% for nucleotides. Their chromosomal distribution appears random and dispersed, with the numbers on chromosomes proportional to length, suggesting sustained "bombardment" over evolution. However, it does vary with GC-content: Processed pseudogenes occur mostly in intermediate GC-content regions. This is similar to Alus but contrasts with functional genes and L1-repeats. Pseudogenes, moreover, have age profiles similar to Alus. The number of pseudogenes associated with a given gene follows a power-law relationship, with a few genes giving rise to many pseudogenes and most giving rise to few. The prevalence of processed pseudogenes agrees well with germ-line gene expression. Highly expressed ribosomal proteins account for approximately 20% of the total. Other notables include cyclophilin-A, keratin, GAPDH, and cytochrome c.
                Bookmark

                Author and article information

                Contributors
                Journal
                Genome Biol
                Genome Biol
                Genome Biology
                BioMed Central
                1465-6906
                1465-6914
                2013
                25 April 2013
                : 14
                : 4
                : R36
                Affiliations
                [1 ]Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, 20742, USA
                [2 ]Department of Computer Science, University of Maryland, College Park, MD 20742, USA
                [3 ]Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, 733 N. Broadway, Baltimore, MD, 21205, USA
                [4 ]Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, 615 N. Wolfe Street, Baltimore, MD, 21205, USA
                [5 ]Broad Institute of MIT and Harvard, 7 Cambridge Center, Cambridge, MA, 02142, USA
                [6 ]Department of Stem Cell and Regenerative Biology, Harvard University, 7 Divinity Ave., Cambridge, MA, 02142, USA
                [7 ]Department of Electrical Engineering and Computer Science, University of California, 101 Sproul Hall, Berkeley, CA, 94720, USA
                [8 ]Illumina Inc., 5200 Illumina Way, San Diego, CA, 92122, USA
                Article
                gb-2013-14-4-r36
                10.1186/gb-2013-14-4-r36
                4053844
                23618408
                30d798fc-cc6e-4086-8dc2-97bb87e2edbb
                Copyright © 2013 Kim et al.; licensee BioMed Central Ltd.

                This is an open access article distributed under the terms of the Creative Commons Attribution License

                History
                : 15 November 2012
                : 5 April 2013
                : 25 April 2013
                Categories
                Method

                Genetics
                Genetics

                Comments

                Comment on this article