35
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      High-throughput annotation of full-length long noncoding RNAs with Capture Long-Read Sequencing

      research-article

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Accurate annotations of genes and their transcripts is a foundation of genomics, but no annotation technique presently combines throughput and accuracy. As a result, reference gene collections remain incomplete: many gene models are fragmentary, while thousands more remain uncatalogued—particularly for long noncoding RNAs (lncRNAs). To accelerate lncRNA annotation, the GENCODE consortium has developed RNA Capture Long Seq (CLS), combining targeted RNA capture with third-generation long-read sequencing. We present an experimental re-annotation of the GENCODE intergenic lncRNA population in matched human and mouse tissues, resulting in novel transcript models for 3574 / 561 gene loci, respectively. CLS approximately doubles the annotated complexity of targeted loci, outperforming existing short-read techniques. Full-length transcript models produced by CLS enable us to definitively characterize the genomic features of lncRNAs, including promoter- and gene-structure, and protein-coding potential. Thus CLS removes a longstanding bottleneck of transcriptome annotation, generating manual-quality full-length transcript models at high-throughput scales.

          Related collections

          Most cited references23

          • Record: found
          • Abstract: found
          • Article: not found

          The transcriptional landscape of the mammalian genome.

          This study describes comprehensive polling of transcription start and termination sites and analysis of previously unidentified full-length complementary DNAs derived from the mouse genome. We identify the 5' and 3' boundaries of 181,047 transcripts with extensive variation in transcripts arising from alternative promoter usage, splicing, and polyadenylation. There are 16,247 new mouse protein-coding transcripts, including 5154 encoding previously unidentified proteins. Genomic mapping of the transcriptome reveals transcriptional forests, with overlapping transcription on both strands, separated by deserts in which few transcripts are observed. The data provide a comprehensive platform for the comparative analysis of mammalian transcriptional regulation in differentiation and development.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            SMART: recent updates, new developments and status in 2015

            SMART (Simple Modular Architecture Research Tool) is a web resource (http://smart.embl.de/) providing simple identification and extensive annotation of protein domains and the exploration of protein domain architectures. In the current version, SMART contains manually curated models for more than 1200 protein domains, with ∼200 new models since our last update article. The underlying protein databases were synchronized with UniProt, Ensembl and STRING, bringing the total number of annotated domains and other protein features above 100 million. SMART's ‘Genomic’ mode, which annotates proteins from completely sequenced genomes was greatly expanded and now includes 2031 species, compared to 1133 in the previous release. SMART analysis results pages have been completely redesigned and include links to several new information sources. A new, vector-based display engine has been developed for protein schematics in SMART, which can also be exported as high-resolution bitmap images for easy inclusion into other documents. Taxonomic tree displays in SMART have been significantly improved, and can be easily navigated using the integrated search engine.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions

              Motivation: As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multispecies nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models. Results: We show that PhyloCSF's classification performance in 12-species Drosophila genome alignments exceeds all other methods we compared in a previous study. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE, and as interest grows in long non-coding RNAs, often initially recognized by their lack of protein coding potential rather than conserved RNA secondary structures. Availability and Implementation: The Objective Caml source code and executables for GNU/Linux and Mac OS X are freely available at http://compbio.mit.edu/PhyloCSF Contact: mlin@mit.edu; manoli@mit.edu
                Bookmark

                Author and article information

                Journal
                9216904
                2419
                Nat Genet
                Nat. Genet.
                Nature genetics
                1061-4036
                1546-1718
                30 October 2017
                06 November 2017
                December 2017
                06 May 2018
                : 49
                : 12
                : 1731-1740
                Affiliations
                [1 ]Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, 08003 Barcelona, Spain
                [2 ]Universitat Pompeu Fabra (UPF), Barcelona, Spain
                [3 ]R&D Department, Quantitative Genomic Medicine Laboratories (qGenomics), Barcelona, Spain
                [4 ]Functional Genomics Group, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, New York 11724, USA
                [5 ]Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, UK CB10 1HH
                Author notes
                [*]

                Equal contribution

                [6]

                Present address: Centre of New Technologies, S. Banacha 2C, 02-097 Warsaw, Poland

                [7]

                Present address: Illumina, Cambridge, UK.

                [8]

                Present address: Department of Clinical Research, University of Bern, Murtenstrasse 35, 3010 Bern, Switzerland.

                Article
                NIHMS912465
                10.1038/ng.3988
                5709232
                29106417
                e3550d5b-5cbd-4ecc-b3be-3b407552e47b

                Users may view, print, copy, and download text and data-mine the content in such documents, for the purposes of academic research, subject always to the full Conditions of use: http://www.nature.com/authors/editorial_policies/license.html#terms

                History
                Categories
                Article

                Genetics
                long noncoding rna,lncrna,lincrna,rna sequencing,transcriptomics,gencode,annotation,captureseq,third generation sequencing,long read sequencing,pacbio,kantr

                Comments

                Comment on this article