Rare Splice Variants in Long Non-Coding RNAs

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Long non-coding RNAs (lncRNAs) form a substantial component of the transcriptome and are involved in a wide variety of regulatory mechanisms. Compared to protein-coding genes, they are often expressed at low levels and are restricted to a narrow range of cell types or developmental stages. As a consequence, the diversity of their isoforms is still far from being recorded and catalogued in its entirety, and the debate is ongoing about what fraction of non-coding RNAs truly conveys biological function rather than being “junk”. Here, using a collection of more than 100 transcriptomes from related B cell lymphoma, we show that lncRNA loci produce a very defined set of splice variants. While some of them are so rare that they become recognizable only in the superposition of dozens or hundreds of transcriptome datasets and not infrequently include introns or exons that have not been included in available genome annotation data, there is still a very limited number of processing products for any given locus. The combined depth of our sequencing data is large enough to effectively exhaust the isoform diversity: the overwhelming majority of splice junctions that are observed at all are represented by multiple junction-spanning reads. We conclude that the human transcriptome produces virtually no background of RNAs that are processed at effectively random positions, but is—under normal circumstances—confined to a well defined set of splice variants.

Related collections

Most cited references 21

Record: found
Abstract: found
Article: not found

Ensembl 2012

Paul Flicek, M. Amode, Daniel Barrell … (2011)

The Ensembl project (http://www.ensembl.org) provides genome resources for chordate genomes with a particular focus on human genome data as well as data for key model organisms such as mouse, rat and zebrafish. Five additional species were added in the last year including gibbon (Nomascus leucogenys) and Tasmanian devil (Sarcophilus harrisii) bringing the total number of supported species to 61 as of Ensembl release 64 (September 2011). Of these, 55 species appear on the main Ensembl website and six species are provided on the Ensembl preview site (Pre!Ensembl; http://pre.ensembl.org) with preliminary support. The past year has also seen improvements across the project.

0 comments Cited 465 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures

Steve Hoffmann, Christian Otto, Stefan Kurtz … (2009)

Introduction Since the 454 pyrosequencing technology [3] has been introduced to the market, the need for algorithms that efficiently map huge amounts of reads to reference genomes has rapidly increased. Later, high throughput sequencing (HTS) methods such as Illumina [4] and SOLiD (Applied Biosystems) have intensified the demand. The development of read mapping methods decisively depends on specifications and error models of the respective technologies. Unfortunately, little is known about specific error models, and models are likely to change as manufactures are constantly modifying chemistry and machinery. Increasing the read length is a key aim of all vendors — tolerating a trade-off with read accuracy. In a recent investigation on error models of 454 and Illumina technologies, it has been shown that 454 reads are more likely to include insertions and deletions while Illumina reads typically contain mismatches [5],[6]. Currently available read mapping programs are specifically designed to allow for mismatches when aligning the reads to the reference genome. Most of the programs, e.g. MAQ [7], SOAP [8], SHRiMP [9] or ELAND (proprietary), use seeding techniques that gain their speed from pre-computed hash look-up tables. Some of these programs, in particular SOAP and MAQ, are specifically designed to map short Illumina or SOLiD reads. Longer sequences cannot be mapped by these tools. The matching models of MAQ, ZOOM [10], SOAP, SHRiMP, Bowtie [11], and ELAND focus on mismatches and largely neglect insertions and deletions. Indels are only considered during subsequent alignment steps but not while searching for seeds. With indels accounting for more than two thirds of all 454 sequencing errors, this is a major shortcoming for these kinds of reads [5]. Only PatMaN [12] and BWA [13] are able to handle a limited number of indels. Mapping is aggravated by the manufacturers' overestimation of their read accuracies. While an overall error rate of 0.5% has been observed for 454, the error rate increases drastically for reads shorter than 80 bp and longer than 100 bp [5], leading to considerably larger error frequencies in real-life datasets. This implies that, sequencing projects aiming to find short transcripts such as miRNAs lose a substantial fraction of their data, unless a matching strategy is used that takes indels into account. In Illumina reads, error rates of up to 4% have been observed [6]. This differs significantly from Illumina's specification. Compared to 454, the frequency of indels is significantly lower. Moreover, differences between reads and reference genome might also occur due to genomic variations such as SNPs. We present a matching method that uses enhanced suffix arrays to compute exact and inexact seeds. Sufficiently good seeds subsequently trigger a full dynamic programming alignment. Our method is insensitive to errors and contaminations at the ends of a read including 3′ and 5′ primers and tags. The results section describes the basic ideas and an evaluation of our segemehl software implementing our method. The technical details of the matching model are described in the Methods section at the end of this contribution. Results Outline of the Algorithmic Approach A read aligner should deliver the original position of the read in the reference genome. Such a position will be called the true position in the following. Optimally scoring local alignments of the read and the reference genome can be used to obtain a possible true position, but because an alignment of the read with the reference genome at the true position does not always have an optimal score according to the chosen scoring scheme, this method does not always work. Nevertheless, there are no better approaches available unless further information about the read is at hand. We present a new read mapping approach that aims at finding optimally scoring local alignments of a read and the reference genome. It is based on computing inexact seeds of variable length and allows to handle insertions, deletions (indels; gaps), and mismatches. Throughout the document the notion of differences refers to mismatches, insertions and deletions in some local alignment of the read and the reference genome, irrespective of whether they arise from technical artifacts or sequence variation. A single difference is either a single mismatch, a single character insertion or a single character deletion. Although not limited to a specific scoring scheme, we have implemented our seed search model in the program segemehl assigning a score of 1 to each match and a score of −1 to each mismatch, insertion or deletion. Our matching strategy derives from a simple and commonly used idea. Assume an optimally scoring local alignment of a read with the reference genome with exactly two differences. If the positions of the differences in the alignment are sufficiently far apart, we can efficiently locate exact seeds which in turn may deliver the position of the optimal local alignment in the reference genome. Likewise, if the distance between the two differences is small, two continuous exact matches at the ends of the read possibly allow to map the read to this position. To exploit this observation, the presented method employs a heuristic based on searches starting at all positions of the read. That is, for each suffix of the read the longest prefix match, i.e. the longest exact match beginning at the first position of the suffix with all substrings of the reference genome is computed. If the longest prefix match is long enough that it only occurs in a few positions of the reference genome, it may be feasible to check all these positions to verify if the longest prefix match is part of a sufficiently good alignment. While this approach works already well for many cases, we need to increase the sensitivity for cases where the computation of the longest prefix match fails to deliver a match at the position of the optimally scoring local alignment. This is the case when a longer prefix match can be obtained at another position of the reference genome by exactly matching characters that would result in a mismatch, insertion or deletion in the optimal local alignment (cf. Fig. 1). Therefore, during the computation of each longest prefix match we check a limited number of differences by enumerating at certain positions all possible mismatches and indels (cf. Fig. 2). 10.1371/journal.pcbi.1000502.g001 Figure 1 Longest prefix matches may fail to deliver the position of the optimally scoring local alignment. Assume a simple scoring scheme that assigns a score of +1 to a single character match and a score of 0 to a single character mismatch, a single insertions or deletion. Using longest prefix matches bears the risk of ignoring differences in the best, i.e. optimally scoring, local alignment. Its retrieval fails if a longer match can be obtained at another position of the reference sequence by matching a character, that is inserted, deleted, or mismatched in the best local alignment. Depending on the length of the reference genome and its nucleotide composition the probability is determined by the length of the substring that can be matched to the position of the best local alignment before the first difference occurs. (A) The optimally scoring alignment of the read P: = cttcttcggc begins at position 3 of the reference genome S: = atacttcttcggcaga. Let Pi denote the ith suffix of the read P. For each Pi , the starting positions of the longest match in S comprise the position of Pi in the best local alignment (solid green lines). That is, the longest match of P 0 begins at position 3, the longest match of P 1 begins at position 4, the longest match of P 2 begins at position 5 and so forth. (B) For the read P: = cttcgtcggc, the retrieval of the best local alignment fails for all Pi , i j, S[i‥j] denotes the empty string. occS (w) denotes the set of occurrences of some string in S, i.e. the set of positions i, 0≤i≤|S|−|w| satisfying w = S[i‥i+|w|−1]. A substring of S beginning at the first position of S is a prefix of S and a substring ending at the last position of S is a suffix of S. To prevent that suffixes have a second occurrence in S, we add a sentinel character $ (not occurring in S) to the end of S. For each i, 0≤i≤n, Si = S[i‥n−1]$ denotes the i-th non-empty suffix of S$, i.e. the suffix beginning at position i in S$. We identify a suffix of S$ by its start position. That is, by suffix i we mean Si . The concept of suffix arrays is based on lexicographically sorting the suffixes of S$. Suppose that the characters are ordered such that A 0. First note that ℓ i −1 ≤ℓ i +1. Moreover, for each q, 1≤q≤ℓ i −1 we have where = {x+y | x∈M} denotes the elementwise addition for any set M. That is, any suffix in can be found in with offset one. To allow differences in our matching heuristic, we introduce the concept of matching branches which branch off from sets of the matching stem. We describe the branching in terms of a transformation of some suffix interval . Let i, 0≤i≤m−1 be arbitrary but fixed. Let q be such that i+q−1

0 comments Cited 243 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Evolutionary dynamics and tissue specificity of human long noncoding RNAs in six mammals

Stefan Washietl, Manolis Kellis, Manuel Garber (2014)

Long intergenic noncoding RNAs (lincRNAs) play diverse regulatory roles in human development and disease, but little is known about their evolutionary history and constraint. Here, we characterize human lincRNA expression patterns in nine tissues across six mammalian species and multiple individuals. Of the 1898 human lincRNAs expressed in these tissues, we find orthologous transcripts for 80% in chimpanzee, 63% in rhesus, 39% in cow, 38% in mouse, and 35% in rat. Mammalian-expressed lincRNAs show remarkably strong conservation of tissue specificity, suggesting that it is selectively maintained. In contrast, abundant splice-site turnover suggests that exact splice sites are not critical. Relative to evolutionarily young lincRNAs, mammalian-expressed lincRNAs show higher primary sequence conservation in their promoters and exons, increased proximity to protein-coding genes enriched for tissue-specific functions, fewer repeat elements, and more frequent single-exon transcripts. Remarkably, we find that ∼20% of human lincRNAs are not expressed beyond chimpanzee and are undetectable even in rhesus. These hominid-specific lincRNAs are more tissue specific, enriched for testis, and faster evolving within the human lineage.

0 comments Cited 178 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

George A. Calin: Role: Academic Editor

Journal

Journal ID (nlm-ta): Noncoding RNA

Journal ID (iso-abbrev): Noncoding RNA

Journal ID (publisher-id): ncrna

Title: Non-Coding RNA

Publisher: MDPI

ISSN (Electronic): 2311-553X

Publication date (Electronic): 05 July 2017

Publication date Collection: September 2017

Volume: 3

Issue: 3

Electronic Location Identifier: 23

Affiliations

[1 ]Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University Leipzig, Härtelstrasse 16-18, D-04107 Leipzig, Germany; rituparno@ 123456bioinf.uni-leipzig.de

[2 ]ecSeq Bioinformatics, Brandvorwerkstraße 43, D-04275 Leipzig, Germany; gero.doose@ 123456ecseq.com

[3 ]German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Competence Center for Scalable Data Services and Solutions, Leipzig Research Center for Civilization Diseases, and Leipzig Research Center for Civilization Diseases (LIFE), University Leipzig, Härtelstrasse 16-18, D-04107 Leipzig, Germany

[4 ]Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany

[5 ]Fraunhofer Institute for Cell Therapy and Immunology, Perlickstrasse 1, D-04103 Leipzig, Germany

[6 ]Center for RNA in Technology and Health, University Copenhagen, Grønnegårdsvej 3, 1870 Frederiksberg C, Denmark

[7 ]Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA

Author notes

[* ]Correspondence: studla@ 123456bioinf.uni-leipzig.de ; Tel.: +49-341-971-6690

Author information

Rituparno Sen https://orcid.org/0000-0001-5980-5565

Peter F. Stadler https://orcid.org/0000-0002-5016-5191

Article

Publisher ID: ncrna-03-00023

DOI: 10.3390/ncrna3030023

PMC ID: 5831916

SO-VID: c4892aab-2a79-43aa-8022-9f1a51986edf

License:

Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license ( http://creativecommons.org/licenses/by/4.0/).

History

Date received : 18 May 2017

Date accepted : 29 June 2017

Comments

Comment on this article

scite_

Cited by 5

See all cited by

Most referenced authors 2,210

See all reference authors

- Version 1

Rare Splice Variants in Long Non-Coding RNAs

Read this article at

Abstract

Related collections

RNA drug delivery

Most cited references 21

Ensembl 2012

Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures

Evolutionary dynamics and tissue specificity of human long noncoding RNAs in six mammals

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Categories

Comments

Comment on this article

Similar content 191

Cited by 5

Most referenced authors 2,210