Transposable elements in human genetic disease

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Related collections

Most cited references 104

Record: found
Abstract: found
Article: not found

Reverse transcription of R2Bm RNA is primed by a nick at the chromosomal target site: a mechanism for non-LTR retrotransposition.

D. D. Luan, M Korman, J L Jakubczak … (1993)

R2 is a non-LTR retrotransposable element that inserts at a specific site in the 28S rRNA genes of most insects. We have expressed the open reading frame of the R2 element from Bombyx mori, R2Bm, in E. coli and shown that it encodes both sequence-specific endonuclease and reverse transcriptase activities. The R2 protein makes a specific nick in one of the DNA strands at the insertion site and uses the 3' hydroxyl group exposed by this nick to prime reverse transcription of its RNA transcript. After reverse transcription, cleavage of the second DNA strand occurs. A similar mechanism of insertion may be used by other non-LTR retrotransposable elements as well as short interspersed nucleotide elements.

0 comments Cited 331 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Transposable Elements Are Major Contributors to the Origin, Diversification, and Regulation of Vertebrate Long Noncoding RNAs

Aurélie Kapusta, Zev Kronenberg, Vincent J Lynch … (2013)

Introduction There is a growing appreciation that the functional repertoire of metazoan genomes includes much more than protein-coding sequences [1]–[3]. Recent functional genomic studies have revealed, in particular, the widespread occurrence, bewildering diversity, and functional significance of noncoding RNA [4]. In addition to small regulatory RNAs, such as tRNAs or microRNAs, the genome encodes a myriad of long noncoding RNAs (lncRNAs) that are greater than 200 nt in length [for review: 5]–[7]. The most recent, though still conservative, catalogues predict between 5,000 and 10,000 discrete lncRNA loci in the human genome [8]–[10]. The majority of lncRNAs in these manually curated reference sets are intergenic units often referred to as large intergenic noncoding RNAs (lincRNAs) because they do not overlap with known protein-coding genes. Comparable numbers of lncRNA loci are expected to occur in the mouse and other vertebrate genomes [9], [11]–[16] and hundreds of loci with similar properties have also been identified in model invertebrates such as Drosophila melanogaster [17] and Caenorhabidtis elegans [18], as well as in the model plant Arabidopsis thaliana [19]. Although once dismissed as transcriptional ‘noise’, there is mounting evidence that many lncRNAs are important functional molecules engaged in diverse regulatory activities. First, the majority of functionally characterized lncRNAs exhibit precise spatiotemporal patterns of expression and, often, discrete cellular localization [9], [11]–[13], [20]–[25]. Second, the structure, biogenesis and processing of lncRNAs are very similar to that of protein-coding genes and indicate that most lncRNAs are produced from independent transcription units. For example, lncRNAs are typically transcribed by RNA polymerase II, under the control of diverse combinations of transcription factors that actively bind to promoters and enhancers, with canonical chromatin modifications [10]–[12], [26], [27]. LncRNA transcripts are also alternatively spliced, polyadenylated, and subject to other post-transcriptional modifications [10], [12], [28]. Third, lncRNA exons generally display a clear signal of purifying selection, implying structural and/or functional sequence constraint, albeit less stringent than on protein-coding exons [10], [12], [29]–[33]. Moreover, some lncRNA genes are evolutionarily ancient. A small but increasing number of loci orthologous to human lncRNAs have been identified in the mouse, and the origins of some human lncRNAs can be traced to the common ancestor of mammals, amniotes, or even vertebrates [9], [10], [14], [16], [34], [35]. Finally, a growing body of genetic and biochemical work on individual lncRNAs, as well as more systematic approaches to explore lncRNA function and their association with disease, point to crucial regulatory activities, notably in cell differentiation and embryonic development [for review: 7], [23], [36]–[44]. While the precise molecular functions of lncRNAs are still poorly understood, even less is known about their origin and evolution. Four non-mutually exclusive hypotheses have been proposed for the emergence of lncRNAs [6], [14]: (i) transformation of a protein-coding genes; (ii) duplication of another lncRNA; (iii) de novo origin from sequences previously untranscribed or devoid of exonic sequences; (iv) emergence from transposable element (TE) sequences. Individual examples illustrating each of these mechanisms have been described. For example, Xist, a lncRNA controlling mammalian X inactivation, originated in the eutherian ancestor from a mixture of exons derived from a decayed protein-coding gene [45] together with a variety of transposable elements (TEs) progressively accumulated and ‘exonized’ at this locus [46]. However, with the exception of a few emblematic and intensively studied lncRNAs such as Xist, the origins of most lncRNAs remain elusive. In one of the most systematic efforts to trace the origins of lncRNAs, Ulitsky et al. [14] found that a minority (∼15%) of zebrafish lncRNAs showed significant sequence similarity to another lncRNAs or protein-coding genes in the zebrafish genome. Likewise, Derrien et al. [10] reported that human lncRNAs rarely have extensive sequence similarity to each other outside of shared repetitive elements. Collectively these observations suggest that, in contrast to protein-coding genes, novel lncRNA genes do not commonly arise by duplication, but rather may emerge de novo from previously non-exonic sequences and/or from TEs. TEs occupy a substantial fraction of vertebrate genomes (e.g. at least half of the human genome [47], [48]) and are increasingly recognized as important players in the origin of functional novelties [for review: 49]–[52]. Several instances of TEs co-opted for cellular function on a genome-wide scale have been documented, notably as a source of cis-elements regulating adjacent host genes, such as promoters [53], [54], transcription factor binding sites [55]–[57], enhancers [58], [59] or insulators [60], [61]. TEs can also be ‘exonized’ into novel coding and non-coding exons [for review: 49], [62], [63]. As a source of non-coding exons, TEs have been shown to contribute substantially to untranslated regions [64]–[67] and to alternatively spliced exons of protein-coding genes [66]–[70], as well as to microRNA genes [71], [72]. In this study we provide evidence for the widespread involvement of TEs in the assembly, diversification, regulation, and potential function of lncRNAs. Results Datasets We focus on three vertebrate species -human, mouse and zebrafish- for which extensive lncRNA datasets are available (Table 1). Each set has been ‘manually’ curated based on a combination of bioinformatics and high-throughput genomics experiments, such as deep sequencing of polyadenylated RNAs (RNA-seq), chromatin state maps and cap-analysis of gene expression (CAGE) or paired-end ditags to determine transcript termini. For human, we primarily analyzed the most recent Gencode catalog of lncRNAs (v13) produced from 15 cell lines as part of the ENCODE project [10], [73], [74]. We replicated most analyses on another large set of lncRNAs assembled by Cabili et al. [9] from 24 human tissues and cell types. Importantly, the Gencode and “Cabili” sets differ slightly in the way they were curated and they are only partially overlapping [10]. Indeed we found that 64.9% of the Gencode v13 genes have no overlap with genes in the Cabili set, and conversely 47.3% of the Cabili genes have no overlap with the Gencode v13 set. While the Cabili set only contains “intergenic” (l i ncRNA) units (no overlap with known protein-coding genes), the Gencode catalog includes also “genic” lncRNAs, i.e. those overlapping or nested within protein-coding genes [10], Figure S1. Thus, these two sets may be viewed as complementary rather than redundant, acting as “biological replicates” for our study. For mouse, we primarily studied lincRNAs from Ensembl (release 70) and replicated some analyses on lincRNAs from adult liver tissue compiled by Kutter et al [16]. For zebrafish, we merged the sets of developmentally expressed lncRNAs from Pauli et al. [24] and lincRNAs from Ulitsky et al. [14] (see Methods for more details). 10.1371/journal.pgen.1003470.t001 Table 1 Number of genes and transcripts in studied datasets. Datasets Genes# Transcripts Human, Gencode v7 [10] 9,277 14,880 Human, Gencode v13 12,393 19,835 Human, lincRNAs, Cabili et al. (2011) [9] 8,263 14,353 Mouse, lincRNAs, Ensembl release 70 1,671 2,167 Mouse, lincRNAs, Kutter et al. (2012) [16] 293 293 Zebrafish [14], [24] 1,402 1,780 # For zebrafish gene annotation, see Methods. Other numbers are from the datasets themselves. A substantial fraction of vertebrate lncRNAs contain exonized TE sequences We inferred the TE content of lncRNAs by calculating the fraction of lncRNA transcripts with exons overlapping at least 10-bp of DNA annotated as TE by RepeatMasker (see Methods). We found that 75% of human (Gencode v13) lncRNA transcripts contain an exon of at least partial TE origin, which is considerably much higher than any other type of RNAs such as small ncRNAs (tRNAs, sno/miRNAs), pseudogenes, coding exons (less than 1%), as well as UTRs, the non-coding parts of mRNAs (Figure 1A). The median length of TE-derived fragments in human lncRNAs is 112 nucleotides and the average is 150 nucleotides. While the majority of human lncRNA transcripts are comprised of a relatively small percentage of TE-derived sequences, 3,789 out of 19,835 transcripts examined (∼19%) are composed of ≥50% of TE-derived sequences (Figure 1B). Similarly, 68.23% and 66.5% of mouse and zebrafish lncRNA transcripts, respectively, contain exonic sequences of at least partial TE origin (Figure 1A). 10.1371/journal.pgen.1003470.g001 Figure 1 TE occurrence in lncRNAs. See text, Methods and Table 1 for more details about lncRNA datasets. A. Percentage of transcripts with at least one exon overlapping with a TEs fragment (at least 10 bp). In red, lncRNAs (human = Gencode v13; mouse = both sets). Rest corresponds to human Refseq 57: in green, small non-coding RNAs (tRNAs and sno/miRNAs); in blue, protein-coding genes (pc genes) separated in exon types (coding and non-coding = UTRs); in black, pseudo = pseudogenes. B. Distribution of percentage of human lncRNA transcripts (Gencode v13) derived from TEs (more than 0% to more than 95%). The number of transcripts with more than 80% and more than 50% TE-derived DNA exons are indicated. Distribution is also shown for the subset of 36 studied lncRNAs presented in Table 2 and Table S2. To measure the total coverage of TE-derived sequences in lncRNA exons in each species, we intersected TE annotations from RepeatMasker (with a minimum overlap of 10 bp, see Methods) with the genomic coordinates of all lncRNA exons, and for comparison, with UTRs and coding exons of RefSeq protein-coding genes. The results show that, in all three species TE coverage is considerably higher for lncRNA exons than for protein-coding exons, but still lower than in the whole genome (Figure 2). The fraction of lncRNA exon sequence covered by TEs is also at least twice higher than in their UTRs. 10.1371/journal.pgen.1003470.g002 Figure 2 Coverage of different TE classes in genome, lncRNA, and protein-coding exons in human, mouse, and zebrafish. For genomes, total length (100%) corresponds to total length of assembly without gaps (human: 2,897 Mb. Mouse: 2,620 Mb. Zebrafish: 1,401 Mb). For lncRNAs, total length of genomic projection of all of exons are considered (human, Genc. = Gencode v13: 14.2 Mb. Human, Cabili set: 8.5 Mb. Mouse, Ens70 = Ensembl 70: 2.8 Mb. Mouse, Kutter: 0.15 Mb. Zebrafish: 2.3 Mb). For protein coding genes (pc genes), total length of CDS exons, 5′ and 3′UTR respectively are as follow: human, 30.9 Mb, 5.2 Mb, 24.6 Mb. Mouse: 30.5 Mb, 4.0 Mb, 21.6 Mb. Zebrafish: 19.1 Mb, 33.6 Mb, 12.5 Mb. Only pc genes from Refseq annotations with CDS and UTR features are considered (see Methods). Percentage of coverage of all TEs is indicated above bars. We noticed that the Cabili set [9], which consists exclusively of intergenic units (lincRNAs) shows greater TE coverage (35.1%; Figure 2) than the Gencode v13 set (28.9%; Figure 2), suggesting that intergenic lncRNAs may have a higher TE content than ‘genic’ lncRNAs (i.e. those overlapping protein-coding genes). Consistent with this idea, the TE coverage of intergenic lncRNAs in the Gencode v13 set is 31.8%, while genic lncRNAs are comprised of 25.9% of TE-derived sequences (Table S1). Thus, human lincRNA transcripts tend to be richer in TE sequences than genic lncRNAs. We wondered whether this trend could merely reflect a higher TE density in intergenic regions in general. It does not appear to be the case because the TE coverage of introns and surrounding sequences of Gencode intergenic lncRNAs is similar to that of protein-coding genes or genic lncRNAs (Table S1). These observations suggest that TEs are more prevalent in intergenic lncRNAs than in genic lncRNAs. A survey of individual human lncRNAs previously characterized in the literature recapitulates the omnipresence and high prevalence of TE sequences we detect in the ab initio lncRNA catalogs (Figure 1B, Table 2 and for an expanded version, Table S2). The presence of exonized TEs has been reported for some of these lncRNAs, such as XIST [46], UCA1 [75], HULC [76], PCAT-14 [77] and SLC7A2-IT1A/B [78]. But for the majority, there has been no previous mention of embedded TEs, even though some of these mature lncRNA transcripts are almost entirely composed of TE sequences. For example, the first three exons (out of four, i.e. ∼86% of the sequence) of the mature transcript of BANCR, which is involved in melanoma cell migration [79], are derived from a MER41 long terminal repeat/endogenous retrovirus (LTR/ERV) element (Figure 3A). The mature transcript of lincRNA-RoR, which has been shown to modulate the reprogramming of human induced pluripotent stem cells [41], is made from an assemblage of 6 different TEs together accounting for 2,057 nt (79.7%) of its length (Figure 3B) [see also ref. 7]. Importantly, the structures of BANCR and lincRNA-RoR transcripts have been validated by a combination of RACE and RT-PCR experiments and their function investigated by siRNA knockdowns and rescue experiments [41], [79]. These transcripts were independently retrieved and their structure accurately predicted in the Cabili and Gencode v13 sets, respectively. In mouse, Fendrr lincRNA, which has a very restricted pattern of expression in lateral mesoderm [80], initiates within a MTEa (ERVL-MaLR) and 4 different TEs account for 808 nt (33.7%) of its length (data not shown). In summary, our analyses point to an extensive contribution of TEs to the content of mature lncRNA transcripts, including many of those with established regulatory functions. 10.1371/journal.pgen.1003470.g003 Figure 3 Examples of lncRNAs with embedded TEs. Genomic DNA is represented as a grey line, transcripts are represented by a black line, with arrows showing sense of transcription and in grey boxes the exons of the mature transcript. TEs as colored boxes (orange-red: DNA TEs. Yellow: SINEs. Pink-purple: LTR/ERVs. Green: LINEs). Only TEs overlapping with lncRNA exons are represented. See also Table S2 for details of TEs in these lncRNAs. A. BANCR [79]. B. lnc-RoR [41]. Apes = gibbon, gorilla, orangutan, bonobo, chimpanzee, human. C. lnc-ES3 [43]. 10.1371/journal.pgen.1003470.t002 Table 2 TE content of known lncRNAs in human. Gene/ID# Range of % of TE based DNA of mature transcripts when applies TSS in TE: class (number of transcripts) polyA in TE: class (number of transcripts) number of transcripts: with TE/total PCAT14 99.6–99.96 LTR (2) LTR (2) 2/2 BANCR 86.4 LTR (1) - 1/1 Lnc-RoR 79.7 LTR (1) LINE (1) 1/1 PTCSC3 60.6 - LTR (1) 1/1 BACE1-AS 56.4 SINE (1) - 1/1 UCA1 51.8 LTR (1) - 1/1 HULC 45.7 LTR (1) - 1/1 LINC00458 (LncRNA-ES3) 45.3–57 LTR (3) - 3/3 ncRNA-7 (LINC00651) 41.1 lincRNA-p21 40.9 - SINE (1) 1/1 NEAT1 38.1 - - 1/2 PCAT1 37.1–81.7 - DNA (1) 2/2 AK023948 29.7 - - 1/1 KCNQ1OT1 29.5 - - 1/1 Tie-1AS 24.4–96.3 LINE (2) - 3/3 PTENP1 21 - LINE (1) 1/1 linc-CCDC90A-1 (LncRNA-ES1) 13.8 - LTR (1) 1/1 OIP5-AS1 (Cyrano) 12.5–64.7 - LTR (1); SINE (3); LINE (1) 10/10 SLC7A2-IT1 11.6–18.6 - LINE (2) 2/2 HOTAIRM1 9 - - 1/5 BIRC6 (megamind) 8.9 LINE (1) - 1/1 HAR1 7.2 - DNA (1) 1/1 BDNFOS (BDNF-AS) 6.4–71.1 - - 9/11 MEG3 5.6–47.1 SINE (2); LINE (1) SINE (6) 23/28 GAS5 5.3–40.9 - SINE (3) 23/29 Xist 5.3–17.4 - - 2/8 MALAT1/NEAT2 4.1–7.7 - - 2/3 ANRIL (CDKN2B-AS1, p15AS1, Mycn) 4–41.1 SINE (1) SINE (7); DNA (1) 16/17 TUG1 1.9–40.6 - LTR (1) 6/7 # Known lncRNAs with no detectable exonized TE are not shown, but include: HOTAIR, Zeb2AS1, TERC, PANDA, H19, LncRNA-ES2 and UNCA-RC. See Table S2 for an expanded version of this table and references. TE sequences in lncRNAs evolve under modest yet greater functional constraint than their non–TE sequences We next sought to evaluate the functional potential of TEs embedded in lncRNA transcripts. Several studies have reported that lncRNA exons show a signature of evolutionary constraint based on interspecies conservation [10], [12], [29], [30] as well as reduced nucleotide diversity in the human population [31]–[33] compared to randomly sampled regions of the genome or surrounding non-exonic sequences. Nonetheless, the level of constraint acting on lncRNA exons assessed through these analyses was much weaker than on protein-coding exons, presumably reflecting greater malleability of lncRNAs. To compare the level of selective constraint acting on TE-derived sequences to non-TE derived sequences in human lncRNAs (Gencode v13) and to various other types of genomic regions, we aggregated conservation scores per nucleotide calculated by phyloP across an alignment of 10 primate genomes (see Methods). As expected, we found that both TE and non-TE sequences in lncRNA exons were much less conserved than coding exons or UTRs of protein-coding transcripts (Figure 4A). Strikingly though, we found that TE sequences within lncRNA exons were significantly more conserved than either a size-matched random set of genomic regions or a neutral set of TE sequences residing in lncRNA introns (permutation test, p 0.5.). TEs promote the lineage-specific diversification of lncRNAs Because transposition represents a major source of lineage-specific DNA, we wanted to evaluate its contribution to the evolution of the vertebrate lncRNA repertoire. Our examination of TE-derived sequences in studied human lncRNAs reveals that many of these elements are restricted to primates (36.3% for Gencode v13, Figure S4), suggesting that TEs play an important role in the diversification and possibly the birth of primate-specific lncRNAs. Few of the human lncRNAs functionally characterized have identifiable orthologs in non-primate species, but Xist and cyrano provide solid examples of functional lncRNAs with ancient evolutionary origins. Xist is involved in X-chromosome inactivation and originated in the common ancestor of eutherian mammals [45], [46]. Previous in silico reconstruction of the Xist locus in the eutherian ancestor suggested that several TEs were already present at the dawn of the Xist gene and likely contributed to the assembly of the first functional Xist transcript [46]. Other TEs embedded in Xist exons are lineage-specific and therefore must have contributed to the diversification of the transcript during eutherian evolution. For example, a primate-specific FLAM_C element makes up nearly half (114 nt) of the first Xist exon in human (Table S2). cyrano is one of a small subset of zebrafish lncRNAs sharing significant sequence similarity and synteny with apparent orthologs in mouse and human [14]. Most of the sequence similarity between species is limited to a central region of the last exon (see PhastCons plot in Figure 9). In zebrafish embryos, cyrano is expressed in the nervous system and notochord and morpholino-mediated knockdowns followed by rescue experiments indicate that this lncRNA plays a role in neurodevelopment, a function possibly conserved in mammals [14]. We find that the conserved exonic region of cyrano is flanked by lineage-specific TEs embedded in this orthologous exon in each of the three species examined (Table 2, Figure 9). These examples illustrate how TEs can be incorporated long after the birth of lncRNAs to diversify their sequence in a lineage-specific fashion. 10.1371/journal.pgen.1003470.g009 Figure 9 Lineage-specific TE insertions in cyrano. Symbols and graphics are as in Figure 3. The structure of cyrano (lnc-oip5) [14] is based on coordinates of Gencode v13 transcript OIP5-AS1-001. Vertebrate PhastCons: peaks of sequence conservation across 46 vertebrate genomes displayed in the UCSC genome browser. Among functionally characterized human lncRNAs, we uncovered numerous instances where the TSS resides in primate-specific TEs (Table S2). In most of those cases, the TE provides the only identified TSS for that lncRNA locus, suggesting a pivotal role for these TEs in the biogenesis and most likely the birth of these lncRNAs during primate evolution. These include six of the eight known lncRNAs containing the largest TE amounts listed in Table 2, which all have their TSS located within the LTR of an ERV element. Intriguingly, these instances include two different lncRNAs that are highly expressed in human embryonic stem cells (ESCs) and have been experimentally shown to be implicated in the maintenance of ESC pluripotency: lncRNA-RoR [41] and lncRNA-ES3 [43]. The transcripts cloned for lncRNA-RoR and lncRNA-ES3 both initiate within LTR7/HERVH elements (Figure 3B and 3C). Furthermore, we found that these same LTR7 elements have donated the DNA binding sites for the ‘master’ transcriptional regulators of pluripotency NANOG, OCT4, and SOX2 mapped previously to the proximal promoter of lncRNA-RoR [41] (Figure 3B). Ng et al. [39] mapped two binding sites for NANOG in the promoter region of lncRNA-ES3 that we find to reside within the LTR7 driving this locus (Figure 3C). The contribution of LTR7 to the regulation of these lncRNAs in ESCs is consistent with two recent studies showing that TEs, and LTR/ERV elements in particular, play an extensive role in the primate-specific wiring of the core transcriptional network of human ESCs [57], [84]. In fact, Kunarso et al. [57] identified LTR7/HERVH as one of the most over-represented TE families seeding OCT4 and NANOG binding sites throughout the human genome. Our results indicate that this ERV family also contributed to the recruitment of primate-specific lncRNAs into the pluripotency network of human ESCs [see also ref. 7]. Some TEs confer lncRNAs the potential to form secondary structures Since lncRNAs act at the RNA level, we hypothesized that TEs may participate in the folding of lncRNAs into secondary structures, which could be important for their function. One prediction of this hypothesis is that lncRNAs with high TE content may fold into more stable structure than those with low TE content. To test this, we selected from the Gencode v13 set the top 100 lncRNAs with highest TE content and the top 100 lncRNAs with lowest TE content (see Methods) and compared the minimum free energy (MFE) of predicted secondary structures computed by the program randfold [85] for each of these individual lncRNAs. For each input sequence, randfold attributes a p-value to a predicted MFE by comparing it with a MFE obtained for the same sequence randomly reshuffled 99 times (See ref. [85] and Methods). The average p-value for high TE content lncRNAs was significantly lower than the one of low TE content lncRNAs (p = 0.0022, Wilcox rank sum test) (Figure 10A). The average length of the lncRNAs in the two datasets was also substantially different (913 nt and 1,913 nt for high and low TE content respectively), but there was no correlation between RNA length and p-value for the 200 lncRNAs examined (data not shown), ruling out a possible bias introduced by lncRNA length. Together these results indicate that TEs generally stabilize lncRNA structure in human, which supports the hypothesis that some of the TEs embedded in lncRNA exons contribute to the folding of lncRNAs into secondary structures. 10.1371/journal.pgen.1003470.g010 Figure 10 TE contribution to predicted lncRNA secondary structures. A. High and low TE content groups of 100 lncRNA were extracted from Gencode v13 set (TE content from 96.74% to 100% and from 0.49% to 2.27% respectively; see Table S7). P-values were calculated by Randfold and provide an indication of predicted secondary structure stability. The boxplot depicts the maximum, upper quantile, median, lower quantile and minimum value in a standard way. The mean of these 2 groups are significantly different by Wilcox rank sum test (p = 0.0022). B. Predicted secondary structures (RNAfold [115]) and compensatory mutations for two zebrafish lncRNAs containing ANGEL (DNA TEs) elements. In structures, TE derived regions are marked by solid line and base pairing probability by color spectrum (from 0 in violet to 1 in red). Zoom-in windows show part of stem with compensatory mutations: nucleotide substitution are boxed and the corresponding nucleotide found in ANGEL consensus are shown under/above actual RNA sequence. Sites of compensatory mutations are marked by asterisks and written p-values are adjusted by Bonferroni methods. To explore further this hypothesis, we studied a family of DNA transposons in zebrafish, called Angel, which occur in high copy numbers and are known to have the potential to form a stable stem-loop structure at the RNA level due to their long inverted repeats [86]. We reasoned that the incorporation of Angel elements in lncRNAs might in some case have conferred a functional benefit by increasing RNA stability. We identified 71 zebrafish lncRNAs containing exonized Angel elements. As expected, RNA folding programs predict that these lncRNAs have the potential to form a long stem-loop structure by intramolecular pairing of the Angel inverted repeats (see examples in Figure 10B). Furthermore, by comparing the sequence of these elements to that of their ancestral (consensus) progenitor, we identified two instances of Angel elements in lncRNAs where base substitutions in one of the arms of the predicted stem-loop structure were accompanied by compensatory substitutions on the other arm allowing the maintenance of base-pairing within the stem-loop structure (Figure 10B). To rule out the possibility that these substitutions occurred not at these loci, but prior to transposon insertion in a progenitor element that would have amplified or duplicated, we used BLAT [87] to search the zebrafish Zv9 genome assembly for paralogous Angel elements that might be sharing the same substitutions. In each case, we found that the compensatory substitutions we identified were unique to the Angel copies residing within the examined lncRNAs (data not shown), suggesting that these mutations occurred after transposon insertion. The probability that these compensatory substitutions would have occurred by chance alone in these two Angel elements is 0.001 and 0.036 after Bonferroni correction, respectively (see Methods). Furthermore, 12 of the 16 concerted mutations were from A/T to C/G base pairs, which is consistent with the idea that they increased the stability of the stem-loop structure. These data suggest that these Angel elements indeed fold into the predicted secondary structures in vivo and have been maintained over time by natural selection, plausibly for the proper function of the lncRNAs. To seek another, independent line of evidence for the involvement of TEs in forming secondary structures potentially important for lncRNA function, we next looked for sites of adenosine-to-inosine (A-to-I) editing in lncRNAs. This form of RNA editing is catalyzed by the ADAR family of adenosine deaminases that act on double-stranded RNA templates [88]. In humans, it has been reported that A-to-I RNA editing occurs predominantly within Alu elements embedded in the 3′ UTR of protein-coding transcripts [89]–[91]. This bias has been explained by the relatively high frequency of Alu elements in transcribed regions of the human genome, which often occur in inverted pairs and thereby can form long RNA duplexes providing templates for ADARs [92]. We used DARNED, a database of RNA editing sites in humans [93], to identify 2,941 A-to-I editing sites in mature lncRNA transcripts. As observed previously for mRNAs, most (82%) of the edited sites in lncRNAs occur within Alu elements, although we also found evidence of A-to-I editing within a wide range of TE types embedded in lncRNAs (Table 3 and Table S6). This may be explained by the fact that non-Alu TE sequences are much more frequent in lncRNAs than in mRNAs, even when UTRs are considered separately (Figure 2 and Figure S1). Indeed, we found that the density of edited sites within Alu, non-Alu TE, or non-TE sequences fall within the same order of magnitude in lncRNAs and UTRs (Table 3). In several cases individually examined, we found that editing sites in TE sequences map preferentially within regions of the lncRNA predicted to form stem-loop structures by virtue of the inclusion of two inverted copies of the same TE family in the lncRNA (see two examples in Figure 11). The finding that TE sequences, and in particular Alu elements, embedded in lncRNAs are frequent templates for A-to-I editing confirms that TEs are commonly engaged in intra- or inter-molecular base pairing interactions to form stable dsRNA structures. 10.1371/journal.pgen.1003470.g011 Figure 11 Long stem in two human lincRNAs (Cabili set) formed by inverted TEs. Two examples of heavily edited human lincRNA transcripts with editing sites located in TEs. RNA structures are predicted by RNAfold [115]. Nucleotide color in structures indicates base pairing probability (from 0 to 1). Inverted TE pairs are marked by solid lines, the arrow illustrating TE strand. The stem pair of TCONS_00017795 is composed by inverted Alu elements, while the structure of TCONS_0001109 is formed by 2 LTRs (MLT2B3). 10.1371/journal.pgen.1003470.t003 Table 3 Editing sites in exons. Total editing sites lincRNA 5′UTR 3′UTR Alu 1521 29 5297 other TE 162 0 121 nonTE 187 87 688 Editing sites per 100 kbp# lincRNA 5′UTR 3′UTR Alu 298 59 472 other TE 6.19 0 6.84 nonTE 3.23 1.75 3.17 # Editing sites density in Alu and non-Alu sequences are similar among non-coding transcript. The density of Alu in different sequences are about 10-3, while in other TE elements and non TE sequences the density is about 10-5. Discussion Recent high-throughput efforts to characterize the transcriptome of multicellular eukaryotes have uncovered thousands of lncRNA genes [5]–[7], [17]–[19]. While current lncRNA catalogs, such as those we used, are still far from exhaustive and almost certainly contain false positives, they indicate that the abundance and complexity of lncRNAs in mammalian genomes may rival or exceed that of protein-coding genes [9], [10], [13], [16]. The precise functions of the vast majority of lncRNAs remain to be determined, but evidence from genetic, genomic, and biochemical experiments, as well as analyses of sequence constraint, suggest that many lncRNAs perform important functions, most notably in the control of protein-coding gene expression during development and differentiation [for review: 5], [6], [7]. Despite the functional importance of some known lncRNAs, the basic mechanisms of lncRNA evolution have been largely unexplored. The few studies that have examined the evolutionary dynamics of lncRNAs paint a picture of evolutionary volatility, where large cohorts of lncRNAs seem to appear, disappear, or rapidly diversify, pointing to a potentially important role of lncRNAs in lineage-specific regulatory innovation [6], [10], [16]. Here we present a systematic assessment of TE contribution to the makeup, evolutionary origins, and regulation of vertebrate lncRNAs. While this paper was in its final stage of preparation, a study by Kelley and Rinn [94] analyzed in some detail the contribution of TEs to human and mouse multi-exonic intergenic lncRNAs. The two studies complement each other in that we analyzed different lncRNA sets, but arrive at the same general conclusion that TEs are important players in the composition and diversification of lncRNAs, highlighting a new way mobile elements have influenced genome evolution and shaped a likely crucial layer of genomic regulation. TEs are ubiquitous in vertebrate lncRNAs, but the type and amount of TE vary among species While TEs are seldom found in protein-coding transcripts (even in their UTRs, see Figure 1 and Figure 2), they are ubiquitous in lncRNAs of all three vertebrates examined (Figure 1A), accounting for a large fraction of total lncRNA sequence (Figure 2). Thus, high TE prevalence is probably a common characteristic of vertebrate lncRNA repertoires that distinguish them from mRNAs and smaller ncRNAs, such as tRNAs or microRNAs, which are typically TE-depleted (Figure 1A). We found that all major TE classes are found in lncRNAs in each of the three vertebrate species surveyed, and their relative abundance mirrors that of the entire genome (Figure 2). Nonetheless, in each species we identified TE families that were statistically enriched (up to 32 fold) in lncRNA exons relative to their coverage or density in the whole genome (Figure 7 and Figure S3). Interestingly, these over-represented families belong to different TE classes in the species examined, for example, LTR/ERV in human and mouse and DNA transposons in zebrafish (compare colors per species in Figure 7 and Figure S3). The predominance of DNA transposons in zebrafish is expected based on the prevalence of DNA transposons in this genome (see Figure 2, Figure 7 and Figure S1). However our results show that LTR/ERVs contribute disproportionally to lncRNAs in human and mouse, which is in agreement with the recent results reported by Kelley and Rinn [94]. Interestingly, human lncRNAs are mostly enriched for the ERV I subclass (alpharetroviruses), compared to mouse where ERV 2, ERV 3 or ERV K TEs are enriched (Figure 7 and Figure S3). ERV 1 subclass of elements is less abundant in the mouse genome [95] and strongly repressed in mouse ESCs [96], [97]. Therefore, it is not surprising that this type of retroviral elements do not contribute more to mouse lincRNAs. While LTR/ERV elements are also generally silenced in most human tissues, a subset of families is known to escape silencing and to become transcriptionally active in some tissues, cell types, or at certain developmental stages [54], [98]–[100]. These properties may derive from the intrinsic capacity of retroviruses to hijack host transcriptional activators in order to promote their own expression in a cell-type or developmentally restricted fashion [51], [52], [55]. For example, hundreds of ERV I elements recruit the pluripotency factors OCT4 and NANOG in human ESCs, but rarely do so in mouse ESCs [57]. This mechanism can readily explain why lncRNA-RoR and lncRNA-ES3 and hundreds of other lncRNAs associated with ERV I elements (such as LTR7/HERVH) are highly transcribed in human ESCs (Figure 3, Table 2) [see also ref. 94]. This trend is also globally apparent through the enrichment of LTR elements (including LTR7/HERVH) in 286 human lncRNAs upregulated in ES cells [annotations from Table S1, 10] (data not shown). In sum, the interspecific variations we observe in the coverage and type of TEs in lncRNAs likely reflect a variety of factors; both methodological, such as the breadth of cell types and tissues examined, and biological such as the abundance and intrinsic properties of certain TEs residing in the genome. “lncRNA first” versus “TE first”: Divergence or emergence? Two scenarios can explain the prevalence of TEs in lncRNAs. The first is that TE insertion in pre-existing lncRNAs has relatively little deleterious effect on lncRNA function allowing TEs to accumulate over time as waves of transposition break in the genome. We call this scenario the ‘lncRNA first’ model because it implies that the origin of the lncRNA predates the incorporation of TE(s) in their exons. In the second and not mutually exclusive scenario, the ‘TE first’ model, lncRNAs are assembled from TEs that inserted before the birth of the lncRNAs. Several observations and examples outlined below indicate that both models contribute to the pervasive occurence of TEs in lncRNAs. The “lncRNA first” scenario is supported by a comparison of the few lncRNAs known to be of relatively ancient origin, exemplified by Xist or cyrano, which have assimilated lineage-specific TE insertions sequentially during evolution (see Figure 9 and Table S2). Typically, these exonized TEs correspond to the most variable regions of the transcript sequence flanking more deeply conserved core sequences (see [14] and Figure 9). On a broader scale, we observe that TEs predominantly contribute to the last exon of lncRNAs (56.5% of TE amount, see Figure S5). The biased incorporation of TEs to the 3′ region of transcripts is also apparent for mRNAs, where exonized TEs are more abundant in 3′ UTRs (Figure 1, Figure 2 and Figure S5), as reported previously [65]–[67]. These data suggest that TE-derived sequences are preferentially acquired at the 3′ end of pre-existing transcripts, either because this region is more permissive to TE insertion and/or because TEs are somehow predisposed for this type of exonization events, for example owing to the presence of cryptic acceptor splice sites facilitating their capture [70], [101], [102]. In any case, this 3′ bias is consistent with the ‘lncRNA first’ model whereby TEs are secondarily acquired by existing lncRNAs. On the other hand, several observations support the ‘TE first’ model. First, we identified thousands of lncRNA transcripts that are mostly or entirely composed of TEs (Figure 1B). It is difficult to conceive that these lncRNAs would have emerged from ancestral non-TE regions later replaced or obliterated by secondary TE insertions. More likely, these lncRNAs were born from material providing by pre-existing TE insertions. In support to this idea, we identified 4,404 human Gencode v13 lncRNA transcripts with TE-derived TSS, with 1,777 of these (40.4%) derived from primate-specific TE families (Table S4). In addition, we found 2,213 human lncRNA transcripts whose first exons are entirely derived from TEs, and 965 of these (43.6%) are derived from primate-specific TE families (Table S4). These values are very similar when only genes with a unique TSS are considered and we retrieved comparable numbers in the Cabili set (Table S4). Since these TEs provide the only TSS assigned for these transcripts, we propose that these lncRNAs were born from the transcriptional activity brought upon TE insertion in the genome. Interestingly, 36.8% (857/2,331) of the TE-derived unique TSS map within LTR/ERV elements, while this type of elements account for only 8% of all TEs in the human genome (see Figure 2). Thus, it appears that the tissue-specific transcriptional activity of LTR/ERV elements [52], [54], [100] represents a major force driving the birth of lncRNAs. These data also imply that a substantial fraction of human lncRNAs are of recent origin, because ∼40% of TEs driving human lncRNAs are primate-specific and some are even restricted to hominoids (e.g. Figure 3A and 3B, Table S2). In summary, our data suggest that, in some instances, TE insertion events have been a source of diversification of ancestral lncRNAs, while in others TE insertions have triggered the emergence of brand new lncRNAs during evolution. In order to better quantify the relative importance of either process to lncRNA evolution, it will be necessary to infer systematically the age of lncRNAs using a comparative RNA-seq approach [16]. TE–mediated regulation of lncRNA genes It has been extensively documented that mammalian TEs represent an abundant source of cis-regulatory sequences driving or modulating the expression of adjacent protein-coding genes [reviewed in 49], [56]. Our study provides evidence that TEs located in the vicinity of lncRNAs may also frequently contribute to the transcriptional regulation of these genes. As discussed above, LTR/ERV elements appear to make a disproportionate contribution to lncRNA regulation relative to other TE types and in some cases they may be solely responsible for the cell-type specificity of lncRNA expression. This is exemplified by lncRNA-RoR whose transcription in hESCs is driven by a LTR7/HERVH element occupied by the pluripotency factors OCT4, NANOG and SOX2 (Figure 3C and [41], [94]). Thus, much like LTR/ERV elements have been implicated in the wiring of protein-coding genes into specific regulatory networks [55], [57], [59], [84], they have also recruited lncRNAs serving important developmental function, notably in the pluripotency network [41], [43], [94]. Possible functions of TEs embedded in lncRNAs Perhaps the most pressing question to address in the future is to what extent TEs may contribute to the function of lncRNA and how? Our analysis shows that TEs embedded in lncRNAs frequently supply sequences and signals essential for the transcription (e.g. TSS) and processing (e.g. splice, polyA sites) of the lncRNAs (Figure 5). However it does not prove that TE sequences per se are indispensable for lncRNA function, if such function even exists. Many studies have used various approaches and statistics to show that lncRNA exons, as a whole, display weak but significant signals of purifying selection suggesting that at least a fraction of lncRNA sequences is subject to functional constraint detectable at the primary DNA level [10], [12], [29]–[33]. Our analysis confirms the existence of a signal of purifying selection acting on human lncRNA exons, but more importantly we observe that this signal is higher in TE-derived than in non-TE derived lncRNA sequences (Figure 4A) suggesting that a subset of TE sequences in lncRNAs are structurally or functionally constrained. TE-derived sequences could serve as the functional elements of lncRNAs in numerous ways. For example, TE sequences might provide interaction interfaces with proteins involved in post-transcriptional or transcriptional regulation, such as the chromatin modifiers often found in complex with lncRNAs [37], [103]. Their inclusion may also provide opportunities for base-pairing interaction with single-stranded DNA or RNA containing similar repeats in inverted orientation. Such duplexes might act as a platform to recruit protein effector complexes to genomic or RNA targets. For example, Alu elements embedded within several human lncRNAs form a group called 1/2sbs-RNAs that base-pair with complementary Alu elements located in the 3′-UTR of several protein-coding transcripts to form duplexes creating a binding site for the Staufen1-mediated RNA decay machinery, which in turn promote post-transcriptional repression of the targeted mRNAs [104]. Given the abundance of Alu and other high copy number TEs in lncRNAs, such trans-regulatory effects may be widespread and affecting a large number of mRNAs containing complementary TEs in their UTRs. It was also shown recently that a B2 SINE embedded in a mouse lncRNA antisense to Uchl1 is required for post-transcriptional up-regulation of UCHL1 protein synthesis, an activity that can be transferred to an artificial antisense green fluorescent protein transcript containing the B2 SINE element [105]. We identified 361 mouse lncRNAs containing B2 SINEs (16.7%; see Tables S5 and S9), raising the possibility that these elements confer similar post-transcriptional regulatory activity to other lncRNAs. Finally, another recent study identified a point mutation associated with a lethal form of infantile encephalopathy within a primate-specific LINE-1 retrotransposon transcribed as part of a lncRNA in the human brain [78]. The precise function of the LINE-1 element in this lncRNA is unknown, but knockdown of the lncRNA resulted in increased neuronal apoptosis, an effect consistent with the encephalopathy phenotype. Interestingly, the point mutation detected in affected individuals was predicted to destabilize a secondary structure in the corresponding lncRNA, suggesting that the LINE-1 element may contribute to lncRNA folding that is important for its function in the brain. Similarly, we identified several instances in zebrafish and in human where TEs embedded in lncRNAs are predicted to be involved in the formation of stem-loop structures that have been maintained in evolution through compensatory mutations and therefore are likely to be functionally significant. We also found that these structures often lead to RNA editing of lncRNAs, which to our knowledge is a novel observation that may be relevant to the function of some lncRNAs [88], [92]. We also show that human lncRNAs fold into more stable structure than those with low TE content, suggesting that these individual examples of TEs apparently co-opted for the cellular function of lncRNAs likely represent only the tip of a large iceberg. Future work is bound to unravel a variety of mechanisms through which TEs embedded in lncRNAs have become involved in regulating the expression of vertebrate genomes. Conclusions There is growing evidence that vertebrate genomes contain a large number of long non-coding RNA genes (lncRNAs) that play important gene regulation roles, however, remarkably little is known about the origins of these genes. Our study reveals that TEs, through their capacity to move and spread in genomes in a lineage-specific fashion, as well as their ability to introduce regulatory sequences upon chromosomal insertion, represent a considerable force shaping the lncRNA repertoire of human, mouse and zebrafish. These results suggest that the apparent paucity of ancient lncRNA genes may be explained in part by rapid turnover mediated by lineage-specific TEs and imply that the regulatory networks in which lncRNA genes act may be rapidly diverging between species. Methods lncRNA datasets The datasets used in this study are as follow: human, Gencode release 13 (from ftp://ftp.sanger.ac.uk/pub/gencode) and Cabili et al. (2011) [9]. Mouse, Ensembl release 70 (ftp://ftp.ensembl.org/pub/release-70/gtf/mus_musculus/) and Kutter et al (2012) [16], both sets filtered to keep only intergenic lncRNAs. Coordinates from Kutter et al. were converted from mm9 to mm10 using the liftover tool from UCSC (http://genome.ucsc.edu/cgi-bin/hgLiftOver?hgsid=325693955). Zebrafish sets are from Pauli et al. (2012) [24] and Ulitsky et al. (2011) [14]. To limit redundancy, in case of overlap of exons between transcripts of the two sets, only transcripts from Pauli et al. (2012) were kept. Additional descriptors of the datasets are provided in Table 1. TE annotation TE annotations used in this study are obtained from the outputs of the RepeatMasker (RM) software [106] produced for the following genome assemblies: human, hg19 assembly, RM v.330, repbase libraries 20120124, from RM website (http://www.repeatmasker.org/species/homSap.html). Mouse, mm10 assembly, RM v.330, repbase libraries 20110920, from UCSC website (http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/). Zebrafish, danRer7: RM v.329, repbase libraries RB20090604, from UCSC website (http://hgdownload.cse.ucsc.edu/goldenPath/danRer7/bigZips/). These RM outputs were filtered to remove non-TE elements (Low Complexity, Satellites, Simple Repeats and ncRNA). For mouse, MutSatRep1, CYRA11_Mm and YREP_Mm are also removed. To minimize multiple counting of single TE copies artificially fragmented in the RepeatMasker outputs we merged consecutive pieces of the same TE separated by less than 10 nt. Counts of TEs in transcripts The TE content of lncRNA transcripts (datasets described above) and human RefSeq 57 ncRNAs (22,486 in total), pseudogenes (13,430), CDS and UTRs (20,848 protein coding genes) was determined by intersecting these sets with each species' TE annotations (described above) using the ‘Table Browser’ at the UCSC Genome Bioinformatics Site (http://genome.ucsc.edu/index.html) [107]. Only overlaps of minimum 10 bp were kept. Coverage of TEs in exons and surrounding sequences Protein-coding gene (pc genes) models were filtered to retain only those with 5′ and 3′ UTRs, from following releases: human, Refseq 49 (hg19, gtf file from UCSC genome browser); mouse, Refseq 57 (mm10, gtf file from UCSC genome browser); zebrafish, Ensembl 68 (danRer7). All nucleotide amounts correspond to genomic amount. Introns and upstream or downstream intergenic regions were processed through Galaxy [108]–[110] to remove all RefSeq genes exons (CDS and UTRS: Refseq 51 for human and zebrafish, Refseq 57 for mouse) as well as lncRNA exons of the datasets considered. Intergenic sequences (upstream or downstream, up to 10 or 1 kb) correspond to the longest fragment between TSS or polyA and another feature (RefSeq entries as well as lncRNA exons of the dataset considered). These sets (exons, introns, intergenic sequences) were then joined in Galaxy with filtered RepeatMasker outputs described above keeping only fragments with at least 10 nt of overlap, to calculate TE coverage of exons. See Tables S7, S8, S9, S10, S11 for transcripts TE content data. Conservation PhyloP By comparing their PhyloP scores across an alignment of 10 primate genomes, the conservation of human (Gencode v13) TE-derived lncRNA exonic segments was compared to non-TE derived lncRNA segments, RefSeq 57 5′- and 3′-UTRs, protein-coding exons and a set of random genomic fragments size-matched to the TE-derived lncRNA segments. We also generated a set of TE-derived lncRNA intronic segments, non overlapping with splicing sites and corresponding to inactive chromatin to obtain a most neutral set to compare with exonic TE-derived lncRNA segments [32] (all annotated chromatin marks from 9 cell lines were subtracted: GM12878, H1-hESC, HMEC, HSMM, HUVEC, HepG2, K562, NHEK, NHLF; ENCODE version Jan. 2011). Precompiled PhyloP scores were obtained from the ‘phyloP46wayPrimates’, ‘phyloP46wayPlacental’, and ‘phyloP46wayAll’ tracks available from the UCSC Genome Bioinformatics Site (http://genome.ucsc.edu/index.html) [107] and intersected with gene annotations using bedtools (http://code.google.com/p/bedtools/) [111]. Boxplots were made in R (http://www.r-project.org). Statistical test used: permutation test with 1000 permutations were performed in R. TE contribution to functional features: Exon counts TEs were not assigned a strand allowing them to overlap genomic features on either strand. TEs found in a genomic feature were classified based on their position in the feature, as schematized in Figure 5A. Both lncRNA and protein-coding genes were filtered before both the random and non-random analyses. In case of multiple splice forms a random mRNA was kept. Additionally protein coding genes that did not have both a 5′ UTR and 3′ UTR were excluded from the analysis. Sets of protein-coding genes are as follows: human, Refseq release 49, mouse, Refseq release 46, zebrafish, Ensembl release 68. For the random sets, all TEs were shuffled within chromosomes (excluding gaps) while preventing TE overlap. This process was repeated 5,000 times for each set, using a custom perl script (see http://www.yandell-lab.org/publications/index.html). The standard error for the random sets across all categories (Figure 5A) always plateaued before 1,000 replicates (data not shown). The probability of observing the non-random counts was calculated using the random sets. The p-value represents number of times a lower category count was observed in the random set out of 5,000 replicates. With the exception of ‘exonized’, ‘TSS’ and ‘polyA’ categories in mouse (p-values = 1, 0.001 and 0.298 respectively) and ‘exonized’ category in zebrafish (p-value = 0.001), the p-values were systematically 1 or the 25 most over represented are shown. A. Human, set from Gencode v13. B. Human, set of Cabili et al. (lincRNAs). C. Mouse, lincRNAs from Ensembl. D. Zebrafish. (TIF) Click here for additional data file. Figure S4 Amount of lineage specific and ancient TEs in human lncRNAs and protein-coding genes genomic environment. G: genome. See Methods “Coverage of TEs in exons and surrounding sequences” for details on sets. Ancient TEs correspond to TEs shared between placental mammals (Eutherians). Gencode v13 set for lncRNA. (TIF) Click here for additional data file. Figure S5 Relative amount of TEs depending on exon type. TE contribution means that 100% is the total coverage of a given class of TEs. For example, ∼20% (19.5) of TE amount is in first exon for lncRNAs, whereas for pc genes it is 6.4% (“All” TEs). lncRNAs are Gencode v13 set. A. Coverage. B. Counts. (TIF) Click here for additional data file. Table S1 TE coverage and TE counts in genic and intergenic human lncRNAs and surrounding regions. (XLSX) Click here for additional data file. Table S2 Detailed TE content of known lncRNAs presented in Table 2. (XLSX) Click here for additional data file. Table S3 Numbers of exons overlapping with TEs (data corresponding to Figure 5). (XLSX) Click here for additional data file. Table S4 Counts of functional features provided by ancient or primate TEs in human. (XLSX) Click here for additional data file. Table S5 Editing sites in Cabili lincRNA set. (XLSX) Click here for additional data file. Table S6 361 mouse lincRNAs with B2 SINEs elements (ENSEMBL release70). (XLSX) Click here for additional data file. Table S7 Details of TE content of human Gencode v13 lncRNAs. (XLSX) Click here for additional data file. Table S8 Details of TE content of human lincRNAs from Cabili et al. (2011). (XLSX) Click here for additional data file. Table S9 Details of TE content of mouse lincRNAs from ENSEMBL release 70. (XLSX) Click here for additional data file. Table S10 Details of TE content of mouse lincRNAs from Kutter et al. (2012). (XLSX) Click here for additional data file. Table S11 Details of TE content of zebrafish lincRNAs. (XLSX) Click here for additional data file.

0 comments Cited 301 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Widespread A-to-I RNA Editing of Alu-Containing mRNAs in the Human Transcriptome

Alekos Athanasiadis, Alexander Rich, Stefan Maas (2004)

Introduction On the molecular level, the complexity of higher organisms is based on the number of different gene products available for structural, enzymatic, and regulatory functions. Posttranscriptional and/or posttranslational mechanisms have an important role in generating RNA and protein diversity (Baltimore 2001). One posttranscriptional processing pathway present in higher eukaryotes is RNA editing by adenosine deamination involving modification of individual adenosine bases to inosine in RNA by adenosine deaminase acting on RNA (ADARs; reviewed in Bass 2002; Schaub and Keller 2002; Maas et al. 2003). Since inosine acts as guanosine during translation, A-to-I conversion in coding sequences leads to amino acid changes and often entails changes in protein function (Seeburg et al. 1998; Bass 2002; Schmauss and Howe 2002). The power of RNA editing in generating protein diversity lies in the fact that usually both the edited and unedited versions of the RNA and/or protein coexist in the same cell, and the ratio between the unedited and multiple edited variants can be regulated in a cell type-specific or time-dependent manner. Crucial functional properties of neurotransmitter receptors are regulated by A-to-I editing in the central nervous system (Seeburg et al. 1998; Schmauss and Howe 2002), and inactivation of editing enzymes in mice (Higuchi et al. 2000) and in the fruit fly (Palladino et al. 2000) have resulted in profound neurological phenotypes. In addition to amino acid changes, A-to-I RNA editing can theoretically lead to the alteration of transcriptional start and stop codons, as well as that of RNA splice sites. In only one case though has the creation of a splice acceptor site through intronic RNA editing been described (Rueter et al. 1999). Currently it is not known if the recoding of mRNAs at single codon positions is the main function of A-to-I RNA editing or if other types of editing events with as yet unknown roles in the regulation of gene expression are more widespread. The recently reported embryonic lethality in mice with ADAR1 deficiency indicates that additional substrates for this enzyme exist that function during early embryonic development (Wang et al. 2000, 2004; Hartner et al. 2004). Furthermore, a role for ADAR1 in the immune system is widely accepted, as one of its isoforms is interferon induced (Patterson and Samuel 1995) and upregulated in immune cells during chronic inflammation (Yang et al. 2003). The ablation of editing enzymes in Caenorhabditis elegans resulted in transgene silencing, suggesting that the RNA editing and RNA interference (RNAi) pathways intersect (Knight and Bass 2002). This notion was recently confirmed by findings that the behavioral phenotype of ADAR-deficient worms could be rescued by inactivation of the RNAi pathway (Tonkin and Bass 2003). Since both RNAi and RNA editing target double-stranded RNA (dsRNA) molecules, RNA editing could suppress gene silencing by preventing the formation of small interfering RNAs (siRNAs). A recurring theme of edited sequences is the involvement of an imperfectly dsRNA foldback structure (Higuchi et al. 1993). The importance of base-paired RNA elements for site-selective editing to occur is also mirrored in the presence of dsRNA binding domains in ADAR enzymes (Bass 2002). At present, though, it is not possible to predict if and to what extent a given RNA molecule is a substrate for A-to-I RNA editing in vivo. Despite recent progress in identifying additional genes that undergo RNA editing (Morse and Bass 1997; Morse et al. 2002; Hoopengardner et al. 2003), the total number of currently known A-to-I edited genes in mammals is still small (Bass 2002). However, the activity of the mammalian editing machinery, as measured by inosine content in mRNA fractions (Paul and Bass 1998), is much higher than expected based on the current number of known substrates. Furthermore, ADARs are ubiquitously expressed in mammalian tissues, but almost all ADAR targets identified to date reside in the brain (Bass 2002; Maas et al. 2003). This discrepancy between signs that A-to-I editing is omnipresent and the scarcity of identified targets has puzzled researchers in the field for some time, wondering where all the edited transcripts are. In this study we identify a minimum of 1,445 edited human mRNAs present in existing databases. Clusters of adenosine-to-guanosine (AtoG) discrepancies in these cDNAs are the result of RNA editing involving intramolecular pairs of inverted Alu repeat sequences, repetitive elements that represent approximately 10% of the human genome and are concentrated in and around genes (Batzer and Deininger 2002). We also characterize functional consequences of the observed editing events and the factors that determine editing levels in Alu repeats and their modification patterns. The prevalence of Alu elements in primate genes, together with our experimental and computational analysis, suggests that the vast majority of primary human gene transcripts (greater than 85% of RNAs with average structure) are subject to A-to-I RNA editing. We show how editing might influence the alternative splicing of exonized Alu elements and discuss the implications of this extensive modification of mRNAs bearing repetitive elements for the regulation of gene expression. Results/Discussion Clusters of AtoG Discrepancies between Genomic and cDNA Sequences Are Due to A-to-I RNA Editing and They Are Located in Alu Repeat Elements A hallmark of an A-to-I RNA editing event is an AtoG transition when comparing genomic and cDNA sequences of the affected gene since inosine base-pairs with cytosine and therefore is replaced by guanosine during reverse transcription and PCR amplification. However, AtoG discrepancies between genomic and cDNA sequences can also be due to single-nucleotide polymorphisms (SNPs) or errors in databases. Therefore the search for edited sequences on a genome-wide basis is not feasible solely based on this single feature. However, in some cases of editing, not a single, but a cluster of AtoG discrepancies between genomic and cDNA sequences is evident within a stretch of a few hundred nucleotides (Patton et al. 1997; Morse et al. 2002; Rosenthal and Bezanilla 2002). Therefore, we decided to inquire whether clusters of AtoG transitions seen in cDNA/genomic DNA (gDNA) sequence comparisons might represent bona fide editing events, since multiple base changes, all being of the AtoG type, are not likely accounted for by cosegregating SNPs or sequencing errors. In an initial screen for candidate genes, we used the Human Unidentified Gene-Encoded (HUGE) database of ca. 3,000 human cDNAs derived from the Kazusa cDNA sequencing project (Kikuno et al. 2002). Several examples of cDNA sequences were found that within a window of 200–300 nt differ at several positions from the genomic sequence, such that the cDNA harbors a G where the genomic counterpart specifies an A. AtoG differences that coincide with an annotated A/G SNP were filtered out. Table 1 shows a list of all 26 genes from the HUGE database showing greater than two AtoG transitions in the exonic regions. Remarkably, we found that in all cases except one (KIAA0001) the location of the AtoG cluster coincides with the position of an Alu repeat element in the cDNA. As with Alu elements, most AtoG transition clusters are localized in 5′-UTR and 3′-UTR sequences and few in coding regions. Alu elements are short interspersed elements found in all primates, which are approximately 300 nt in length (Batzer and Deininger 2002). There are about 1.4 million copies of Alus from several closely related subfamilies present in the human genome, comprising approximately 10% of its mass (Lander et al. 2001). The enrichment of Alu repeats in gene-rich regions of the genome (Chen et al. 2002) makes their prevalence in transcribed sequences even more pronounced. Their high CpG dinucleotide content renders Alu sequences targets for methylation and implicates them in the regulation of gene expression (Rubin et al. 1994). Clusters of A/G discrepancies that mapped to Alu repeats had been noted before in the HUGE database (Kikuno et al. 2002). Furthermore, of ten newly identified editing targets in C. elegans (Morse and Bass 1999; Morse et al. 2002) and 19 in human brain (Morse et al. 2002), most were located in repeat elements. These findings suggested that repetitive elements, such as Alus, might be frequent targets for A-to-I RNA editing. In order to better understand the connection of Alu's with the observed AtoG clusters, we analyzed experimentally the cDNAs from all 25 candidate genes for RNA editing in human brain. Total RNA and gDNA were isolated from the same human brain specimen to eliminate false positives from unmapped A/G SNPs. For all 25 genes in vivo RNA editing was detected by single-run sequencing of gene-specific RT-PCR products, and for five of them the editing efficiency was quantitatively evaluated through repeated experiments. Extents of editing ranged from less than 2% to 90% at individual sites (Table 1; Figures 1–3). Intramolecular Pairs of Oppositely Oriented Alus Are Responsible for Alu Element Editing Since a prerequisite for A-to-I RNA editing is the presence of a partially base-paired RNA foldback structure (Higuchi et al. 1993; Bass 2002), the observed modifications in Alu repeats might be the result of two oppositely oriented, base-pairing repeat elements located within the same RNA molecule. For each of the 25 genes with edited, exonic Alu elements we find such oppositely oriented Alu repeats in the same pre-mRNA, many of which are located in intronic sequences. To determine if the predicted Alu pairs and the calculated foldback structures (Figures 1A, 3, and 4) actually form in vivo, we analyzed experimentally the predicted Alu partners from the pre-mRNA for four of the identified editing targets (LUSTR, KIAA0500, Bruton's tyrosine kinase [BTKI], and KIAA1497). In each case we found that the closest, oppositely oriented Alu repeat undergoes A-to-I RNA editing as well. Because of the abundance of Alu elements in human pre-mRNAs, most primary transcripts contain one or more pairs of oppositely oriented Alus. If a majority of them is indeed subject to A-to-I RNA editing in vivo, it should be possible to predict RNA edited genes by identifying inverted pairs of Alu repeats in pre-mRNA transcripts. As a proof of principle, the analysis was extended to four arbitrary chosen genes (p53, SIRT2, NFκB, and paraplegin (SPG7) containing pairs of Alu repeats as seen schematically in Figure 4B–4E. In all four cases, editing in the Alu elements that are predicted to form a dsRNA foldback structure is readily detectable. Many primary gene transcripts allow several energetically favorable foldback structures to be predicted for a given Alu that involve different combinations of Alu pairs. Do all these alternative Alu-pair foldback structures exist in vivo and are therefore subject to RNA editing? To address this question we examined the editing pattern of the G-protein-coupled receptor 81 (GPR81; identified through a computational search as described below). GPR81 contains four Alu elements, one sense and three antisense oriented, in the 3.6-kb pre-mRNA and was selected based on Alu repeat configuration and transcript length. If the alternative foldback structures depicted in Figure 5 coexist in vivo, all four Alu elements should show signs of editing with the level of editing indicating how prominent each of the alternative structures is. According to the analysis of GPR81 pre-mRNA, all three configurations form in vivo with variant II possibly being the dominant one since AluSp and AluJo show the highest levels of editing (Figure 5). These results suggest that Alu elements in human mRNAs are subject to RNA editing by ADARs because of foldback structures formed between two oppositely oriented Alus present within the same primary transcript. Editing of Alus Is Tissue Dependent and It Alters Codons and Pre-mRNA Splice Sites of Alternatively Spliced Alu Exons Exonic Alu repeat elements are predominantly located in the 5′- and 3′-UTRs of mRNAs, and as a result, most cases of Alu editing occur in noncoding regions. However, some editing events predict amino acid changes (Table 1). Among the identified genes for which we performed a detailed, quantitative editing analysis several unique and recurring features emerge regarding the locations and functional implications of the editing events. The LUSTR1 cDNA codes for a G-protein-coupled seven-transmembrane receptor (also termed GPR107 or KIAA1624), with three AtoG discrepancies located within an alternatively spliced AluJo-derived exon that leads to the in-frame insertion of 29 amino acids between transmembrane regions V and VI of the protein (see Figure 1A). The experimental analysis revealed a total of ten editing sites within this Alu element (see Figure 1B), including two major sites that lead to amino acid changes (H/R and Q/R sites). Interestingly, editing levels at all positions were significantly different in human brain (19%–58%) compared to lung (less than 5%), suggesting a tissue-specific regulation of editing (see Figure 1B). Analyzing the RNA editing pattern of LUSTR1 pre-mRNA revealed additional intronic editing sites, one of which represents the splice acceptor adenosine (AG to IG) in intron 15 (22% edited in brain; see Figure 1B). Editing at this position is predicted to lead to the exclusion of the Alu exon, indicating that the alternative splicing of exon 15a might be coregulated by RNA editing of its splice junction. This is to our knowledge the first documented example where A-to-I RNA editing acts to destroy a pre-mRNA splice signal. A picture similar to LUSTR1 emerges from analysis of the gene for human inhibitor of BTKI (also termed KIAA1417; Liu et al. 2001; Strausberg et al. 2002). Again, an alternatively spliced Alu exon (located between constitutive exons 22 and 23) is affected. This time two AluSx elements are positioned in opposite orientation at the start and end of the exon (see Figure 3). Inclusion of the exon using the splice acceptor site provided by AluSx− leads to the premature termination of translation within this exon with all editing sites located in the 3′-UTR. Editing levels at 20 sites throughout the Alu element range from less than 5% to 31% in human brain, whereas cDNAs isolated from human lung again displayed few editing sites with low editing levels of less than 5%. A splice site is also subject to editing in BTKI, this time affecting an additional alternative splice acceptor site within AluSx−. On the pre-mRNA level this position is edited to 15%. However, in transcripts that use the weak upstream splice acceptor site (underlined with a dashed line in Figure 3B; as in the HUGE database clone hh15303), the additional alternative splice site (underlined with a solid line in Figure 3B) is highly edited, raising the possibility that edited BTKI pre-mRNA preferentially follows the alternative splicing pathway (data not shown). The analysis of GPR81 revealed another case of Alu exon alternative splicing and, surprisingly, a new mechanism showing how RNA editing might affect RNA processing. Within the AluSp+ element located in the 3′-UTR of GPR81 transcripts a splice donor site (AT to IT) is generated in 57% of primary transcripts by RNA editing. This is predicted to give rise to alternatively spliced mRNA products represented by GenBank entry AF385431 (see Figure 5B). This is, to our knowledge the first reported case of potential splice donor site creation by RNA editing. It is possible that here the Alu element was inserted into the 3′-UTR exon of the GPR81 gene and has evolved into a state where it is a single mutation away from initiating the birth of a novel intron. Posttranscriptionally RNA editing provides the final base change to create the new splice site. This scenario is supported by the fact that in mice the GPR81 gene lacks introns. It is intriguing that we find cases where editing in alternatively spliced Alu exons, or within adjacent splice sites, interferes with or counteracts exon formation of an Alu repeat. It suggests that RNA editing might be more generally involved in the regulation of Alu exonization. Recently, it has been shown that more than 5% of the alternatively spliced exons in the human genome are Alu derived (Sorek et al. 2002). Exonization of Alu repeats occurs via activating mutations in mostly antisense-oriented, intronic Alus generating a novel splice acceptor site (Mitchell et al. 1991; Lev-Maor et al. 2003), and it has been speculated that exonization of transposable elements in general is a major mechanism for the generation of novel exons (Kreahling and Graveley 2004). A large number of intronic Alu elements are just a single mutation away from being exonized (Lev-Maor et al. 2003; Kreahling and Graveley 2004), and in some cases the constitutive splicing of an intronic Alu has been shown to cause a genetic disorder (Mitchell et al. 1991; Knebelmann et al. 1995; Vervoort et al. 1998). In this context RNA editing may partially counteract genomic mutations that lead to the incorporation of deleterious novel exons while maintaining their potential to form exons with beneficial functions through further mutation. Furthermore, RNA editing in Alus might be involved also in the generation of novel introns as seems to be the case in GPR81. Statistically, however, the exonization of intronic Alus would be much more frequent than the intronization of exonic Alus because of the abundance of Alus in introns. A Transcriptome Wide Screen for Edited Alu Repeats The results presented above show that clusters of AtoG mismatches in cDNA/gDNA sequence comparisons represent an effective way to identify authentic editing cases with a low rate of false positives. Since all clusters of AtoG discrepancies mapped to repeat elements, we wondered how prevalent the editing of Alu or other repeat elements is in the human transcriptome. Therefore, we devised a database search procedure to identify pairs of inverted repetitive elements in human mRNAs exhibiting AtoG transitions. Initially, a limited search was carried out for closely spaced (less than 2 kb) inverted pairs of human Alu, MIR, and L1 repeat elements that overlap with exonic sequences and for which an mRNA sequence can be found in GenBank entries. This search, involving about one-third of all repeat elements in the human genome, identified 71 mRNAs with exonic repetitive-element pairs (51 Alu, six L1, six MER, and eight MIR). From those mRNAs, 27 displayed clusters of AtoG changes, all in Alu elements. Fourteen of these genes were chosen for experimental analysis, and all 14 proved to be subject to A-to-I RNA editing (Table 2). Since these initial results indicated a high prevalence of editing in Alu elements, we decided to carry out a comprehensive search involving all elements present in cDNA sequences. We analyzed the total of 103,723 human mRNA sequences (from the University of California, Santa Cruz [UCSC] Genome database [Kent et al. 2002], July 2003 assembly) for overlaps with repetitive elements of the L1, Alu, MaLR, and MIR families. For Alus, 17,406 mRNAs (16.8%) contained a total of 31,666 complete or partial repetitive-element sequences. Comparing the cDNA sequences with their genomic counterpart revealed that the number of AtoG discrepancies within Alu repeats is more than seven times higher than the average number of the other transitions (23,204 versus 3,271 [the average for GtoA, CtoT, and TtoC transitions]). In fact, the number of AtoG mismatches is higher than all other eleven types of nucleotide discrepancies combined (Figure 6A). While the finding that non-AtoG transitions (GtoA, CtoT, and TtoC) are approximately three times more frequent than transversions is in line with results from previous studies analyzing gDNA sequences (Lander et al. 2001; Venter et al. 2001), there is no explanation for the observed excess of AtoG mismatches relative to other base transitions. Alu sequences carry 22–23 CpG dinucleotides, which are known to show high mutation rates because of C-methylation, and as a consequence, these positions should display an elevated frequency of SNPs. Nevertheless, an elevated number of SNPs would lead to a rise in AtoG as well as GtoA mismatches when comparing a representative population of cDNAs with the corresponding genomic sequence. Thus, we concluded that the excess exclusively in the number of AtoG discrepancies in Alus over other base changes may reflect cases of bona fide A-to-I editing at the RNA level. We then devised a statistical approach to distinguish repetitive elements that show AtoG mismatches due to sequencing errors and SNPs from those that have undergone A-to-I RNA editing. The method was based on the observation above that Alu elements subject to RNA editing undergo multiple base modifications that result in a cluster of AtoG discrepancies (5–30) between cDNA and gDNA. The probability that a cluster of several AtoG discrepancies is due to sequencing errors or SNPs (in the absence of an increased number of other nucleotide discrepancies indicating low-quality sequence data) is negligible. Thus the number of clustered AtoG changes can be used to distinguish genuinely edited elements from elements with aberrant or non-editing-related base changes. For each Alu element with AtoG discrepancies, we computed the χ2 test comparing the observed number of AtoG discrepancies with the expected number, based on the number of non-AtoG mismatches present in the same sequence. Elements with a χ2 higher than the critical value for α = 0.00001 (corresponding to a probability of one in 100,000 that the observed AtoG transitions are due to SNPs or sequencing errors) were selected as “edited” and will be called so throughout the rest of the manuscript. Using this approach we found that out of those 17,406 mRNAs with one or more exonic Alu elements, 1,445 (8.0%) mRNAs are edited within one or more of the Alu sequences (for a full list of edited mRNAs see Table S1). When looking at all the 31,666 Alu elements present within these 17,406 RNAs, we find that 1,925 (6.1%) Alu elements are “edited,” while another 4,574 Alu elements (14.4%) show AtoG discrepancies but fail to pass our probability cutoff (Figure 6B). Thus, the total of 6,499 elements (or 20.5%) represents the upper limit of potentially edited Alus in our sample (Figure 6B). The total number of Alus with GtoA discrepancies in the same sequence sample is 2,002, and we considered this value to reflect base changes that are due to SNPs and sequencing errors. Assuming a similar number for random AtoG and GtoA mismatches, we can subtract this number from the total count of potentially edited Alus, obtaining 4,497 cases (14.2%) as the approximate number of actually edited elements. In order to validate our screening approach, we performed an identical analysis for GtoA, CtoT, and TtoC mismatches. Compared to the 1,925 AtoG-edited Alu elements in mRNA, we found 12 GtoA, 11 CtoT, and 11 TtoC cases of “editing.” These cases may represent false positives and thus set the error level of our screen to less than 0.6%. These results suggest that out of the 103,723 human mRNAs at least 1.4% are A-to-I edited within an exonic Alu element. Apart from Alu repeats, many more low- and high-frequency repeats exist in the human genome (Venter et al. 2001) and might give rise to RNA foldback structures that result in exonic A-to-I editing. Therefore, the total number of mRNAs edited in exonic repeat sequences is probably higher than the value obtained from our analysis of Alu elements. Most Alu repeats are located in introns, and it is there where the bulk of RNA editing is expected to occur. The average number of Alu repeats per gene is 12.4 estimated for Chromosomes 21 and 22 (Grover et al. 2003). This value is comparable to the 19.3 Alus/gene estimated from our data (2,003,976/103,723: total number of Alus (nonunique) in mRNA boundaries/number of mRNAs) for the whole genome. Considering that based on our analysis 14.2% of exonic Alus are edited, and assuming similar editing rates for intronic Alus, we can estimate that the probability of an average pre-mRNA to be edited is approximately 1–0.85819.3 or 94.7% (85.0% with the 12.4 Alu/gene estimate). While this value is an approximation, assuming that all genes have similar structures, and does not take into account editing in other repeat elements, it does reflect the magnitude of repetitive-element editing. Distance, Conservation, and Tissue Localization Influence Which Pairs of Alu Elements Are Edited To gain insight into the factors that determine which Alus are subject to RNA editing, and under what circumstances, the identified set of 1,925 high-confidence cases of editing in Alu elements (contained in the 1,445 mRNAs listed in Table S1) was used for further computational analysis. It was assumed that the observed editing is the result of RNA foldback structures formed between intramolecular inverted Alu repeats, as we have demonstrated for all the experimentally analyzed cases. If this hypothesis is correct, then the distance between an Alu and its closest inverted pairing element should be a critical determinant for how likely it is that a given element will be targeted by the RNA editing machinery. To test this hypothesis the closest inverted Alu was identified within the same gene for all 31,666 Alu elements. A properly oriented element was found for 19,231 of those Alu elements, and a plot was made showing the percentage of edited Alus as a function of the distance between elements (Figure 7A). The highest level of editing (approximately 16%) was found for Alu pairs 300–400 nt apart, which corresponds to slightly more than the size of a full-length Alu repeat. Editing levels subside with increasing distance as the probability for base-pairing between the two Alu elements apparently decreases. Alu pairs with distances below 300 nt indicate partial Alu elements, and the observed decrease in editing levels is likely because of the smaller, less energetically stable foldback structures. These results suggest that the optimal configuration of an Alu-pair stem loop involves two full-length Alus forming the stem separated by a short (10–50 bp) intervening loop sequence. Interestingly, as the distance increases we ultimately arrive at a low-level plateau of approximately 1% editing without any further drop in editing levels. RNA editing in trans caused by base-pairing Alus located in separate RNA molecules might be responsible for this “background” editing. A-to-I editing in trans does occur on pre-annealed RNA duplexes in vitro (Bass and Weintraub 1987; Nishikura et al. 1991) and could occur also in vivo if such intermolecular RNA duplexes form. In Xenopus one case of potential trans editing has been described, involving RNA duplexes formed between sense and antisense transcripts of bFGF (Saccomanno and Bass 1999). The distance dependence of the extent of editing clearly suggests that the formation of Alu–Alu stem loop structures predominantly results from intramolecular Alu inverted repeats with an upper limit of approximately 1% editing that could be due to intermolecular Alu pairings. To our knowledge, these results describe for the first time the distance relationship of long-range RNA folding interactions in vivo and how their stability is influenced by distance. More important, considering the high frequency of Alu elements in primate RNA sequences and the low levels of potential intermolecular editing observed, we conclude that intermolecular duplexes between complementary RNA sequences do not form in the nucleus at a significant rate. This raises the question of how the regulation of thousands of human messages proposed by Yelin and colleagues involving antisense transcripts works (Yelin et al. 2003). It might also explain why cases of editing involving endogenous sense/antisense RNA duplexes have not been reported despite evidence for extensive antisense transcription. The editing of RNAs by ADARs has been shown to be dependent on the double-stranded character of the substrates, such that editing levels and promiscuity increase with the extent of the base-paired region (Bass 2002). The human Alu family is composed of several subfamilies of different genetic ages, and their consensus sequences contain diagnostic changes distinguishing one subfamily from another. The extent of base-pairing between two oppositely oriented Alu elements, and in turn the extent of A-to-I editing, depends on their sequence homology, and it is expected that highly diverged elements would form less stable foldback structures. The relationship between observed editing level in our set of 31,666 Alu repeats and the sequence divergence of each Alu repeat from the consensus of its respective subfamily is shown in Figure 7B. A decrease in editing levels is seen with an increase in diversity, suggesting that an Alu element with lower sequence homology to most other Alu repeats has a lower probability of forming a suitable editing substrate. Unexpectedly, we also observed a drop in editing levels for Alu elements with low divergence from their subfamily consensus sequence. This trend may be caused by the distribution of Alu divergence. Within the human genome the majority of Alu elements have diverged by 10%–15% from their subfamily consensus (Stenger et al. 2001). Therefore, Alu elements with lower than average divergence have a lower likelihood of encountering another element of similar divergence, resulting in low editing levels for this subset. We obtained similar results when we compared the editing levels of Alu elements with the sum of the divergence of the edited Alu and its closest inverted Alu element (data not shown). In agreement with these conclusions, we find that the most populated Alu subfamily (AluSx) and the subfamilies closely related to AluSx sequences (AluSq, Sc, and Sg) show the highest levels of editing (Figure 7C). The pool of mRNAs used in this study represents a heterogeneous collection of sequences from different tissues and cell types. In analyzing the editing of Alu elements as a function of tissue origin (Figure 7D), significant differences in editing levels were found. The highest editing activities were seen in brain tissues, in trachea and thymus. These results are in accordance with prior studies that have measured the overall activity of RNA editing enzymes in selected mammalian tissues as judged by the amount of inosine detectable in the poly(A)+ fraction of RNA (Paul and Bass 1998). The two human enzymes with A-to-I RNA editing activity (ADAR1 and ADAR2) display a different but overlapping activity profile on known substrates, and their expression is highest in brain (ADAR2) and in cells of the immune system (ADAR1; Bass 2002). Furthermore, ADAR1 was found to be induced during inflammation leading to high activity in blood cells and thymus (Yang et al. 2003). These findings are also in agreement with our experimental results, which show much higher editing levels in brain-derived RNAs than in the same mRNA isolated from lung tissue (see Figures 1–3). The pool of edited Alu elements was analyzed for other features that might influence editing levels, such as the position of the edited Alu within the mRNA (3′-UTR, 5′-UTR, and coding region) or its orientation in relation to the mRNA (sense, antisense). No significant correlation of Alu editing was detected with any of these features (data not shown). Editing of Alu Repeats Shows Sequence and Structure Preferences The availability of such a large collection of A-to-I edited sequences resulting from this analysis allowed us to examine the modification pattern of edited Alu elements for potential editing hot spots or base preferences. To this end we first aligned all 141 edited Alu sequences (greater than 260 bp) in RNAs originating from Chromosome 1 and mapped the edited sites on their consensus sequence (Figure 8). Interestingly, certain adenosines are targeted in greater than 30% of the edited RNAs while other adenosines do not show any evidence of editing. The four editing hot spots all map to TA dinucleotides that are located in conserved Alu regions (greater than 80% identity), suggesting that they are base-paired in the average foldback structure. This confirms the previously proposed T/A 5′-neighbor preference for both ADARs (Polson and Bass 1994). Surprisingly, most of the 22 CpGs of the consensus sequence coincide with the location of high-frequency editing events. CpGs are known to be targeted by cytosine DNA methylation, which results in a high mutation rate, turning CpGs into either CA or TG dinucleotides (Batzer and Deininger 2002). Since less than 50% of the transcripts carry a CA or TG (edited in reverse complement) at these CpG consensus sites, the editing efficiency (edited adenosines/total adenosines) at these positions is comparable to that at the hot spots (Figure 8A, arrows). As a result of the high CpG mutation rate, frequently the Alu foldback structure of the unedited RNA is predicted to carry A–C mismatches at these positions. Editing at these sites restores the CpG repeat (CA→CI) on the RNA level and converts the A–C mismatch to an I–C base pair. Energy calculations for several predicted Alu pairs show, surprisingly, that the stability of the foldback structure is not diminished by editing but often increased because of the high frequency of I/C pair formation (data not shown). It is therefore unlikely that in the case of Alu foldback structures, RNA editing serves to resolve RNA secondary structures that interfere with the processes of splicing or translation of these RNAs, as suggested previously (Morse et al. 2002). Two typical configurations of editing sites observed in Alu elements are depicted in the magnifications of Figure 8B where either A–U pairs are turned into I–U wobble pairs in conserved regions of the sequence (ii), or A–C mismatches are converted into I–C pairs within nonconserved Alu regions (i). While the above analysis shows the qualitative features of the editing sites in Alus, determination of cis preferences was carried out by extracting 14,774 pentanucleotide sequences with the edited adenosine as the middle base and estimating the frequency of each base at positions −2, −1, 1, and 2 relative to the editing site. To correct for Alu sequence bias we performed the same analysis for a randomly chosen adenosine for each edited adenosine in our sample. We then subtracted those frequencies to obtain unbiased editing preferences (Figure 9). The presence of large, unpaired poly(A)+ tails in Alus obscures our analysis for adenosines surrounding edited A's but is informative regarding other base preferences. Position −1 shows a strong preference for C and T and aversion for G in agreement with previous studies (Bass 2002). Interestingly, we observe preferences for G in position +1 and for C or G at positions −2 and +2, which have not been described before. This preference pattern appears not to be linked to any Alu-specific structural feature and therefore possibly reflects the editing enzyme cis preferences. We also identify a preference for an editing site to be preceded or followed by another editing site (Figure 9). This data-rich assessment of sequence preferences for edited sites might be useful in an ab initio identification of new editing sites. Taken together our results identify loose RNA duplexes carrying A–C mismatches or A/U-rich regions, as favored editing targets. The high incidence of “corrective” editing at mutated CpG consensus positions in Alus raises the possibility that posttranscriptional restoration of CpG repeats in Alu primary transcripts by RNA editing contributed to the surprising retention of CpGs in Alus during evolution (Batzer and Deininger 2002). This might constitute an important consequence of A-to-I editing in view of the role of CpG islands in the regulation of gene expression. Potential Functional Implications of RNA Editing in Repetitive Elements Considering the available data on in vitro editing activities of ADARs on dsRNA molecules of different sequences and structures, it is not surprising that highly base-paired RNA foldback structures such as the ones induced by Alu inverted repeats are substrates for the editing enzymes. However, it is remarkable and maybe surprising that these predicted structures are edited in vivo at significant levels. This indicates that many of these structures do form in vivo and are readily accessible for ADARs in the nucleus. Alu elements are ideal for the formation of editable RNA structures because of their large numbers, size, and degree of conservation. We find no evidence for a sequence or otherwise specific interaction of the editing machinery with Alu sequences. Thus, other repetitive elements able to form similar structures should also be targets of A-to-I editing. Our data suggest, however, that editing levels in all other major repeat-element families that dominate the human genome (LINE, LTR, and other short interspersed elements) are very low compared to editing levels seen in Alu repeats (see Figure 6A and unpublished data). The selectivity for Alus might be explained based on the distribution features of each repetitive-element family: For example full-length L1 repeats are approximately 6 kb in length, and as a consequence, most of the time they have low chance of having a base-pairing sequence in proximity. MIR repeats, although found in significant numbers, which potentially could form foldback structures, have a low average level of conservation (30%–40% divergence) and so may be inadequately double stranded to be a substrate. MaLR elements of the LTR superfamily are present in numbers such that the average distance between an inverted pair is very high (approximately 10 kb). However, our analysis suggests that all repetitive elements might become targets of RNA editing at different stages in evolution. Young repetitive elements in their expansionary phase of evolution display features that we identify as important for being editing targets. Based on these observations it will not be surprising if repeat elements that show low levels of editing in humans are major targets in other organisms. For mRNA fractions, we estimated the inosine content due to Alu editing as follows: In 103,724 mRNAs we found 23,204 AtoG mismatches, while the same sequence sample has an average for the other transitions of 3,271. Assuming an average mRNA size of 4 kb, the ratio of inosine in the sample is estimated to be one inosine every 20,814 nucleotides (103,724 × 4,000/[23,204–3271]) generated by editing in Alu sequences. This estimation for Alu editing is in the range of one inosine in 17,000 nt (brain), one in 33,000 nt (lung, heart), to one inosine in 150,000 nt (skeletal muscle) as was experimentally determined by Bass and colleagues in the polyA-fraction of rat RNAs (Paul and Bass 1998). Since the rat genome lacks Alus, the total amount of inosine generated in human mRNAs may be much higher than in rats, unless a class of edited sequences in rats exists with a similar prevalence to Alus in humans. In any case, our data imply that most of the inosine detected in mRNA transcripts can be explained by the widespread A-to-I editing of repetitive elements. Repeat-element editing might therefore point toward an important housekeeping function for RNA editing. In contrast, the well-studied examples of editing that lead to single nucleotide and codon changes in mRNA might be less frequent cases of editing events. While a significant amount of editing occurs in mRNAs that contain repetitive elements in their exons, our results predict that the bulk of A-to-I editing takes place in intronic sequences missing from cDNA databases. This is suggested by the experimental results regarding the LUSTR, GPR81, p53, SIRT2, NFκB, and paraplegin genes, for which intronic data was available (see Figures 1A, 4, and 5A). This extensive editing of repetitive elements in pre-mRNAs creates an enormous pool for the generation of gain-of-function mutations. The involvement of editing in creating or destroying splicing sites of alternatively spliced Alu exons, along with internal editing of those exons, suggests an intriguing new mechanism for accelerated evolution. We are now in a position to analyze the extent to which this process occurs within the human transcriptome. Such a role in “stimulating” evolution, however, is unlikely to be related to the “daily” function of A-to-I RNA editing. It has been shown that hyperedited, inosine-containing RNAs are retained in the nucleus by a protein complex containing the inosine binding protein p54 (Zhang and Carmichael 2001). In view of the widespread editing of Alus this offers an intriguing mechanism to preclude aberrantly spliced mRNAs and, more generally, repetitive-element-containing RNAs from exiting the nucleus. This model, though, suggests that intronic RNA editing occurs frequently in other organisms and in other repetitive-element types as well, something that remains to be shown. A connection between A-to-I RNA editing and RNAi has recently been suggested through studies in C. elegans where inactivation of the editing machinery leads to transgene silencing (Knight and Bass 2002), and subsequent inactivation of the RNAi pathway restored transgene expression (Tonkin and Bass 2003). Furthermore, retrotransposon LTR sequences were shown to induce natural RNAi due to RNA duplex formation (Sijen and Plasterk 2003). The RNAi machinery has been implicated in gene silencing in two independent modalities: at the RNA level through degradation of mRNAs and at the chromatin structure level through induction of methylation (Dykxhoorn et al. 2003; Ekwall 2004). Both silencing pathways might be affected by editing of repetitive-element foldback structures. Silencing of RNAs containing such inverted repeats might be prevented through their modification by RNA editing and their subsequent nuclear retention (Zhang and Carmichael 2001) or by rendering those RNAs inadequate substrates of the RNAi machinery. It is possible that the observed embryonic lethality and apoptosis in A-to-I editing-deficient mice (Wang et al. 2000, 2004; Hartner et al. 2004) is related to the breakdown of this control mechanism leading to the posttranscriptional silencing of essential genes. The work presented here has been based on the analysis of cellular mRNAs that contain Alu repeat elements. However, the underlying principles probably also apply to Alu RNAs generated from transcriptionally active Alu elements. Alu elements do not encode transcription termination signals (Deininger 1989), and thus read-through transcription from transposition-competent Alu repeats can result in intramolecular Alu pairs, leading to the editing of a sequence that subsequently becomes retrotranscribed. Editing of primary transcripts of repetitive elements may have an important role in the control of their proliferation and a dedicated analysis of such transcripts for editing events represents an important future direction. A recent study by Levanon et al. (2004) reported a computational approach for the identification of heavily edited genes in the human transcriptome and found that editing mostly occurs in Alu repeat elements (greater than 92% of the substrates identified), giving us the opportunity to compare the two approaches and datasets. The computational strategy used by Levanon et al. (2004) differs substantially from ours both in the sequence dataset employed and in the methodology applied. The use of expressed sequence tags (ESTs; in contrast to our use of mRNA sequences) offers a much larger primary dataset for analysis; however, single-pass sequences have a higher error rate (Liang et al. 2000), and EST databases are biased toward sequences near 3′-termini of mRNAs (Liang et al. 2000). Levanon et al. (2004) selected candidate sequences for editing by identifying inverted repeats followed by the evaluation of AtoG mismatch rates, whereas we directly evaluated the AtoG mismatch content in repetitive elements irrespective of the presence of a nearby pairing sequence. The approach of Levanon et al. (2004) allows the discovery of edited inverted repeats that do not belong to any of the repetitive-element families (although the previously known brain substrates were missed), but it does not identify cases where a base-pairing sequence is not evident because of truncated cDNA and EST sequences and incomplete knowledge of gene boundaries. We found that for approximately one-third of the edited Alu elements a pairing Alu cannot be located within the gene boundaries as determined by known mRNAs, although in most of the cases it can be identified at the genome level. A comparison of the edited gene/mRNA datasets of the two studies shows a 34.5% overlap when gene names and symbols are compared. It should be noted, though, that editing of the same gene might reflect editing at different sites or within different Alu elements of the same gene. The two approaches are overlapping as well as complementary. Taken together, they have probably uncovered the most significant part of the heavily edited exonic sequences for which sequence data are available. From our analysis we estimate an additional approximately 4,000 edited Alu elements besides the 1,925 Alus that we have selected as a very high confidence set. Thus, it is important to note that the heavily edited sequences represent the tip of an iceberg with many more mRNAs in the human transcriptome being edited at single or a small number of positions. Materials and Methods RNA editing analysis Human brain samples were provided by the Harvard Brain Tissue Resource Center, Belmont, Massachusetts, United States; human lung cDNA was from Clontech (Palo Alto, California, United States). Total RNA isolation and reverse transcription have been described previously (Ausubel et al. 1995; Maas et al. 2001). Gene-specific PCR was performed as described earlier (Maas et al. 2001), and a list of oligonucleotide primer sequences used in this study is available on request. RNA editing analysis was done by direct sequencing of gene-specific, gel-purified RT-PCR products as described (Maas et al. 2001), using an automated ABI310 (Applied Biosystems, Foster City, California, United States) capillary electrophoresis sequencer. Human gDNA used for gene-specific PCR was isolated from the same tissues according to standard protocols (Ausubel et al. 1995). Computational procedures For analysis of the pool of human cDNA sequences we developed a program named Procedures for Repetitive Element Foldback Analysis (PREFA). We used the set of cDNA sequences from the UCSC database (July 2003) comprising 103,723 sequences (after removal of duplicate entries). The set of repetitive elements (for Alus 1,163,041 unique elements) and related information of the human genome (created with RepeatMasker based on the Repbase [Jurka 2000] release of June 2002) was obtained from the same source. For each examined repetitive-element family we first selected the subset overlapping partially or fully with genes. For Alus the number is 2,003,976, including duplicates, or 572,107 unique sequences. From this subset we then selected those overlapping with exons (31,666). The RNA and genomic sequence for each element was extracted and compared base by base for mismatches. A small number of cases with very high non-AtoG mismatches (greater than 20/element) were discarded as misaligned or erroneous. From the repetitive elements showing at least a single AtoG change we selected those where mismatch distribution cannot be accounted for by SNPs and sequence errors using the following procedure: The overall expected ratio of AtoG discrepancies relative to the total number of mismatches was calculated from the whole sample, assuming the expected AtoG mismatches to be approximately equal to the average of the rest of the transitions: The expected probability for an AtoG mismatch at a single position in a given element was calculated from the total number of mismatches found in the element in cases where other mismatches were present (2) or from the whole sample where only AtoG mismatches were found (3): Here nAtoG and nOther is the total number of AtoG and non-AtoG mismatches found for this element: Given the probability p for an AtoG mismatch to occur at any given position, the expected values for the number of AtoG were calculated: A χ2 test was calculated for each element and those with a χ2 value exceeding the critical value (for α = 0.000001) were selected as edited, and these values correspond to approximately more than five AtoG changes in the absence of any other change in the approximately 300 bp of an Alu). For each element in the high-confidence set the closest inverted element was identified among the elements present in the same gene boundaries. The distance separating the pair was calculated from the location of the first base of each element, according to the genomic sequence numbering and irrespective of their orientation. The divergence of each element was derived from the corresponding entry in the UCSC annotation database (ChrN_rmsk) representing mismatches per hundred bases. Tissue of origin of the RNAs was also derived from the UCSC mRNA annotation. For RNAs described to originate from multiple tissues, the corresponding RNAs were included in the count for each of those tissues. RNAs originating from a specific subregion of a tissue, such as subareas of the brain, were counted within the subregion but not in the whole-tissue set of RNAs. Alignment of the Chromosome 1-derived Alu sequences was performed with the MegAlign program of the DNASTAR (Madison, Wisconsin, United States) package (Lasergene) using the CLUSTAL algorithm (Jeanmougin et al. 1998). Further manual adjustments were necessary owing to the presence of simple repeats in Alu sequences. Analysis of the alignment and base counts surrounding the editing sites were done with PREFA. Supporting Information Table S1 Database of Computationally Identified Editing Targets The database lists the GenBank accession numbers, gene names, gene product description, chromosome location, and type of Alu element and location within the mRNA sequence, the identity of the most likely pairing Alu elements within the same gene, and the distance in base pairs (bp) between the pairing Alus. The positions of all predicted editing sites within the individual sequences can be viewed by pasting the accession number into the USCS genome browser (Kent et al. 2002) at http://genome.ucsc.edu/cgi-bin/hgGateway and following the link to mRNA/Genomic alignment. We found that six cDNAs map on two chromosomes (AB095924, AK021666, AK055562, AK092837, AK094425, and BC039501); details are given for the most plausible assignment. We have observed that in the 43 cases that we experimentally analyzed, usually additional editing sites were identified when directly sequencing gene-specific PCR products. (276 KB XLS). Click here for additional data file. Accession Numbers The GenBank ((http://www.ncbi.nlm.nih.gov/Genbank) accession numbers for the genetic sequences discussed in this paper are LUSTR (AB046844), KIAA0500 (AB007969), BTKI (AB037838), KIAA1497 (AB040930), and GPR81 (BC0067484). The Entrez Gene (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene) ID numbers for ADAR1, p53, SIRT2, NFκB, and SPG7 are 103, 7157, 22933, 4790, and 6687, respectively.