57
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Adenosine-to-inosine RNA editing and human disease

      review-article
      1 , , 1
      Genome Medicine
      BioMed Central

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          A-to-I RNA editing is a post-transcriptional modification that converts adenosines to inosines in both coding and noncoding RNA transcripts. It is catalyzed by ADAR (adenosine deaminase acting on RNA) enzymes, which exist throughout the body but are most prevalent in the central nervous system. Inosines exhibit properties that are most similar to those of guanosines. As a result, ADAR-mediated editing can post-transcriptionally alter codons, introduce or remove splice sites, or affect the base pairing of the RNA molecule with itself or with other RNAs. A-to-I editing is a mechanism that regulates and diversifies the transcriptome, but the full biological significance of ADARs is not understood. ADARs are highly conserved across vertebrates and are essential for normal development in mammals. Aberrant ADAR activity has been associated with a wide range of human diseases, including cancer, neurological disorders, metabolic diseases, viral infections and autoimmune disorders. ADARs have been shown to contribute to disease pathologies by editing of glutamate receptors, editing of serotonin receptors, mutations in ADAR genes, and by other mechanisms, including recently identified regulatory roles in microRNA processing. Advances in research into many of these diseases may depend on an improved understanding of the biological functions of ADARs. Here, we review recent studies investigating connections between ADAR-mediated RNA editing and human diseases.

          Related collections

          Most cited references58

          • Record: found
          • Abstract: found
          • Article: not found

          Mapping the Human miRNA Interactome by CLASH Reveals Frequent Noncanonical Binding

          Introduction MicroRNAs (miRNAs) play a key role in the posttranscriptional regulation of gene expression by guiding the association between the RNA-induced silencing complex (RISC) and target RNAs (reviewed in Fabian et al., 2010). Human cells express more than 1,000 miRNAs, each potentially binding to hundreds of messenger RNAs (mRNAs) (Lewis et al., 2005), but only a small fraction of these interactions has been validated experimentally. Experiments conducted throughout the last decade have established a set of canonical rules of miRNA-target interactions (reviewed in Bartel, 2009): (1) interactions are mediated by the “seed” region, a 6- to 8-nt-long fragment at the 5′ end of the miRNA that forms Watson-Crick pairs with the target; (2) nucleotides paired outside the seed region stabilize interactions but are reported not to influence miRNA efficacy (Garcia et al., 2011; Grimson et al., 2007); and (3) functional miRNA targets are localized close to the extremes of the 3′ UTRs of protein-coding genes in relatively unstructured regions (Grimson et al., 2007). Recently, RISC-binding sites on mRNAs have been mapped transcriptome wide by crosslinking, immunoprecipitation, and high-throughput sequencing (CLIP-seq), allowing prediction of many miRNA-mRNA interactions (Chi et al., 2009; Hafner et al., 2010a; Zhang and Darnell, 2011) and yielding data consistent with the canonical rules. However, there is substantial evidence for exceptions to these rules. As examples, in C. elegans, the well-studied lin-4::lin-14 interaction involves bulged nucleotides (Ha et al., 1996), whereas the let-7::lin-41 interaction involves wobble G·U pairing (Vella et al., 2004). Human miR-24 targets important cell-cycle genes using interaction sites that are spread over almost the whole miRNA. These interactions lack obvious seed pairing and contain multiple mismatches, bulges, and wobbles (Lal et al., 2009). Analysis of the miR-124 targets recovered by HITS-CLIP revealed a mode of miRNA-mRNA binding that involves a G bulge in the target, opposite miRNA nucleotides 5 and 6. It has been estimated that about 15% of miR-124 targets in mice brain are recognized by this mode of binding (Chi et al., 2012). Another, apparently rare, base-pairing pattern called “centered site” (Shin et al., 2010) involves 11 consecutive Watson-Crick base pairs between the target and positions 4–14 or 5–15 of miRNA. There are also multiple exceptions regarding the requirement for miRNA-binding sites to be located in the 3′ UTR. Functional miRNA-binding sites have occasionally been reported in 5′ UTRs (Grey et al., 2010) and, more frequently, within mRNA coding sequences (Hafner et al., 2010a; Reczko et al., 2012). Moreover, recent reports show that miRNA targets are not limited to protein-coding transcripts and can be found in noncoding RNAs (ncRNAs) that arise from pseudogenes (Poliseno et al., 2010). Together, these data indicate that miRNAs can bind to a wide variety of targets, with both canonical and noncanonical base pairing, and indicate that miRNA targeting rules may be complex and flexible. To allow direct, high-throughput mapping of RNA-RNA interactions, we previously developed crosslinking, ligation, and sequencing of hybrids (CLASH) (Kudla et al., 2011). High-throughput methods have been developed to map protein-DNA interactions, protein-RNA interactions, and DNA-DNA interactions, so CLASH completes the toolkit necessary to study nucleic acid interactomes. Here, we adapted CLASH to allow direct observation of miRNA-target pairs as chimeric reads in deep-sequencing data. Our transcriptome-wide data set reveals the prevalence of seed and nonseed interactions and the diversity of in vivo targets for miRNAs. Results CLASH Directly Maps miRNA-Binding Sites To recover RNA species bound to the human RISC complex, we created an N-terminal fusion of hAGO1 with a protein A-TEV cleavage site-His6 tripartite tag (PTH-AGO1). N-terminally tagged AGO proteins were used previously in many studies and were shown to be functional (Chatterjee and Grosshans, 2009; Lian et al., 2009). Actively growing Flp-In T-REx 293 cells stably expressing PTH-AGO1 were UV irradiated (254 nm) to crosslink proteins to interacting RNAs. PTH-AGO1 was purified, and interacting RNA molecules were partially hydrolyzed, ligated, reverse transcribed, and subjected to Illumina sequencing. At the ligation step, RNA molecules present in AGO-associated miRNA-target duplexes can be joined together (Figure 1A). Following RT-PCR amplification, these generate “chimeric” complementary DNAs (cDNAs), which can be identified because they contain two regions that map to sites that are noncontiguous in the transcriptome sequence (Figure 1B). When AGO1-associated RNAs were analyzed, around 98% were “single reads” representing AGO1-binding sites on RNAs, similar to those obtained with HITS-CLIP and PAR-CLIP (Chi et al., 2009; Hafner et al., 2010a). However, ∼2% were chimeric reads reflecting intermolecular stem structures present in the AGO1-associated RNAs (Figures 1A and 1B). Supporting the significance of the chimeras, 94% of the sequences involved were also recovered as single reads in at least one experiment (Figure 1C). As a control experiment, the lysate obtained from UV-irradiated human cells was mixed with an equal quantity of yeast lysates prior to CLASH analysis (details in Extended Experimental Procedures and Table S2C). This revealed that the background arising from RNA-RNA interactions formed in vitro represents 98%) of the miRNA-target RNAs interactions identified by CLASH had formed in vivo in human cells. Although seed-mediated interactions constitute the largest class in our data, only around 37% of seed interactions involve uninterrupted Watson-Crick base pairing. This figure seemed surprisingly low but is consistent with the many observations of endogenous noncanonical miRNA targets. High-throughput studies found fewer noncanonical (or nonseed) interactions (Chi et al., 2009; Hafner et al., 2010b), but this may reflect an inherent bias in that seed binding was used to computationally identify interactions. Notably, many high-confidence AGO-binding sites identified in previous CLIP-seq data could not be assigned bioinformatically to any specific miRNA. Computational searches for miRNA-mRNA interactions have also been biased toward the identification of binding sites in 3′ UTR regions. In contrast, we observed substantial numbers of miRNA interactions in all the regions of mRNAs, with the greatest number of hits in coding sequences. Notably, different miRNAs vary in the relative proportions of targets in 5′ UTRs, coding sequences, and 3′ UTRs. As examples, miR-100 returned 4% 5′ UTR: 23% CDS: 73% 3′ UTR, whereas miR-149 returned 8% 5′ UTR: 72% CDS: 19% 3′ UTR (data not shown). To provide an overview of the key features of miRNA-mRNA interactions, we analyzed miRNA base-pairing patterns by cluster analysis. As expected, the most frequent miRNA interaction site with a target is the seed, and base pairing in this region is detected for more than half of the interactions. However, seed interactions alone are found in only a relatively small fraction of identified targets (class I, 19%). Defined classes II–III agree with previously described 3′ supplementary and compensatory sites (Grimson et al., 2007; Lian et al., 2009). Unexpectedly, we identified a substantial class of interactions (class IV, 16% of all interactions) that does not involve contacts within the seed region and resembles reported “seedless” interactions (Lal et al., 2009). The identification of miRNAs that predominately interact with target mRNAs using their 3′ regions helps explain the pattern of evolutionary conservation of these miRNAs. However, target mRNAs that fall into this class seem to be relatively poorly conserved in evolution, and high-throughput data show that, on average, these targets respond only weakly to miRNA binding. Our experimental data on the regulation of miR-92a targets agree with this analysis, showing a statistically significant but moderate effect of class IV interactions on mRNA stability and possibly translation in reporter constructs. The results further suggested that the 3′ motif might act cooperatively with seed interactions. It is, of course, possible that the nonseed, motif interactions have additional functions, e.g., in attracting regulatory factors or switching effector pathways. Overall, we show that noncanonical miRNA-mRNA targeting is much more widespread than anticipated. Moreover, the analysis of base-pairing patterns and of miRNA-binding site motifs indicates that individual miRNAs systematically differ in their target binding modes. Indeed, even members of the same miRNA family can manifest distinct base-pairing patterns. This was previously predicted by RepTar (Elefant et al., 2011) and was observed on a small scale in the analysis of enriched 6-mers in mRNAs recovered in AGO-immunoprecipitates following miRNA transfection (Nelson et al., 2011; Wang et al., 2010). The recently published human AGO2 crystal structure (Elkayam et al., 2012) does not exclude the possibility of noncanonical seed interactions. The trajectory of the miRNA seen in the structure leaves most base edges accessible to be read by potential target molecules. Biochemical studies show that the structure of hAGO2 is flexible, and miRNA binding stabilizes and spatially orients AGO2 domains. Differences in patterns of miRNA-target RNA base pairing can induce allosteric changes in the RISC complex, potentially leading to different AGO activities. This suggests that the various interaction classes and/or the specific motifs identified might have distinct functional roles. The integration of CLASH data with RNA-sequencing and proteomics should give a clearer indication of the range of miRNA functions and their relationship to miRNA-mRNA interaction patterns. Many interactions between Argonaute proteins and abundant, stable rRNA and tRNA species can be found in our data and in published high-throughput AGO-CLIP experiments (Chi et al., 2009; Hafner et al., 2010a). Evidence for miRNA-rRNA interactions has been reported, including the association of miR-206 with both nuclear preribosomes and mature cytoplasmic ribosomes (Politz et al., 2006). miR-206 is, however, specific for skeletal muscles and is not expressed in HEK293 cells. In addition, the involvement of AGO2 in pre-rRNA processing has been reported, although it is unclear whether this is dependent on the RISC pathway (Liang and Crooke, 2011). Specific, short tRNA fragments can be bound by AGO proteins and possibly function analogously to miRNAs (Burroughs et al., 2011), but there are no previous reports of tRNAs being targeted by miRNAs. It was recently proposed that “competing endogenous RNA” (ceRNA), generated from transcribed pseudogenes and long noncoding RNAs, participates in mRNA regulation by competing for miRNA binding (Salmena et al., 2011). We speculate that regulation by competition for miRNAs involves not only ncRNAs and other modestly expressed species but also the abundant stable RNAs. In some cases, the highly abundant tRNAs and rRNAs may also “buffer” miRNAs. They might potentially bind miRNAs that are in (perhaps temporary) excess over cognate targets, preventing inappropriate target binding and/or protecting unbound miRNAs against premature degradation. This model is supported by the observation that miRNA interactions with mRNAs have a lower average free energy than those with stable RNA species (data not shown), so authentic target mRNAs might readily recruit cognate miRNAs from the buffered pool. Interactions between pairs of distinct miRNAs were not very frequent (∼3%), but some were highly reproducible and apparently isoform specific—for example, miR-30::let-7. Two published reports of miRNA-miRNA interactions reveal different outcomes. Binding of miR-107 and let-7 mutually reduced miRNA stability and activity (Chen et al., 2011), whereas binding of miR-709 alters the biogenesis of miR-15a/16-1 (Tang et al., 2012). The application of the CLASH technique to miRNAs offers many possibilities for future research. As an example, analyses of miRNA association reveal comparable distributions of miRNAs associated with the four mammalian AGO homologs (Burroughs et al., 2011; Liu et al., 2004; Meister et al., 2004; Su et al., 2009), but it is less clear whether all miRNAs target the same mRNAs when bound to different AGOs. Similarly, closely related paralogs exist for many human miRNAs, but it has been difficult to determine their relative efficiencies in mRNA targeting. The distribution of nontemplated terminal U residues among miRNAs has also been determined (Kim et al., 2010), but not how this effects targeting in vivo. More generally, the spectrum of miRNA-mRNA interactions is expected to rapidly change during differentiation, and viral infection and following metabolic shifts or environmental insults. All of these can potentially be addressed using CLASH. Experimental Procedures CLASH Analyses The previously reported protocol (Kudla et al., 2011) was extensively modified to allow miRNA target identification in mammalian cells. The experimental protocol, variants tested, and bioinformatic analyses are described in detail in the Supplemental Information. Cell Lines A protein A-TEV protease cleavage site 6xHis (PTH) tag was fused to the N terminus of human AGO1 and stably transfected into Flp-In T-REx 293 cells. PTH-AGO1 expression was induced with Doxycycline and confirmed by western blotting. Experimental Validation of CLASH Targets Flp-In T-REx 293-hAGO1 cells were transfected with miR-92a inhibitor or universal negative control. 48 hr posttransfection RNA was isolated, and cDNA was quantified using primers listed in Table S6. Luciferase reporter vectors were prepared by cloning short oligonucleotides containing single miR-92a-binding sites or PCR-amplified long fragments of 3′ UTRs (sequences in Table S6) into the 3′ UTR of Renilla luciferase in the psiCHECK2 vector (Promega). HEK293 cells were transfected in 96-well plates with reporter vectors or nonmodified psiCHECK2 as control together with control or miR-92a inhibitors. Luminescence of Renilla and firefly (internal reference) luciferases was measured 48 hr posttransfection. Extended Experimental Procedures CLASH Protocol Cell Preparation and Crosslinking Flp-In T-REx 293–PTH-AGO1 and control Flp-In T-REx 293 cells (Life Technologies) were seeded onto 150 mm plates (Nunc, Thermo Scientific), 4 plates per sample. The next day cells were induced for hAGO1 production with 0.5 μg/ml Doxycycline (Sigma, D9891). 36 hr post induction growing cells were UV crosslinked on ice with λ = 254 nm in Stratalinker 1800 (Stratagene), power settings = 400 mJ/cm2. Cell Lysis Directly after crosslinking cells were lysed by addition of cooled Lysis Buffer (50 mM Tris pH 7.8, 300 mM NaCl, 1% NP-40, 5 mM EDTA, 10% glycerol, 5 mM β-mercaptoethanol, protease inhibitors (Roche, cOmplete, EDTA-free)). Lysates were centrifuged at 6,400 x g for 30 min at 4°C and supernatant was collected. Unless used directly lysates were stored at −80°C. PTH-AGO1-RNA Purification on IgG-Dynabeads Dynabeads (Life Technologies, M-270 Epoxy) were coated in advance with rabbit IgG (Sigma, I5006) according to a published protocol (Oeffinger et al., 2007). Cell lysates were incubated with IgG-Dynabeads for 40 min at 4°C using 20 mg beads/sample and then beads were washed with LS-IgG WB buffer (50 mM Tris-HCl pH 7.8, 0.3 M NaCl, 5 mM MgCl2, 0.5% NP-40, 2.5% glycerol, 5 mM β-mercaptoethanol), HS-IgG WB buffer (50 mM Tris-HCl pH 7.8, 0.8 M NaCl, 10 mM MgCl2, 0.5% NP-40, 2.5% glycerol, 5 mM β-mercaptoethanol) and PNK-Wash buffer (50 mM Tris-HCl pH 7.5, 10 mM MgCl2, 0.5% NP-40, 50 mM NaCl, 5 mM β-mercaptoethanol). RNase A+T1 Digestion and PTH-AGO1-RNA Elution RNP complexes bound to IgG-Dynabeads were trimmed with 0.5 unit RNaseA+T1 mix (RNace-IT, Stratagene) in 500 μl PNK buffer (50 mM Tris-HCl pH 7.5, 10 mM MgCl2, 0.5% NP-40, 50 mM NaCl, 10 mM β-mercaptoethanol) for 7 min at 20°C. Then RNase solution was removed and PTH-AGO1-RNA complexes were eluted with Ni-WB I buffer (50 mM Tris-HCl pH 7.8, 300 mM NaCl, 10 mM Imidazole pH 8.0, 6 M Guanidine-HCl, 0.1 M NP-40, 5 mM β-mercaptoethanol) for 10 min at 20°C. Purification of PTH-AGO1-RNA on Ni-NTA Agarose The eluate from IgG-Dynabeads was loaded on 40 μl of Ni-NTA Agarose (QIAGEN) equilibrated with Ni-WB I for 2 hr at 4°C and then Ni-NTA beads were washed with Ni-WB I buffer. Ni-NTA beads were transferred to the spin columns (Pierce, Thermo Scientific, 69725) and from that point on all the reactions and washes were performed on the columns. Beads were subsequently washed with Ni-WB II buffer (50 mM Tris-HCl pH 7.8, 300 mM NaCl, 10 mM Imidazole pH 8.0, 0.1 M NP-40, 5 mM β-mercaptoethanol) and extensively washed with PNK-Wash buffer. RNA 5′ End Phosphorylation and Intramolecular Ligation Ni-NTA beads with bound PTH-AGO1-RNA complexes were incubated with 40 units T4 Polynucleotide kinase (New England Biolabs, M0201), 1 mM ATP and RNase inhibitors (RNasin, Promega, N211B) in PNK buffer for 150 min at 20°C, and then washed with Ni-WB I, Ni-WB II and PNK-Wash buffer. PTH-AGO1 bound, interacting RNA molecules were ligated together overnight using 40 units of T4 RNA ligase 1 (New England Biolabs, M0204), 1 mM ATP and RNase inhibitors in PNK buffer at 16°C. On the next day, the ligation mixture was washed out with Ni-WB I, Ni-WB II and PNK-Wash buffer. RNA Dephosphorylation and 3′ miRCat-33 Linker Ligation Ni-NTA beads were resuspended in a dephosphorylation mixture containing 8 units Thermosensitive Alkaline Phosphatase (Promega, M9910) and RNase inhibitors in PNK buffer for 45 min at 20°C and subsequently washed with Ni-WB I, Ni-WB II and PNK-Wash buffer. 3′ miRCat-33 linker ligation was performed for 6 hr at 16°C in the reaction mixture containing 800 units T4 RNA ligase 2 truncated, K227Q (New England Biolabs, M0351), 1 μM miRCat-33 linker (IDT), 10% PEG 8000 and RNase inhibitors in PNK buffer. Beads were washed with Ni-WB I, Ni-WB II and PNK-Wash buffer. Radioactive Labeling of RNA and Elution of PTH-AGO1-RNA Complexes RNAs bound to PTH-AGO1 were radiolabelled with 32P-γ-ATP (Perkin Elmer, 6000 Ci/mmol) in a mixture containing 40 units T4 Polynucleotide kinase, RNase inhibitors in PNK buffer for 30 min at 37°C. Beads were washed with Ni-WB I, Ni-WB II and PNK-Wash buffer. AGO1-RNA complexes were eluted by incubation with Ni-EB for 5 min at room temperature. TCA Precipitation, SDS-PAGE, and Transfer to Nitrocellulose Protein-RNA complexes from the Ni-NTA eluate were precipitated with 2 μg BSA (Sigma) and 17% TCA. Pellets were washed twice with cold acetone, dried, resuspended in 10 μl water and NuPAGE LDS SB (Life Technologies, NP0007). Samples were incubated for 10 min at 65°C and then resolved on a 4%–12% Bis-Tris NuPAGE gel (Life Technologies, NP0335) in NuPAGE SDS MOPS running buffer (Life Technologies, NP0001). Protein-RNA complexes were transferred to nitrocellulose membrane (GE Healthcare, Amersham Hybond ECL) in the wet-transfer tank (Bio-Rad, Mini Trans Blot cell) with NuPage transfer buffer (Life Technologies, NP00061) and 10% methanol for 2 hr at constant voltage = 100V, on ice. Air-dried membrane was exposed on film (GE Healthcare, Amersham Hyperfilm MP) for about 1 hr at room temperature. Developed film was aligned with the membrane and the radioactive bands corresponding to the PTH-hAGO1 complexes were cut out. Proteinase K Treatment and RNA Isolation Cut out bands were incubated with 100 μg of Proteinase K (Roche) and proteinase K buffer (50 mM Tris-HCl pH 7.8, 50 mM NaCl, 10 mM imidazole pH 8.0, 0.1% NP-40, 1% SDS, 5 mM EDTA, 5 mM β-mercaptoethanol) for 2 hr at 55°C. The membrane was discarded. Released RNA was extracted with phenol-chloroform-isoamyl alcohol (PCI) mixture and ethanol precipitated overnight with 20 μg GlycoBlue (Ambion, Life Technologies). Pellets were washed twice with 70% ice-cold EtOH, then air-dried. 5′ Phosphorylation and 5′ Linker Ligation RNAs were phosphorylated with 10u of T4 Polynucleotide kinase in RNA ligase 1 buffer (New England Biolabs) for 30 min at 37°C. Then 10 units of RNA ligase 1 and barcoded 5′ linker (final conc. 5 μM; Table S6) were added and the reaction mixture was incubated for 10 hr at 16°C. RNA was subsequently PCI extracted end ethanol precipitated. Reverse Transcription and Library Amplification by PCR Whole isolated RNA was reverse transcribed with Superscript III Reverse Transcriptase (Life Technologies) according to manufacturer instructions, using miRCat-33 primer (IDT) at 50°C. RNA was then degraded by addition of 2 μl RNase H (New England Biolabs) for 30 min at 37°C. cDNA was amplified using primers P5 and primer PE_miRCat_PCR (Table S6) and TaKaRa LA Taq polymerase (Takara Bio, RR002M). Library Size Selection PCR products were separated on a 3% MetaPhor agarose (Lonza)/TBE gel with SYBRSafe (Life Technologies). The gel was run in 1 x TBE on ice. Then the gel was scanned on a Fujifilm scanner (Fla-5100) and two gel slices with different DNA fragments sizes were cut out: LB: 90 - 100 nt and HB: 110 - 180 nt. After purification with Gel Extraction Kit with MinElute columns (QIAGEN), libraries were stored at −20°C. For high-throughput sequencing, LB:HB fractions were mixed at a 1:3 ratio. Cell Lines Tagged human AGO1 was constructed by ligating the Protein A - TEV protease cleavage site - 6xHis (PTH) tag (amplified from the pRS415 plasmid, a generous gift from Markus T. Bohnsack, Frankfurt, Germany) to the N-terminus of human AGO1 amplified from cDNA, which was then cloned into the pcDNA5/FRT/TO vector (Life Technologies). This vector was further used for stable transfection of Flp-In T-REx 293 cells (Life Technologies) according to the manufacturer’s protocol. Before the CLASH experiments PTH-AGO1 expression was induced with Doxycycline (final concentration = 0.5 μg/ml) for 36 hr, with the aim of achieving expression close to that of endogenous AGO1. Expression of tagged AGO1 was confirmed by the Western Blotting, using peroxidase anti-peroxidase soluble complex (PAP; Sigma, P1291) recognizing the protein A tag. Determination of the Background Level of RNA-RNA Interactions To assess the frequency at which RNA-RNA interactions recovered by CLASH were formed in vitro following cell lysis, lysate obtained from the crosslinked HEK cells expressing PTH-AGO1 was mixed with an equal quantity (measured by RNA content) of yeast cell lysate, prior to CLASH analysis based on the optimized protocol (E4). Analysis of a sample in which no yeast lysate was present (E7) showed the background of human sequences that could be, incorrectly, matched to the yeast genome (0.37% of single reads and 0.11% of miRNA chimeras) (Table S4). Correcting for this, experiment E8, in which yeast lysate was included, reveals that 1.1% of single reads and 1.6% miRNA chimeras arose from yeast RNAs. This low level of background probably reflects the very stringent purification conditions used during CLASH. The same analysis was performed for data sets E9 and E10, prepared using an alternative protocol (E6), in which ligation was performed prior to protein denaturation. This gave a higher level of yeast-human chimeras (∼10%) (Table S4). This indicates that variations in the CLASH protocol can have substantial effects on the signal to noise ratio in the resulting cDNA library, and also confirms the conclusion that protocol E4 is optimal. Experimental Validation of CLASH Targets by qPCR Flp-In T-REx 293-hAGO1 cells were transfected with 25nM miR-92a inhibitor or universal negative control (miRIDIAN Hairpin inhibitors, ThermoScientific, IH-300510-06, IN-001005-01) and Lipofectamine 2000 (Life Technologies). 48 hr post-transfection RNA was isolated using TRI Reagent (Sigma, T9424), DNase treated (TURBO DNase, Ambion) and further purified by ethanol precipitation. cDNA was prepared using SuperScript III (Life Technologies) and random 6-mer oligonucleotides (Promega) at 45°C. cDNA was quantified using Roche LightCycler 480 Real-Time PCR System, Universal Probe Library System (Roche) and primers listed in Table S6. All the primers were designed using Roche UPL Assay Design Center. The expression level of the genes of interest (GIO) was internally normalized to the expression level of GAPDH. Luciferase Reporter Assays Luciferase reporter vectors were prepared by cloning short oligonucleotides containing single miR-92a-binding sites or PCR amplified long fragments of target 3′UTRs into the 3′ UTR of Renilla luciferase in the psiCHECK2 vector (Promega) using XhoI and NotI restriction sites. All the sequences of oligonucleotides used for creating reporters and reporter mutagenesis are listed in the Extended Experimental Procedures. HEK293 cells were transfected in 96-well plates with 50 or 10ng of reporter vectors or non-modified psiCHECK2 as control together with 25nM control or miR-92a inhibitors (miRIDIAN Hairpin inhibitors, Dharmacon, IN-001005-01 and IH-300510-06) or 6.25nM miR-92a controls and inhibitors (IDT, custom oligonucleotides, sequences listed in Extended Experimental Procedures) using Lipofectamine 2000 (Life Technologies). Luminescence of Renilla and firefly (internal reference) luciferases was measured 48 hr post-transfection using the Dual-Glo Luciferase Assay System (Promega, E2920) according to manufacturers’ instructions and SpectraMax M5 Multi-Mode Microplate Reader (Molecular Devices). Microarray Analysis Flp-In T-REx 293-hAGO1 cells were transfected with 6.25nM miR-92a inhibitor or negative control (IDT synthesized custom oligos, sequences in the list of oligonucleotides) and Lipofectamine 2000 (Life Technologies), each sample in 5 replicates. 48 hr post-transfection RNA was isolated using TRIZOL (Invitrogen) and concentrated by ethanol precipitation. Total RNA was processed and quantified on Affymetrix GeneChip Human Exon 1.0 ST by Source BioScience. Bioinformatic Analysis BLAST Transcriptome Database The components of our custom BLAST database come from the following sources: (1) BioMart (http://www.biomart.org), data set: GRCh37.p2: mRNA (cDNA sequence of protein coding genes limited to those with RefSeq protein ID), pseudogenes (miRNA_pseudogene, misc_RNA_pseudogene, Mt_tRNA_pseudogene, polymorphic_pseudogene, RNA_pseudogene, snoRNA_pseudogene, snRNA_pseudogene, tRNA_pseudogene), snoRNA, snRNA, processed transcripts, lincRNA, miscellaneous RNA, mitochondrial rRNA, mitochondrial tRNA, 5.8S rRNA, 5S rRNA; (2) genomic tRNA database http://gtrnadb.ucsc.edu/): human tRNAs; (3) National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov): rRNA sequence (NR_003287.2, NR_003286.2); (4) miRBase release 15 (http://www.mirbase.org): mature human miRNAs. Redundant sequences were removed from the database, as were pseudogenes, lincRNAs and processed transcripts that matched full-length mature miRNA sequences. The mRNA sequences and coordinates in the database match those of ENST entries from ENSEMBL Release 60 (http://useast.ensembl.org/info/website/archives/index.html). For the analysis of experiments E7-E10 we supplemented our database with sequences of S. cerevisiae chromosomes obtained from the Saccharomyces Genome Database. Mapping Reads by BLAST Illumina reads were assigned to the experimental and control samples using their 5′ barcode sequences, and barcodes were clipped. Illumina reads were then mapped by nucleotide BLAST (Altschul et al., 1990) version 2.2.24 to the human transcriptome database described above. BLAST was run with the expectation value set to 0.1, and other parameters set to default. We discarded all antisense matches. Approximately 70% of all mapped reads were mapped unambiguously. The remaining sequences mapped to more than one transcript, typically originating from a single gene. To uniquely assign those reads that were mapped with the same e-value to more than one transcript, we ranked all transcripts according to their total number of BLAST hits and assigned the read to the transcript with most hits. Identification and Clustering of Chimeras We identified as chimeras those reads where: (1) the read yielded at least two blast hits with e-value ≤ 0.1 against the human transcriptome database described above; (2) the hits were either directly adjacent in the read, or with up to 4 nucleotides gap or overlap between hits; and (3) the hits were in sense orientation with respect to the blast database. We rejected reads with more than ten hits in the database. For reads that could be assigned to chimeras in more than one way (for example if one of the fragments was mapped to two different transcripts from the same gene), we used the following criteria to decide which mapping to retain, in the following order: (1) we retained the mapping with the most significant e-value of both fragments; (2) we preferentially retained miRNA-mRNA chimeras; (3) we retained the mapping to transcripts with the highest number of non-chimeric reads. Criterion (2) was introduced to avoid assigning chimeras to pseudogenes, if an mRNA transcript with the same mapping quality was found. After mapping the chimeric reads, we clustered those chimeras for which the coordinates of both mapped fragments overlapped, independently of the order of fragments in the chimera. We first performed the clustering for each of the six experiments independently, and then clustered the six experiments together. Finally we adjusted the transcript regions found in clustered chimeras by extending the miRNA fragment to the full length mature miRNA, and by adding 25 nucleotides downstream sequence to each mRNA fragment. We refer to the mRNA fragments adjusted in this way as the “CLASH targets,” and to the miRNA-mRNA fragment pairs as “CLASH interactions.” Identification of AGO1-Binding Site Clusters from Nonchimeric Reads We identified clusters of non-chimeric reads using a method similar to that previously described (Chi et al., 2009). First, we randomly distributed all distinct reads mapped to each transcript along the same transcript using BEDTools (Quinlan and Hall, 2010), to calculate the maximum random cluster height for each transcript. We then repeated the procedure a hundred times, and retained the maximum cluster height across the repeats. This value was used as a transcript-specific cluster height threshold for the identification of actual clusters. High-confidence clusters were identified as regions of transcripts in which the coverage equaled or exceeded the threshold, the coverage was higher than twenty reads, and the region of high coverage was at least twenty-nucleotide wide. We then calculated high-confidence cluster peaks as the area of twenty nucleotides on each side of the position of highest coverage within the cluster. Overlap between CLASH Targets and Experimentally Validated miRNA Targets from Published Databases To check if our CLASH data set contains known, experimentally validated miRNA targets we compared it with two databases – miRTarBase (Hsu et al., 2011) and TarBase (Vergoulis et al., 2012). Information in these databases is limited: target and miRNA names are often ambiguous (ex. LIN-28B and Lin28 are both used to identify the same gene), and detailed information about the binding site is often missing. To make the comparison possible we downloaded both databases and modified them as follows: 1) we simplified all the miRNA names to “miR-number” (and let-7); 2) we filtered TarBase to only retain human genes with an ENSEMBL gene ID (ENSG) represented in our custom database; 3) we translated gene names in miRTarBase to ENSEMBL gene IDs using the “Gene ID Conversion” tool from Database for Annotation, Visualization and Integrated Discovery (DAVID) (Dennis et al., 2003) and only genes represented in our custom database were retained; 4) we removed redundant interactions within each database. As a consequence each miRNA-mRNA interaction was represented in a simple “miRNA – ENSG” fashion. Overlap between CLASH Targets, AGO-Binding Sites, and Bioinformatically Predicted miRNA Targets The number of CLASH targets that overlapped with AGO1-binding sites obtained in the present study was estimated using BEDTools (Quinlan and Hall, 2010) intersect, using either the positions of all non-chimeric reads mapped to RNA, or the positions of high-confidence cluster peaks calculated as described above. To calculate the overlap between CLASH targets and high-confidence AGO-binding sites obtained in the PAR-CLIP study we first mapped the sequences of the 17,319 clusters from Table S4 in (Hafner et al., 2010a) to our transcriptome database. 15823 clusters were successfully mapped. We then used BEDTools to intersect the positions of CLASH targets and AGO binding clusters. To calculate the degree of overlap between CLASH targets and AGO-binding clusters that would be expected by chance, we randomly placed AGO-binding clusters on: (1) the same transcript in which each cluster was found; (2) a set of random transcripts with a distribution of expression levels matching the transcripts with AGO clusters; or (3) a set of transcripts randomly selected from the human transcriptome. The expression levels of human transcripts in 293 cells were obtained from the microarray data in (Hafner et al., 2010a). We then calculated the number of CLASH targets that overlapped the randomly placed clusters using BEDTools. To calculate the overlap between CLASH targets and bioinformatically predicted miRNA targets by miRanda (John et al., 2004), TargetScan (Lewis et al., 2005), PITA (Kertesz et al., 2007), PicTar (Krek et al., 2005) and RNAhybrid (Rehmsmeier et al., 2004), we extracted the coordinates of predicted miRNA target sites in the hg18 genome from the Functional RNA Project website (www.ncrna.org). We converted CLASH targets from transcriptome to hg19 coordinates using a custom script based on the ENSEMBL Perl API (available on demand), then to hg18 coordinates using liftOver (Kent et al., 2002), and retained the targets mapped to a single genomic position that corresponded to a 3′UTR of a RefSeq gene, only considering those 3′UTR fragments that did not overlap with a CDS of a RefSeq gene on either strand. To generate the control interaction data set, we randomly distributed CLASH targets in the 3′UTR of RefSeq genes (excluding those fragments overlapping with a CDS). We then used BEDTools to overlap CLASH or control interactions with bioinformatically predicted targets. The enrichment was calculated as N_CLASH/N_control, where N_CLASH is the number of CLASH interactions that matched bioinformatic predictions, and N_control is the number of control interactions that matched bioinformatic predictions. To call a match we required that the CLASH interaction and the prediction share both (1) the miRNA name (2) the target position in the mRNA. To decide whether the CLASH targets and predictions refer to the same or different miRNA, we converted miRNA names to a simplified miR-number format, using the regular expression /[a-zA-Z]∗-[0-9]∗/. Analysis of Base Pairing in miRNA-Target Interactions To calculate free energies of binding and determine base pairing between miRNA and target fragments of chimeric reads we used programs from the UNAFold suite (Markham and Zuker, 2008). First, chimeric reads were divided into miRNA and target fragments based on the BLAST analysis. If the miRNA sequence was trimmed then it was replaced with the full-length miRNA sequence from miRBase (Griffiths-Jones et al., 2008). Target fragments were extended by 25 nucleotides at the 3′ end in case the length of the sequencing reads (which ranged from 50 to 100 nts) was too short to include the miRNA-binding site. Then we used the hybrid-min program with default parameters to calculate minimum hybridization energy of the two fragments of chimera that represented interacting RNA molecules. To convert the numeric representation into a Vienna-style dot-bracket format we used a modified version of M. Zuker’s Ct2b.pl script. Evolutionary Conservation of CLASH Targets To analyze evolutionary conservation of CLASH targets, we used per-nucleotide conservation scores among 46 vertebrate genomes calculated using the PhyloP algorithm (Pollard et al., 2010), and downloaded from the UCSC genome browser (Fujita et al., 2011). We selected CLASH targets that were located in 3′UTRs of RefSeq genes, with at least 100 nt of 3′UTR sequence on each side. The CLASH targets were centered at the 5′ end of the longest predicted miRNA-mRNA stem in each target. To calculate the average profile of evolutionary conservation around CLASH targets, the PhyloP score was calculated in the region from −100 to +100 nt from the center, averaged across all targets, and normalized to the number of targets for which the PhyloP score was available. The conservation scores for individual CLASH targets were defined as (C_CLASH - C_control), where C_CLASH is the mean PhyloP score within the longest predicted miRNA-mRNA stem in each target, and C_control is the mean PhyloP score in regions spanning nt −50 through −10 and +10 through +50, relative to the ends of the stem. We used a paired t test to compare the mean conservation within and outside the predicted stems. Analysis of CLASH Target Downregulation after Inhibition of 25 miRNAs To analyze the effect of miRNA inhibition on mRNA transcript levels in specific sets of transcripts, we used the miRNA depletion data from (Hafner et al., 2010a). In this study, 25 miRNAs were depleted using an antisense 2′-O-methyl oligonucleotide cocktail, and mRNA levels were measured by the Human Genome U133 Plus 2.0 Array from Affymetrix. We averaged the log2 GC-RMA signal for all probes matching each transcript, discarding probes matching multiple genes. Probes were assigned to transcripts using data downloaded from ENSEMBL. We then calculated the log2-enrichment of each transcript upon miRNA depletion, by subtracting the average signal in mock-transfected cells from the average signal in 2′-O-methyl oligonucleotide-transfected cells. To calculate the background distribution of mRNA enrichment upon depletion of 25 miRNAs, we randomly selected one transcript for each gene, and then selected a subset of transcripts with expression levels matching those of transcripts in which CLASH targets were identified. We plotted the cumulative distribution of log2-enrichment for these transcripts. As a positive control we used the 331 experimentally validated targets of the 25 miRNAs for which depletion data was available, downloaded from the miRTarBase on August 1, 2011. We then plotted cumulative distributions of log2-enrichment scores after miRNA depletion, for the 1,995 transcripts in which CLASH targets for at least one of the 25 miRNAs were found, or for subsets filtered by: seed complementarity; location of the CLASH target in 5′ UTR, coding sequence, or 3′ UTR; predicted basepairing energy between miRNA and its target; presence of non-chimeric read clusters overlapping with the mRNA fragment of each chimera. Analysis of CLASH Target Downregulation after miR-92a Inhibition Data acquisition and normalization from Affymetrix GeneChip Human Exon 1.0 ST arrays were done by Source BioScience. We averaged the log2 RMA signal for all probes matching each transcript, discarding probes matching multiple genes. We only accepted probes for which expression was detected at p < 0.05 in at least 4 experiments. Probes were assigned to transcripts using data downloaded from ENSEMBL. We then calculated the log2-enrichment of each transcript upon miRNA depletion, by subtracting the average signal in mock-transfected cells from the average signal in miR-92a inhibitor-transfected cells. Wilcoxon tests were used to check whether the log2-enrichment differs between sets of genes. Identification of Base-Pairing Classes by K-Means Clustering We extracted the miRNA basepairing pattern for each CLASH interaction using a modified version of the Ct2B.pl script written by M. Zuker. We then converted the dot and bracket strings into a numeric format by replacing dots with zeroes, and brackets with a number F that represented the minimum energy of each interaction, rescaled from 0 to 5. This was calculated as F = 5∗(dG-dGlow)/(dGhigh-dGlow), where dG is the minimum energy of the interaction as calculated by hybrid-min, dGlow was set to −11 kcal/mol, and dGhigh to −16 kcal/mol. F was capped at 0 at the bottom and 5 at the top. We next performed K-means clustering on the matrix of 18,514 interactions and 22 miRNA positions using Gene Cluster 3.0 (downloaded from http://bonsai.hgc.jp/∼mdehoon/software/cluster/software.htm), to group interactions with similar basepairing patterns. We performed 50 runs of K-means clustering with K set to 4, 5, or 6, using Euclidean distance, and we selected the grouping into 5 clusters for presentation, as it yielded a clear separation into well-defined classes of comparable size. We used two independent approaches to analyze the clustering patterns expected by chance. First, we randomly reassigned (shuffled) miRNA-mRNA pairs found in chimeras, before performing the folding prediction and clustering. This overestimated the number of interactions expected among random miRNA and mRNA fragments because many targets of abundant miRNAs were assigned by chance to the same miRNA. As a second approach, we randomly permuted (scrambled) the target sequence found in each interaction, while conserving the nucleotide content of the targets. Again, this approach is conservative because keeping the nucleotide content will act to increase apparent interaction strengths. Nevertheless, the basepairing patterns observed in the two randomized sets were very different from the one seen for CLASH interactions. The basepairing patterns were represented graphically as heatmaps using Java TreeView (Saldanha, 2004), using contrast = 5.0 and a linear color scale. To analyze basepairing patterns for subsets of CLASH interactions, the clustering was performed first for the entire data set and the relevant interactions were extracted next, to preserve the order of clusters. Clustering each subset independently yielded similar results. Identification of Enriched Sequence Motifs in CLASH Targets To find overrepresented motifs in CLASH targets, we first extracted all miRNAs with at least 10 targets found. For each miRNA separately we ran MEME (Bailey and Elkan, 1994), with the settings: -dna -mod zoops -maxw 7 -nmotifs 1, using the default 0-order Markov model based on nucleotide frequencies in the training set as the background model. Changing the maximum motif length to 8 or 9; the maximum number of motifs to 2, or the number of nucleotides by which CLASH targets were bioinformatically extended between 0 and 50 did not change the conclusions. The motifs identified by meme for targets of each miRNA were then aligned to the reverse-complemented miRNA sequence using FIMO (Grant et al., 2011), with the setting–output-pthresh 0.01. We then selected the motifs with FIMO q-value (FDR) < 0.05, and meme Bonferroni-corrected p-value < 0.05, yielding a set of 108 high-confidence motifs. For motifs that could be mapped to the reverse-complemented miRNA sequence in more than one way, the mapping with the most significant q-value was selected. (Details of the MEME analysis available upon request).
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Point mutation in an AMPA receptor gene rescues lethality in mice deficient in the RNA-editing enzyme ADAR2.

            RNA editing by site-selective deamination of adenosine to inosine alters codons and splicing in nuclear transcripts, and therefore protein function. ADAR2 (refs 7, 8) is a candidate mammalian editing enzyme that is widely expressed in brain and other tissues, but its RNA substrates are unknown. Here we have studied ADAR2-mediated RNA editing by generating mice that are homozygous for a targeted functional null allele. Editing in ADAR2-/- mice was substantially reduced at most of 25 positions in diverse transcripts; the mutant mice became prone to seizures and died young. The impaired phenotype appeared to result entirely from a single underedited position, as it reverted to normal when both alleles for the underedited transcript were substituted with alleles encoding the edited version exonically. The critical position specifies an ion channel determinant, the Q/R site, in AMPA (alpha-amino-3-hydroxy-5-methyl-4-isoxazole propionate) receptor GluR-B pre-messenger RNA. We conclude that this transcript is the physiologically most important substrate of ADAR2.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Widespread A-to-I RNA Editing of Alu-Containing mRNAs in the Human Transcriptome

              Introduction On the molecular level, the complexity of higher organisms is based on the number of different gene products available for structural, enzymatic, and regulatory functions. Posttranscriptional and/or posttranslational mechanisms have an important role in generating RNA and protein diversity (Baltimore 2001). One posttranscriptional processing pathway present in higher eukaryotes is RNA editing by adenosine deamination involving modification of individual adenosine bases to inosine in RNA by adenosine deaminase acting on RNA (ADARs; reviewed in Bass 2002; Schaub and Keller 2002; Maas et al. 2003). Since inosine acts as guanosine during translation, A-to-I conversion in coding sequences leads to amino acid changes and often entails changes in protein function (Seeburg et al. 1998; Bass 2002; Schmauss and Howe 2002). The power of RNA editing in generating protein diversity lies in the fact that usually both the edited and unedited versions of the RNA and/or protein coexist in the same cell, and the ratio between the unedited and multiple edited variants can be regulated in a cell type-specific or time-dependent manner. Crucial functional properties of neurotransmitter receptors are regulated by A-to-I editing in the central nervous system (Seeburg et al. 1998; Schmauss and Howe 2002), and inactivation of editing enzymes in mice (Higuchi et al. 2000) and in the fruit fly (Palladino et al. 2000) have resulted in profound neurological phenotypes. In addition to amino acid changes, A-to-I RNA editing can theoretically lead to the alteration of transcriptional start and stop codons, as well as that of RNA splice sites. In only one case though has the creation of a splice acceptor site through intronic RNA editing been described (Rueter et al. 1999). Currently it is not known if the recoding of mRNAs at single codon positions is the main function of A-to-I RNA editing or if other types of editing events with as yet unknown roles in the regulation of gene expression are more widespread. The recently reported embryonic lethality in mice with ADAR1 deficiency indicates that additional substrates for this enzyme exist that function during early embryonic development (Wang et al. 2000, 2004; Hartner et al. 2004). Furthermore, a role for ADAR1 in the immune system is widely accepted, as one of its isoforms is interferon induced (Patterson and Samuel 1995) and upregulated in immune cells during chronic inflammation (Yang et al. 2003). The ablation of editing enzymes in Caenorhabditis elegans resulted in transgene silencing, suggesting that the RNA editing and RNA interference (RNAi) pathways intersect (Knight and Bass 2002). This notion was recently confirmed by findings that the behavioral phenotype of ADAR-deficient worms could be rescued by inactivation of the RNAi pathway (Tonkin and Bass 2003). Since both RNAi and RNA editing target double-stranded RNA (dsRNA) molecules, RNA editing could suppress gene silencing by preventing the formation of small interfering RNAs (siRNAs). A recurring theme of edited sequences is the involvement of an imperfectly dsRNA foldback structure (Higuchi et al. 1993). The importance of base-paired RNA elements for site-selective editing to occur is also mirrored in the presence of dsRNA binding domains in ADAR enzymes (Bass 2002). At present, though, it is not possible to predict if and to what extent a given RNA molecule is a substrate for A-to-I RNA editing in vivo. Despite recent progress in identifying additional genes that undergo RNA editing (Morse and Bass 1997; Morse et al. 2002; Hoopengardner et al. 2003), the total number of currently known A-to-I edited genes in mammals is still small (Bass 2002). However, the activity of the mammalian editing machinery, as measured by inosine content in mRNA fractions (Paul and Bass 1998), is much higher than expected based on the current number of known substrates. Furthermore, ADARs are ubiquitously expressed in mammalian tissues, but almost all ADAR targets identified to date reside in the brain (Bass 2002; Maas et al. 2003). This discrepancy between signs that A-to-I editing is omnipresent and the scarcity of identified targets has puzzled researchers in the field for some time, wondering where all the edited transcripts are. In this study we identify a minimum of 1,445 edited human mRNAs present in existing databases. Clusters of adenosine-to-guanosine (AtoG) discrepancies in these cDNAs are the result of RNA editing involving intramolecular pairs of inverted Alu repeat sequences, repetitive elements that represent approximately 10% of the human genome and are concentrated in and around genes (Batzer and Deininger 2002). We also characterize functional consequences of the observed editing events and the factors that determine editing levels in Alu repeats and their modification patterns. The prevalence of Alu elements in primate genes, together with our experimental and computational analysis, suggests that the vast majority of primary human gene transcripts (greater than 85% of RNAs with average structure) are subject to A-to-I RNA editing. We show how editing might influence the alternative splicing of exonized Alu elements and discuss the implications of this extensive modification of mRNAs bearing repetitive elements for the regulation of gene expression. Results/Discussion Clusters of AtoG Discrepancies between Genomic and cDNA Sequences Are Due to A-to-I RNA Editing and They Are Located in Alu Repeat Elements A hallmark of an A-to-I RNA editing event is an AtoG transition when comparing genomic and cDNA sequences of the affected gene since inosine base-pairs with cytosine and therefore is replaced by guanosine during reverse transcription and PCR amplification. However, AtoG discrepancies between genomic and cDNA sequences can also be due to single-nucleotide polymorphisms (SNPs) or errors in databases. Therefore the search for edited sequences on a genome-wide basis is not feasible solely based on this single feature. However, in some cases of editing, not a single, but a cluster of AtoG discrepancies between genomic and cDNA sequences is evident within a stretch of a few hundred nucleotides (Patton et al. 1997; Morse et al. 2002; Rosenthal and Bezanilla 2002). Therefore, we decided to inquire whether clusters of AtoG transitions seen in cDNA/genomic DNA (gDNA) sequence comparisons might represent bona fide editing events, since multiple base changes, all being of the AtoG type, are not likely accounted for by cosegregating SNPs or sequencing errors. In an initial screen for candidate genes, we used the Human Unidentified Gene-Encoded (HUGE) database of ca. 3,000 human cDNAs derived from the Kazusa cDNA sequencing project (Kikuno et al. 2002). Several examples of cDNA sequences were found that within a window of 200–300 nt differ at several positions from the genomic sequence, such that the cDNA harbors a G where the genomic counterpart specifies an A. AtoG differences that coincide with an annotated A/G SNP were filtered out. Table 1 shows a list of all 26 genes from the HUGE database showing greater than two AtoG transitions in the exonic regions. Remarkably, we found that in all cases except one (KIAA0001) the location of the AtoG cluster coincides with the position of an Alu repeat element in the cDNA. As with Alu elements, most AtoG transition clusters are localized in 5′-UTR and 3′-UTR sequences and few in coding regions. Alu elements are short interspersed elements found in all primates, which are approximately 300 nt in length (Batzer and Deininger 2002). There are about 1.4 million copies of Alus from several closely related subfamilies present in the human genome, comprising approximately 10% of its mass (Lander et al. 2001). The enrichment of Alu repeats in gene-rich regions of the genome (Chen et al. 2002) makes their prevalence in transcribed sequences even more pronounced. Their high CpG dinucleotide content renders Alu sequences targets for methylation and implicates them in the regulation of gene expression (Rubin et al. 1994). Clusters of A/G discrepancies that mapped to Alu repeats had been noted before in the HUGE database (Kikuno et al. 2002). Furthermore, of ten newly identified editing targets in C. elegans (Morse and Bass 1999; Morse et al. 2002) and 19 in human brain (Morse et al. 2002), most were located in repeat elements. These findings suggested that repetitive elements, such as Alus, might be frequent targets for A-to-I RNA editing. In order to better understand the connection of Alu's with the observed AtoG clusters, we analyzed experimentally the cDNAs from all 25 candidate genes for RNA editing in human brain. Total RNA and gDNA were isolated from the same human brain specimen to eliminate false positives from unmapped A/G SNPs. For all 25 genes in vivo RNA editing was detected by single-run sequencing of gene-specific RT-PCR products, and for five of them the editing efficiency was quantitatively evaluated through repeated experiments. Extents of editing ranged from less than 2% to 90% at individual sites (Table 1; Figures 1–3). Intramolecular Pairs of Oppositely Oriented Alus Are Responsible for Alu Element Editing Since a prerequisite for A-to-I RNA editing is the presence of a partially base-paired RNA foldback structure (Higuchi et al. 1993; Bass 2002), the observed modifications in Alu repeats might be the result of two oppositely oriented, base-pairing repeat elements located within the same RNA molecule. For each of the 25 genes with edited, exonic Alu elements we find such oppositely oriented Alu repeats in the same pre-mRNA, many of which are located in intronic sequences. To determine if the predicted Alu pairs and the calculated foldback structures (Figures 1A, 3, and 4) actually form in vivo, we analyzed experimentally the predicted Alu partners from the pre-mRNA for four of the identified editing targets (LUSTR, KIAA0500, Bruton's tyrosine kinase [BTKI], and KIAA1497). In each case we found that the closest, oppositely oriented Alu repeat undergoes A-to-I RNA editing as well. Because of the abundance of Alu elements in human pre-mRNAs, most primary transcripts contain one or more pairs of oppositely oriented Alus. If a majority of them is indeed subject to A-to-I RNA editing in vivo, it should be possible to predict RNA edited genes by identifying inverted pairs of Alu repeats in pre-mRNA transcripts. As a proof of principle, the analysis was extended to four arbitrary chosen genes (p53, SIRT2, NFκB, and paraplegin (SPG7) containing pairs of Alu repeats as seen schematically in Figure 4B–4E. In all four cases, editing in the Alu elements that are predicted to form a dsRNA foldback structure is readily detectable. Many primary gene transcripts allow several energetically favorable foldback structures to be predicted for a given Alu that involve different combinations of Alu pairs. Do all these alternative Alu-pair foldback structures exist in vivo and are therefore subject to RNA editing? To address this question we examined the editing pattern of the G-protein-coupled receptor 81 (GPR81; identified through a computational search as described below). GPR81 contains four Alu elements, one sense and three antisense oriented, in the 3.6-kb pre-mRNA and was selected based on Alu repeat configuration and transcript length. If the alternative foldback structures depicted in Figure 5 coexist in vivo, all four Alu elements should show signs of editing with the level of editing indicating how prominent each of the alternative structures is. According to the analysis of GPR81 pre-mRNA, all three configurations form in vivo with variant II possibly being the dominant one since AluSp and AluJo show the highest levels of editing (Figure 5). These results suggest that Alu elements in human mRNAs are subject to RNA editing by ADARs because of foldback structures formed between two oppositely oriented Alus present within the same primary transcript. Editing of Alus Is Tissue Dependent and It Alters Codons and Pre-mRNA Splice Sites of Alternatively Spliced Alu Exons Exonic Alu repeat elements are predominantly located in the 5′- and 3′-UTRs of mRNAs, and as a result, most cases of Alu editing occur in noncoding regions. However, some editing events predict amino acid changes (Table 1). Among the identified genes for which we performed a detailed, quantitative editing analysis several unique and recurring features emerge regarding the locations and functional implications of the editing events. The LUSTR1 cDNA codes for a G-protein-coupled seven-transmembrane receptor (also termed GPR107 or KIAA1624), with three AtoG discrepancies located within an alternatively spliced AluJo-derived exon that leads to the in-frame insertion of 29 amino acids between transmembrane regions V and VI of the protein (see Figure 1A). The experimental analysis revealed a total of ten editing sites within this Alu element (see Figure 1B), including two major sites that lead to amino acid changes (H/R and Q/R sites). Interestingly, editing levels at all positions were significantly different in human brain (19%–58%) compared to lung (less than 5%), suggesting a tissue-specific regulation of editing (see Figure 1B). Analyzing the RNA editing pattern of LUSTR1 pre-mRNA revealed additional intronic editing sites, one of which represents the splice acceptor adenosine (AG to IG) in intron 15 (22% edited in brain; see Figure 1B). Editing at this position is predicted to lead to the exclusion of the Alu exon, indicating that the alternative splicing of exon 15a might be coregulated by RNA editing of its splice junction. This is to our knowledge the first documented example where A-to-I RNA editing acts to destroy a pre-mRNA splice signal. A picture similar to LUSTR1 emerges from analysis of the gene for human inhibitor of BTKI (also termed KIAA1417; Liu et al. 2001; Strausberg et al. 2002). Again, an alternatively spliced Alu exon (located between constitutive exons 22 and 23) is affected. This time two AluSx elements are positioned in opposite orientation at the start and end of the exon (see Figure 3). Inclusion of the exon using the splice acceptor site provided by AluSx− leads to the premature termination of translation within this exon with all editing sites located in the 3′-UTR. Editing levels at 20 sites throughout the Alu element range from less than 5% to 31% in human brain, whereas cDNAs isolated from human lung again displayed few editing sites with low editing levels of less than 5%. A splice site is also subject to editing in BTKI, this time affecting an additional alternative splice acceptor site within AluSx−. On the pre-mRNA level this position is edited to 15%. However, in transcripts that use the weak upstream splice acceptor site (underlined with a dashed line in Figure 3B; as in the HUGE database clone hh15303), the additional alternative splice site (underlined with a solid line in Figure 3B) is highly edited, raising the possibility that edited BTKI pre-mRNA preferentially follows the alternative splicing pathway (data not shown). The analysis of GPR81 revealed another case of Alu exon alternative splicing and, surprisingly, a new mechanism showing how RNA editing might affect RNA processing. Within the AluSp+ element located in the 3′-UTR of GPR81 transcripts a splice donor site (AT to IT) is generated in 57% of primary transcripts by RNA editing. This is predicted to give rise to alternatively spliced mRNA products represented by GenBank entry AF385431 (see Figure 5B). This is, to our knowledge the first reported case of potential splice donor site creation by RNA editing. It is possible that here the Alu element was inserted into the 3′-UTR exon of the GPR81 gene and has evolved into a state where it is a single mutation away from initiating the birth of a novel intron. Posttranscriptionally RNA editing provides the final base change to create the new splice site. This scenario is supported by the fact that in mice the GPR81 gene lacks introns. It is intriguing that we find cases where editing in alternatively spliced Alu exons, or within adjacent splice sites, interferes with or counteracts exon formation of an Alu repeat. It suggests that RNA editing might be more generally involved in the regulation of Alu exonization. Recently, it has been shown that more than 5% of the alternatively spliced exons in the human genome are Alu derived (Sorek et al. 2002). Exonization of Alu repeats occurs via activating mutations in mostly antisense-oriented, intronic Alus generating a novel splice acceptor site (Mitchell et al. 1991; Lev-Maor et al. 2003), and it has been speculated that exonization of transposable elements in general is a major mechanism for the generation of novel exons (Kreahling and Graveley 2004). A large number of intronic Alu elements are just a single mutation away from being exonized (Lev-Maor et al. 2003; Kreahling and Graveley 2004), and in some cases the constitutive splicing of an intronic Alu has been shown to cause a genetic disorder (Mitchell et al. 1991; Knebelmann et al. 1995; Vervoort et al. 1998). In this context RNA editing may partially counteract genomic mutations that lead to the incorporation of deleterious novel exons while maintaining their potential to form exons with beneficial functions through further mutation. Furthermore, RNA editing in Alus might be involved also in the generation of novel introns as seems to be the case in GPR81. Statistically, however, the exonization of intronic Alus would be much more frequent than the intronization of exonic Alus because of the abundance of Alus in introns. A Transcriptome Wide Screen for Edited Alu Repeats The results presented above show that clusters of AtoG mismatches in cDNA/gDNA sequence comparisons represent an effective way to identify authentic editing cases with a low rate of false positives. Since all clusters of AtoG discrepancies mapped to repeat elements, we wondered how prevalent the editing of Alu or other repeat elements is in the human transcriptome. Therefore, we devised a database search procedure to identify pairs of inverted repetitive elements in human mRNAs exhibiting AtoG transitions. Initially, a limited search was carried out for closely spaced (less than 2 kb) inverted pairs of human Alu, MIR, and L1 repeat elements that overlap with exonic sequences and for which an mRNA sequence can be found in GenBank entries. This search, involving about one-third of all repeat elements in the human genome, identified 71 mRNAs with exonic repetitive-element pairs (51 Alu, six L1, six MER, and eight MIR). From those mRNAs, 27 displayed clusters of AtoG changes, all in Alu elements. Fourteen of these genes were chosen for experimental analysis, and all 14 proved to be subject to A-to-I RNA editing (Table 2). Since these initial results indicated a high prevalence of editing in Alu elements, we decided to carry out a comprehensive search involving all elements present in cDNA sequences. We analyzed the total of 103,723 human mRNA sequences (from the University of California, Santa Cruz [UCSC] Genome database [Kent et al. 2002], July 2003 assembly) for overlaps with repetitive elements of the L1, Alu, MaLR, and MIR families. For Alus, 17,406 mRNAs (16.8%) contained a total of 31,666 complete or partial repetitive-element sequences. Comparing the cDNA sequences with their genomic counterpart revealed that the number of AtoG discrepancies within Alu repeats is more than seven times higher than the average number of the other transitions (23,204 versus 3,271 [the average for GtoA, CtoT, and TtoC transitions]). In fact, the number of AtoG mismatches is higher than all other eleven types of nucleotide discrepancies combined (Figure 6A). While the finding that non-AtoG transitions (GtoA, CtoT, and TtoC) are approximately three times more frequent than transversions is in line with results from previous studies analyzing gDNA sequences (Lander et al. 2001; Venter et al. 2001), there is no explanation for the observed excess of AtoG mismatches relative to other base transitions. Alu sequences carry 22–23 CpG dinucleotides, which are known to show high mutation rates because of C-methylation, and as a consequence, these positions should display an elevated frequency of SNPs. Nevertheless, an elevated number of SNPs would lead to a rise in AtoG as well as GtoA mismatches when comparing a representative population of cDNAs with the corresponding genomic sequence. Thus, we concluded that the excess exclusively in the number of AtoG discrepancies in Alus over other base changes may reflect cases of bona fide A-to-I editing at the RNA level. We then devised a statistical approach to distinguish repetitive elements that show AtoG mismatches due to sequencing errors and SNPs from those that have undergone A-to-I RNA editing. The method was based on the observation above that Alu elements subject to RNA editing undergo multiple base modifications that result in a cluster of AtoG discrepancies (5–30) between cDNA and gDNA. The probability that a cluster of several AtoG discrepancies is due to sequencing errors or SNPs (in the absence of an increased number of other nucleotide discrepancies indicating low-quality sequence data) is negligible. Thus the number of clustered AtoG changes can be used to distinguish genuinely edited elements from elements with aberrant or non-editing-related base changes. For each Alu element with AtoG discrepancies, we computed the χ2 test comparing the observed number of AtoG discrepancies with the expected number, based on the number of non-AtoG mismatches present in the same sequence. Elements with a χ2 higher than the critical value for α = 0.00001 (corresponding to a probability of one in 100,000 that the observed AtoG transitions are due to SNPs or sequencing errors) were selected as “edited” and will be called so throughout the rest of the manuscript. Using this approach we found that out of those 17,406 mRNAs with one or more exonic Alu elements, 1,445 (8.0%) mRNAs are edited within one or more of the Alu sequences (for a full list of edited mRNAs see Table S1). When looking at all the 31,666 Alu elements present within these 17,406 RNAs, we find that 1,925 (6.1%) Alu elements are “edited,” while another 4,574 Alu elements (14.4%) show AtoG discrepancies but fail to pass our probability cutoff (Figure 6B). Thus, the total of 6,499 elements (or 20.5%) represents the upper limit of potentially edited Alus in our sample (Figure 6B). The total number of Alus with GtoA discrepancies in the same sequence sample is 2,002, and we considered this value to reflect base changes that are due to SNPs and sequencing errors. Assuming a similar number for random AtoG and GtoA mismatches, we can subtract this number from the total count of potentially edited Alus, obtaining 4,497 cases (14.2%) as the approximate number of actually edited elements. In order to validate our screening approach, we performed an identical analysis for GtoA, CtoT, and TtoC mismatches. Compared to the 1,925 AtoG-edited Alu elements in mRNA, we found 12 GtoA, 11 CtoT, and 11 TtoC cases of “editing.” These cases may represent false positives and thus set the error level of our screen to less than 0.6%. These results suggest that out of the 103,723 human mRNAs at least 1.4% are A-to-I edited within an exonic Alu element. Apart from Alu repeats, many more low- and high-frequency repeats exist in the human genome (Venter et al. 2001) and might give rise to RNA foldback structures that result in exonic A-to-I editing. Therefore, the total number of mRNAs edited in exonic repeat sequences is probably higher than the value obtained from our analysis of Alu elements. Most Alu repeats are located in introns, and it is there where the bulk of RNA editing is expected to occur. The average number of Alu repeats per gene is 12.4 estimated for Chromosomes 21 and 22 (Grover et al. 2003). This value is comparable to the 19.3 Alus/gene estimated from our data (2,003,976/103,723: total number of Alus (nonunique) in mRNA boundaries/number of mRNAs) for the whole genome. Considering that based on our analysis 14.2% of exonic Alus are edited, and assuming similar editing rates for intronic Alus, we can estimate that the probability of an average pre-mRNA to be edited is approximately 1–0.85819.3 or 94.7% (85.0% with the 12.4 Alu/gene estimate). While this value is an approximation, assuming that all genes have similar structures, and does not take into account editing in other repeat elements, it does reflect the magnitude of repetitive-element editing. Distance, Conservation, and Tissue Localization Influence Which Pairs of Alu Elements Are Edited To gain insight into the factors that determine which Alus are subject to RNA editing, and under what circumstances, the identified set of 1,925 high-confidence cases of editing in Alu elements (contained in the 1,445 mRNAs listed in Table S1) was used for further computational analysis. It was assumed that the observed editing is the result of RNA foldback structures formed between intramolecular inverted Alu repeats, as we have demonstrated for all the experimentally analyzed cases. If this hypothesis is correct, then the distance between an Alu and its closest inverted pairing element should be a critical determinant for how likely it is that a given element will be targeted by the RNA editing machinery. To test this hypothesis the closest inverted Alu was identified within the same gene for all 31,666 Alu elements. A properly oriented element was found for 19,231 of those Alu elements, and a plot was made showing the percentage of edited Alus as a function of the distance between elements (Figure 7A). The highest level of editing (approximately 16%) was found for Alu pairs 300–400 nt apart, which corresponds to slightly more than the size of a full-length Alu repeat. Editing levels subside with increasing distance as the probability for base-pairing between the two Alu elements apparently decreases. Alu pairs with distances below 300 nt indicate partial Alu elements, and the observed decrease in editing levels is likely because of the smaller, less energetically stable foldback structures. These results suggest that the optimal configuration of an Alu-pair stem loop involves two full-length Alus forming the stem separated by a short (10–50 bp) intervening loop sequence. Interestingly, as the distance increases we ultimately arrive at a low-level plateau of approximately 1% editing without any further drop in editing levels. RNA editing in trans caused by base-pairing Alus located in separate RNA molecules might be responsible for this “background” editing. A-to-I editing in trans does occur on pre-annealed RNA duplexes in vitro (Bass and Weintraub 1987; Nishikura et al. 1991) and could occur also in vivo if such intermolecular RNA duplexes form. In Xenopus one case of potential trans editing has been described, involving RNA duplexes formed between sense and antisense transcripts of bFGF (Saccomanno and Bass 1999). The distance dependence of the extent of editing clearly suggests that the formation of Alu–Alu stem loop structures predominantly results from intramolecular Alu inverted repeats with an upper limit of approximately 1% editing that could be due to intermolecular Alu pairings. To our knowledge, these results describe for the first time the distance relationship of long-range RNA folding interactions in vivo and how their stability is influenced by distance. More important, considering the high frequency of Alu elements in primate RNA sequences and the low levels of potential intermolecular editing observed, we conclude that intermolecular duplexes between complementary RNA sequences do not form in the nucleus at a significant rate. This raises the question of how the regulation of thousands of human messages proposed by Yelin and colleagues involving antisense transcripts works (Yelin et al. 2003). It might also explain why cases of editing involving endogenous sense/antisense RNA duplexes have not been reported despite evidence for extensive antisense transcription. The editing of RNAs by ADARs has been shown to be dependent on the double-stranded character of the substrates, such that editing levels and promiscuity increase with the extent of the base-paired region (Bass 2002). The human Alu family is composed of several subfamilies of different genetic ages, and their consensus sequences contain diagnostic changes distinguishing one subfamily from another. The extent of base-pairing between two oppositely oriented Alu elements, and in turn the extent of A-to-I editing, depends on their sequence homology, and it is expected that highly diverged elements would form less stable foldback structures. The relationship between observed editing level in our set of 31,666 Alu repeats and the sequence divergence of each Alu repeat from the consensus of its respective subfamily is shown in Figure 7B. A decrease in editing levels is seen with an increase in diversity, suggesting that an Alu element with lower sequence homology to most other Alu repeats has a lower probability of forming a suitable editing substrate. Unexpectedly, we also observed a drop in editing levels for Alu elements with low divergence from their subfamily consensus sequence. This trend may be caused by the distribution of Alu divergence. Within the human genome the majority of Alu elements have diverged by 10%–15% from their subfamily consensus (Stenger et al. 2001). Therefore, Alu elements with lower than average divergence have a lower likelihood of encountering another element of similar divergence, resulting in low editing levels for this subset. We obtained similar results when we compared the editing levels of Alu elements with the sum of the divergence of the edited Alu and its closest inverted Alu element (data not shown). In agreement with these conclusions, we find that the most populated Alu subfamily (AluSx) and the subfamilies closely related to AluSx sequences (AluSq, Sc, and Sg) show the highest levels of editing (Figure 7C). The pool of mRNAs used in this study represents a heterogeneous collection of sequences from different tissues and cell types. In analyzing the editing of Alu elements as a function of tissue origin (Figure 7D), significant differences in editing levels were found. The highest editing activities were seen in brain tissues, in trachea and thymus. These results are in accordance with prior studies that have measured the overall activity of RNA editing enzymes in selected mammalian tissues as judged by the amount of inosine detectable in the poly(A)+ fraction of RNA (Paul and Bass 1998). The two human enzymes with A-to-I RNA editing activity (ADAR1 and ADAR2) display a different but overlapping activity profile on known substrates, and their expression is highest in brain (ADAR2) and in cells of the immune system (ADAR1; Bass 2002). Furthermore, ADAR1 was found to be induced during inflammation leading to high activity in blood cells and thymus (Yang et al. 2003). These findings are also in agreement with our experimental results, which show much higher editing levels in brain-derived RNAs than in the same mRNA isolated from lung tissue (see Figures 1–3). The pool of edited Alu elements was analyzed for other features that might influence editing levels, such as the position of the edited Alu within the mRNA (3′-UTR, 5′-UTR, and coding region) or its orientation in relation to the mRNA (sense, antisense). No significant correlation of Alu editing was detected with any of these features (data not shown). Editing of Alu Repeats Shows Sequence and Structure Preferences The availability of such a large collection of A-to-I edited sequences resulting from this analysis allowed us to examine the modification pattern of edited Alu elements for potential editing hot spots or base preferences. To this end we first aligned all 141 edited Alu sequences (greater than 260 bp) in RNAs originating from Chromosome 1 and mapped the edited sites on their consensus sequence (Figure 8). Interestingly, certain adenosines are targeted in greater than 30% of the edited RNAs while other adenosines do not show any evidence of editing. The four editing hot spots all map to TA dinucleotides that are located in conserved Alu regions (greater than 80% identity), suggesting that they are base-paired in the average foldback structure. This confirms the previously proposed T/A 5′-neighbor preference for both ADARs (Polson and Bass 1994). Surprisingly, most of the 22 CpGs of the consensus sequence coincide with the location of high-frequency editing events. CpGs are known to be targeted by cytosine DNA methylation, which results in a high mutation rate, turning CpGs into either CA or TG dinucleotides (Batzer and Deininger 2002). Since less than 50% of the transcripts carry a CA or TG (edited in reverse complement) at these CpG consensus sites, the editing efficiency (edited adenosines/total adenosines) at these positions is comparable to that at the hot spots (Figure 8A, arrows). As a result of the high CpG mutation rate, frequently the Alu foldback structure of the unedited RNA is predicted to carry A–C mismatches at these positions. Editing at these sites restores the CpG repeat (CA→CI) on the RNA level and converts the A–C mismatch to an I–C base pair. Energy calculations for several predicted Alu pairs show, surprisingly, that the stability of the foldback structure is not diminished by editing but often increased because of the high frequency of I/C pair formation (data not shown). It is therefore unlikely that in the case of Alu foldback structures, RNA editing serves to resolve RNA secondary structures that interfere with the processes of splicing or translation of these RNAs, as suggested previously (Morse et al. 2002). Two typical configurations of editing sites observed in Alu elements are depicted in the magnifications of Figure 8B where either A–U pairs are turned into I–U wobble pairs in conserved regions of the sequence (ii), or A–C mismatches are converted into I–C pairs within nonconserved Alu regions (i). While the above analysis shows the qualitative features of the editing sites in Alus, determination of cis preferences was carried out by extracting 14,774 pentanucleotide sequences with the edited adenosine as the middle base and estimating the frequency of each base at positions −2, −1, 1, and 2 relative to the editing site. To correct for Alu sequence bias we performed the same analysis for a randomly chosen adenosine for each edited adenosine in our sample. We then subtracted those frequencies to obtain unbiased editing preferences (Figure 9). The presence of large, unpaired poly(A)+ tails in Alus obscures our analysis for adenosines surrounding edited A's but is informative regarding other base preferences. Position −1 shows a strong preference for C and T and aversion for G in agreement with previous studies (Bass 2002). Interestingly, we observe preferences for G in position +1 and for C or G at positions −2 and +2, which have not been described before. This preference pattern appears not to be linked to any Alu-specific structural feature and therefore possibly reflects the editing enzyme cis preferences. We also identify a preference for an editing site to be preceded or followed by another editing site (Figure 9). This data-rich assessment of sequence preferences for edited sites might be useful in an ab initio identification of new editing sites. Taken together our results identify loose RNA duplexes carrying A–C mismatches or A/U-rich regions, as favored editing targets. The high incidence of “corrective” editing at mutated CpG consensus positions in Alus raises the possibility that posttranscriptional restoration of CpG repeats in Alu primary transcripts by RNA editing contributed to the surprising retention of CpGs in Alus during evolution (Batzer and Deininger 2002). This might constitute an important consequence of A-to-I editing in view of the role of CpG islands in the regulation of gene expression. Potential Functional Implications of RNA Editing in Repetitive Elements Considering the available data on in vitro editing activities of ADARs on dsRNA molecules of different sequences and structures, it is not surprising that highly base-paired RNA foldback structures such as the ones induced by Alu inverted repeats are substrates for the editing enzymes. However, it is remarkable and maybe surprising that these predicted structures are edited in vivo at significant levels. This indicates that many of these structures do form in vivo and are readily accessible for ADARs in the nucleus. Alu elements are ideal for the formation of editable RNA structures because of their large numbers, size, and degree of conservation. We find no evidence for a sequence or otherwise specific interaction of the editing machinery with Alu sequences. Thus, other repetitive elements able to form similar structures should also be targets of A-to-I editing. Our data suggest, however, that editing levels in all other major repeat-element families that dominate the human genome (LINE, LTR, and other short interspersed elements) are very low compared to editing levels seen in Alu repeats (see Figure 6A and unpublished data). The selectivity for Alus might be explained based on the distribution features of each repetitive-element family: For example full-length L1 repeats are approximately 6 kb in length, and as a consequence, most of the time they have low chance of having a base-pairing sequence in proximity. MIR repeats, although found in significant numbers, which potentially could form foldback structures, have a low average level of conservation (30%–40% divergence) and so may be inadequately double stranded to be a substrate. MaLR elements of the LTR superfamily are present in numbers such that the average distance between an inverted pair is very high (approximately 10 kb). However, our analysis suggests that all repetitive elements might become targets of RNA editing at different stages in evolution. Young repetitive elements in their expansionary phase of evolution display features that we identify as important for being editing targets. Based on these observations it will not be surprising if repeat elements that show low levels of editing in humans are major targets in other organisms. For mRNA fractions, we estimated the inosine content due to Alu editing as follows: In 103,724 mRNAs we found 23,204 AtoG mismatches, while the same sequence sample has an average for the other transitions of 3,271. Assuming an average mRNA size of 4 kb, the ratio of inosine in the sample is estimated to be one inosine every 20,814 nucleotides (103,724 × 4,000/[23,204–3271]) generated by editing in Alu sequences. This estimation for Alu editing is in the range of one inosine in 17,000 nt (brain), one in 33,000 nt (lung, heart), to one inosine in 150,000 nt (skeletal muscle) as was experimentally determined by Bass and colleagues in the polyA-fraction of rat RNAs (Paul and Bass 1998). Since the rat genome lacks Alus, the total amount of inosine generated in human mRNAs may be much higher than in rats, unless a class of edited sequences in rats exists with a similar prevalence to Alus in humans. In any case, our data imply that most of the inosine detected in mRNA transcripts can be explained by the widespread A-to-I editing of repetitive elements. Repeat-element editing might therefore point toward an important housekeeping function for RNA editing. In contrast, the well-studied examples of editing that lead to single nucleotide and codon changes in mRNA might be less frequent cases of editing events. While a significant amount of editing occurs in mRNAs that contain repetitive elements in their exons, our results predict that the bulk of A-to-I editing takes place in intronic sequences missing from cDNA databases. This is suggested by the experimental results regarding the LUSTR, GPR81, p53, SIRT2, NFκB, and paraplegin genes, for which intronic data was available (see Figures 1A, 4, and 5A). This extensive editing of repetitive elements in pre-mRNAs creates an enormous pool for the generation of gain-of-function mutations. The involvement of editing in creating or destroying splicing sites of alternatively spliced Alu exons, along with internal editing of those exons, suggests an intriguing new mechanism for accelerated evolution. We are now in a position to analyze the extent to which this process occurs within the human transcriptome. Such a role in “stimulating” evolution, however, is unlikely to be related to the “daily” function of A-to-I RNA editing. It has been shown that hyperedited, inosine-containing RNAs are retained in the nucleus by a protein complex containing the inosine binding protein p54 (Zhang and Carmichael 2001). In view of the widespread editing of Alus this offers an intriguing mechanism to preclude aberrantly spliced mRNAs and, more generally, repetitive-element-containing RNAs from exiting the nucleus. This model, though, suggests that intronic RNA editing occurs frequently in other organisms and in other repetitive-element types as well, something that remains to be shown. A connection between A-to-I RNA editing and RNAi has recently been suggested through studies in C. elegans where inactivation of the editing machinery leads to transgene silencing (Knight and Bass 2002), and subsequent inactivation of the RNAi pathway restored transgene expression (Tonkin and Bass 2003). Furthermore, retrotransposon LTR sequences were shown to induce natural RNAi due to RNA duplex formation (Sijen and Plasterk 2003). The RNAi machinery has been implicated in gene silencing in two independent modalities: at the RNA level through degradation of mRNAs and at the chromatin structure level through induction of methylation (Dykxhoorn et al. 2003; Ekwall 2004). Both silencing pathways might be affected by editing of repetitive-element foldback structures. Silencing of RNAs containing such inverted repeats might be prevented through their modification by RNA editing and their subsequent nuclear retention (Zhang and Carmichael 2001) or by rendering those RNAs inadequate substrates of the RNAi machinery. It is possible that the observed embryonic lethality and apoptosis in A-to-I editing-deficient mice (Wang et al. 2000, 2004; Hartner et al. 2004) is related to the breakdown of this control mechanism leading to the posttranscriptional silencing of essential genes. The work presented here has been based on the analysis of cellular mRNAs that contain Alu repeat elements. However, the underlying principles probably also apply to Alu RNAs generated from transcriptionally active Alu elements. Alu elements do not encode transcription termination signals (Deininger 1989), and thus read-through transcription from transposition-competent Alu repeats can result in intramolecular Alu pairs, leading to the editing of a sequence that subsequently becomes retrotranscribed. Editing of primary transcripts of repetitive elements may have an important role in the control of their proliferation and a dedicated analysis of such transcripts for editing events represents an important future direction. A recent study by Levanon et al. (2004) reported a computational approach for the identification of heavily edited genes in the human transcriptome and found that editing mostly occurs in Alu repeat elements (greater than 92% of the substrates identified), giving us the opportunity to compare the two approaches and datasets. The computational strategy used by Levanon et al. (2004) differs substantially from ours both in the sequence dataset employed and in the methodology applied. The use of expressed sequence tags (ESTs; in contrast to our use of mRNA sequences) offers a much larger primary dataset for analysis; however, single-pass sequences have a higher error rate (Liang et al. 2000), and EST databases are biased toward sequences near 3′-termini of mRNAs (Liang et al. 2000). Levanon et al. (2004) selected candidate sequences for editing by identifying inverted repeats followed by the evaluation of AtoG mismatch rates, whereas we directly evaluated the AtoG mismatch content in repetitive elements irrespective of the presence of a nearby pairing sequence. The approach of Levanon et al. (2004) allows the discovery of edited inverted repeats that do not belong to any of the repetitive-element families (although the previously known brain substrates were missed), but it does not identify cases where a base-pairing sequence is not evident because of truncated cDNA and EST sequences and incomplete knowledge of gene boundaries. We found that for approximately one-third of the edited Alu elements a pairing Alu cannot be located within the gene boundaries as determined by known mRNAs, although in most of the cases it can be identified at the genome level. A comparison of the edited gene/mRNA datasets of the two studies shows a 34.5% overlap when gene names and symbols are compared. It should be noted, though, that editing of the same gene might reflect editing at different sites or within different Alu elements of the same gene. The two approaches are overlapping as well as complementary. Taken together, they have probably uncovered the most significant part of the heavily edited exonic sequences for which sequence data are available. From our analysis we estimate an additional approximately 4,000 edited Alu elements besides the 1,925 Alus that we have selected as a very high confidence set. Thus, it is important to note that the heavily edited sequences represent the tip of an iceberg with many more mRNAs in the human transcriptome being edited at single or a small number of positions. Materials and Methods RNA editing analysis Human brain samples were provided by the Harvard Brain Tissue Resource Center, Belmont, Massachusetts, United States; human lung cDNA was from Clontech (Palo Alto, California, United States). Total RNA isolation and reverse transcription have been described previously (Ausubel et al. 1995; Maas et al. 2001). Gene-specific PCR was performed as described earlier (Maas et al. 2001), and a list of oligonucleotide primer sequences used in this study is available on request. RNA editing analysis was done by direct sequencing of gene-specific, gel-purified RT-PCR products as described (Maas et al. 2001), using an automated ABI310 (Applied Biosystems, Foster City, California, United States) capillary electrophoresis sequencer. Human gDNA used for gene-specific PCR was isolated from the same tissues according to standard protocols (Ausubel et al. 1995). Computational procedures For analysis of the pool of human cDNA sequences we developed a program named Procedures for Repetitive Element Foldback Analysis (PREFA). We used the set of cDNA sequences from the UCSC database (July 2003) comprising 103,723 sequences (after removal of duplicate entries). The set of repetitive elements (for Alus 1,163,041 unique elements) and related information of the human genome (created with RepeatMasker based on the Repbase [Jurka 2000] release of June 2002) was obtained from the same source. For each examined repetitive-element family we first selected the subset overlapping partially or fully with genes. For Alus the number is 2,003,976, including duplicates, or 572,107 unique sequences. From this subset we then selected those overlapping with exons (31,666). The RNA and genomic sequence for each element was extracted and compared base by base for mismatches. A small number of cases with very high non-AtoG mismatches (greater than 20/element) were discarded as misaligned or erroneous. From the repetitive elements showing at least a single AtoG change we selected those where mismatch distribution cannot be accounted for by SNPs and sequence errors using the following procedure: The overall expected ratio of AtoG discrepancies relative to the total number of mismatches was calculated from the whole sample, assuming the expected AtoG mismatches to be approximately equal to the average of the rest of the transitions: The expected probability for an AtoG mismatch at a single position in a given element was calculated from the total number of mismatches found in the element in cases where other mismatches were present (2) or from the whole sample where only AtoG mismatches were found (3): Here nAtoG and nOther is the total number of AtoG and non-AtoG mismatches found for this element: Given the probability p for an AtoG mismatch to occur at any given position, the expected values for the number of AtoG were calculated: A χ2 test was calculated for each element and those with a χ2 value exceeding the critical value (for α = 0.000001) were selected as edited, and these values correspond to approximately more than five AtoG changes in the absence of any other change in the approximately 300 bp of an Alu). For each element in the high-confidence set the closest inverted element was identified among the elements present in the same gene boundaries. The distance separating the pair was calculated from the location of the first base of each element, according to the genomic sequence numbering and irrespective of their orientation. The divergence of each element was derived from the corresponding entry in the UCSC annotation database (ChrN_rmsk) representing mismatches per hundred bases. Tissue of origin of the RNAs was also derived from the UCSC mRNA annotation. For RNAs described to originate from multiple tissues, the corresponding RNAs were included in the count for each of those tissues. RNAs originating from a specific subregion of a tissue, such as subareas of the brain, were counted within the subregion but not in the whole-tissue set of RNAs. Alignment of the Chromosome 1-derived Alu sequences was performed with the MegAlign program of the DNASTAR (Madison, Wisconsin, United States) package (Lasergene) using the CLUSTAL algorithm (Jeanmougin et al. 1998). Further manual adjustments were necessary owing to the presence of simple repeats in Alu sequences. Analysis of the alignment and base counts surrounding the editing sites were done with PREFA. Supporting Information Table S1 Database of Computationally Identified Editing Targets The database lists the GenBank accession numbers, gene names, gene product description, chromosome location, and type of Alu element and location within the mRNA sequence, the identity of the most likely pairing Alu elements within the same gene, and the distance in base pairs (bp) between the pairing Alus. The positions of all predicted editing sites within the individual sequences can be viewed by pasting the accession number into the USCS genome browser (Kent et al. 2002) at http://genome.ucsc.edu/cgi-bin/hgGateway and following the link to mRNA/Genomic alignment. We found that six cDNAs map on two chromosomes (AB095924, AK021666, AK055562, AK092837, AK094425, and BC039501); details are given for the most plausible assignment. We have observed that in the 43 cases that we experimentally analyzed, usually additional editing sites were identified when directly sequencing gene-specific PCR products. (276 KB XLS). Click here for additional data file. Accession Numbers The GenBank ((http://www.ncbi.nlm.nih.gov/Genbank) accession numbers for the genetic sequences discussed in this paper are LUSTR (AB046844), KIAA0500 (AB007969), BTKI (AB037838), KIAA1497 (AB040930), and GPR81 (BC0067484). The Entrez Gene (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene) ID numbers for ADAR1, p53, SIRT2, NFκB, and SPG7 are 103, 7157, 22933, 4790, and 6687, respectively.
                Bookmark

                Author and article information

                Contributors
                Journal
                Genome Med
                Genome Med
                Genome Medicine
                BioMed Central
                1756-994X
                2013
                29 November 2013
                : 5
                : 11
                : 105
                Affiliations
                [1 ]Department of Gene Expression and Regulation, The Wistar Institute, Spruce Street, Philadelphia, PA 19104-4268, USA
                Article
                gm508
                10.1186/gm508
                3979043
                24289319
                e9018799-6c02-4994-bd63-0970f064e1de
                Copyright © 2013 BioMed Central Ltd.
                History
                Categories
                Review

                Molecular medicine
                Molecular medicine

                Comments

                Comment on this article