Introduction The mammalian genome displays a complex and extensive pattern of interlaced transcription of protein-coding genes and thousands of non-coding RNA (ncRNA; see Materials and Methods for definitions) loci [1]. Exons from ncRNA loci may overlap on the same (sense), or opposite (antisense), strand with exons from other transcripts, including those from protein-coding genes. They may also be contained within introns of other transcripts. Other ncRNAs are transcribed from bidirectional promoters: their transcriptional events, and those for neighbouring transcripts from the opposite strand, are initiated in close genomic proximity. Several recent studies investigated whether cis-antisense, intronic, or bidirectional ncRNAs regulate the transcription of protein-coding genes whose loci they overlap [2],[3]. These report complex relationships between the expression profiles of ncRNAs and their overlapping protein-coding genes in adult mice. Further investigations, however, are clearly needed to investigate other types of ncRNAs, in particular intergenic and long (>200 nt) ncRNAs transcribed from outside protein-coding loci, and those expressed during development. If most long ncRNAs convey biological functions, then what these molecular mechanisms are remain almost completely unknown. For the few with established mechanisms a general theme has emerged of them acting as transcriptional regulators of protein-coding genes (reviewed in [4]). For many such ncRNAs, the genomic location of their transcription has proved key to their mechanism. When promoters of non-coding and coding transcripts are closely juxtaposed on the chromosome, for example, then transcriptional events initiated from them may be coupled. This has been shown to occur following chromatin remodelling of chromosomal domains [5]–[7], or because of collisions between transcriptional machineries processing along sequence in close proximity [8], or because of transcriptional interference when transcription proceeds through a promoter sequence thereby suppressing transcription initiation from it [8]. Other long ncRNAs are cis-regulators of transcription via indirect means involving their participation in ribonucleoprotein complexes [9],[10]. Other long ncRNAs, such as NRON or 7SK, act in trans: they regulate the expression of target genes or gene products from chromosomes other than the ones from which they are transcribed [11]–[13]. Cis-regulation by ncRNAs of protein-coding gene transcription is well-established in imprinting [14] and for developmental genes, such as Dlx5 and Dlx6 [9], yet these represent transcriptional events that overlap on the genome. By way of contrast, we sought statistical evidence that pairs of adjacent, yet distinct, coding and non-coding loci often give rise to separate transcripts with similar spatiotemporal expression patterns indicative of positive co-operativity of transcriptional regulation. (Of course, negative co-operativity by, for example, transcriptional interference is also likely. However, such instances tend to be harder to establish experimentally owing to low levels of ncRNA expression.) We considered that if evidence of transcriptional co-operativity were to be forthcoming then specific pairs of coding and noncoding transcripts could be prioritised for experimentation. In such studies, it is important to demonstrate that long ncRNAs and mRNAs are transcribed exclusively from separate promoters. Otherwise, similarities in their expression profiles may not represent distinct transcriptional events but instead single transcripts spanning both coding and noncoding exons. We recently demonstrated several evolutionary signatures of functionality for a large set of mouse long ncRNAs and their promoters [15]. These long ncRNA sequences are largely full-length [16], map to genomic loci lying outside of protein-coding gene models and consequently are unlikely to act as antisense transcripts of a neighbouring gene locus. Although some of these ncRNAs may result from uncoordinated and inconsequential transcription, evidence of transcriptional regulation [17] and constraints on splicing motifs [15] cannot be explained by such transcriptional ‘noise’. We were interested in whether long intergenic ncRNAs are located randomly with respect to protein-coding genes. If not, this might suggest a trend for long ncRNAs to act in cis with neighbouring protein-coding genes. To improve our chances of detecting non-uniformities of chromosomal location, we considered long ncRNAs whose genomic sequences are evolutionarily constrained and thus are more likely to be functional. If long ncRNAs possess, in general, cis-regulatory roles, one might expect their transcribed genomic regions to lie in proximity to their functionally-linked protein-coding genes, and their tissue expression profiles to be similar. Finally, it might also be expected that functional long ncRNAs would tend to be linked to certain subsets of protein-coding genes that convey particular biological functions. We investigated this cis-regulatory hypothesis for a set of 659 evolutionary constrained long ncRNAs and found large-scale and experimental evidence for co-regulation of non-coding and protein-coding transcript pairs. For the first time, we show that these constrained long ncRNAs are not evenly distributed on the genome but rather tend to be concentrated near to genes with similar expression patterns and from particular functional classes. These findings immediately provide new and unbiased criteria for prioritising long ncRNAs for experimental investigation. Hundreds of constrained long ncRNAs can now be targeted for detailed examination, specifically those that either (i) are expressed in the brain during development and are transcribed in proximity to transcription factor genes, or (ii) are expressed outside of the CNS in adult individuals and that lie adjacent to signalling genes. Results This study examined large numbers of mouse long intergenic ncRNAs, partitioned by the availability or otherwise of evidence for their expression in the brain or during development, and of evidence for sequence constraint. Previous studies had focused specifically on the expression of antisense, bidirectional and intronic ncRNAs in 56 day old adult mice or during mouse embryonic stem cell differentiation [2],[3]. For each set of ncRNA loci we examined the null hypothesis that they are located at random relative to protein-coding genes. Instead, we find strong and significant co-expression and functional biases. We show experimentally that these biases do not derive from single transcriptional events. Constrained ncRNAs are enriched in predicted RNA secondary structures We started by analysing 3,122 long ncRNAs transcribed from intergenic regions (see Materials and Methods) that, when considered together, exhibit evolutionary constraint [15]. Among these ncRNAs, we then identified 659 long ncRNAs that individually show evidence of constraint (hereafter termed constrained long ncRNAs): individually, their mouse-human nucleotide substitution rate is significantly (p 9 tags). (0.03 MB DOCX) Click here for additional data file. Figure S3 Co-expression of further protein-coding/non-coding RNA transcript pairs in the developing (Panels A, B, C) and adult (Panels D, E, F) CNS. Brightfield images of in situ hybridization from adjacent wild-type sections are shown. (A) Expression of the ncRNA AK082989 appeared ubiquitous in an E13.5 embryo, although Zic4, the adjacent protein coding gene, showed a highly specific pattern of expression in the spinal cord and forebrain at the same time-point, as was described previously (Gaston-Massuet et al., 2005). (B) At P12, Meis1 is only expressed above background levels in the developing cerebellar granule cell layer, where the ncRNA AK042766 is also found expressed. (C) Grik2, however, is expressed ubiquitously in the brain, although the adjacent ncRNA AK047467 is only found at low levels in the cerebellar granule cell layer at P12. (D) Both Hip2 and its paired ncRNA, AK045758, are expressed at high levels in the cortex and the hippocampus. (E) Eif2c3 is ubiquitously expressed in the brain, as is the genomically adjacent transcribed ncRNA, AK047638. (F) Adr also shows a ubiquitous expression pattern, although expression of its paired ncRNA, AK162901, is not detected in the adult brain, consistent with the RT-PCR results (Figure S3). In all cases, the sense strand negative control probe failed to show specific staining (data not shown). Gaston-Massuet, C, Henderson DJ, Greene ND, Copp AJ, (2005) Zic4, a zinc-finger transcription factor, is expressed in the developing mouse nervous system. Dev Dyn, 233: 1110-5. (3.43 MB TIF) Click here for additional data file. Figure S4 RT-PCR and 5′ RACE analysis of protein-coding and non-coding transcripts. (A) Total RNA was purified from the tissues and the developmental time-points indicated. RT-PCR was performed using primers spanning from the 3′ UTR of the protein-coding gene to the adjacent ncRNA genomic sequence. Control amplification using the same primer pairs from genomic DNA (gDNA) and a reaction containing no reverse transcriptase (-RT) is also shown. Importantly, RT-PCR of each protein-coding gene and ncRNA was performed from the same tissue. Apart from Add2/AK013768, no evidence for read-through from the 3′ UTR to the ncRNA was observed that would account for the in situ hybridisation results obtained (Figure 5, Figure 6). (B) 5′ RACE products of all 12 ncRNAs analysed in this study (adjacent pc genes are indicated in brackets). Total RNA was purified from the tissue corresponding to the in situ hybridisation data: adult brain (AK018196 - AK162901), P12 cerebellum (AK149041, AK042766 and AK047467) and E13.5 brain (AK082938, AK049627 and AK082969). In these reactions, a nested reverse primer approximately 300 bp from the predicted ncRNA transcription start site and a nested forward primer specific for the cap-ligated RACE anchor primer was used. A reaction containing no reverse transcriptase (-RT) is also shown for each primer pair. RACE reactions containing no TAP enzyme showed no amplification products (data not shown). (1.16 MB TIF) Click here for additional data file. Table S1 Brain-expressed ncRNAs are more likely to be constrained than ncRNAs expressed elsewhere (χ2-test, p = 3×10−3). This observed bias is independent of the lengths of these constrained ncRNAs since the length distributions of brain- and non-brain-expressed ncRNAs are indistinguishable (p = 0.4, Kolmogorov-Smirnov test). Transcripts classified as constrained or non-constrained were divided further into those transcribed in the same (sense) or opposite (antisense) direction relative to the transcriptional orientation of the most proximal protein-coding gene. Cases where a ncRNA is located near to protein-coding genes that are transcribed on both strands have been excluded. An asterisk (*) indicates a significant association with the direction of transcription of the proximal annotated protein-coding gene (see Materials and Methods). Non-constrained, brain-expressed ncRNAs show no directional preference, whereas non-brain-expressed ncRNAs show a small but significant bias in the opposite orientation (54% transcribed in antisense, p = 6×10−3). (0.03 MB XLS) Click here for additional data file. Table S2 Constrained ncRNAs that are expressed in brain or in nonbrain tissues during development show a significant tendency to lie adjacent to proteincoding genes that are highly expressed in specific tissues (p<10−3; EFDR<0.04). Shown is the significant over-representation of ncRNAs in proximity to protein-coding genes that are expressed in these tissues as a result of the observed densities when compared to expected densities on randomly sampled G+C matched sequences; also shown are the lower and upper confidence intervals (CIs) at the 95% level and the standard deviation. (0.02 MB XLS) Click here for additional data file. Table S3 Brain-expressed and constrained ncRNAs show a tendency to be transcribed near to protein-coding genes expressed in brain tissues. Shown are significant (p-value<10−2, EFDR = 0.53) and non-significant (highlighted in grey) enrichments. The observed densities of ncRNAs transcribed in proximity to protein-coding genes expressed in particular tissues have been compared to expected densities from randomly sampled G+C matched sequences (see Materials and Methods). Also shown are lower and upper confidence intervals (CIs) at the 95% level, and standard deviations (StdDev). Terms highlighted in bold correspond to results shown in Figure 4 (p-value<10−3, EFDR = 0.05). (0.02 MB XLS) Click here for additional data file. Table S4 Experimental EST and CAGE TC (tag cluster) support for six non-coding transcripts (AK018196, AK045528, AK013768, AK149041, AK082938, AK049627) for which in situ hybridizations (ISHs) were performed (see Figure 5, Figure 6). Each of the six brain-derived and evolutionarily constrained ncRNA transcripts was further investigated for additional experimental evidence in the form of ESTs and CAGE TCs and the results are summarized in separate tables. For each EST and CAGE TC, its accession code, coordinates, strand, tissue type and stage are reported, and additionally for each EST its position (5′ or 3′) relative to the ncRNA is shown. (0.08 MB XLS) Click here for additional data file. Table S5 ncRNA data sets used in this study: evolutionary and functional properties. The four sets contain ncRNAs that are (i) constrained and derived from brain-associated tissues, (ii) constrained and derived from tissues outside the CNS, (iii) non-constrained and derived from brain-associated tissues and (iv) non-constrained and derived from tissues outside the CNS. Each ncRNA is represented by its (i) accession code, (ii) genome coordinates (assembly mm5), (iii) strand information and (iv) whether it overlaps with: 1. EvoFold predictions of RNA secondary structure (EvoFold), 2. human copy number variants (CNVs), 3. segmental duplications (SDs), 4. PhastCons multispecies conserved elements (MCSs), and 5. indelpurified segments (IPSs). Overlap is indicated by the integer 1, lack of overlap by 0. (0.35 MB XLS) Click here for additional data file. Table S6 ncRNA data sets used in this study: accession codes of all ncRNAs in these four data sets. The four sets contain ncRNAs that are (i) constrained and derived from brain-associated tissues, (ii) constrained and derived from tissues outside the CNS, (iii) non-constrained and derived from brain-associated tissues and (iv) non-constrained and derived from tissues outside the CNS. In particular, the two unconstrained data sets are listed in their entireties since in Table S5 only those that are homologous to human sequence are shown. (0.16 MB XLS) Click here for additional data file. Text S1 Functional associations and transcript read-through. (0.03 MB DOC) Click here for additional data file.