112
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      De Novo Assembly and Transcriptome Analysis of Contrasting Sugarcane Varieties

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Sugarcane is an important crop and a major source of sugar and alcohol. In this study, we performed de novo assembly and transcriptome annotation for six sugarcane genotypes involved in bi-parental crosses. The de novo assembly of the sugarcane transcriptome was performed using short reads generated using the Illumina RNA-Seq platform. We produced more than 400 million reads, which were assembled into 72,269 unigenes. Based on a similarity search, the unigenes showed significant similarity to more than 28,788 sorghum proteins, including a set of 5,272 unigenes that are not present in the public sugarcane EST databases; many of these unigenes are likely putative undescribed sugarcane genes. From this collection of unigenes, a large number of molecular markers were identified, including 5,106 simple sequence repeats (SSRs) and 708,125 single-nucleotide polymorphisms (SNPs). This new dataset will be a useful resource for future genetic and genomic studies in this species.

          Related collections

          Most cited references34

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Pervasive Transcription of the Human Genome Produces Thousands of Previously Unidentified Long Intergenic Noncoding RNAs

          Introduction A large fraction of the human genome consists of intergenic sequence. Once referred to as “junk DNA”, it is now clear that functional elements exist in intergenic regions. In fact, genome wide association studies have revealed that approximately half of all disease and trait-associated genomic regions are intergenic [1]. While some of these regions may function solely as DNA elements, it is now known that intergenic regions can be transcribed [2]–[7], and a growing list of functional noncoding RNA genes within intergenic regions has emerged [8]. Despite this progress, a complete understanding of the extent of intergenic transcription and the identity of these transcripts has remained elusive. The first attempts to analyze the extent and nature of intergenic transcription utilized tiling array technology [2]–[5]. These studies suggested that intergenic transcription is pervasive, but concerns about cross-hybridization have fueled a debate about the data [9]–[12]. Furthermore, in order to avoid technical difficulties associated with analyzing repeat sequence using tiling arrays, the studies were restricted to evaluating less than half of the genome. More recently, a few studies have focused on evaluating the extent of intergenic transcription using sequencing-based approaches, but with the exception of the recently published ENCODE project results [13], [14], these studies have thus far been limited to very narrow preselected regions of the genome and a small number of tissues [6], [7]. Overcoming these prior shortcomings, the ENCODE project used RNA-seq analysis in combination with other technologies to profile 15 human cell lines, providing evidence for transcription across 83.7% of the human genome and firmly establishing the reality of pervasive transcription [14]. Long intergenic noncoding RNAs (lincRNAs) are defined as intergenic (relative to current gene annotations) transcripts longer than 200 nucleotides in length that lack protein coding capacity. LincRNAs are known to perform myriad functions through diverse mechanisms ranging from the regulation of epigenetic modifications and gene expression to acting as scaffolds for protein signaling complexes [8], [15]. The first attempts to generate lincRNA annotation sets either profiled lincRNAs specific to a small number of tissues or required that transcripts harbor specific structural features such as splicing and polyadenylation [16]–[18]. The GENCODE consortium (GENCODE v7) has manually curated approximately five thousand lincRNAs that are not restricted to particular tissues or structural features, however this annotation set contains only a small fraction of all lincRNAs because it does not take advantage of RNA-seq data to identify novel transcripts [19], [20]. The limited scale of current lincRNA annotations, including GENCODE, is clearly incompatible with the massive amount of intergenic transcription observed by the ENCODE project. It should therefore be expected that the genome encodes far more lincRNAs than are currently known. In order to bridge the gap between the observation of pervasive intergenic transcription by the ENCODE project and the currently limited set of annotated lincRNAs, we performed an analysis of a unique set of RNA-seq data derived from both novel and published datasets that complements and significantly expands prior efforts [14], [16], [19]. This analysis resulted in a clear corroboration of the observations of pervasive transcription across the human genome by the ENCODE project [14]. Furthermore, analysis of previously annotated putative lincRNAs, including those of the ENCODE project [19], in addition to de novo discovery of novel lincRNAs from RNA-seq data has resulted in the compilation of the most comprehensive catalog of human lincRNAs. Owing to the extended breadth of tissues sampled and relaxed constraints on transcript structure, we find significantly more lincRNAs than all previous lincRNA annotation sets combined. Our analyses revealed that these lincRNAs display many features consistent with functionality, contrasting prior claims that intergenic transcription is primarily the product of transcriptional noise [12]. In sum, our findings corroborate recent reports of pervasive transcription across the human genome and demonstrate that intergenic transcription results in the production of a large number of previously unknown lincRNAs. We provide this vastly expanded lincRNA annotation set as an important resource for the study of intergenic functional elements in human health and disease. Results Quantitation of the Extent of Transcription of the Human Genome We have analyzed six novel RNA-seq datasets generated as part of the Human Epigenome Atlas (http://www.genboree.org/epigenomeatlas/index.rhtml) and 121 previously published RNA-seq datasets representing 23 human tissues under multiple conditions and consisting of over 4.5 billion uniquely mapped reads (Table S1). This set of RNA-seq data allowed for detection of both rare and tissue-specific transcription events that would otherwise be undetectable. In contrast to the limited reach of prior tiling array studies [2]–[5], we analyzed the much larger portion (83.4%) of the genome to which RNA-seq reads can be uniquely mapped thus providing a broader view of the transcriptome. At a threshold of one RNA-seq read, we observed reads mapping to 78.9% of the genome and, if additional evidence of transcription is taken into account including the full structures of known genes, spliced ESTs and cDNAs, we found evidence that 85.2% of the genome is transcribed (Figure 1A). This result closely agrees with the recently published findings from the ENCODE project in which evidence for transcription of 83.7% of the genome was uncovered [14]. Interestingly, even with 4.5 billion mapped reads, we observe an increase in genomic coverage at each lower read threshold implying that even more read depth may reveal yet higher genomic coverage. (Figure S1). 10.1371/journal.pgen.1003569.g001 Figure 1 The human intergenic transcriptome. (A) 85.2% of the genome has evidence of transcription, with RNA-seq reads mapping directly to 78.9% of genomic sequence. The remaining genomic coverage is comprised of known genes, spliced ESTs and spliced cDNAs. The grey circle represents the portion of the genome (83.4%) that is uniquely mappable with RNA-seq reads. (B) Protein coding (NM gene) exon, intron and intergenic region expression level distribution. Regions that have high expression have a larger fraction of base calls appearing at higher read depths. Protein coding gene exons have the highest proportion of bases with high read depth, while introns and intergenic regions have relatively more bases of low read depth though each contain many highly expressed regions. Base calls = (# of genomic positions at a specific read depth)(read depth). (C) Most intergenic transcription is outside of annotated noncoding RNA genes. The fraction of intergenic base calls within RefSeq noncoding RNA genes (NR genes) compared to other intergenic regions are compared. In (A–C), only uniquely mappable portions of the genome are considered (see Methods). As expected, protein coding gene exons contain the largest fraction of highly expressed bases (Figure 1B) as well as a disproportionately large fraction of total reads relative to their small ( 1, 3,676 lincRNAs at FPKM>10, and 925 lincRNAs at FPKM>30 (Dataset S2 and Figure S3). Surprisingly, greater than 94% of the final set of merged lincRNAs at each expression level consists exclusively of novel de novo assembled transcripts discovered from the RNA-seq data in this study (Table S3 and Dataset S2). Rather than being clustered near currently annotated genes, these lincRNAs are spread throughout intergenic sequence. 58.1% of FPKM>1 lincRNAs, 61.9% of FPKM>10 lincRNAs, and 67.7% of FPKM>30 lincRNAs are greater than 30 kilobases from the nearest protein coding gene on either strand. We annotated the lincRNAs as belonging to the same “group” (see Methods) if they are within 1 kilobase of each other to account for the possibility that some proximal lincRNA annotations may be partial structures of larger transcripts (see Discussion). This grouping resulted in 35,585 distinct lincRNA groups at FPKM>1, 2,970 at FPKM>10, and 764 at FPKM>30, and the lincRNAs in the catalog are named according to these groups (Dataset S2). These annotations are likely to be incomplete due to limitations in transcript assembly from RNA-seq data; indeed, some annotations may be fragments of larger overlapping lincRNA transcripts. Therefore, the actual number of independent lincRNAs may differ from the above numbers, and future work is needed to more fully define complete, independent lincRNA transcript annotations (see Discussion). Evaluation of LincRNA Filtering Approach We evaluated the stringency with which our filtering process removed protein coding transcripts by analyzing ribosomal profiling data from HeLa cells (Figure 2B) [22]. As expected, lincRNAs resemble the 3′ untranslated region exons of protein coding genes, with very few transcripts showing significant engagement with the ribosome. This finding is in agreement with the recent observation that GENCODE long noncoding RNAs (a subset of our catalog) generally lack mass spectrometry based evidence for translation [23]. In contrast, a recent study found that many previously annotated mouse lincRNAs bind the ribosome [24]. While the biological significance of this discrepancy is unknown, it may be the result of differences in the stringency of the filtering approach employed in the generation of the lincRNA annotations under consideration. Further confirming the stringency of our filters, a computational analysis of protein coding potential using the program PhyloCSF revealed that our set of filtered lincRNAs lack predicted protein coding capacity (Figure 2C). From these analyses we conclude that our filtering approach effectively removed protein coding transcripts from the catalog. Additional LincRNA Catalogs and Resources While the remainder of this study focuses on this catalog of putative lincRNAs (Dataset S2), we have provided multiple alternative lincRNA catalogs. These include a combined catalog of the lincRNAs identified in this study merged (see Methods) with a set of additional lincRNAs identified in Cabili, et al. [16] which passed all of our filters except were not expressed at FPKM>1 in any of the RNA-seq datasets analyzed here. The added lincRNAs are expressed at FPKM>1 in one or more of the RNA-seq datasets analyzed in Cabili et al. [16], which are entirely distinct from the datasets analyzed here, and are therefore likely to be genuine lincRNAs by our criteria. This catalog (Dataset S3) includes 54,784 lincRNAs at FPKM>1 (920 additional lincRNAs compared to Dataset S2), 3,764 lincRNAs at FPKM>10 (88 additional lincRNAs), and 942 lincRNAs at FPKM>30 (17 additional lincRNAs). In addition, we have included a catalog of spliced lincRNAs that are expressed at FPKM>1 in at least one dataset (4,576 lincRNAs, Dataset S4), of which 61% are exclusively composed of de novo assembled transcripts discovered in this study. We have also compiled a catalog of lincRNAs expressed at FPKM>1 in at least two datasets (26,455 lincRNAs, Dataset S5), of which 97% are exclusively de novo assembled transcripts discovered here. Additionally, an alternative lincRNA catalog containing only those lincRNAs expressed significantly higher than randomly sampled intergenic regions (see Methods) were included (5,267 lincRNAs, Datasets S6, S7). Furthermore, as an additional resource we provide the expression level (FPKM and raw RNA-seq read counts) of all lincRNAs (in Dataset S2) and RefSeq protein coding genes across all 127 RNA-seq datasets (Dataset S8). LincRNAs Are Specifically Regulated The degree to which intergenic transcription is functional remains uncertain and controversial [9]–[12], [25]. In order to evaluate whether the lincRNAs identified in the present study are specifically regulated as opposed to transcriptional noise, we determined if the lincRNA genes harbor canonical epigenetic marks for activation and repression with the reasoning that noise transcripts should lack coherent epigenetic modification patterns. Consistent with observations based on earlier long noncoding RNA annotations [18], [19], [26], [27], analysis of ChIP-seq and RNA-seq data [28], [29] revealed that the catalog of lincRNAs shows patterns of epigenetic modification similar to protein coding genes (Figure 3A). Activating histone marks, H3K4me3 and H3K36me3, are both significantly enriched within highly expressed lincRNAs. Similarly, the repressive mark H3K27me3 is significantly enriched within lowly expressed lincRNAs. Thus, the expression of lincRNAs appears to be specifically regulated. 10.1371/journal.pgen.1003569.g003 Figure 3 LincRNAs possess features inconsistent with transcriptional noise. (A) ChIP-seq and RNA-seq data from IMR90 cells [28], [29] were analyzed for lincRNAs and RefSeq NM genes. *P = 4.01E-7, ** P = 4.52E-9, *** P = 2.43E-14, **** P 1 in at least one of the two fractions for each cell type were analyzed (16,819 NM genes and 127 lincRNAs). Individual lincRNA and NM gene ratios of FPKMs in polyA+/polyA− fractions are plotted. Pearson correlation value for lincRNAs = 0.622 (P = 5.551E-15) and for NM genes = 0.702 (P FPKM 10 and nearly 1,000 expressed at >FPKM 30, rivaling the expression of many messenger RNAs. We chose to apply an expression cutoff to remove very lowly expressed transcripts from the catalog of lincRNAs. However, it may be the case that there exist many functional lincRNAs with very low expression levels, below our expression filter cutoff. For example, the functional human lincRNA HOTTIP is expressed in approximately one out of three cells [37]. Furthermore, recent findings have shown that the intergenic transcriptome may be vastly more complex than currently appreciated when very lowly expressed transcripts are considered [7]. It is possible that some of these are functional transcripts despite their apparent low expression, perhaps having brief bursts of expression during stages of the cell cycle or functioning in single cells in a heterogeneous population as has been previously observed [14]. Therefore, while we have provided the most complete lincRNA catalog to date, there may be additional lowly expressed, yet potentially functional lincRNAs that were excluded here. In order to minimize any potential contamination of the lincRNA catalog with protein coding transcripts, the filtering approach used was very aggressive. In fact, most previously annotated noncoding RNAs failed to pass our filters and were therefore excluded from the lincRNA catalog (Table S3 and Dataset S9). The vast majority of these transcripts (including most GENCODEv6 “lincRNAs” and “processed transcripts”) overlap known or predicted protein coding genes, pseudogenes, or non-lincRNA noncoding RNAs (e.g. microRNAs)(Table S3). Some of these removed transcripts may be functional long noncoding RNAs, such as GAS5 (removed because it contains 10 snoRNA genes within its introns). However, in order to most confidently identify only lincRNAs, rather than potential unannotated extensions of known genes, these were removed. Of those previously annotated noncoding RNAs that are intergenic, more than half contain predicted ORFs longer than 100 amino acids. For example, two previously characterized functional human lincRNAs were found to contain ORFs longer than 100 amino acids, Xist and HOTAIR. These results demonstrate that our filtering approach, which eliminates all transcripts with ORFs larger than 100 amino acids, may have removed some lincRNAs with large, nonfunctional ORFs. However, the use of a 100 amino acid ORF cutoff, a commonly used threshold to define potential protein coding genes, is justifiable because ORFs of this size infrequently occur by chance and instead indicate potential for protein coding capacity [38], [39]. Rather than discard all transcripts with large ORFs, as we did here, one option to discriminate between transcripts that are coding versus noncoding is to analyze the frequency of synonymous codon substitutions (PhyloCSF) [40]. However, this approach is limited to ORFs that can be aligned across species, potentially missing recently evolved or otherwise nonconserved novel protein coding genes. Importantly, our approach of removing all transcripts with large open reading frames effectively removed transcripts with significant predicted coding potential (Figure 2C), indicating that using an ORF size cutoff is at least as conservative as filtering based on PhyloCSF analysis. The lack of engagement of the ribosome, observed with ribosomal profiling data, confirms the stringency of the ORF cutoff filter (Figure 2B). Further analysis of these removed large ORF-containing intergenic transcripts is outside the scope of this study, but we have included these annotations for investigators interested in further analyzing their coding potential in search of novel protein coding genes (Dataset S10). Despite the fact that most previously annotated noncoding RNAs failed to pass our filters, our lincRNA catalog contains significantly more lincRNAs than previously known (>94% of lincRNAs are entirely novel at each expression level). This is the result of two unique features of our study. First, the RNA-seq read depth and diversity of tissues surveyed allowed for the detection of rare and tissue specific transcripts that were previously unknown. Many of these novel transcripts passed all filters and are annotated as novel lincRNAs in our catalog. Second, in contrast to prior lincRNA annotation efforts that were restricted to identification of only spliced or polyadenylated lincRNAs [16], [19], [41], we sought to generate annotations of a more complete set of human lincRNAs regardless of splicing or polyadenylation status. The reasons for taking this approach are manifold. Two of the most well known and abundant functional human lincRNAs, NEAT1 and MALAT1, are single exon genes (as are approximately 5% of protein coding genes) [42], suggesting that non-spliced transcripts may make up an important class of lincRNA. Additionally, numerous functional nonpolyadenylated noncoding RNAs have been described [30], [43]. Even long noncoding RNAs which can be spliced are often found in their unprocessed forms [44], a distinct property of long noncoding RNAs that would result in missed lincRNAs if splicing were a required attribute. Therefore, we chose not to exclude any lincRNAs from this catalog due to lack of splicing or polyadenylation. Importantly, because nonspliced, nonpolyadenylated transcripts could theoretically be erroneously de novo assembled from reads derived from contaminating genomic DNA in RNA-seq data, we took multiple measures to mitigate any contributions of genomic DNA contaminant reads (see Methods). Due to inherent limitations of de novo transcriptome assembly using short reads of finite depth, it is not always possible to unequivocally determine the complete structure of a transcript. This is particularly true for lowly expressed transcripts where the number of reads available is limited, and for genomic regions to which reads cannot be uniquely mapped. In the case of shallow read depth, exons of multi-exonic transcripts may lack reads connecting the exons, and de novo assembly could result in separate annotation of each exon as a distinct transcript. In support of this, we found that lower expressed lincRNAs discovered from de novo transcript assembly were less likely to have multi-exonic structures (Table S5). Additionally, the annotated 5′ and 3′ ends of the lincRNAs may represent truncations of the full length transcripts. Indeed, our analysis of PET tag data revealed that while the majority of our lincRNA catalog is overlapped by at least one PET tag, in most cases there is minimal PET tag support for the annotated 5′ and 3′ ends of the lincRNAs (Table S6). It is therefore the case that some lincRNA annotations in the catalog we provide (Dataset S2), particularly single exon lincRNA annotations, may represent fragments of larger transcripts. Furthermore, considering the reported prevalence of low level overlapping transcripts throughout intergenic sequence [7], it is not clear that full lincRNA structures can be unequivocally deconvoluted using short read RNA-seq technology. The determination of full lincRNA structures will be an important future effort in the field and may rely upon new datasets of longer read length and greater read depth, use of multiple orthogonal data types in the same tissue, new technologies such as ultra long read next generation sequencing, and further improvements in software for de novo transcript assembly. In addition, the majority of RNA-seq data we analyzed lacks strand information and as a result most of the lincRNAs in our catalog are of ambiguous strandedness. Prior annotations have relied upon splice site orientation to infer the strandedness of the transcript [16]. While this is a reasonable approach that we too have adopted when applicable in the present lincRNA catalog, stranded RNA-seq data is needed to most confidently assign strandedness to de novo assembled transcripts. While determining the isoforms and full structures of all lincRNAs is clearly desirable, these incomplete lincRNA structure annotations are nonetheless of tremendous practical value. Knowledge of the structure of a portion of a transcript is often sufficient to test for differential expression or perform RNAi knockdown experiments, and facilitates the cloning and sequencing of the full length transcript. Because of this, instead of placing additional restrictions upon lincRNA annotations, our filtering strategy was aimed toward identification of as many transcripts as possible that fit within the definition of a lincRNA. However, for investigators interested in more refined lincRNA annotations, we have provided multiple more restrictive lincRNA catalogs (Datasets S4, S5, S6). A key question in the field is whether the transcripts resulting from pervasive transcription of intergenic regions are functional or the result of noisy transcription. The lincRNAs we describe are specifically regulated and contain conserved sequence, attributes inconsistent with transcriptional noise (Figure 3). Furthermore, lincRNAs were found to be strongly enriched for intergenic TASs compared to nonexpressed intergenic regions (Figure 4). This striking finding supports the possibility that many intergenic SNPs mark regions that function as lincRNAs rather than DNA elements. Because nearly half of all TASs are intergenic, it is possible that lincRNAs play a significant role in the majority of human traits and diseases thus far analyzed in GWASs. One functional lincRNA (MIAT) was first identified during the experimental interrogation of an intergenic TAS [35], and another lincRNA PTCSC3, was identified nearby a TAS found from a papillary thyroid carcinoma GWAS, perhaps representing the first of many such discoveries to come from intergenic TASs. The finding that lincRNAs are strongly enriched for TASs provides a new opportunity to revisit intergenic trait-associated regions with unknown functional mechanisms by testing whether the overlapping lincRNA is involved in the observed phenotype. This noncoding RNA catalog represents a major step toward achieving a more complete understanding of this exciting frontier. We have identified a large number of putative lincRNAs with characteristics suggesting functionality. However, many of these lincRNAs are low expressed and definitive proof of functionality for a lincRNA requires functional experiments. High throughput functional genomic approaches, such as RNAi and cDNA overexpression screens, will serve as crucial tools for future efforts to uncover the roles of lincRNAs in diverse biological systems. With the requisite technology now available for these next generation experimental approaches, the time is ripe for this dark matter of the human genome to step further into the spotlight. Materials and Methods RNA-seq and Ribosomal Profiling Read Alignment and Processing 127 RNA-seq sequence files (5 novel and 122 publicly available datasets, Table S1) were aligned to hg18 with TopHat v1.1.4 allowing only uniquely mapped reads using the option -g 1 (all other parameters were default, see the TopHat manual http://tophat.cbcb.umd.edu/manual.html). Detailed information pertaining to each dataset, including novel datasets, is available in the sources provided in Table S1. These RNA-seq datasets were chosen because they sampled a wide breadth of human tissues and cell types, have well documented experimental methods used for their generation, and were publicly available. While datasets with longer reads and deeper read depth were preferred because they allow for more complete de novo transcript assembly, some datasets with short reads and shallow read depths were included in order to sample as many tissue types as possible. Datasets derived from tissues with mutated genomes, such as cancers, were included to capture tissue specific expression even though some reads from mutated genomic positions would fail to map to the reference hg18 genome. SAMtools v0.1.7 and BEDTools v2.12.0 were used to process aligned read files. Quantitation of the Transcribed Fraction of the Genome The uniquely mappable human genome, defined here as the portions of the genome to which RNA-seq reads can be uniquely mapped, was derived for hg18 from http://www.imagenix.com/uniqueome/downloads/hg18_uniqueome.unique_starts.base-space.50.2.positive.BED.gz [45]. It contains 2,570,174,327 bp or 83.4% of the total human genomic sequence. To determine the genomic coverage of RNA-seq data, all aligned RNA-seq reads were combined and read coverage at each genomic base position was determined with the BEDTools function genomeCoverageBed. Split reads (i.e. exon-exon junction spanning reads) were counted such that intronic sequence was included as part of the reads. In Figure 1A, “All genes, ESTs, cDNAs” includes GENCODE v10 genes (excluding pseudogenes), RefSeq NM and NR genes, UCSC Known Genes, spliced H-Invitational cDNAs, spliced ESTs (UCSC Genome Browser “Spliced EST” track), and previously annotated spliced lincRNAs [16]. In all cases, intronic sequences of genes, cDNAs and ESTs were included. LincRNA Discovery Transcripts annotated in public databases and literature sources that could be lincRNAs were compiled Ensembl v61 “processed_transcript” and “lincRNA” categories, GENCODE v6 “processed_transcript” and “lincRNA” categories, RefSeq NR and XR genes, H-Invitational “noncoding” transcripts, ultra conserved elements (UCEs), and published lincRNAs from Khalil et al. [18] and Cabili et al. [16]. LiftOver was used to map hg19 coordinates to hg18 for Ensembl, GENCODE, H-Invitational and Cabili et al. [16] transcripts. RefSeq XR sequences in hg19 were aligned to hg18 with BLAT v34 and the top scoring alignment was used. Ultra conserved elements sequences were retrieved from http://biodev.cbm.fvg.it, aligned to hg18 with BLAT v34 and the top scoring alignment was used. Khalil et al. [18] exons were grouped by their overlapping defined transcribed regions to build transcript structures. Novel transcripts from de novo transcriptome assembly of RNA-seq data were compiled De novo transcriptome assembly was performed on RNA-seq data with Cufflinks v1.0.1 using the upper quartile normalization (-N) and fragment bias correction (-b) options. This transcript assembly was performed using reads that were prealigned to hg18 using TopHat as described above. In cases where multiple datasets of the same library type from the same tissue were available, these datasets were combined to increase read depth for de novo assembly (see Table S2). For paired end read datasets, only properly paired and singleton reads as defined by SAMTools were used. Transcripts were filtered to remove overlap with non-lincRNA genes or pseudogenes and short transcripts Transcripts less than 200 nt in length were removed. Remaining transcripts were removed if they were within 1 kb of RefSeq NM genes on the same strand or, in the case of transcripts with ambiguous strandedness, on either strand relative to the NM gene. Transcripts on the opposite strand of an NM gene were removed if they overlapped the NM gene by at least one base. In addition, transcripts overlapping at least one base of any of the following were removed, regardless of strandedness: Ensembl v61 genes except “lincRNA” and “processed_transcript”, non-human RefSeq genes aligned to hg18 with BLAT (UCSC Genome Browser “Other RefSeq” track), alternative and extended 5′ and 3′ UTRs of known human genes from UTRdb, RefSeq NR and XR transcripts annotated as “pseudogenes”, and Ensembl v54 coding sequences. Transcripts containing large ORFs were removed Two steps of filtering were performed to remove both putative protein coding transcripts and their UTRs. First, large ORFs (>100 amino acids) were identified in all transcripts in all reading frames using EMBOSS getorf v6.1.0. In order to account for potentially truncated ORF-containing transcripts in which the start or stop codon may be outside the annotated region, the presence of greater than 300 nt downstream of a start codon without an interrupting stop codon, or 300 nt upstream of a stop codon without an interrupting start codon, sufficed to call a putative ORF. Transcripts with putative large ORFs were removed. These putative large ORF containing intergenic transcripts, some of which may be novel protein coding genes, are provided as a resource in Dataset S10. In order to remove potential UTRs of these large ORF-containing transcripts from the lincRNA catalog, the remaining transcripts were filtered to remove any that overlapped a large ORF-containing transcript. Transcripts overlapping extended protein coding gene structures were removed RNA-seq reads may extend beyond annotated 5′ and 3′ ends of annotated protein coding gene structures representing possible extended UTRs as well as, in the case of spliced reads mapping to the gene from distal sites, unannotated exons. In order to avoid cataloging transcripts in these regions as lincRNAs, we created a filter based on extended boundaries of protein coding genes using RNA-seq data. To do this, de novo transcriptome assembly with Cufflinks v1.1.0 using RefSeq NM genes as a reference annotation (-g), upper quartile normalization (-N), and fragment bias correction (-b) was performed on all polyA+ RNA-seq libraries in Table S2. RefSeq NM gene annotations were used as the reference annotation for this transcript assembly because these represent a limited, high confidence set of protein coding gene annotations. This set of extended protein coding gene boundaries (Dataset S1) was used as a filter to remove transcripts that overlap any extended protein coding gene by at least one base regardless of strandedness. Transcripts not expressed at FPKM>1 in at least one dataset were removed In order to determine transcript expression levels, mapped RNA-seq reads were distributed to transcripts using a modified version of HTSeq v0.5.3p that allows for reads that are mapped to shared portions of overlapping transcripts to be counted as a full read for each overlapping transcript. This was necessary to properly assign reads to each of multiple redundant annotations of transcripts present in the combined set from all public databases and de novo assemblies prior to the merging of overlapping lincRNA annotations (described below). These redundant annotations are the result of the repeated de novo assembly of the same transcript in multiple different datasets or redundant existing annotations in public databases, each of which have slightly different genomic coordinates yet may represent the same transcript. As such, all reads were distributed fully to each redundant annotation rather than proportioned between them. Read counts were converted to FPKMs using total mapped reads for each dataset calculated by the SAMTools flagstat function and custom scripts. Transcripts not expressed at FPKM>1 in at least one dataset were removed. As a result of this FPKM>1 minimum filter, 99.975% of de novo assembled lincRNAs (pre-merging) have at least 5 reads supporting their expression in at least one of the combined datasets in Table S2, and >99.1% have at least 10 reads in one dataset. Transcripts were further categorized as FPKM>1, FPKM>10, and FPKM>30 in at least one dataset where each category is inclusive of all transcripts in higher categories. Overlapping transcripts passing all filters at each expression cutoff were merged and grouped by proximity To identify a minimal set of distinct lincRNAs, overlapping transcripts were merged if 50% of an exon of a transcript overlapped an exon of another transcript. Furthermore, merged transcripts within 1 kb of each other were placed in the same group but received distinct transcript numbers, and are named based on the FPKM expression level they were derived from, e.g. FPKM1_group_32871_transcript_1. Merging, grouping and naming was performed separately on all FPKM>1 transcripts, FPKM>10 transcripts, and FPKM>30 transcripts. Filtering statistics are presented in Table S3. The catalog of merged lincRNAs at each expression cutoff is in BED format for genome build hg18 in Dataset S2. The FPKM>1 catalog of lincRNAs was used for all analyses in this study unless stated otherwise. The lincRNA annotations are provided as BED files in the hg18 genome annotation rather than hg19 because the UCSC Genome Browser currently has more data “tracks” available for hg18. However, the lincRNA annotations may be readily converted to hg19 or other genome annotations by users with the LiftOver tool: http://genome.ucsc.edu/cgi-bin/hgLiftOver. After merging these expression filtered, overlapping lincRNAs, FPKMs were recalculated (Dataset S8) for the merged lincRNAs using the modified HTSeq program described above. Due to the incomplete nature of the lincRNA structures resulting from de novo assembly, overlapping and nearby lincRNAs were considered to represent different potential models of the same lincRNA gene (rather than isoforms). Therefore, in the rare instances where two or more lincRNA models partially overlap but do not satisfy our merging criteria (above), the reads mapping to these overlapping portions were fully assigned to each lincRNA. Identifying lincRNAs expressed significantly above other intergenic regions For each RNA-seq dataset (Table S1), an empirical background distribution of expression values was generated using one million size-matched annotations shuffled randomly across intergenic sequence. The intergenic sequence used includes all portions of the uniquely mappable genome excluding RefSeq NM, NR and XR genes, Ensembl v61 genes including “lincRNAs” and “processed transcripts”, GENCODEv6 genes including “lincRNAs” and “processed transcripts”, H-Invitational “noncoding” transcripts, alternative and extended 5′ and 3′ UTRs of known human genes from UTRdb, extended protein coding gene structures derived from RNA-seq data (extended gene filter, described above), and published lincRNAs from Khalil et al. [18] and Cabili et al. [16]. To determine which putative lincRNAs (in Dataset S2, FPKM>1) were expressed significantly above background in at least one dataset the probability of observing a transcript at any given expression level was estimated using the dataset-specific background distribution and adjusted for multiple testing according to the Bonferroni correction assuming one test per RNA-seq dataset. Those lincRNA annotations with a corrected P value 1 in any of the datasets analyzed here and were therefore removed from the lincRNA catalog in Dataset S2. However, some of these transcripts were reported as expressed at FPKM>1 in at least one of the datasets analyzed in Cabili et al. [16], all of which are distinct from the datasets analyzed here. These additional lincRNAs were merged with the lincRNAs in the catalog in Dataset S2 resulting in an additional 920 lincRNAs in 741 groups at FPKM>1, 88 lincRNAs in 82 groups at FPKM>10, and 17 lincRNAs in 17 groups at FPKM>30. This expanded lincRNA catalog is in BED format for genome build hg18 in Dataset S3 and was not used further for any analyses in this study. Note on Genomic DNA Contamination in RNA-seq Datasets Genomic DNA contamination is a potential source of false positive expression signal in RNA-seq data that may contribute to de novo assembly of erroneous transcripts. In principle, only exon-exon junction spanning reads can be unequivocally determined as derived from RNA. Proper de novo assembly of both nonspliced and spliced (aside from the exon-exon junction spanning reads) transcripts may therefore suffer if significant genomic DNA contamination is present. Because our analysis utilized a wide range of novel and previously existing RNA-seq datasets of unknown genomic DNA contamination content, we took multiple steps to mitigate this possibility. First, for all RNA-seq datasets, we analyzed the distribution of reads between protein coding exons compared to other regions with the expectation that read distributions should be similar between RNA-seq datasets generated from libraries of the same type (e.g. polyA+ selected). A dataset with an unusually high percentage of intronic and intergenic reads could contain significant genomic DNA contamination. Our analysis of the datasets used in this study revealed that, as expected, polyA+ specific RNA-seq datasets have a higher fraction of reads mapping to protein coding gene exons than rRNA-depleted or polyA− specific datasets. Furthermore, no obvious outlier datasets were found for any of the library types. The results of this analysis ensured that no datasets with high genomic DNA contamination were used in this study (Figure S2). Next, as described in Figure 2A and in the Methods, we applied both size and expression cutoffs for all lincRNAs. The size cutoff prevents miscalling errant single reads, either from genomic DNA contamination or from read mapping artifacts, as lincRNAs while the expression cutoff removes lincRNAs that are assembled from rare genomic DNA-derived reads. The combination of these approaches served to minimize the contribution of genomic DNA to the lincRNA catalog. Analysis of Distribution of LincRNAs Between Polyadenylated and Nonpolyadenylated RNA-seq Data H9 ESC and HeLa RNA-seq data from fractions exclusively containing polyA− or polyA+ transcripts were analyzed [46]. Transcripts with RNA-seq reads in all four datasets and with FPKM>1 in at least one of the two fractions for each cell type were analyzed for Figure 3B (16,819 NM genes and 127 lincRNAs). For Figure S5, transcripts with reads in both fractions and FPKM>1 in at least one of the two fractions for a specific cell type were included in the analysis of that cell type (20,470 NM genes and 849 lincRNAs in H9 ESCs; 18,294 NM genes and 1,009 lincRNAs in HeLa). The whiskers of the box and whisker plot extend to +/−1.5 times the interquartile range or the most extreme datapoint. Paired-End Ditag (PET) Cluster Analysis Publicly available paired-end ditag (PET) cluster annotations derived from 7 cell lines or tissues, generated by the ENCODE project, were downloaded from http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeGisRnaPet. The PET cluster annotation files used were (by cell or tissue type): A549 (wgEncodeGisRnaPetA549CellPapClusters.bedCluster), H1_hESC (wgEncodeGisRnaPetH1hescCellPapClustersRep1.bed), HeLa-S3 (wgEncodeGisRnaPetHelas3CellPapClustersRep1.bed), IMR90 (wgEncodeGisRnaPetImr90CellPapClusters.bedCluster), MCF-7 (wgEncodeGisRnaPetMcf7CellPapClusters.bedCluster), Prostate (wgEncodeGisRnaPetProstateCellPapClustersRep1.bed), SK-N-SH (wgEncodeGisRnaPetSknshCellPapClusters.bedCluster). Further description of these PET clusters, including how the annotations were generated, is available at the UCSC Genome Browser site here http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=321010719&c=chr21&g=wgEncodeGisRnaPet. BEDTools was employed to compute overlap between lincRNA and RefSeq NM gene 5′ and 3′ ends and PET cluster 5′ and 3′ end ‘blocks’. In the case of ambiguous stranded lincRNAs, both potential orientations were allowed for determining overlap with the 5′ and 3′ ends of PET clusters. Ribosome Profiling Analysis Ribosome profiling data and matched mRNA-seq data from HeLa cells corresponding to the experiments (mock transfected 12 hr time point) presented in Guo et al. [22] were downloaded from the NCBI GEO (GSE22004). The expression level of the filtered set of lincRNAs and of RefSeq NM transcripts was evaluated as above. The 803 lincRNAs expressed at an FPKM>1 and a sample of 1292 RefSeq NM transcripts expressed at an FPKM>1 (divided into their constituent CDS and 3′ UTR regions) were broken up into 30 bp windows with a 1 bp offset. A modified version of HTSeq (described above) was used to count reads aligning to each window for both RNA-seq and ribosomal profiling data. The ratio of ribosome-associated reads over mRNA-seq reads was evaluated for each window and the maximum ratio for a given transcript was taken as a measure of ribosome engagement. The whiskers of the box and whisker plot in Figure 2B extend to +/−1.5 times the interquartile range with outliers depicted as dots. Wilcoxon rank sum test was used to calculate P values. Computational Analysis of Coding Potential The program PhyloCSF (9/16/2010 release) [40] was used to computationally evaluate the coding potential of the filtered lincRNA transcripts. A BED file describing these lincRNA transcripts as well as a random sample of 8310 RefSeq NM transcripts was loaded onto the Galaxy webserver (https://main.g2.bx.psu.edu/) and the tool ‘Stitch Gene Blocks’ was used to retrieve multiple alignment files with sequence entries for the following genome builds based on the 44 way Multiz alignment to hg18: hg18 panTro2 rheMac2 tarSyr1 micMur1 otoGar1 tupBel1 mm9 rn4 dipOrd1 cavPor3 speTri1 oryCun1 ochPri2 vicPac1 turTru1 bosTau4 equCab2 felCat3 canFam2 myoLuc1 pteVam1 eriEur1 sorAra1 loxAfr2 proCap1 echTel1 dasNov2 choHof1. Genome build names were converted to common names and PhyloCSF was run using the options –orf = StopStop3 and –frames = 6. Chromatin Modification Analysis ChIP-seq data from IMR90 cells [28] was retrieved from the NCBI SRA (Table 1) and aligned to hg18 using Bowtie v0.12.7 allowing only uniquely mapped reads (-k 1). A modified version of HTSeq v0.5.3p (described above) was used to count reads mapping to lincRNAs and RefSeq NM genes. The ratio of IP reads to matched input control reads was used as the measure of ChIP signal. RNA-seq data from IMR90 cells [29] was also analyzed to obtain FPKM values for lincRNAs and RefSeq NM genes using the same procedure used for lincRNA discovery. The whiskers of the box and whisker plot extend to +/−1.5 times the interquartile range or the most extreme data point. 10.1371/journal.pgen.1003569.t001 Table 1 Datasets used for chromatin modification analysis. Mark Sample ID SRA File ID(s) H3K4me3 214 SRR029610, SRR029618 H3K9me3 805 SRR037619 H3K36me3 214 SRR037546, SRR037550, SRR037553, SRR037592 H3k27me3 803 SRR037555, SRR037560 Input 803 SRR037639 Input 805 SRR037640 Input 214 SRR037634, SRR037635, SRR037636 Tissue Clustering by LincRNA Expression RNA-seq datasets from B cells, H1 ESCs, and brain (see Table S1) were clustered by lincRNA expression levels. LincRNAs with FPKM>10 in one or more of the 7 RNA-seq datasets analyzed in Figure 3B were used to generate the heatmap and dendrogram. These 7 datasets were chosen for this analysis because they have replicates from each tissue and have deep read counts for all replicates (Table S1), important features for accurate measurement of differential expression. Using Gene Cluster 3.0, FPKM values were log2 transformed and the genes (rows) and samples (columns) were normalized by multiplying each log2 transformed FPKM value by a scale factor such that the sum of the squares of the values in each row and column are 1.0. Euclidean distance using centroid linkage was calculated for all samples and the heatmap and dendrogram was generated with Java TreeView. Red corresponds to fully induced expression and blue corresponds to fully repressed expression. Conservation Analysis Base-wise conservation scores (PhyloP score calculated with PHAST), based on the multiple alignment of placental mammal genomes, were downloaded from the UCSC Genome Browser. The 50 bp window in each lincRNA transcript with the highest average PhyloP score was identified. The process was repeated for RefSeq NM genes and a set of size-matched (to lincRNAs) repetitive elements from RepeatMasker (UCSC Genome Browser). PhyloP scores for the maximally conserved 50 bp windows of each lincRNA are listed in Table S4. SNP Analysis Enrichment of trait-associated SNPs A table containing all trait-associated SNPs with P 0.05) from HapMap release #27 was downloaded from the BioMart HapMap site (http://hapmap.ncbi.nlm.nih.gov/biomart/martview) and the number of common SNPs within RefSeq NM gene exons, lincRNA exons and background loci divided by the number of genomic bases in each of these categories was determined. Fisher's exact test was used to calculate P values and error bars in Figure S7 represent 95% binomial proportion confidence intervals. Supporting Information Dataset S1 Extended protein coding gene boundary filter (BED format; hg18). (TXT) Click here for additional data file. Dataset S2 Primary catalog of lincRNAs identified and analyzed in this study (53,864 FPKM>1, 3,676 FPKM>10, and 925 FPKM>30 transcripts) (BED format; hg18). (ZIP) Click here for additional data file. Dataset S3 Catalog of lincRNAs in Dataset S2 after merging with additional lincRNAs found to be expressed at FPKM>1 exclusively in Cabili et al. [16] (54,784 FPKM>1, 3,764 FPKM>10, and 942 FPKM>30 transcripts) (BED format; hg18). (ZIP) Click here for additional data file. Dataset S4 Catalog of lincRNAs in Dataset S2 (FPKM>1) that are spliced (4,576 transcripts) (BED format, hg18). (TXT) Click here for additional data file. Dataset S5 Catalog of lincRNAs in Dataset S2 that are expressed at FPKM>1 in at least two RNA-seq datasets (26,455 transcripts) (BED format, hg18). (TXT) Click here for additional data file. Dataset S6 Catalog of lincRNAs in Dataset S2 (FPKM>1) that are statistically significantly (p 1) and NM genes in all individual datasets (TXT). Please note that these are large files: the compressed FPKM file is 32 MB (94 MB uncompressed) and the compressed counts file is 7 MB (29 MB uncompressed). (ZIP) Click here for additional data file. Dataset S9 GENCODEv6 “lincRNAs” and “processed transcripts” that were removed at each step of filtering. (A) Unfiltered GENCODEv6 “lincRNAs” and “processed transcripts” (39,472 transcripts) (BED format; hg18) (TXT). (B) GENCODEv6 “lincRNAs” and “processed transcripts” that overlap RefSeq NM (protein coding) genes by at least 1 base pair on either strand (27,267 transcripts) (BED format; hg18) (TXT). (C) GENCODEv6 “lincRNAs” and “processed transcripts” that overlap (see Methods) one or more elements of an expanded set of protein coding genes (UCSC, RefSeq, Ensembl, GENCODE), pseudogenes, UTRs (UTRdb), or non-lincRNA noncoding RNAs (33,245 transcripts) (BED format; hg18) (TXT). (D) GENCODEv6 “lincRNAs” and “processed transcripts” that passed the protein/pseudogene/non-lincRNA ncRNAs/ 100 amino acids in length (964 transcripts) (BED format; hg18) (TXT). (E) GENCODEv6 “lincRNAs” and “processed transcripts” that do not themselves contain an ORF>100 amino acids, but overlap another annotated or de novo lincRNA that contains an ORF>100 amino acids (2,700 transcripts) (BED format; hg18) (TXT). (F) GENCODEv6 “lincRNAs” and “processed transcripts” that passed the prior filters but overlap an extended protein coding gene structure (149 transcripts) (BED format; hg18) (TXT). (G) GENCODEv6 “lincRNAs” and “processed transcripts” passing all prior filters except not found expressed at FPKM>1 in any dataset (1,469 transcripts) (BED format; hg18) (TXT). (H) GENCODEv6 “lincRNAs” and “processed transcripts” passing all filters and expressed at FPKM>1 in at least one dataset (945 transcripts) (BED format; hg18) (TXT). (ZIP) Click here for additional data file. Dataset S10 Catalog of intergenic transcripts containing ORFs longer than 100 amino acids (105,265 transcripts) (BED format; hg18). (TXT) Click here for additional data file. Figure S1 Fraction of the human genome with mapped RNA-seq reads at varying minimum read thresholds. The 4.5 billion mapped reads from all 127 RNA-seq datasets were combined and aligned to the uniquely mappable portion of the human genome (see Methods). The fraction of the uniquely mappable genome with at least the minimum read threshold is plotted. The data does not plateau at low minimum read thresholds, indicating that deeper sequencing would result in a further increase in the fraction of genome covered. For split reads (reads spanning an intron), the intervening (intronic) sequence was either inferred to have been transcribed (Including Inferred Bases) or was not (Excluding Inferred Bases). At the 1 read minimum read count threshold, 67.1% and 78.9% of the genome have read coverage when excluding or including inferred bases, respectively. (TIF) Click here for additional data file. Figure S2 Fraction of RNA-seq reads mapping to protein coding (RefSeq NM) gene exons versus intronic and intergenic regions for 127 RNA-seq datasets grouped by RNA-seq library type. Read counting was performed using a modified version of HTSeq v0.5.3p (see Methods). Isoforms of protein coding genes were flattened before reads were counted such that reads were distributed only once per gene even if multiple isoforms exist. PolyA+ selected libraries (enriched for mRNAs) contain a higher fraction of reads mapping to protein coding gene exons while ribosomal RNA-depleted RNA-seq libraries and polyA− selected libraries contain a higher fraction of intronic and intergenic reads. In all cases, due to the generally high expression levels of protein coding genes, protein coding gene exons contain a disproportionate number of mapped reads relative to the genomic space they occupy ( 1) expressed at varying minimum FPKM levels. The fraction of lincRNAs in Dataset S2 that are expressed at or above the corresponding FPKM level in at least one dataset is plotted. (TIF) Click here for additional data file. Figure S4 LincRNAs have tissue specific expression patterns. LincRNA expression levels (FPKMs) were used to cluster replicates of RNA-seq data from B cells, H1 embryonic stem cells and brain tissue. Agglomerative hierarchical clustering of both lincRNAs (rows) and samples (columns) by Euclidean distance was performed with log2 transformed lincRNA FPKM values for lincRNAs with FPKM>10 in at least one of the analyzed samples. The heatmap displays red for fully induced lincRNAs and blue for fully repressed lincRNAs, where rows and columns were normalized (see Methods). (TIF) Click here for additional data file. Figure S5 Polyadenylation of lincRNAs versus protein coding genes. Distribution of ratios of FPKMs in polyA+/polyA− fractions for lincRNAs and NM genes in HeLa and H9 ESCs. Transcripts with reads in both fractions and FPKM>1 in at least one of the two fractions for a specific cell type were included in the analysis of that cell type (20,470 NM genes and 849 lincRNAs in H9 ESCs; 18,294 NM genes and 1,009 lincRNAs in HeLa). Whiskers extend to +/−1.5 times interquartile range or most extreme data point. (TIF) Click here for additional data file. Figure S6 Comparison of conservation of the full lincRNA catalog (53,864 lincRNAs, Dataset S2, FPKM>1) to GENCODEv6 lincRNAs. The maximally conserved 50 bp windows in each lincRNA, RefSeq NM gene and repetitive element (nonconserved control sequences) were determined. Only the GENCODE lincRNAs that passed all lincRNA filters (2,414 GENCODE lincRNAs, Table S3) were evaluated. (TIF) Click here for additional data file. Figure S7 Distribution of common SNPs between lincRNA exons, NM gene exons, and nonexpressed intergenic regions. HapMap II SNPs with minor allele frequency >0.05 located within NM gene exons, lincRNA exons, or background loci (nonexpressed intergenic regions), normalized by total number of base pairs in each region, were counted (*P = 0.0173, ** P 1). 532 lincRNAs do not contain 50 contiguous bases with PhyloP scores and therefore are not listed. (XLSX) Click here for additional data file. Table S5 Fraction of de novo assembled lincRNAs (pre-merging) discovered by de novo assembly in each combined dataset (see Table S2) that are spliced. (XLSX) Click here for additional data file. Table S6 LincRNA (Dataset S2) and RefSeq NM gene analysis for experimental support of 5′ and 3′ end annotations using combined paired-end ditag (PET) data from 7 tissues/cell lines generated by the ENCODE project (see Methods). (XLSX) Click here for additional data file.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Transposable Elements Are Major Contributors to the Origin, Diversification, and Regulation of Vertebrate Long Noncoding RNAs

            Introduction There is a growing appreciation that the functional repertoire of metazoan genomes includes much more than protein-coding sequences [1]–[3]. Recent functional genomic studies have revealed, in particular, the widespread occurrence, bewildering diversity, and functional significance of noncoding RNA [4]. In addition to small regulatory RNAs, such as tRNAs or microRNAs, the genome encodes a myriad of long noncoding RNAs (lncRNAs) that are greater than 200 nt in length [for review: 5]–[7]. The most recent, though still conservative, catalogues predict between 5,000 and 10,000 discrete lncRNA loci in the human genome [8]–[10]. The majority of lncRNAs in these manually curated reference sets are intergenic units often referred to as large intergenic noncoding RNAs (lincRNAs) because they do not overlap with known protein-coding genes. Comparable numbers of lncRNA loci are expected to occur in the mouse and other vertebrate genomes [9], [11]–[16] and hundreds of loci with similar properties have also been identified in model invertebrates such as Drosophila melanogaster [17] and Caenorhabidtis elegans [18], as well as in the model plant Arabidopsis thaliana [19]. Although once dismissed as transcriptional ‘noise’, there is mounting evidence that many lncRNAs are important functional molecules engaged in diverse regulatory activities. First, the majority of functionally characterized lncRNAs exhibit precise spatiotemporal patterns of expression and, often, discrete cellular localization [9], [11]–[13], [20]–[25]. Second, the structure, biogenesis and processing of lncRNAs are very similar to that of protein-coding genes and indicate that most lncRNAs are produced from independent transcription units. For example, lncRNAs are typically transcribed by RNA polymerase II, under the control of diverse combinations of transcription factors that actively bind to promoters and enhancers, with canonical chromatin modifications [10]–[12], [26], [27]. LncRNA transcripts are also alternatively spliced, polyadenylated, and subject to other post-transcriptional modifications [10], [12], [28]. Third, lncRNA exons generally display a clear signal of purifying selection, implying structural and/or functional sequence constraint, albeit less stringent than on protein-coding exons [10], [12], [29]–[33]. Moreover, some lncRNA genes are evolutionarily ancient. A small but increasing number of loci orthologous to human lncRNAs have been identified in the mouse, and the origins of some human lncRNAs can be traced to the common ancestor of mammals, amniotes, or even vertebrates [9], [10], [14], [16], [34], [35]. Finally, a growing body of genetic and biochemical work on individual lncRNAs, as well as more systematic approaches to explore lncRNA function and their association with disease, point to crucial regulatory activities, notably in cell differentiation and embryonic development [for review: 7], [23], [36]–[44]. While the precise molecular functions of lncRNAs are still poorly understood, even less is known about their origin and evolution. Four non-mutually exclusive hypotheses have been proposed for the emergence of lncRNAs [6], [14]: (i) transformation of a protein-coding genes; (ii) duplication of another lncRNA; (iii) de novo origin from sequences previously untranscribed or devoid of exonic sequences; (iv) emergence from transposable element (TE) sequences. Individual examples illustrating each of these mechanisms have been described. For example, Xist, a lncRNA controlling mammalian X inactivation, originated in the eutherian ancestor from a mixture of exons derived from a decayed protein-coding gene [45] together with a variety of transposable elements (TEs) progressively accumulated and ‘exonized’ at this locus [46]. However, with the exception of a few emblematic and intensively studied lncRNAs such as Xist, the origins of most lncRNAs remain elusive. In one of the most systematic efforts to trace the origins of lncRNAs, Ulitsky et al. [14] found that a minority (∼15%) of zebrafish lncRNAs showed significant sequence similarity to another lncRNAs or protein-coding genes in the zebrafish genome. Likewise, Derrien et al. [10] reported that human lncRNAs rarely have extensive sequence similarity to each other outside of shared repetitive elements. Collectively these observations suggest that, in contrast to protein-coding genes, novel lncRNA genes do not commonly arise by duplication, but rather may emerge de novo from previously non-exonic sequences and/or from TEs. TEs occupy a substantial fraction of vertebrate genomes (e.g. at least half of the human genome [47], [48]) and are increasingly recognized as important players in the origin of functional novelties [for review: 49]–[52]. Several instances of TEs co-opted for cellular function on a genome-wide scale have been documented, notably as a source of cis-elements regulating adjacent host genes, such as promoters [53], [54], transcription factor binding sites [55]–[57], enhancers [58], [59] or insulators [60], [61]. TEs can also be ‘exonized’ into novel coding and non-coding exons [for review: 49], [62], [63]. As a source of non-coding exons, TEs have been shown to contribute substantially to untranslated regions [64]–[67] and to alternatively spliced exons of protein-coding genes [66]–[70], as well as to microRNA genes [71], [72]. In this study we provide evidence for the widespread involvement of TEs in the assembly, diversification, regulation, and potential function of lncRNAs. Results Datasets We focus on three vertebrate species -human, mouse and zebrafish- for which extensive lncRNA datasets are available (Table 1). Each set has been ‘manually’ curated based on a combination of bioinformatics and high-throughput genomics experiments, such as deep sequencing of polyadenylated RNAs (RNA-seq), chromatin state maps and cap-analysis of gene expression (CAGE) or paired-end ditags to determine transcript termini. For human, we primarily analyzed the most recent Gencode catalog of lncRNAs (v13) produced from 15 cell lines as part of the ENCODE project [10], [73], [74]. We replicated most analyses on another large set of lncRNAs assembled by Cabili et al. [9] from 24 human tissues and cell types. Importantly, the Gencode and “Cabili” sets differ slightly in the way they were curated and they are only partially overlapping [10]. Indeed we found that 64.9% of the Gencode v13 genes have no overlap with genes in the Cabili set, and conversely 47.3% of the Cabili genes have no overlap with the Gencode v13 set. While the Cabili set only contains “intergenic” (l i ncRNA) units (no overlap with known protein-coding genes), the Gencode catalog includes also “genic” lncRNAs, i.e. those overlapping or nested within protein-coding genes [10], Figure S1. Thus, these two sets may be viewed as complementary rather than redundant, acting as “biological replicates” for our study. For mouse, we primarily studied lincRNAs from Ensembl (release 70) and replicated some analyses on lincRNAs from adult liver tissue compiled by Kutter et al [16]. For zebrafish, we merged the sets of developmentally expressed lncRNAs from Pauli et al. [24] and lincRNAs from Ulitsky et al. [14] (see Methods for more details). 10.1371/journal.pgen.1003470.t001 Table 1 Number of genes and transcripts in studied datasets. Datasets Genes# Transcripts Human, Gencode v7 [10] 9,277 14,880 Human, Gencode v13 12,393 19,835 Human, lincRNAs, Cabili et al. (2011) [9] 8,263 14,353 Mouse, lincRNAs, Ensembl release 70 1,671 2,167 Mouse, lincRNAs, Kutter et al. (2012) [16] 293 293 Zebrafish [14], [24] 1,402 1,780 # For zebrafish gene annotation, see Methods. Other numbers are from the datasets themselves. A substantial fraction of vertebrate lncRNAs contain exonized TE sequences We inferred the TE content of lncRNAs by calculating the fraction of lncRNA transcripts with exons overlapping at least 10-bp of DNA annotated as TE by RepeatMasker (see Methods). We found that 75% of human (Gencode v13) lncRNA transcripts contain an exon of at least partial TE origin, which is considerably much higher than any other type of RNAs such as small ncRNAs (tRNAs, sno/miRNAs), pseudogenes, coding exons (less than 1%), as well as UTRs, the non-coding parts of mRNAs (Figure 1A). The median length of TE-derived fragments in human lncRNAs is 112 nucleotides and the average is 150 nucleotides. While the majority of human lncRNA transcripts are comprised of a relatively small percentage of TE-derived sequences, 3,789 out of 19,835 transcripts examined (∼19%) are composed of ≥50% of TE-derived sequences (Figure 1B). Similarly, 68.23% and 66.5% of mouse and zebrafish lncRNA transcripts, respectively, contain exonic sequences of at least partial TE origin (Figure 1A). 10.1371/journal.pgen.1003470.g001 Figure 1 TE occurrence in lncRNAs. See text, Methods and Table 1 for more details about lncRNA datasets. A. Percentage of transcripts with at least one exon overlapping with a TEs fragment (at least 10 bp). In red, lncRNAs (human = Gencode v13; mouse = both sets). Rest corresponds to human Refseq 57: in green, small non-coding RNAs (tRNAs and sno/miRNAs); in blue, protein-coding genes (pc genes) separated in exon types (coding and non-coding = UTRs); in black, pseudo = pseudogenes. B. Distribution of percentage of human lncRNA transcripts (Gencode v13) derived from TEs (more than 0% to more than 95%). The number of transcripts with more than 80% and more than 50% TE-derived DNA exons are indicated. Distribution is also shown for the subset of 36 studied lncRNAs presented in Table 2 and Table S2. To measure the total coverage of TE-derived sequences in lncRNA exons in each species, we intersected TE annotations from RepeatMasker (with a minimum overlap of 10 bp, see Methods) with the genomic coordinates of all lncRNA exons, and for comparison, with UTRs and coding exons of RefSeq protein-coding genes. The results show that, in all three species TE coverage is considerably higher for lncRNA exons than for protein-coding exons, but still lower than in the whole genome (Figure 2). The fraction of lncRNA exon sequence covered by TEs is also at least twice higher than in their UTRs. 10.1371/journal.pgen.1003470.g002 Figure 2 Coverage of different TE classes in genome, lncRNA, and protein-coding exons in human, mouse, and zebrafish. For genomes, total length (100%) corresponds to total length of assembly without gaps (human: 2,897 Mb. Mouse: 2,620 Mb. Zebrafish: 1,401 Mb). For lncRNAs, total length of genomic projection of all of exons are considered (human, Genc. = Gencode v13: 14.2 Mb. Human, Cabili set: 8.5 Mb. Mouse, Ens70 = Ensembl 70: 2.8 Mb. Mouse, Kutter: 0.15 Mb. Zebrafish: 2.3 Mb). For protein coding genes (pc genes), total length of CDS exons, 5′ and 3′UTR respectively are as follow: human, 30.9 Mb, 5.2 Mb, 24.6 Mb. Mouse: 30.5 Mb, 4.0 Mb, 21.6 Mb. Zebrafish: 19.1 Mb, 33.6 Mb, 12.5 Mb. Only pc genes from Refseq annotations with CDS and UTR features are considered (see Methods). Percentage of coverage of all TEs is indicated above bars. We noticed that the Cabili set [9], which consists exclusively of intergenic units (lincRNAs) shows greater TE coverage (35.1%; Figure 2) than the Gencode v13 set (28.9%; Figure 2), suggesting that intergenic lncRNAs may have a higher TE content than ‘genic’ lncRNAs (i.e. those overlapping protein-coding genes). Consistent with this idea, the TE coverage of intergenic lncRNAs in the Gencode v13 set is 31.8%, while genic lncRNAs are comprised of 25.9% of TE-derived sequences (Table S1). Thus, human lincRNA transcripts tend to be richer in TE sequences than genic lncRNAs. We wondered whether this trend could merely reflect a higher TE density in intergenic regions in general. It does not appear to be the case because the TE coverage of introns and surrounding sequences of Gencode intergenic lncRNAs is similar to that of protein-coding genes or genic lncRNAs (Table S1). These observations suggest that TEs are more prevalent in intergenic lncRNAs than in genic lncRNAs. A survey of individual human lncRNAs previously characterized in the literature recapitulates the omnipresence and high prevalence of TE sequences we detect in the ab initio lncRNA catalogs (Figure 1B, Table 2 and for an expanded version, Table S2). The presence of exonized TEs has been reported for some of these lncRNAs, such as XIST [46], UCA1 [75], HULC [76], PCAT-14 [77] and SLC7A2-IT1A/B [78]. But for the majority, there has been no previous mention of embedded TEs, even though some of these mature lncRNA transcripts are almost entirely composed of TE sequences. For example, the first three exons (out of four, i.e. ∼86% of the sequence) of the mature transcript of BANCR, which is involved in melanoma cell migration [79], are derived from a MER41 long terminal repeat/endogenous retrovirus (LTR/ERV) element (Figure 3A). The mature transcript of lincRNA-RoR, which has been shown to modulate the reprogramming of human induced pluripotent stem cells [41], is made from an assemblage of 6 different TEs together accounting for 2,057 nt (79.7%) of its length (Figure 3B) [see also ref. 7]. Importantly, the structures of BANCR and lincRNA-RoR transcripts have been validated by a combination of RACE and RT-PCR experiments and their function investigated by siRNA knockdowns and rescue experiments [41], [79]. These transcripts were independently retrieved and their structure accurately predicted in the Cabili and Gencode v13 sets, respectively. In mouse, Fendrr lincRNA, which has a very restricted pattern of expression in lateral mesoderm [80], initiates within a MTEa (ERVL-MaLR) and 4 different TEs account for 808 nt (33.7%) of its length (data not shown). In summary, our analyses point to an extensive contribution of TEs to the content of mature lncRNA transcripts, including many of those with established regulatory functions. 10.1371/journal.pgen.1003470.g003 Figure 3 Examples of lncRNAs with embedded TEs. Genomic DNA is represented as a grey line, transcripts are represented by a black line, with arrows showing sense of transcription and in grey boxes the exons of the mature transcript. TEs as colored boxes (orange-red: DNA TEs. Yellow: SINEs. Pink-purple: LTR/ERVs. Green: LINEs). Only TEs overlapping with lncRNA exons are represented. See also Table S2 for details of TEs in these lncRNAs. A. BANCR [79]. B. lnc-RoR [41]. Apes = gibbon, gorilla, orangutan, bonobo, chimpanzee, human. C. lnc-ES3 [43]. 10.1371/journal.pgen.1003470.t002 Table 2 TE content of known lncRNAs in human. Gene/ID# Range of % of TE based DNA of mature transcripts when applies TSS in TE: class (number of transcripts) polyA in TE: class (number of transcripts) number of transcripts: with TE/total PCAT14 99.6–99.96 LTR (2) LTR (2) 2/2 BANCR 86.4 LTR (1) - 1/1 Lnc-RoR 79.7 LTR (1) LINE (1) 1/1 PTCSC3 60.6 - LTR (1) 1/1 BACE1-AS 56.4 SINE (1) - 1/1 UCA1 51.8 LTR (1) - 1/1 HULC 45.7 LTR (1) - 1/1 LINC00458 (LncRNA-ES3) 45.3–57 LTR (3) - 3/3 ncRNA-7 (LINC00651) 41.1 lincRNA-p21 40.9 - SINE (1) 1/1 NEAT1 38.1 - - 1/2 PCAT1 37.1–81.7 - DNA (1) 2/2 AK023948 29.7 - - 1/1 KCNQ1OT1 29.5 - - 1/1 Tie-1AS 24.4–96.3 LINE (2) - 3/3 PTENP1 21 - LINE (1) 1/1 linc-CCDC90A-1 (LncRNA-ES1) 13.8 - LTR (1) 1/1 OIP5-AS1 (Cyrano) 12.5–64.7 - LTR (1); SINE (3); LINE (1) 10/10 SLC7A2-IT1 11.6–18.6 - LINE (2) 2/2 HOTAIRM1 9 - - 1/5 BIRC6 (megamind) 8.9 LINE (1) - 1/1 HAR1 7.2 - DNA (1) 1/1 BDNFOS (BDNF-AS) 6.4–71.1 - - 9/11 MEG3 5.6–47.1 SINE (2); LINE (1) SINE (6) 23/28 GAS5 5.3–40.9 - SINE (3) 23/29 Xist 5.3–17.4 - - 2/8 MALAT1/NEAT2 4.1–7.7 - - 2/3 ANRIL (CDKN2B-AS1, p15AS1, Mycn) 4–41.1 SINE (1) SINE (7); DNA (1) 16/17 TUG1 1.9–40.6 - LTR (1) 6/7 # Known lncRNAs with no detectable exonized TE are not shown, but include: HOTAIR, Zeb2AS1, TERC, PANDA, H19, LncRNA-ES2 and UNCA-RC. See Table S2 for an expanded version of this table and references. TE sequences in lncRNAs evolve under modest yet greater functional constraint than their non–TE sequences We next sought to evaluate the functional potential of TEs embedded in lncRNA transcripts. Several studies have reported that lncRNA exons show a signature of evolutionary constraint based on interspecies conservation [10], [12], [29], [30] as well as reduced nucleotide diversity in the human population [31]–[33] compared to randomly sampled regions of the genome or surrounding non-exonic sequences. Nonetheless, the level of constraint acting on lncRNA exons assessed through these analyses was much weaker than on protein-coding exons, presumably reflecting greater malleability of lncRNAs. To compare the level of selective constraint acting on TE-derived sequences to non-TE derived sequences in human lncRNAs (Gencode v13) and to various other types of genomic regions, we aggregated conservation scores per nucleotide calculated by phyloP across an alignment of 10 primate genomes (see Methods). As expected, we found that both TE and non-TE sequences in lncRNA exons were much less conserved than coding exons or UTRs of protein-coding transcripts (Figure 4A). Strikingly though, we found that TE sequences within lncRNA exons were significantly more conserved than either a size-matched random set of genomic regions or a neutral set of TE sequences residing in lncRNA introns (permutation test, p 0.5.). TEs promote the lineage-specific diversification of lncRNAs Because transposition represents a major source of lineage-specific DNA, we wanted to evaluate its contribution to the evolution of the vertebrate lncRNA repertoire. Our examination of TE-derived sequences in studied human lncRNAs reveals that many of these elements are restricted to primates (36.3% for Gencode v13, Figure S4), suggesting that TEs play an important role in the diversification and possibly the birth of primate-specific lncRNAs. Few of the human lncRNAs functionally characterized have identifiable orthologs in non-primate species, but Xist and cyrano provide solid examples of functional lncRNAs with ancient evolutionary origins. Xist is involved in X-chromosome inactivation and originated in the common ancestor of eutherian mammals [45], [46]. Previous in silico reconstruction of the Xist locus in the eutherian ancestor suggested that several TEs were already present at the dawn of the Xist gene and likely contributed to the assembly of the first functional Xist transcript [46]. Other TEs embedded in Xist exons are lineage-specific and therefore must have contributed to the diversification of the transcript during eutherian evolution. For example, a primate-specific FLAM_C element makes up nearly half (114 nt) of the first Xist exon in human (Table S2). cyrano is one of a small subset of zebrafish lncRNAs sharing significant sequence similarity and synteny with apparent orthologs in mouse and human [14]. Most of the sequence similarity between species is limited to a central region of the last exon (see PhastCons plot in Figure 9). In zebrafish embryos, cyrano is expressed in the nervous system and notochord and morpholino-mediated knockdowns followed by rescue experiments indicate that this lncRNA plays a role in neurodevelopment, a function possibly conserved in mammals [14]. We find that the conserved exonic region of cyrano is flanked by lineage-specific TEs embedded in this orthologous exon in each of the three species examined (Table 2, Figure 9). These examples illustrate how TEs can be incorporated long after the birth of lncRNAs to diversify their sequence in a lineage-specific fashion. 10.1371/journal.pgen.1003470.g009 Figure 9 Lineage-specific TE insertions in cyrano. Symbols and graphics are as in Figure 3. The structure of cyrano (lnc-oip5) [14] is based on coordinates of Gencode v13 transcript OIP5-AS1-001. Vertebrate PhastCons: peaks of sequence conservation across 46 vertebrate genomes displayed in the UCSC genome browser. Among functionally characterized human lncRNAs, we uncovered numerous instances where the TSS resides in primate-specific TEs (Table S2). In most of those cases, the TE provides the only identified TSS for that lncRNA locus, suggesting a pivotal role for these TEs in the biogenesis and most likely the birth of these lncRNAs during primate evolution. These include six of the eight known lncRNAs containing the largest TE amounts listed in Table 2, which all have their TSS located within the LTR of an ERV element. Intriguingly, these instances include two different lncRNAs that are highly expressed in human embryonic stem cells (ESCs) and have been experimentally shown to be implicated in the maintenance of ESC pluripotency: lncRNA-RoR [41] and lncRNA-ES3 [43]. The transcripts cloned for lncRNA-RoR and lncRNA-ES3 both initiate within LTR7/HERVH elements (Figure 3B and 3C). Furthermore, we found that these same LTR7 elements have donated the DNA binding sites for the ‘master’ transcriptional regulators of pluripotency NANOG, OCT4, and SOX2 mapped previously to the proximal promoter of lncRNA-RoR [41] (Figure 3B). Ng et al. [39] mapped two binding sites for NANOG in the promoter region of lncRNA-ES3 that we find to reside within the LTR7 driving this locus (Figure 3C). The contribution of LTR7 to the regulation of these lncRNAs in ESCs is consistent with two recent studies showing that TEs, and LTR/ERV elements in particular, play an extensive role in the primate-specific wiring of the core transcriptional network of human ESCs [57], [84]. In fact, Kunarso et al. [57] identified LTR7/HERVH as one of the most over-represented TE families seeding OCT4 and NANOG binding sites throughout the human genome. Our results indicate that this ERV family also contributed to the recruitment of primate-specific lncRNAs into the pluripotency network of human ESCs [see also ref. 7]. Some TEs confer lncRNAs the potential to form secondary structures Since lncRNAs act at the RNA level, we hypothesized that TEs may participate in the folding of lncRNAs into secondary structures, which could be important for their function. One prediction of this hypothesis is that lncRNAs with high TE content may fold into more stable structure than those with low TE content. To test this, we selected from the Gencode v13 set the top 100 lncRNAs with highest TE content and the top 100 lncRNAs with lowest TE content (see Methods) and compared the minimum free energy (MFE) of predicted secondary structures computed by the program randfold [85] for each of these individual lncRNAs. For each input sequence, randfold attributes a p-value to a predicted MFE by comparing it with a MFE obtained for the same sequence randomly reshuffled 99 times (See ref. [85] and Methods). The average p-value for high TE content lncRNAs was significantly lower than the one of low TE content lncRNAs (p = 0.0022, Wilcox rank sum test) (Figure 10A). The average length of the lncRNAs in the two datasets was also substantially different (913 nt and 1,913 nt for high and low TE content respectively), but there was no correlation between RNA length and p-value for the 200 lncRNAs examined (data not shown), ruling out a possible bias introduced by lncRNA length. Together these results indicate that TEs generally stabilize lncRNA structure in human, which supports the hypothesis that some of the TEs embedded in lncRNA exons contribute to the folding of lncRNAs into secondary structures. 10.1371/journal.pgen.1003470.g010 Figure 10 TE contribution to predicted lncRNA secondary structures. A. High and low TE content groups of 100 lncRNA were extracted from Gencode v13 set (TE content from 96.74% to 100% and from 0.49% to 2.27% respectively; see Table S7). P-values were calculated by Randfold and provide an indication of predicted secondary structure stability. The boxplot depicts the maximum, upper quantile, median, lower quantile and minimum value in a standard way. The mean of these 2 groups are significantly different by Wilcox rank sum test (p = 0.0022). B. Predicted secondary structures (RNAfold [115]) and compensatory mutations for two zebrafish lncRNAs containing ANGEL (DNA TEs) elements. In structures, TE derived regions are marked by solid line and base pairing probability by color spectrum (from 0 in violet to 1 in red). Zoom-in windows show part of stem with compensatory mutations: nucleotide substitution are boxed and the corresponding nucleotide found in ANGEL consensus are shown under/above actual RNA sequence. Sites of compensatory mutations are marked by asterisks and written p-values are adjusted by Bonferroni methods. To explore further this hypothesis, we studied a family of DNA transposons in zebrafish, called Angel, which occur in high copy numbers and are known to have the potential to form a stable stem-loop structure at the RNA level due to their long inverted repeats [86]. We reasoned that the incorporation of Angel elements in lncRNAs might in some case have conferred a functional benefit by increasing RNA stability. We identified 71 zebrafish lncRNAs containing exonized Angel elements. As expected, RNA folding programs predict that these lncRNAs have the potential to form a long stem-loop structure by intramolecular pairing of the Angel inverted repeats (see examples in Figure 10B). Furthermore, by comparing the sequence of these elements to that of their ancestral (consensus) progenitor, we identified two instances of Angel elements in lncRNAs where base substitutions in one of the arms of the predicted stem-loop structure were accompanied by compensatory substitutions on the other arm allowing the maintenance of base-pairing within the stem-loop structure (Figure 10B). To rule out the possibility that these substitutions occurred not at these loci, but prior to transposon insertion in a progenitor element that would have amplified or duplicated, we used BLAT [87] to search the zebrafish Zv9 genome assembly for paralogous Angel elements that might be sharing the same substitutions. In each case, we found that the compensatory substitutions we identified were unique to the Angel copies residing within the examined lncRNAs (data not shown), suggesting that these mutations occurred after transposon insertion. The probability that these compensatory substitutions would have occurred by chance alone in these two Angel elements is 0.001 and 0.036 after Bonferroni correction, respectively (see Methods). Furthermore, 12 of the 16 concerted mutations were from A/T to C/G base pairs, which is consistent with the idea that they increased the stability of the stem-loop structure. These data suggest that these Angel elements indeed fold into the predicted secondary structures in vivo and have been maintained over time by natural selection, plausibly for the proper function of the lncRNAs. To seek another, independent line of evidence for the involvement of TEs in forming secondary structures potentially important for lncRNA function, we next looked for sites of adenosine-to-inosine (A-to-I) editing in lncRNAs. This form of RNA editing is catalyzed by the ADAR family of adenosine deaminases that act on double-stranded RNA templates [88]. In humans, it has been reported that A-to-I RNA editing occurs predominantly within Alu elements embedded in the 3′ UTR of protein-coding transcripts [89]–[91]. This bias has been explained by the relatively high frequency of Alu elements in transcribed regions of the human genome, which often occur in inverted pairs and thereby can form long RNA duplexes providing templates for ADARs [92]. We used DARNED, a database of RNA editing sites in humans [93], to identify 2,941 A-to-I editing sites in mature lncRNA transcripts. As observed previously for mRNAs, most (82%) of the edited sites in lncRNAs occur within Alu elements, although we also found evidence of A-to-I editing within a wide range of TE types embedded in lncRNAs (Table 3 and Table S6). This may be explained by the fact that non-Alu TE sequences are much more frequent in lncRNAs than in mRNAs, even when UTRs are considered separately (Figure 2 and Figure S1). Indeed, we found that the density of edited sites within Alu, non-Alu TE, or non-TE sequences fall within the same order of magnitude in lncRNAs and UTRs (Table 3). In several cases individually examined, we found that editing sites in TE sequences map preferentially within regions of the lncRNA predicted to form stem-loop structures by virtue of the inclusion of two inverted copies of the same TE family in the lncRNA (see two examples in Figure 11). The finding that TE sequences, and in particular Alu elements, embedded in lncRNAs are frequent templates for A-to-I editing confirms that TEs are commonly engaged in intra- or inter-molecular base pairing interactions to form stable dsRNA structures. 10.1371/journal.pgen.1003470.g011 Figure 11 Long stem in two human lincRNAs (Cabili set) formed by inverted TEs. Two examples of heavily edited human lincRNA transcripts with editing sites located in TEs. RNA structures are predicted by RNAfold [115]. Nucleotide color in structures indicates base pairing probability (from 0 to 1). Inverted TE pairs are marked by solid lines, the arrow illustrating TE strand. The stem pair of TCONS_00017795 is composed by inverted Alu elements, while the structure of TCONS_0001109 is formed by 2 LTRs (MLT2B3). 10.1371/journal.pgen.1003470.t003 Table 3 Editing sites in exons. Total editing sites lincRNA 5′UTR 3′UTR Alu 1521 29 5297 other TE 162 0 121 nonTE 187 87 688 Editing sites per 100 kbp# lincRNA 5′UTR 3′UTR Alu 298 59 472 other TE 6.19 0 6.84 nonTE 3.23 1.75 3.17 # Editing sites density in Alu and non-Alu sequences are similar among non-coding transcript. The density of Alu in different sequences are about 10-3, while in other TE elements and non TE sequences the density is about 10-5. Discussion Recent high-throughput efforts to characterize the transcriptome of multicellular eukaryotes have uncovered thousands of lncRNA genes [5]–[7], [17]–[19]. While current lncRNA catalogs, such as those we used, are still far from exhaustive and almost certainly contain false positives, they indicate that the abundance and complexity of lncRNAs in mammalian genomes may rival or exceed that of protein-coding genes [9], [10], [13], [16]. The precise functions of the vast majority of lncRNAs remain to be determined, but evidence from genetic, genomic, and biochemical experiments, as well as analyses of sequence constraint, suggest that many lncRNAs perform important functions, most notably in the control of protein-coding gene expression during development and differentiation [for review: 5], [6], [7]. Despite the functional importance of some known lncRNAs, the basic mechanisms of lncRNA evolution have been largely unexplored. The few studies that have examined the evolutionary dynamics of lncRNAs paint a picture of evolutionary volatility, where large cohorts of lncRNAs seem to appear, disappear, or rapidly diversify, pointing to a potentially important role of lncRNAs in lineage-specific regulatory innovation [6], [10], [16]. Here we present a systematic assessment of TE contribution to the makeup, evolutionary origins, and regulation of vertebrate lncRNAs. While this paper was in its final stage of preparation, a study by Kelley and Rinn [94] analyzed in some detail the contribution of TEs to human and mouse multi-exonic intergenic lncRNAs. The two studies complement each other in that we analyzed different lncRNA sets, but arrive at the same general conclusion that TEs are important players in the composition and diversification of lncRNAs, highlighting a new way mobile elements have influenced genome evolution and shaped a likely crucial layer of genomic regulation. TEs are ubiquitous in vertebrate lncRNAs, but the type and amount of TE vary among species While TEs are seldom found in protein-coding transcripts (even in their UTRs, see Figure 1 and Figure 2), they are ubiquitous in lncRNAs of all three vertebrates examined (Figure 1A), accounting for a large fraction of total lncRNA sequence (Figure 2). Thus, high TE prevalence is probably a common characteristic of vertebrate lncRNA repertoires that distinguish them from mRNAs and smaller ncRNAs, such as tRNAs or microRNAs, which are typically TE-depleted (Figure 1A). We found that all major TE classes are found in lncRNAs in each of the three vertebrate species surveyed, and their relative abundance mirrors that of the entire genome (Figure 2). Nonetheless, in each species we identified TE families that were statistically enriched (up to 32 fold) in lncRNA exons relative to their coverage or density in the whole genome (Figure 7 and Figure S3). Interestingly, these over-represented families belong to different TE classes in the species examined, for example, LTR/ERV in human and mouse and DNA transposons in zebrafish (compare colors per species in Figure 7 and Figure S3). The predominance of DNA transposons in zebrafish is expected based on the prevalence of DNA transposons in this genome (see Figure 2, Figure 7 and Figure S1). However our results show that LTR/ERVs contribute disproportionally to lncRNAs in human and mouse, which is in agreement with the recent results reported by Kelley and Rinn [94]. Interestingly, human lncRNAs are mostly enriched for the ERV I subclass (alpharetroviruses), compared to mouse where ERV 2, ERV 3 or ERV K TEs are enriched (Figure 7 and Figure S3). ERV 1 subclass of elements is less abundant in the mouse genome [95] and strongly repressed in mouse ESCs [96], [97]. Therefore, it is not surprising that this type of retroviral elements do not contribute more to mouse lincRNAs. While LTR/ERV elements are also generally silenced in most human tissues, a subset of families is known to escape silencing and to become transcriptionally active in some tissues, cell types, or at certain developmental stages [54], [98]–[100]. These properties may derive from the intrinsic capacity of retroviruses to hijack host transcriptional activators in order to promote their own expression in a cell-type or developmentally restricted fashion [51], [52], [55]. For example, hundreds of ERV I elements recruit the pluripotency factors OCT4 and NANOG in human ESCs, but rarely do so in mouse ESCs [57]. This mechanism can readily explain why lncRNA-RoR and lncRNA-ES3 and hundreds of other lncRNAs associated with ERV I elements (such as LTR7/HERVH) are highly transcribed in human ESCs (Figure 3, Table 2) [see also ref. 94]. This trend is also globally apparent through the enrichment of LTR elements (including LTR7/HERVH) in 286 human lncRNAs upregulated in ES cells [annotations from Table S1, 10] (data not shown). In sum, the interspecific variations we observe in the coverage and type of TEs in lncRNAs likely reflect a variety of factors; both methodological, such as the breadth of cell types and tissues examined, and biological such as the abundance and intrinsic properties of certain TEs residing in the genome. “lncRNA first” versus “TE first”: Divergence or emergence? Two scenarios can explain the prevalence of TEs in lncRNAs. The first is that TE insertion in pre-existing lncRNAs has relatively little deleterious effect on lncRNA function allowing TEs to accumulate over time as waves of transposition break in the genome. We call this scenario the ‘lncRNA first’ model because it implies that the origin of the lncRNA predates the incorporation of TE(s) in their exons. In the second and not mutually exclusive scenario, the ‘TE first’ model, lncRNAs are assembled from TEs that inserted before the birth of the lncRNAs. Several observations and examples outlined below indicate that both models contribute to the pervasive occurence of TEs in lncRNAs. The “lncRNA first” scenario is supported by a comparison of the few lncRNAs known to be of relatively ancient origin, exemplified by Xist or cyrano, which have assimilated lineage-specific TE insertions sequentially during evolution (see Figure 9 and Table S2). Typically, these exonized TEs correspond to the most variable regions of the transcript sequence flanking more deeply conserved core sequences (see [14] and Figure 9). On a broader scale, we observe that TEs predominantly contribute to the last exon of lncRNAs (56.5% of TE amount, see Figure S5). The biased incorporation of TEs to the 3′ region of transcripts is also apparent for mRNAs, where exonized TEs are more abundant in 3′ UTRs (Figure 1, Figure 2 and Figure S5), as reported previously [65]–[67]. These data suggest that TE-derived sequences are preferentially acquired at the 3′ end of pre-existing transcripts, either because this region is more permissive to TE insertion and/or because TEs are somehow predisposed for this type of exonization events, for example owing to the presence of cryptic acceptor splice sites facilitating their capture [70], [101], [102]. In any case, this 3′ bias is consistent with the ‘lncRNA first’ model whereby TEs are secondarily acquired by existing lncRNAs. On the other hand, several observations support the ‘TE first’ model. First, we identified thousands of lncRNA transcripts that are mostly or entirely composed of TEs (Figure 1B). It is difficult to conceive that these lncRNAs would have emerged from ancestral non-TE regions later replaced or obliterated by secondary TE insertions. More likely, these lncRNAs were born from material providing by pre-existing TE insertions. In support to this idea, we identified 4,404 human Gencode v13 lncRNA transcripts with TE-derived TSS, with 1,777 of these (40.4%) derived from primate-specific TE families (Table S4). In addition, we found 2,213 human lncRNA transcripts whose first exons are entirely derived from TEs, and 965 of these (43.6%) are derived from primate-specific TE families (Table S4). These values are very similar when only genes with a unique TSS are considered and we retrieved comparable numbers in the Cabili set (Table S4). Since these TEs provide the only TSS assigned for these transcripts, we propose that these lncRNAs were born from the transcriptional activity brought upon TE insertion in the genome. Interestingly, 36.8% (857/2,331) of the TE-derived unique TSS map within LTR/ERV elements, while this type of elements account for only 8% of all TEs in the human genome (see Figure 2). Thus, it appears that the tissue-specific transcriptional activity of LTR/ERV elements [52], [54], [100] represents a major force driving the birth of lncRNAs. These data also imply that a substantial fraction of human lncRNAs are of recent origin, because ∼40% of TEs driving human lncRNAs are primate-specific and some are even restricted to hominoids (e.g. Figure 3A and 3B, Table S2). In summary, our data suggest that, in some instances, TE insertion events have been a source of diversification of ancestral lncRNAs, while in others TE insertions have triggered the emergence of brand new lncRNAs during evolution. In order to better quantify the relative importance of either process to lncRNA evolution, it will be necessary to infer systematically the age of lncRNAs using a comparative RNA-seq approach [16]. TE–mediated regulation of lncRNA genes It has been extensively documented that mammalian TEs represent an abundant source of cis-regulatory sequences driving or modulating the expression of adjacent protein-coding genes [reviewed in 49], [56]. Our study provides evidence that TEs located in the vicinity of lncRNAs may also frequently contribute to the transcriptional regulation of these genes. As discussed above, LTR/ERV elements appear to make a disproportionate contribution to lncRNA regulation relative to other TE types and in some cases they may be solely responsible for the cell-type specificity of lncRNA expression. This is exemplified by lncRNA-RoR whose transcription in hESCs is driven by a LTR7/HERVH element occupied by the pluripotency factors OCT4, NANOG and SOX2 (Figure 3C and [41], [94]). Thus, much like LTR/ERV elements have been implicated in the wiring of protein-coding genes into specific regulatory networks [55], [57], [59], [84], they have also recruited lncRNAs serving important developmental function, notably in the pluripotency network [41], [43], [94]. Possible functions of TEs embedded in lncRNAs Perhaps the most pressing question to address in the future is to what extent TEs may contribute to the function of lncRNA and how? Our analysis shows that TEs embedded in lncRNAs frequently supply sequences and signals essential for the transcription (e.g. TSS) and processing (e.g. splice, polyA sites) of the lncRNAs (Figure 5). However it does not prove that TE sequences per se are indispensable for lncRNA function, if such function even exists. Many studies have used various approaches and statistics to show that lncRNA exons, as a whole, display weak but significant signals of purifying selection suggesting that at least a fraction of lncRNA sequences is subject to functional constraint detectable at the primary DNA level [10], [12], [29]–[33]. Our analysis confirms the existence of a signal of purifying selection acting on human lncRNA exons, but more importantly we observe that this signal is higher in TE-derived than in non-TE derived lncRNA sequences (Figure 4A) suggesting that a subset of TE sequences in lncRNAs are structurally or functionally constrained. TE-derived sequences could serve as the functional elements of lncRNAs in numerous ways. For example, TE sequences might provide interaction interfaces with proteins involved in post-transcriptional or transcriptional regulation, such as the chromatin modifiers often found in complex with lncRNAs [37], [103]. Their inclusion may also provide opportunities for base-pairing interaction with single-stranded DNA or RNA containing similar repeats in inverted orientation. Such duplexes might act as a platform to recruit protein effector complexes to genomic or RNA targets. For example, Alu elements embedded within several human lncRNAs form a group called 1/2sbs-RNAs that base-pair with complementary Alu elements located in the 3′-UTR of several protein-coding transcripts to form duplexes creating a binding site for the Staufen1-mediated RNA decay machinery, which in turn promote post-transcriptional repression of the targeted mRNAs [104]. Given the abundance of Alu and other high copy number TEs in lncRNAs, such trans-regulatory effects may be widespread and affecting a large number of mRNAs containing complementary TEs in their UTRs. It was also shown recently that a B2 SINE embedded in a mouse lncRNA antisense to Uchl1 is required for post-transcriptional up-regulation of UCHL1 protein synthesis, an activity that can be transferred to an artificial antisense green fluorescent protein transcript containing the B2 SINE element [105]. We identified 361 mouse lncRNAs containing B2 SINEs (16.7%; see Tables S5 and S9), raising the possibility that these elements confer similar post-transcriptional regulatory activity to other lncRNAs. Finally, another recent study identified a point mutation associated with a lethal form of infantile encephalopathy within a primate-specific LINE-1 retrotransposon transcribed as part of a lncRNA in the human brain [78]. The precise function of the LINE-1 element in this lncRNA is unknown, but knockdown of the lncRNA resulted in increased neuronal apoptosis, an effect consistent with the encephalopathy phenotype. Interestingly, the point mutation detected in affected individuals was predicted to destabilize a secondary structure in the corresponding lncRNA, suggesting that the LINE-1 element may contribute to lncRNA folding that is important for its function in the brain. Similarly, we identified several instances in zebrafish and in human where TEs embedded in lncRNAs are predicted to be involved in the formation of stem-loop structures that have been maintained in evolution through compensatory mutations and therefore are likely to be functionally significant. We also found that these structures often lead to RNA editing of lncRNAs, which to our knowledge is a novel observation that may be relevant to the function of some lncRNAs [88], [92]. We also show that human lncRNAs fold into more stable structure than those with low TE content, suggesting that these individual examples of TEs apparently co-opted for the cellular function of lncRNAs likely represent only the tip of a large iceberg. Future work is bound to unravel a variety of mechanisms through which TEs embedded in lncRNAs have become involved in regulating the expression of vertebrate genomes. Conclusions There is growing evidence that vertebrate genomes contain a large number of long non-coding RNA genes (lncRNAs) that play important gene regulation roles, however, remarkably little is known about the origins of these genes. Our study reveals that TEs, through their capacity to move and spread in genomes in a lineage-specific fashion, as well as their ability to introduce regulatory sequences upon chromosomal insertion, represent a considerable force shaping the lncRNA repertoire of human, mouse and zebrafish. These results suggest that the apparent paucity of ancient lncRNA genes may be explained in part by rapid turnover mediated by lineage-specific TEs and imply that the regulatory networks in which lncRNA genes act may be rapidly diverging between species. Methods lncRNA datasets The datasets used in this study are as follow: human, Gencode release 13 (from ftp://ftp.sanger.ac.uk/pub/gencode) and Cabili et al. (2011) [9]. Mouse, Ensembl release 70 (ftp://ftp.ensembl.org/pub/release-70/gtf/mus_musculus/) and Kutter et al (2012) [16], both sets filtered to keep only intergenic lncRNAs. Coordinates from Kutter et al. were converted from mm9 to mm10 using the liftover tool from UCSC (http://genome.ucsc.edu/cgi-bin/hgLiftOver?hgsid=325693955). Zebrafish sets are from Pauli et al. (2012) [24] and Ulitsky et al. (2011) [14]. To limit redundancy, in case of overlap of exons between transcripts of the two sets, only transcripts from Pauli et al. (2012) were kept. Additional descriptors of the datasets are provided in Table 1. TE annotation TE annotations used in this study are obtained from the outputs of the RepeatMasker (RM) software [106] produced for the following genome assemblies: human, hg19 assembly, RM v.330, repbase libraries 20120124, from RM website (http://www.repeatmasker.org/species/homSap.html). Mouse, mm10 assembly, RM v.330, repbase libraries 20110920, from UCSC website (http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/). Zebrafish, danRer7: RM v.329, repbase libraries RB20090604, from UCSC website (http://hgdownload.cse.ucsc.edu/goldenPath/danRer7/bigZips/). These RM outputs were filtered to remove non-TE elements (Low Complexity, Satellites, Simple Repeats and ncRNA). For mouse, MutSatRep1, CYRA11_Mm and YREP_Mm are also removed. To minimize multiple counting of single TE copies artificially fragmented in the RepeatMasker outputs we merged consecutive pieces of the same TE separated by less than 10 nt. Counts of TEs in transcripts The TE content of lncRNA transcripts (datasets described above) and human RefSeq 57 ncRNAs (22,486 in total), pseudogenes (13,430), CDS and UTRs (20,848 protein coding genes) was determined by intersecting these sets with each species' TE annotations (described above) using the ‘Table Browser’ at the UCSC Genome Bioinformatics Site (http://genome.ucsc.edu/index.html) [107]. Only overlaps of minimum 10 bp were kept. Coverage of TEs in exons and surrounding sequences Protein-coding gene (pc genes) models were filtered to retain only those with 5′ and 3′ UTRs, from following releases: human, Refseq 49 (hg19, gtf file from UCSC genome browser); mouse, Refseq 57 (mm10, gtf file from UCSC genome browser); zebrafish, Ensembl 68 (danRer7). All nucleotide amounts correspond to genomic amount. Introns and upstream or downstream intergenic regions were processed through Galaxy [108]–[110] to remove all RefSeq genes exons (CDS and UTRS: Refseq 51 for human and zebrafish, Refseq 57 for mouse) as well as lncRNA exons of the datasets considered. Intergenic sequences (upstream or downstream, up to 10 or 1 kb) correspond to the longest fragment between TSS or polyA and another feature (RefSeq entries as well as lncRNA exons of the dataset considered). These sets (exons, introns, intergenic sequences) were then joined in Galaxy with filtered RepeatMasker outputs described above keeping only fragments with at least 10 nt of overlap, to calculate TE coverage of exons. See Tables S7, S8, S9, S10, S11 for transcripts TE content data. Conservation PhyloP By comparing their PhyloP scores across an alignment of 10 primate genomes, the conservation of human (Gencode v13) TE-derived lncRNA exonic segments was compared to non-TE derived lncRNA segments, RefSeq 57 5′- and 3′-UTRs, protein-coding exons and a set of random genomic fragments size-matched to the TE-derived lncRNA segments. We also generated a set of TE-derived lncRNA intronic segments, non overlapping with splicing sites and corresponding to inactive chromatin to obtain a most neutral set to compare with exonic TE-derived lncRNA segments [32] (all annotated chromatin marks from 9 cell lines were subtracted: GM12878, H1-hESC, HMEC, HSMM, HUVEC, HepG2, K562, NHEK, NHLF; ENCODE version Jan. 2011). Precompiled PhyloP scores were obtained from the ‘phyloP46wayPrimates’, ‘phyloP46wayPlacental’, and ‘phyloP46wayAll’ tracks available from the UCSC Genome Bioinformatics Site (http://genome.ucsc.edu/index.html) [107] and intersected with gene annotations using bedtools (http://code.google.com/p/bedtools/) [111]. Boxplots were made in R (http://www.r-project.org). Statistical test used: permutation test with 1000 permutations were performed in R. TE contribution to functional features: Exon counts TEs were not assigned a strand allowing them to overlap genomic features on either strand. TEs found in a genomic feature were classified based on their position in the feature, as schematized in Figure 5A. Both lncRNA and protein-coding genes were filtered before both the random and non-random analyses. In case of multiple splice forms a random mRNA was kept. Additionally protein coding genes that did not have both a 5′ UTR and 3′ UTR were excluded from the analysis. Sets of protein-coding genes are as follows: human, Refseq release 49, mouse, Refseq release 46, zebrafish, Ensembl release 68. For the random sets, all TEs were shuffled within chromosomes (excluding gaps) while preventing TE overlap. This process was repeated 5,000 times for each set, using a custom perl script (see http://www.yandell-lab.org/publications/index.html). The standard error for the random sets across all categories (Figure 5A) always plateaued before 1,000 replicates (data not shown). The probability of observing the non-random counts was calculated using the random sets. The p-value represents number of times a lower category count was observed in the random set out of 5,000 replicates. With the exception of ‘exonized’, ‘TSS’ and ‘polyA’ categories in mouse (p-values = 1, 0.001 and 0.298 respectively) and ‘exonized’ category in zebrafish (p-value = 0.001), the p-values were systematically 1 or the 25 most over represented are shown. A. Human, set from Gencode v13. B. Human, set of Cabili et al. (lincRNAs). C. Mouse, lincRNAs from Ensembl. D. Zebrafish. (TIF) Click here for additional data file. Figure S4 Amount of lineage specific and ancient TEs in human lncRNAs and protein-coding genes genomic environment. G: genome. See Methods “Coverage of TEs in exons and surrounding sequences” for details on sets. Ancient TEs correspond to TEs shared between placental mammals (Eutherians). Gencode v13 set for lncRNA. (TIF) Click here for additional data file. Figure S5 Relative amount of TEs depending on exon type. TE contribution means that 100% is the total coverage of a given class of TEs. For example, ∼20% (19.5) of TE amount is in first exon for lncRNAs, whereas for pc genes it is 6.4% (“All” TEs). lncRNAs are Gencode v13 set. A. Coverage. B. Counts. (TIF) Click here for additional data file. Table S1 TE coverage and TE counts in genic and intergenic human lncRNAs and surrounding regions. (XLSX) Click here for additional data file. Table S2 Detailed TE content of known lncRNAs presented in Table 2. (XLSX) Click here for additional data file. Table S3 Numbers of exons overlapping with TEs (data corresponding to Figure 5). (XLSX) Click here for additional data file. Table S4 Counts of functional features provided by ancient or primate TEs in human. (XLSX) Click here for additional data file. Table S5 Editing sites in Cabili lincRNA set. (XLSX) Click here for additional data file. Table S6 361 mouse lincRNAs with B2 SINEs elements (ENSEMBL release70). (XLSX) Click here for additional data file. Table S7 Details of TE content of human Gencode v13 lncRNAs. (XLSX) Click here for additional data file. Table S8 Details of TE content of human lincRNAs from Cabili et al. (2011). (XLSX) Click here for additional data file. Table S9 Details of TE content of mouse lincRNAs from ENSEMBL release 70. (XLSX) Click here for additional data file. Table S10 Details of TE content of mouse lincRNAs from Kutter et al. (2012). (XLSX) Click here for additional data file. Table S11 Details of TE content of zebrafish lincRNAs. (XLSX) Click here for additional data file.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              De Novo Assembly of Chickpea Transcriptome Using Short Reads for Gene Discovery and Marker Identification

              Chickpea ranks third among the food legume crops production in the world. However, the genomic resources available for chickpea are still very limited. In the present study, the transcriptome of chickpea was sequenced with short reads on Illumina Genome Analyzer platform. We have assessed the effect of sequence quality, various assembly parameters and assembly programs on the final assembly output. We assembled ∼107million high-quality trimmed reads using Velvet followed by Oases with optimal parameters into a non-redundant set of 53 409 transcripts (≥100 bp), representing about 28 Mb of unique transcriptome sequence. The average length of transcripts was 523 bp and N50 length of 900 bp with coverage of 25.7 rpkm (reads per kilobase per million). At the protein level, a total of 45 636 (85.5%) chickpea transcripts showed significant similarity with unigenes/predicted proteins from other legumes or sequenced plant genomes. Functional categorization revealed the conservation of genes involved in various biological processes in chickpea. In addition, we identified simple sequence repeat motifs in transcripts. The chickpea transcripts set generated here provides a resource for gene discovery and development of functional molecular markers. In addition, the strategy for de novo assembly of transcriptome data presented here will be helpful in other similar transcriptome studies.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS One
                PLoS ONE
                plos
                plosone
                PLoS ONE
                Public Library of Science (San Francisco, USA )
                1932-6203
                2014
                11 February 2014
                : 9
                : 2
                : e88462
                Affiliations
                [1 ]Center for Molecular Biology and Genetic Engineering (CBMEG), University of Campinas (UNICAMP), Campinas, SP, Brazil
                [2 ]Centro Avançado da Pesquisa Tecnológica do Agronegócio de Cana (IAC/Apta), Ribeirão Preto, SP, Brazil
                [3 ]Departamento de Biotecnologia e Produção Vegetal e Animal, Centro de Ciências Agrárias, Universidade Federal de São Carlos, Araras, SP, Brazil
                [4 ]Departamento de Genética, Escola Superior de Agricultura Luiz de Queiroz, Universidade de São Paulo, Piracicaba, SP, Brazil
                [5 ]Departamento de Biologia Vegetal, Instituto de Biologia, Universidade Estadual de Campinas (UNICAMP), Campinas, SP, Brazil
                University of North Carolina at Charlotte, United States of America
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                Conceived and designed the experiments: AAFG MSC LRP APdS RV. Performed the experiments: EAC MCM TWAB. Analyzed the data: CBCS EAC LECC RV. Contributed reagents/materials/analysis tools: EAC MCM TWAB. Wrote the paper: CBCS EAC RV.

                Article
                PONE-D-13-33543
                10.1371/journal.pone.0088462
                3921171
                24523899
                cbab82ae-1894-44e0-ae7f-13243f41c06b
                Copyright @ 2014

                This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 15 August 2013
                : 7 January 2014
                Page count
                Pages: 10
                Funding
                The authors gratefully acknowledge the Fundação de Amparo a Pesquisa do Estado de São Paulo (FAPESP, http://www.fapesp.br) for the financial support grants 2008/52197-4 (AS) and 2008/58031-0 (RV) and for the graduate scholarships to CBCS, EAC, MCM, and TWB, and to the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPQ, http://www.cnpq.br) for the research fellowships to AAG, APS, and RV. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Agriculture
                Agricultural Biotechnology
                Marker-Assisted Selection
                Biology
                Biotechnology
                Plant Biotechnology
                Marker-Assisted Selection
                Computational Biology
                Genomics
                Genome Analysis Tools
                Transcriptomes
                Genetics
                Plant Genetics
                Crop Genetics
                Genomics
                Genome Analysis Tools
                Transcriptomes
                Plant Science
                Agronomy
                Plant Breeding

                Uncategorized
                Uncategorized

                Comments

                Comment on this article