Unprecedented High-Resolution View of Bacterial Operon Architecture Revealed by RNA Sequencing

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

INTRODUCTION Escherichia coli burst into the realm of model organisms with the discovery of conjugation by Joshua Lederberg in 1946 (1). Just 15 years later, Francois Jacob and Jacque Monod proposed the operon model in E. coli (2). For two-thirds of a century, E. coli has been an important vehicle for scientific investigation, playing a role in research resulting in no fewer than 10 Nobel prizes (1 – 10). The E. coli K-12 genome was among the early ones sequenced (11) and E. coli is unique among model organisms, possessing biochemical or genetic evidence for functions for ca. 75% of its known genes, making it arguably the best understood organism (12). Examination of its genome sequence confirmed what had long been surmised, that genes of related function are frequently arranged in operons (13 – 15). Soon after the discovery of the lac operon, it became clear that not all operons are transcribed as discrete units of information neatly arranged end to end on the genome. First, it was recognized that regions of phage lambda are transcribed on complementary strands (16). Over the next 40 years, operons were studied, one or two at a time, in line with the technology of the day, revealing occasional glimpses of transcriptional complexity arising from overlapping, divergent (17, 18) and convergent operons (19, 20). Second, the perception of transcriptome complexity was forever changed when it was found that at least one antisense (AS) transcription start site is associated with nearly one-half (46%) of Helicobacter pylori genes (21). There is also a substantial amount of AS transcription in E. coli (22 – 24). While some researchers suggested that extensive AS transcription is a “by-product” of the transcription machinery, largely because AS transcripts did not appear to be conserved in enteric bacteria (25), others concluded the opposite, that AS RNA has an important role in transcriptional regulation (26 – 32). The recent identification and sequencing of 316 potentially functional double-stranded RNAs in E. coli is a step toward laying the argument to rest (33). The “excludon” concept of AS RNA control of divergent operons ascribes an important function to overlapping, complementary transcripts (34). A recent study of Staphylococcus aureus suggests that AS transcripts drive RNase III-mediated RNA processing, although a comparison of the AS RNA content of selected bacteria led the authors to infer that the mechanism is prevalent in Gram positives but absent in Gram negatives (30). Amid the mounting evidence for transcriptional complexity in bacteria and the finding that AS transcripts are prevalent in bacteria, we undertook a comprehensive transcriptome analysis of E. coli. RNA sequencing (RNA-Seq) offers tremendous power for high-resolution transcriptome analysis. However, the fullness of its power has yet to be realized for E. coli, because all previous studies of the E. coli transcriptome failed to annotate both the 5′ and 3′ transcript ends and hence operons were not precisely mapped. We therefore developed an organizational schema described herein to precisely map RNA-Seq reads across entire operons, including both the 5′ and 3′ transcript ends, and to annotate these data in the context of the operon arrangement on the transcriptome. Though others used tiling microarray technology to address bacterial transcriptome organization (28, 35), tiling arrays did not have the resolving power to define transcript ends precisely or to elucidate operons with multiple promoters. More recent transcriptome mapping studies of E. coli relied on 5′ end mapping to identify transcription start sites (TSSs) (36, 37). However, our own critical examination of these data sets revealed extensive discrepancies that call into question many candidate TSSs, reinforcing the need for alternative promoter-mapping strategies (38). Recent RNA-Seq analyses of E. coli were also unfortunately not designed to map transcript ends accurately because they relied on randomly primed cDNA synthesis (39) or they had a resolution of only ca. 50 nucleotides due to low sequence read coverage (40). The recent development of differential RNA-Seq technology allowed mapping TSSs in Helicobacter pylori (21) and Salmonella enterica (29, 41); however, operon architecture was not determined because the 3′ ends were not mapped. In evaluating these approaches, we recognized that identification of both 5′ and 3′ transcript ends is essential for precise mapping of transcriptional regulatory features. Considering the foundational role of E. coli in the life sciences, high-resolution RNA-Seq will stimulate progress by unambiguous mapping of the features that control transcription. To annotate operons and characterize their response to carbon starvation, we obtained a time series of RNA samples from wild-type E. coli K-12 BW38028 cultures grown to stationary phase on chemically defined, carbon source-limited (glucose) minimal medium. We chose these conditions because they are intrinsic to the physiology that allows E. coli to colonize the mammalian intestine yet survive in the environment until encountering a new host and, in the case of E. coli pathogens, cause disease (42). We analyzed these RNA samples by deep sequencing with a strand-specific RNA ligation approach (43) that ensures full read coverage and precise mapping of both the 5′ and 3′ transcript ends. In practice, only three transcriptional features are needed to define operon architecture, regardless of complexity. These are the 5′ ends (promoters), the 3′ ends (terminators), and sufficient RNA-Seq read coverage to connect the ends, which together define operons (Fig. 1). Our analyses revealed an unprecedented high-resolution view of E. coli operon architecture. Our analytical approach allowed us to test the hypothesis that bacterial operon structure accommodates substantial transcriptional complexity. We offer our annotated E. coli K-12 operon map as a community resource upon which others can participate in annotating additional transcriptional regulatory features (GEO accession no. GSE52059). FIG 1 Single-nucleotide resolution of promoters and terminators in example complex operons. (A) The bolA operon contains transcription units (TUs) P-453657:T-454091 (red arrow) and S-453688:T-454091 (orange arrow). RNA-Seq data are shown in a JBrowse visualization of positive-strand (red) transcription in logarithmic- and stationary-phase samples (average from three replicates). The base count data were normalized and log2 transformed such that track heights in JBrowse are directly comparable. (B) bolA promoter region showing primary promoter P-453576 and secondary promoter S-453658 at single-nucleotide resolution (drawn to scale). (C) Plot of promoter usage (average count of 10 bases beginning at TSS) and TU usage (average count of bases within TU) for 10 growth curve time points showing bolA induction upon entry into stationary phase (see Fig. S1 for growth curve). (D) Terminator usage (average counts of 10 bases preceding and following terminator) is shown for T-1066062, which is shared by converging operons agp on positive strand (red) and wrbA-yccJ on negative strand (blue). RESULTS AND DISCUSSION Single-nucleotide resolved RNA-Seq data sets. E. coli K-12 has served as an important model organism for more than a half century and was the first bacterium analyzed by DNA microarray technology (44, 45). While several other bacteria have now been analyzed by RNA-Seq (21, 26, 28, 31, 41, 47 – 49), the limited RNA-Seq studies of E. coli have not provided single-nucleotide resolution (39, 40). Herein, we used a strand-specific RNA ligation-based RNA-Seq strategy, which when coupled with a robust analytical approach, allowed us to define transcriptional features across the whole E. coli genome at single-nucleotide resolution. We acquired time series of RNA samples from duplicate cultures of E. coli K-12 BW38028 and its isogenic rpoS mutant BW39452 during logarithmic- and stationary-phase growth on glucose-limited minimal medium (see Fig. S1 in the supplemental material). In total, we sequenced 26 RNA samples to generate a data set of 72.1 million uniquely mapped sequence reads corresponding to 5.5 gigabases of RNA-Seq data (see Table S1). Appropriate temporal expression of bolA, a known glucose starvation-inducible gene (50), confirmed that our RNA-Seq data correctly represented the growth conditions (Fig. 1). Our ongoing analyses of the RpoS regulon will be reported elsewhere. The correlation between replicate cultures was >0.96 (see Fig. S1); this level of biological replication provides a reliable view of the E. coli K-12 transcriptome (Fig. 2). The data are available at GEO (accession no. GSE52059). FIG 2 Genome-wide promoter locations and annotated transcriptome map of a selected region. (A) Promoters aligned by genome location. Line heights correspond to normalized, TEX-enriched promoter usage values (see text for details), shown for logarithmic phase (black) and stationary phase (orange). (B) Annotated regulatory features of a selected region of the genome. Positive-strand RNA-Seq data (red) and negative-strand data (blue) were normalized for comparison between logarithmic- and stationary-phase samples. Primary promoters and corresponding TUs (red) are indicated by arrows extending from promoter to terminator, as are secondary promoters (orange), internal promoters (purple), and AS promoters (green). Beginning on the left, rmf is transcribed from a primary promoter and depending on growth conditions terminates either before or within the ycbZ-fabA operon, which has a primary promoter upstream of ycbZ, an internal promoter within ycbZ, and a secondary promoter upstream of fabA. matP is transcribed from primary and secondary promoters. ompA is transcribed from a secondary promoter in logarithmic phase and is cotranscribed from the primary promoter of the sulA-ompA operon during stationary phase. An AS TU that overlaps the sulA sense transcript is turned on in stationary phase. The sxy and yccF-yccS operons converge. Finally, mgsA is transcribed as an independent TU from a secondary promoter in logarithmic phase and also is expressed in the yccT-mgsA operon from a promoter that is active only in stationary phase. (C) Plot of TU base counts for ycbZ-fabA operon, colorized according to color scheme in panel B; (D) TU plot of sulA-ompA operon; (E) TU plot of yccFS operon; (F) TU plot of yccT-mgsA operon. We developed an in-house computational tool to convert the binary read alignment (BAM) files to base count (WIG) files to facilitate single-nucleotide resolution analyses. We normalized our base count data by using a strategy analogous to the total count approach (51) for normalizing gene-specific read alignments. Accordingly, the resulting WIG files contain only the base location and the number of times each base is read (sequenced) and are >100-fold smaller than the sample read alignment (SAM) files. Advantages of this simple base count approach are severalfold: first, the data are inherently more computable; second, normalization of base count data makes all samples directly comparable and eliminates transcription unit (TU) length bias; third, the base counts of individual features can be computed and queried at any desired resolution from single nucleotide to an entire operon. Since we analyzed RNA-Seq reads at the base count level, the normalized base counts can be readily averaged across any range of bases to calculate the relative usage of transcriptional features, including promoters, terminators, TUs, and operons (Fig. 1). We empirically determined the number of bases used to calculate promoter usage by comparing the single base count value at the TSS to 3-, 5-, 10-, and 20-base averages, each beginning at the TSS. In practice, the shorter base count lengths were highly variable, presumably because of staggered starts that are occasionally observed in primer extension experiments (52) and were frequently observed in the RNA-Seq data sets. However, a 20-base-count length was too long to allow discrimination of closely spaced promoters. We therefore used 10-base average counts for quantifying promoter usage (Fig. 1). The same 10-base average worked well for calculating terminator efficiency by comparing the 10-base average counts before and after the termination site (Fig. 1 and 3). We used these base count values to calculate the usage of individual transcription features as well as the impact of operon structure on relative TU and gene expression. Promoter mapping. Our search for promoters was driven by mapping of putative TSSs on the basis of (i) enrichment with terminator exonuclease (TEX), which degrades RNA molecules with 5′-monophosphate ends and consequently enriches for 5′-triphosphate ends corresponding to the nucleotide initiated de novo by RNA polymerase (18); (ii) promoter motif analysis; (iii) consensus among replicate data sets; and (iv) sigma factor-specific RNA polymerase binding (SELEX). None of these approaches alone is comprehensive, because each gives rise to false-positive results and fails to find all TSSs (20). For example, TEX treatment does not enrich for some TSSs because RppH phosphatase activity removes the 5′-triphosphates (53). Additionally, not all promoters have consensus motifs that can be identified by computer algorithms (54), nor do all promoters bind RNA polymerase in vitro (55). To facilitate promoter mapping, we wrote a simple algorithm to search for changes in base count values exceeding 2-fold in replicate TEX-enriched and coverage data sets (n = 14, wild-type [WT] and rpoS culture samples from log and stationary phase). The TSSs of highly expressed genes were apparent in all 14 replicates. However, since the 14 samples represented both logarithmic- and stationary-phase samples, expression of some promoters was condition specific. In order to generate transcriptome maps that are condition independent for annotating the response to many conditions in the future, we chose a consensus in which three replicates of either logarithmic- or stationary-phase samples have TSSs at the identical base locations as a starting point for promoter mapping. This conservative strategy revealed 11,329 putative TSSs, a value that is similar to the number of promoters found by Thomason and Storz (submitted for publication), and includes known promoters of weakly expressed genes. However, this number far exceeds the expected promoter density on the E. coli genome, thus exemplifying the need to use a multifaceted approach to confirm promoters. To identify candidate promoters missed by TSS mapping of regions that had few RNA-Seq reads, we employed genomic SELEX screening, which was developed for quick identification of genes under the control of specific transcription factors (57). Confirmation of tentative TSS’s by RNAP binding was previously employed for promoter mapping of Salmonella enterica serovar Typhimurium (29). Sites that bound RpoD in vitro, exceeding a conservative signal-to-background ratio threshold of 3.0, and corresponded to RNA-Seq reads expressed in vivo identified 1,254 additional candidate promoters (see Fig. S2 in the supplemental material). Next, we used a bioinformatics approach to search the 50-bp sequences immediately upstream of the 12,583 putative TSSs for promoter motifs by using FIMO software (58) to screen against a library of E. coli promoter motifs available at DPInteract (59). We found it was necessary to modify the RpoD promoter library according to the characterization of 554 promoters by Mitchell et al. (60), which demonstrated that the RpoD consensus promoter has −10 and −35 regions with spacing of 14 to 20 bases between promoter elements. The search output was restricted to promoter sequences correctly positioned within ±3 bases of the TSS, with E values corresponding to P values of <0.02. This multifaceted approach yielded 5,653 putative RpoD-dependent promoters, which we evaluated further by manual annotation, which involved direct visual observation. A visual graphic environment (JBrowse [61]) interfaced to an Oracle database facilitated manual annotation documentation. From the list of candidate promoters, we created a JBrowse track at the corresponding base locations, each displaying a “clickable” URL call to the database that automatically recorded the base location and allowed manual entry of metadata, including the type of promoter, regulatory information supported by differential expression analysis, and comments. We annotated only promoters that could be experimentally associated with operons, by using RNA-Seq data as described in the next section. This comprehensive strategy yielded 2,122 vegetative promoters (Fig. 2), which more than doubled the 811 individually characterized E. coli promoters annotated in RegulonDB and calls into question several thousand candidate promoters that were identified by less reliable high-throughput strategies (35, 38). The promoter data set (see Table S2) is dominated by primary promoters (P), defined as the furthest upstream promoter in an operon (66.3%), with a lower number of secondary promoters (S) that are intergenic and downstream of P promoters (19.6%), internal promoters (I) that are intragenic (9.8%), and AS (4.2%) promoters. All possible arrangements and orientations of these promoter types were observed and collectively generated substantial complexity in the transcriptome (Fig. 2). It is well known that promoter strength, i.e., quality, varies greatly (60) and that variability is reflected in our data set. To quantify promoter quality, we scored the four criteria (metrics) used to map candidate promoters (see Table S2). The promoter quality score was calculated by applying a weighted matrix on the basis of 10 points, where TEX enrichment carries a weight of 4, the promoter motif score carries a weight of 3, the TSS consensus (between replicates) score carries a weight of 2, and the SELEX score carries a weight of 1. The resulting analyses yielded promoters scored on a scale of 0 to 10. The TEX enrichment metric reflects the number of instances among four TEX replicates in which the ratio of TEX-treated versus non-TEX-treated base counts (10-base-count average beginning at the TSS) for a sample exceeded 2-fold. The promoter motif score was calculated in quartiles of E values for RpoD-dependent promoter motifs as determined by using FIMO. The TSS consensus score was calculated as the number of occurrences of a TSS at a precise base location divided by the total number of samples evaluated (n = 14). The final metric was the presence or absence of SELEX-determined RpoD binding, which was scored as a 1 or 0. The 2,122 promoters ranged in score from 10 to 0.14, with the top 10% of promoters scoring above 7.8, the bottom 10% scoring below 2.9, and the average promoter scoring 5.5. We found no strong correlation between promoter usage (average count of first 10 transcribed bases) and promoter confidence scores or promoter motif scores (see Fig. S3 in the supplemental material), which is in agreement with an earlier report (60). However, we did find a weak correlation between promoter usage and TU usage (average count of bases from promoter to terminator) (see Fig. S3). We confirmed that TU usage and RNA half-life (62) (measured under similar conditions) do not correspond, as noted previously. Nevertheless, promoter and TU usage values do reflect the physiologically relevant transcript level, as the RNA concentration in the cell is determined both by the frequency of transcription initiation and the rate of RNA decay, which vary substantially for different transcripts (62). Operon mapping. To annotate operons, it was also necessary to map the 3′ transcript ends, which allowed documenting the connections between promoters and the corresponding downstream terminators (Fig. 1). Our criteria for operon annotation were (i) the P promoter must be followed by sequence read coverage across the entire operon, (ii) the mapped 3′ ends must extend beyond the stop codon of the last gene in the operon, (iii) the S and I promoters must increase read coverage of the downstream bases, and (iv) internal terminators must decrease coverage of downstream bases without interrupting contiguous coverage by readthrough transcripts. Our search for 3′ ends that can be associated with annotated promoter(s) by deep sequence read coverage throughout the operon led to mapping 1,774 candidate terminators (see Table S3 in the supplemental material), 264 of which lie within operons and apparently permit partial readthrough transcription of downstream genes, as demonstrated for the sdhCDAB-sucABCD operon (Fig. 3). We evaluated the 1,774 3′ ends by using TransTermHP (63) and confirmed that 623 have sequences characteristic of intrinsic terminators, which extends the number of annotated E. coli terminators previously annotated (227 [38]) by nearly 8-fold. It has been predicted that roughly one-half of terminators are intrinsic (64). The remaining 1,151 terminators that were not confirmed by TransTermHP are candidates for ones requiring Rho or another protein factor for termination. Since there is no computational approach to identify factor-dependent terminators, the data in Table S3 represent the most extensive genome-wide prediction of nonintrinsic terminators. FIG 3 Balanced transcript coverage of the sdhCDAB-sucABCD operon achieved by complex interaction of internal terminator and secondary promoter. (A) JBrowse instance showing coverage data; (B) terminator usage in logarithmic (WT_log_cmb_pos) and stationary (WT_stat_cmb_pos) phase; (C) TU coverage time series. The preceding analyses of only logarithmic- and stationary-phase samples revealed a total of 6,463 regulatory features, including 2,122 promoters (see Table S2), 1,774 terminators (see Table S3), and 2,566 transcription units (TUs) corresponding to 1,510 operons (see Table S4). The sequence reads covered more than 90% of bases within 90% of the annotated operons. The 1,510 operons cover 2,985 of 4,457 genes (approximately two-thirds) annotated on the reference genome. As more data sets and growth conditions are analyzed, our simple organizational schema should make it straightforward to add newly identified regulatory features to the E. coli K-12 transcriptome map. For ready distribution, we converted our data sets to GenBank format by using “promoter,” “terminator,” and “operon” as feature keys (65). This data format allows annotation of any number of experimental parameters that affect the usage of these features. Our E. coli K-12 transcriptome annotation GenBank feature table is available from GEO (accession no. GSE52059). Operon organization examples. The data in Fig. 2 unequivocally confirm that the E. coli genome is organized in operons. In its original conception, the operon has a regulatory region with a single promoter that initiates transcription of a polycistronic mRNA covering the lac operon genes and ends with a single terminator. Indeed, many E. coli operons fit this model or are even simpler if they contain a single gene (monocistronic). However, the whole E. coli transcriptome reveals densely packed regulatory features that cannot be discerned from the genome sequence alone (Fig. 2). Complex operons result from transcripts initiated by S and I promoters, as well as internal terminators. For example, sulA and ompA are independently transcribed during logarithmic phase, with each gene having its own promoter and terminator. However, during stationary phase, the sulA TU reads through a nonintrinsic sulA terminator to form the sulA-ompA transcript, driven by an S promoter that increases expression of the ompA TU (Fig. 2). An AS transcript that fully overlaps the sulA coding sequence is also switched on in stationary phase. This arrangement of the sulA-ompA operon and AS transcript was postulated as a means for posttranscriptional control of the synthesis of the cell division inhibitor SulA (66), which is further supported by our results. Our organizational schema makes the previously unannotated sulA AS transcript (12) and similar regulatory features readily apparent on the sulA-ompA transcriptome map (Fig. 2). Such differential expression of TUs within operons can provide bacteria with the ability to modulate gene expression to cope with physiological complexity (28, 30, 34, 41). It is especially notable that Fig. 2 reveals the E. coli transcriptome for only two growth conditions, log phase and stationary phase, due to carbon source limitation. Our analyses show that 29% of operons have more than one promoter and that 15% of operons have more than one terminator under these conditions (Fig. 4). Further, many operons are subject to multiple regulatory inputs (38) that have not been examined here. Differential mRNA decay can also provide an additional layer of control within operons (62). No doubt, future RNA-Seq analysis of the myriad responses to numerous regulatory signals is likely to reveal substantially more variation in operon architecture, as seen for Salmonella (41). FIG 4 Computational analysis of single-nucleotide resolution data reveals complex operon architecture. (A) Operons organized by increasing complexity; (B) TU usage plot of ligT-sfsA-dksA-yadB-pcnB-floK operon. The primary TU corresponding to the entire operon is shown in red. The differentially expressed dksA-specific TU driven by promoter I-161376 is shown in purple. The pcnB-folK TU driven by S-159171 is shown in orange. Note that transcript levels of dksA increase upon entry into stationary phase, whereas pcnB-folK decreases. (C) JBrowse instance showing ligT-sfsA-dksA-yadB-pcnB-floK operon; (D) TU usage plot of ybdK-ybdJ-ybdF-nrsB-mbcM operon. Note the primary TU corresponding to the entire operon (red) decreases only slightly during transition from logarithmic phase into stationary phase, because it is comprised of two differentially expressed TUs, one of which increases and the other decreases during growth: the nfsB-mbcM-specific transcript (orange) essentially disappears in stationary phase, whereas the ybdK-specific transcript (blue) is induced in stationary phase. (E) JBrowse instance of ybdK-ybdJ-ybdF-nrsB-mbcM operon. Intricacy is readily apparent for operons with internal promoters and terminators. For example, three promoters upstream of the ahpCF operon contribute to its expression in an additive fashion (Fig. 5). Such an arrangement allows differential control of alkylhydroperoxidase production in response to stationary phase, osmotic stress, and oxidative stress (67). Likewise, three promoters contribute to ybfE-fldA-uof-fur operon expression during logarithmic phase, allowing for continuation of uof-fur TU expression, decline of fldA expression, and turnoff of ybfE expression in the stationary phase (Fig. 5). Although cotranscription of the complex ybfE-fldA-uof-fur operon was not previously recognized (68), it makes sense for uof-fur to be transcribed independently of ybfE-fldA under certain conditions, because fur encodes a negative regulator of genes for iron uptake. Furthermore, uof expression is controlled indirectly by the trans-acting noncoding RNA RhyB, which is itself Fur regulated, thus forming a negative feedback loop responsive to iron limitation (68). The ability to unravel condition-specific terminator usage by our organizational schema is illustrated for the internal terminator of the sdhCDAB-sucABCD operon, which encodes three enzymes of the tricarboxylic acid (TCA) cycle (Fig. 3). This arrangement explains how intrinsic termination can allow one operon to function independently as two operons under appropriate conditions (69). These examples demonstrate how our promoter and terminator usage calculations can reveal new biological insights from the RNA-Seq transcriptome analyses. FIG 5 Three promoters contribute to expression levels of genes within the ahpCF and the ybfE-fldA-uof-fur operons. (A) WT time series of TU base counts of three overlapping TUs within the ahpCF operon; (B) usage of 3 ahpC promoters (10-base average from TSS +1 to +10) during logarithmic phase (time point 4); (C) TU coverage time series of the ybfE-fldA-uof-fur operon; (D) differential usage of three promoters within the ybfE-fldA-uof-fur operon during log phase. Promoter usage and TU coverage calculations are described in the legend to Fig. 1. Catalogue of operon architecture. High-resolution mapping of well-characterized regions of the genome provided glimpses of intricate operon arrangements (Fig. 2 to 5). Our analyses of E. coli operons at single-nucleotide resolution further revealed numerous instances of complexity genome-wide. Single-gene operons with a single promoter and terminator make up 47% of all operons, while 17% are “traditional” operons with more than one gene and a single promoter and terminator (Fig. 4). The remaining operons (36%) are more complex: 21% have multiple (as many as 8) promoters, 7% have multiple (as many as 4) terminators, and 8% have both multiple promoters and multiple terminators. The average operon has 1.98 genes (see Table S4 in the supplemental material). In our data set, the most complex operon, which encodes several core cellular functions, has 14 genes, 8 promoters, 4 terminators, and 23 TUs (yjeF-yjeE-amiB-mutL-miaA-hfq-hflX-hflK-hflC-yjeT-purA-nsrR-rnr-rlmB operon; see Fig. S4). Differential TU expression within operons can result from the activity of S and I promoters, internal terminators, and combinations of these regulatory features. For example, Fig. 4 illustrates how an I promoter and internal terminator can function together to increase expression of the DksA-specific TU in stationary phases. For the ybdK operon, Fig. 4 shows differential expression of the 5′ and 3′ TUs of the operon caused by transcription from an S promoter and an internal terminator. This arrangement of features results in a complete inversion in expression of the 2 TUs between logarithmic and stationary phases. These findings suggest that operon architecture permits E. coli to adjust relative levels of gene expression within the same operon in response to environmental conditions. To quantify differential gene expression within E. coli operons, we compared the base counts of TUs within the same operon under the same growth condition and tabulated the complexity that arises from internal promoters and terminators (see Table S6 in the supplemental material). Of 548 complex operons displaying multiple TUs due to having multiple promoters or terminators (Fig. 4), 327 showed more than 2-fold differential expression of 1 TU compared to other TUs within the same operon (see Table S6). Of 633 operons containing more than one gene, we observed 2-fold or greater differential gene expression within 315 operons (e.g., see Fig. 4). In the case of polycistronic operons that have only a single promoter and terminator, it appears that differential decay of the processed transcripts is responsible. In total, 43% (642 of 1,510) of all E. coli operons show a complex gene expression regulatory pattern (see Table S6). Clearly, differential expression of TUs and genes within the same operon is common in E. coli. Our analyses provided the opportunity to map potential AS transcription across the transcriptome. In many cases, AS transcripts completely overlap and are complementary to sense strand transcripts that encode proteins; however, these AS transcripts do not appear to encode proteins. For example, the long AS RNA that is complementary to sulA does not appear to be translated, because it has no properly positioned ribosome binding site nearby a start codon and thus is likely to be a long noncoding RNA (lncRNA). We found 18 transcripts for annotated protein-coding genes and small RNAs that completely overlap operons transcribed in the opposite direction (see Table S5). As a result of this arrangement, the 18 corresponding operons contain long noncoding AS transcripts that overlap the coding sequences on the opposite strand. Since genome annotation relies heavily on identification of coding sequences, we predicted that our transcriptome analysis would reveal unannotated genes. Indeed, we identified 96 transcripts that do not correspond to genes on the reference genome and were previously unannotated in E. coli K-12 (see Table S5). These include 89 AS transcripts that have an average length of 397 bases, the longest of which is 1,168 bases. The remaining 7 transcripts are completely intergenic and do not overlap annotated genes. None of the 96 transcripts appear to code for protein because they all have multiple stop codons in all three reading frames. Of the 89 AS transcripts, 21 are convergent with known operons that code for proteins, 7 are divergent with mapped operons, and 40 completely overlap annotated operons. The remaining 21 AS transcripts overlap known genes that could not be annotated into operons by RNA-Seq. The genomic regions corresponding to 72% of these lncRNAs are highly conserved in >50 E. coli and Shigella genomes. It was proposed previously that bacterial lncRNAs may be functional (30, 34), yet this was questioned by others (25). Similar lncRNAs have also been found in eukaryotes, and although they are not well understood, they are thought to play a role in regulating gene expression (70). A recent study of terminator efficiency showed that only 3% of E. coli terminators are “strong” (71). Inefficient termination would explain how convergent operons sometimes have overlapping transcription (19, 20). Therefore, we tested the hypothesis that partial termination between convergent operons would generate complementary 3′ transcript ends and add further complexity to the transcriptome. Figure 1 shows an intrinsic terminator located between convergent operons, which terminates transcription by 4-fold yet allows readthrough transcription of 329 bases of AS RNA for the 3′ end of the convergent operon transcript. Our analyses of 370 instances of convergent operons revealed that 75% show transcription into an adjacent operon to generate complementary 3′ transcript ends that overlap by an average of 286 bases, the longest of which is 1,395 bases (see Table S5). In genome regions where there are many highly transcribed operons, it is more likely to observe convergent transcription. Of the genome regions corresponding to these convergent operons, 74% are highly conserved at the nucleotide sequence level in >50 E. coli (and Shigella) complete genomes. It is thus reasonable to suggest that overlapping transcription of convergent operons is a general property in bacteria. Transcription of divergent operons can result in overlapping transcripts (17, 18). Complementary transcripts generated by divergent promoters recently have been called “excludons,” which are thought to act as negative regulators of genes on the opposite strand (34). Our analyses of 388 instances of divergent operons revealed that 35% have promoters arranged such that their 5′ transcript ends overlap by an average of 168 bases, the longest of which is 1,012 bases (see Table S5). The genome regions corresponding to 81% of these overlapping divergent operons are highly conserved in >50 E. coli (and Shigella) genomes. The finding of sequence conservation says nothing about functionality, but our finding that over one-third of divergent operons generate overlapping complementary transcripts does suggest that excludons may be prevalent in bacteria. Comparison to other data sets. We compared our AS transcript annotations to other high-quality data sets using a conservative approach. We compared our data sets to highly expressed and experimentally verified AS transcripts from those studies. A contemporaneous single-nucleotide analysis of the E. coli transcriptome by Storz, Sharma, and colleagues (submitted for publication) focused on AS transcripts. They found that most previously annotated sRNAs are present at high levels, so we compared our AS RNA data set to the most highly expressed AS RNAs in their study. Our data corroborate 74 of their 127 highest-expressed AS RNAs. Furthermore, we corroborated 6 of 14 candidate AS RNAs tested on Northern blots by the Storz group. However, while their gels verified 6 of the 14, we corroborated only 2 of those 6, indicating that there is substantial variability in these high-throughput data sets. A recent coimmunoprecipitation study of the double-stranded E. coli transcriptome revealed 316 double-stranded RNAs, including partially and fully overlapping transcripts as well as many generated by divergent and convergent operons (33). Our analyses predicted AS RNAs corresponding to 13 of 21 double-stranded RNAs that were verified in Northern blot analysis (33). It is tempting to speculate that AS RNAs that are corroborated by RNA-Seq studies, are verified by Northern blot analysis, and correspond to highly conserved genome sequences are functional. However, functions have been confirmed for only a limited number of AS RNAs (56, 72). It remains to be seen how many of the AS RNAs identified by RNA-Seq will prove to be expressed in the same cell as the sense transcript and display a yet unknown phenotype. Bacterial operons compared to eukaryotic genes. It did not escape our attention that the widespread occurrence of bacterial operons with multiple TUs in some ways resembles alternative splicing of eukaryotic transcripts. From both bacterial operons and eukaryotic genes arise primary transcripts that are divided into alternative transcripts by the activity of transcriptional regulatory features, i.e., internal promoters and terminators in bacteria and RNA splice junctions in eukaryotes. The potential for eukaryotic gene complexity is reflected in the number of exons per gene. The number of exons per gene in Saccharomyces cerevisiae is 1.1 (73), which is considerably fewer than the 1.7 TUs per operon in E. coli. In contrast, higher organisms have 4 to 9 introns per gene (74), making them more complex than E. coli. Perhaps the loss of exons that is proposed to have happened in budding yeasts during evolution from more primitive eukaryotes accentuates their difference from E. coli and higher organisms (75). We conclude that E. coli possesses operon complexity comparable to analogous gene structures in budding yeasts. Concluding statement. This study reveals the power of single-nucleotide resolved RNA-Seq data sets for pinpointing transcriptional features across the genome, which we used to annotate operons by precisely mapping their 5′ and 3′ ends. We found an astounding level of overlapping transcription where complementary RNAs are transcribed from both strands, such as those generated by several hundred convergent and divergent operons. We discovered more than 100 long AS transcripts overlapping operons that also were transcribed on the sense strand. In sum, we found that approximately one in three (519 out of 1,510) operons at least partially overlaps with other operons to generate AS RNA. These AS transcripts are highly conserved in E. coli and appear to be noncoding RNA, suggesting that they are involved in regulation of gene expression, as has been proposed for excludons in bacteria (34) and lncRNAs in eukaryotes (70). We also found 7 transcripts that did not correspond to an annotated gene and therefore represent previously unrecognized yet potentially functional operons. The transcriptome intricacy we observed in E. coli appears to be a general property of the domain bacteria, as the transcriptomes of several other bacteria appear to be similarly intricate (21, 26, 28, 31, 41, 47 – 49). Whether the same is true of the Archaea must await high-resolution RNA-Seq analysis of representatives of this domain of life (83). Since operon arrangements are more highly conserved than gene repertoires (76), it is interesting to speculate that the requirements of primordial life led to the evolution of an operon architecture in bacteria which accommodates substantial variation in gene expression. MATERIALS AND METHODS Bacterial strains and growth conditions. To annotate operons and characterize their response to carbon starvation, wild-type E. coli BW38028 and E. coli BW39452 (ΔrpoS::cat) were grown in 1 liter of morpholinepropanesulfonic acid (MOPS) minimal medium (77) containing 0.2% glucose in a fermenter at 37°C with constant pH and aeration. MOPS medium solutions were modified as described elsewhere (78), which permits preparation of 40× “M” stock solution, giving the same final medium recipe (77). Cultures were sampled at 10 time points during growth of E. coli BW38028 and at five time points for E. coli BW39452, as shown in Fig. S1 in the supplemental material. Logarithmic- and stationary-phase samples were duplicated from replicate cultures. RNA sequencing. RNA was prepared by using an RNeasy kit (Qiagen, USA). Because small RNAs may be preferentially lost during column purification, they are likely underrepresented in our data sets. Replicates of logarithmic- and stationary-phase RNA were treated with Terminator 5′-phosphate-dependent exonuclease (Epicenter, USA) to enrich 5′-triphosphate mRNA fragments for TSS mapping. RNA sequencing libraries (see Table S1) were prepared by using the strand-specific, ligation-based SOLiD Total RNA-Seq kit. Paired-end sequencing was performed on the SOLiD 4 Genetic Analyzer at Purdue University Genomics Facility. Raw data processing. Sequence reads were aligned to the E. coli MG1655 reference genome (U00096.2) with Bowtie version 1.8 (79). For the first pass, we used paired-end color space mapping with a distance cutoff of 350 bases between read mates. Bowtie parameters were set to include only perfect matches and retained only one alignment where a read mapped to more than one genome location. In practice, we found the efficiency of paired-end mapping was between 3 and 10%. To improve the overall alignment, we mapped the orphan 5′- and 3′-end reads in two additional passes with Bowtie (one for the 5′ reads and one for the 3′ reads). The output of the three passes through Bowtie was three SAM files for each sample. Overall, we achieved 40 to 60% mapping efficiency with this three-pass strategy. SAMtools (80) utilities were used to convert SAM files to BAM format and to sort and index them. The binary read alignment (BAM) files were displayed in Integrated Genome Viewer (IGV version 2) for primary analysis and quality control. The BAM files were then converted to base count (WIG) files. We accomplished this by using an in-house script to extract strand-specific base count data from BAM files (outputs are positive- and negative-strand WIG files). First, our solidbam2wig.pl script reads in the paired-end BAM file and counts the nucleotides spanning inserts between the mated 5′ and 3′ reads. Next, the script pulls in the orphan 5′ and 3′ reads from the respective BAM files and increments the base counts at each base location without duplicating the reads already incremented from the paired ends. Base count data were then normalized based on the assumption that reads are randomly distributed across the genome and that if sequencing was sufficiently deep, all expressed transcripts would be represented in the data set (39). In practice, SOLiD sequencing did not generate data sets in which the lowest-abundance transcripts were fully covered by contiguous reads. In addition, inefficient ribo-depletion can bias the number of reads that map to non-rRNA genes. Our normalization strategy accounts for both of these factors by maximizing TU coverage and removing rRNA reads during data processing. Our in-house script, normWIG.pl, reads in the raw WIG files. A simple global normalization approach was utilized that multiplied the count at each base location by 1 billion and divides that value by the sum of base counts at all base locations in the file. This normalization strategy is analogous to the total count approach used for normalizing gene-specific read alignments (51). In this way, the base counts are expressed as parts per billion. For display in JBrowse (61), the normalized WIG files were converted to BIGWIG files by using SAMtools (80). Analysis of the data was conducted in a graphic user interface consisting of JBrowse (61) and an Oracle database. SELEX. Genomic SELEX was previously described (81). Antibodies against RpoD sigma, RpoS sigma, and core enzyme subunits were produced in rabbits by injecting purified sigma proteins (82). Nucleotide sequence accession number. RNA sequencing data and curated results were deposited at Gene Expression Omnibus, accession no. GSE52059. SUPPLEMENTAL MATERIAL Table S1 RNA-Seq data sets. Table S1, XLSX file, 0 MB. Table S2 Promoters. Table S2, XLSX file, 0.6 MB. Table S3 Terminators. Table S3, XLSX file, 0.2 MB. Table S4 Transcription units. Table S4, XLSX file, 0.3 MB. Table S5 Overlaps. Table S5, XLSX file, 0.1 MB. Table S6 Complexity analysis. Table S6, XLSX file, 0.6 MB. Figure S1 Growth conditions for total RNA sampling and base count data replicates. (A) Wild-type E. coli BW38028 was grown on MOPS glucose minimal medium in a 2-liter Biostat B fermenter (Braun Biotech) with a 1-liter working volume at 37°C, pH was kept constant at 7.4 by the addition of 1 M NaOH, and dissolved oxygen was maintained above 40% of saturation by adjusting the agitation speeds in the range of 270 to 500 rpm with fixed 1.5 liters/min airflow. Total RNA was prepared from culture samples taken at 10 time points, indicated by red arrows. Replicate samples from duplicate cultures were taken at times indicated by asterisks. (B) E. coli BW39452 (ΔrpoS::cat) grown as described for panel A; (C) normalized transcription unit (TU) usage values from replicate 1 plotted against values from replicate 2 for logarithmic-phase samples; (D) normalized TU usage values from replicate 1 plotted against values from replicate 2 for stationary-phase samples. All annotated TUs (see Table S2) are plotted. The trend line is shown as a solid black line. The correlations are R = 0.97 for stationary-phase samples and R = 0.96 for logarithmic-phase samples. Download Figure S1, PDF file, 0.1 MB Figure S2 Genomic Selex data for RpoD RNA polymerase holoenzyme binding in vitro. Line heights corresponding to binding values are shown according to their binding locations on the genome, in base pairs. Download Figure S2, PDF file, 0.5 MB Figure S3 Comparison of promoter usage to promoter metrics and TU usage. (A) Promoter motif score versus promoter usage; (B) promoter confidence score versus promoter usage; (C) TU usage versus promoter usage. Usage values were determined from normalized, log2 base count data, as described in detail in the text. Download Figure S3, PDF file, 0.1 MB Figure S4 Complex yjeF-yjeE-amiB-mutL-miaA-hfq-hflX-hflK-hflC-yjeT-purA-nsrR-rnr-rlmB operon. This operon has 8 promoters and 4 terminators and contains 23 transcription units created by transcription initiation from S and I promoters, as well as termination and transcriptional readthrough at internal terminators. Download Figure S4, PDF file, 0.1 MB

Related collections

Most cited references 70

Record: found
Abstract: found
Article: not found

Assessing computational tools for the discovery of transcription factor binding sites.

Martin Tompa, Nan Li, Timothy L. Bailey … (2005)

The prediction of regulatory elements is a problem where computational methods offer great hope. Over the past few years, numerous tools have become available for this task. The purpose of the current assessment is twofold: to provide some guidance to users regarding the accuracy of currently available tools in various settings, and to provide a benchmark of data sets for assessing future tools.

0 comments Cited 375 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Self-splicing RNA: autoexcision and autocyclization of the ribosomal RNA intervening sequence of Tetrahymena.

Kelly Kruger, Paula J. Grabowski, Arthur J. Zaug … (1982)

In the macronuclear rRNA genes of Tetrahymena thermophila, a 413 bp intervening sequence (IVS) interrupts the 26S rRNA-coding region. A restriction fragment of the rDNA containing the IVS and portions of the adjacent rRNA sequences (exons) was inserted downstream from the lac UV5 promoter in a recombinant plasmid. Transcription of this template by purified Escherichia coli RNA polymerase in vitro produced a shortened version of the pre-rRNA, which was then deproteinized. When incubated with monovalent and divalent cations and a guanosine factor, this RNA underwent splicing. The reactions that were characterized included the precise excision of the IVS, attachment of guanosine to the 5' end of the IVS, covalent cyclization of the IVS and ligation of the exons. We conclude that splicing activity is intrinsic to the structure of the RNA, and that enzymes, small nuclear RNAs and folding of the pre-rRNA into an RNP are unnecessary for these reactions. We propose that the IVS portion of the RNA has several enzyme-like properties that enable it to break and reform phosphodiester bonds. The finding of autocatalytic rearrangements of RNA molecules has implications for the mechanism and the evolution of other reactions that involve RNA.

0 comments Cited 362 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Comprehensive comparative analysis of strand-specific RNA sequencing methods

Joshua Levin, Moran Yassour, Xian Adiconis … (2010)

Strand-specific, massively-parallel cDNA sequencing (RNA-Seq) is a powerful tool for novel transcript discovery, genome annotation, and expression profiling. Despite multiple published methods for strand-specific RNA-Seq, no consensus exists as to how to choose between them. Here, we developed a comprehensive computational pipeline to compare library quality metrics from any RNA-Seq method. Using the well-annotated Saccharomyces cerevisiae transcriptome as a benchmark, we compared seven library construction protocols, including both published and our own novel methods. We found marked differences in strand-specificity, library complexity, evenness and continuity of coverage, agreement with known annotations, and accuracy for expression profiling. Weighing each method’s performance and ease, we identify the dUTP second strand marking and the Illumina RNA ligation methods as the leading protocols, with the former benefitting from the current availability of paired-end sequencing. Our analysis provides a comprehensive benchmark, and our computational pipeline is applicable for assessment of future protocols in other organisms.