Introduction Cellular transcriptomes are dictated by complex interactions between signal transduction pathways, general and specific transcription factors, chromatin remodeling proteins, and the RNA polymerase complexes. Precise transcriptional responses are achieved in part by targeting transcription factor complexes to the cis-regulatory regions of target genes via specific binding site sequences. The importance of these binding site sequences is reflected in the conservation of sequence motifs in coregulated genes and through evolution. A number of studies have examined transcriptomic changes to breast cancer cells following estrogen treatment [1–7]. Estrogen receptors (ERs) (specifically ERα and ERβ) are ligand-dependent transcription factors that mediate cellular responses to hormone exposure in vertebrate development, physiological processes, and endocrine-related diseases. ERα, in particular, has been implicated in the etiology of breast cancer and is a major prognostic marker and therapeutic target in disease management . At the molecular level, ERs interact either directly with genomic targets encoded by estrogen response elements (EREs) (5′-GGTCAnnnTGACC-3′) or indirectly by tethering to nuclear proteins, such as AP-1 and Sp1 transcription factors [9–11]. The mechanisms of ER binding site specificity, however, are not clear since these binding site sequence motifs are ubiquitous in the genome, and there is no discernable difference between functional and nonfunctional sites by computational modeling approaches. This ambiguity is likely due to a lack of systemic information on binding site usage and architecture and mechanistic complexity involving additional transcription factors and epigenetic modifications [12,13]. Chromatin immunoprecipitation (ChIP) assays have facilitated characterizations of in vivo protein-DNA complexes such as histone modifications and recruitment of transcription factor complexes to specific binding sites . Coupled with microarray technology, the ChIP-on-chip experiments have resulted in more global binding site maps for a number of human transcription factors in proximal promoters and in specific genomic regions. For example, Carroll and colleagues recently mapped ER binding sites first in human Chromosomes 21 and 22 and more recently across the entire human genome [15,16]. In spite of the success of ChIP-on-chip studies, there remain caveats regarding probe specificity and performance, including constraints on probe design in certain genomic regions and potential biases introduced by amplification protocols. Thus, the analysis of ERα binding sites using alternative genome-wide technologies is warranted. Previously we developed a high throughput cloning and sequencing approach for mapping full-length transcripts. By employing specialized cloning techniques and vectors, paired-end diTags (PETs) from ends of transcripts and this gene identification signature (GIS) can be sequenced and mapped precisely to the genome . The GIS-PET technique increases sequencing efficiency by 30-fold as compared to sequencing the entire transcript insert. We subsequently showed that binding site fragments from ChIP experiments can also be subjected to PET analysis to generate an unbiased whole-genome map of p53 tumor suppressor protein binding sites and demonstrated an association of binding sites and adjacent target genes with p53 functions in patient tumor samples . The PET technology has also been utilized to study OCT4 and Nanog binding sites in stem cell biology . To obtain a global map of ERα binding sites in breast cancer cells, we applied ChIP-PET to generate a library of ERα binding sites in MCF-7 cells following estrogen treatment. We then combined the binding site data with hormone-responsive and breast tumor sample microarray gene expression studies to discern correlations between ER binding, transcriptional activity, and disease status and outcome in patients. We also compared our findings with data from ChIP-on-chip studies to evaluate the performance of each respective technology. Herein, we describe the comprehensive cartographic results and outline the insights they provide into binding site usage and molecular mechanisms of ER transcriptional regulatory functions. Results ChIP-PET Analysis Mapped ER Binding Sites across the Human Genome Hormone-deprived MCF-7 cells were treated with 10 nM estradiol for 45 min, and then DNA-bound receptor complexes were isolated through ChIP using anti-ERα antibodies (HC-20, Santa Cruz Biotechnology, http://www.scbt.com). Prior to generating the PET library for sequencing, we qualified the ChIP products by measuring DNA fragment size and enrichment of known ER binding site in the pS2/TFF1 gene promoter after immunoprecipitation to ensure sample quality. The ChIP DNA fragments ranged from 300 bp to 1 kb, and there was an 80-fold enrichment of ER binding at the known pS2/TFF1 ERE as compared to the irrelevant antibody control. Once ChIP DNA quality had been confirmed, the PET library was constructed and sequenced as described previously . A total of 635,371 PETs were sequenced, of which 361,241 (~56.86%) were unambiguously mapped to unique loci in the human genome (hg17/NCBI build 35) and localized to 136,158 distinct genomic coordinates. One of the first questions we asked was whether the sequencing of these 635,371 clone “equivalents” in the form of pair-end tags provided sufficient representation of the chip library. To assess the degree of saturation of the PET library sequenced (the total number of distinct ChIP DNA fragments that can be captured from the library), we fitted a Hill Function  using extrapolated and historical sequencing data (see Figure S1). The degree of saturation of the ER ChIP PET library is estimated at 73.24% (136,158 actual/185,915 expected), suggesting that ~73% of the extrapolated hypothetical limit of coverage by our library was sequenced. Sequencing results are summarized in Figure 1A. Overlapping uniquely mapped PETs that form PET clusters (see Figure 1B) have been shown to be highly enriched for “true” binding sites . We previously set the selection parameter (i.e., number of PETs within a given cluster) for high probability binding regions by using the goodness-of-fit analysis employing a mixture of two standard Pareto distributions to model the signal and noise within the dataset . MCF-7 cells, however, pose a special analytical challenge in that regions of gene amplification in the cell line also appeared to amplify cluster PET numbers from low quality binding sites (Figure S2). Therefore, we devised an alternative strategy that normalizes regions of gene amplification so that all chromosomal regions can be directly compared. Using this “adaptive maximum overlap PET (moPET) threshold” approach (unpublished data) and setting the false-positive rate of 100 kb away. These findings initially suggest that DNA-bound ER can interact with the transcriptional machinery through both proximal- and distal-acting mechanisms, and these interactions are not likely to be limited by binding site orientation (5′ or 3′) relative to the TSSs. Intriguingly, functional ER binding sites were rarely in exons and when in exons were in probable untranslated regions. We did not detect any binding regions that mapped to a protein-coding domain of a transcriptional unit. These observations further suggest a dynamic selection of ER binding sites that excludes exonic regions and raise the possibility that transcription factor binding sites (TFBSs) in exons may undergo negative selection during evolution. The ERE Sequence Motif Is Enriched in Validated ChIP-PET Binding Sites As a preliminary assessment on the fidelity of the 1,234 high quality binding regions, we considered presence of putative ERE motif as a proxy for a real binding event. A total of 13 base pair sites that were at most two Hamming distance away (i.e., two base deviation) from the consensus ERE (GGTCA-nnn-TGACC) were called putative EREs. Upon scanning the 1,234 binding regions, we discovered that 884 (~71%) binding regions contained at least one ERE-like sequence, 25% encoded putative half-ERE sites, and the remaining 4% bore no ERE sequence motifs whatsoever (Figure 2). To further confirm the validity of the discovered binding regions, we selected 107 out of the 1,234 high quality clusters for further ChIP-qPCR validation (Figure 3A). All clusters containing a full putative ERE (47 sites) showed significant (>2-fold over control) enrichment, and based on this 100% success rate the 884 genomic loci of ER binding containing consensus ERE motifs are highly likely to encompass true ER binding sites. We also tested 37 sites with half EREs and validated ER binding in 84% (31 of 37) of selected sites. A similar success rate was found for the non-ERE ChIP-PET clusters, as 19 out of 23 tested sites (83%) showed binding. The median fold enrichment of the validated sites containing a full ERE was 81-fold, which was considerably higher than the median fold enrichment observed for half- and non-ERE-containing PET clusters (36- and 51-fold, respectively). These results support the idea that EREs tend to encode high affinity ER binding sites whereas the half- and non-ERE binding likely support moderate affinity binding, perhaps through indirect tethering mechanisms. This is further supported by the enrichment of EREs as the number of PETs in the moPET clusters increase, corresponding to higher ChIP efficiency and potentially reflecting the higher affinity binding. The positive gradient of the curve supports the notion that higher moPET clusters are more likely to contain full ERE-like sequences (p = 3.204e−8) (Figure 3B). Thus, the canonical ERE sequences appears to be a hallmark of ER control on a genome-wide scale. Figure 2 The ERE Sequence Motif Is Enriched in ER Binding Sites Overall, 71.6% of the 1,234 high confidence ChIP-PET clusters encode at least one ERE motif, and 3.73% contained no ERE or half site motif. Figure 3 Putatively Higher Affinity Binding by ER Is Associated with the ERE (A) ChIP-PET ER binding sites are validated by conventional ChIP followed by quantitative PCR. Data shown represent average of duplicate experiments. Binding sites are considered validated if the binding ratio (ER ChIP/irrelevant antibody control) ≥ 2. Validated sites are grouped by presence of EREs (allowing for up to two base deviation from consensus), half ERE sites, and no ERE motifs. (B) The frequency of ERE motifs increase as the size of binding site clusters increase. ERE motifs are only present in 7.05% similarly sized genomic fragments shown as the background. Out of the ten clusters that failed validation, eight of the loci that were misclassified as true binders belonged to the moPET 3 category, which would be considered in the borderline confident range; one was a moPET 4 cluster, whereas a false positive in a moPET7 was located in an amplified region of the MCF-7 genome on Chromosome 20 (which we have shown overestimates the binding efficiency of a DNA fragment to ER). When adjusted for the frequency of full, half, and no ERE motifs in the PET-defined binding loci, our validation rate for binding calls is 94%. These validation results are in line with our previous whole-genome ChIP-PET analysis of p53, Oct4, and Nanog binding sites [18,19] and compares favorably with other genome-wide technologies such as ChIP-on-chip. To examine whether ER ChIP-PET binding sites containing putative full EREs harbor transcriptional enhancer activities, we PCR amplified 11 binding sites from MCF-7 genomic DNA and cloned them upstream of the pGL4-TATA luciferase reporter. For negative and positive controls we used pGL4-TATA and pGL4-2ERE-TATA, respectively. These constructs were transiently transfected into hormone-depleted MCF-7 cells, treated with either ethanol or E2 for 18–24 hours, and then assayed for luciferase activity. A total of eight out of 11 ER ChIP-PET binding site reporter constructs tested were E2 responsive (Figure 4A). To show that the transcriptional enhancer activities of the ER ChIP-PET binding sites are mediated via EREs, we mutated the putative ERE motifs in the eight E2 responsive constructs and transfected them into MCF-7 cells. Destroying the putative EREs resulted in either significant reduction or complete loss of estrogen response, thus demonstrating the EREs of the ER ChIP-PET binding sites are responsible for their enhancer properties (Figure 4B). Figure 4 ERE Sequences in ER ChIP-PET Binding Sites Are Functional Transcriptional Enhancers (A) ER ChIP-PET binding sites were cloned into the pGL4-TATA luciferase reporter construct and transfected into MCF-7 cells that have been grown in hormone depleted medium for at least three days. The cells were treated with either ethanol or 10 nM estradiol for 18–24 h before harvesting for luciferase activity. pGL4-TATA and pGL4-2ERE-TATA (two copies of the vitellogenin ERE cloned upstream of TATA box of pGL4-TATA) were used as negative and positive controls, respectively. The cells were also cotransfected with the HSV-TK renilla vector as an internal control for transfection efficiency. The data represent the average of three individual experiments ± standard error of mean. The binding site coordinates and their adjacent genes are: ERE1 and ESR1, Chromosome 6: 152029288–152029705; ERE2 and ESR1, Chromosome 6: 152071268–152071889; ERE3 and FOXA1, Chromosome 14: 37189409–37189699; ERE4 and GREB1, Chromosome 2: 11589053–11589737; ERE5 and GREB1, Chromosome 2: 11621762–11622024; ERE6 and GREB1, Chromosome 2: 11622967–11623504; ERE7 and GREB1, Chromosome 2: 11630097–11630780; ERE8 and PGR, Chromosome 11: 100554271–100554807; ERE9 and PGR, Chromosome 11: 100712072–100712428; ERE10 and CA12, Chromosome 15: 61467060–61467460; ERE11 and TFF1, Chromosome 21: 42669273–42670075. (B) Putative ERE motifs in ER ChIP-PET binding sites were mutated and examined in transient transfection studies as described in (A). Comparative Analysis of ChIP-PET and ChIP-on-Chip Data Reveals Differential Binding Site Discovery by These Technologies Other technologies have been used to map ER binding sites on a genomic scale. Carroll et al. using ChIP and human genome tiling arrays have mapped ER binding sites in Chromosomes 21 and 22 and across the human genome in MCF-7 cells [15,16]. We first compared our mapped ER binding sites to the previously published results in Chromosomes 21 and 22 . In the 1,234 binding sites mapped by ChIP-PET, we detected 36 ER binding sites in these chromosomes and found that 20 binding sites were identified by both techniques (Figure 5A) and 16 sites were unique to ChIP-PET. We then tested ER binding to technology-specific and overlapping sites by conventional ChIP to determine the validity of the mapped sites. We selected 25 sites detected by one technique or by both for further validation by ChIP and quantitative PCR (Figure 5B). The six sites identified by both technologies were validated as bona fide ER binding sites and had the greatest fold enrichment (median ≈ 300×). All nine of the sites discovered only by ChIP-PET were validated compared to nine of the ten selected sites discovered by ChIP-on-chip alone. The median fold enrichment of the sites identified solely by the ChIP-PET approach was higher than that of the ChIP-on-chip (medians ~45× versus ~22×). A total of four of the ten sites discovered only by ChIP-on-chip were also associated with ChIP-PET clusters but not deemed high probability sites since three had two PETs in their cluster, and one site was a one PET singleton (unpublished data). Moreover, these four ChIP-on-chip sites overlapping with lower probability PET clusters had higher fold enrichment for ER binding than the remaining sites with only ChIP-on-chip supporting evidence (~37× versus ~12× median), suggesting that conjoint assignment of sites by the two technologies even at suboptimal thresholds may identify higher quality ER binding sites. To assess the two technology platforms across the entire genome, we also compared the 1,234 ChIP-PET ER binding sites identified in this study to the recently published whole-genome ChIP-on-chip ER binding site map of 3,665 binding sites  and found 624 (50.6%) ChIP-PET sites in common with the ChIP-on-chip data and 610 (49.4%) sites unique to the PET technology (see Figure 5C). These results are consistent with the data of Chromosomes 21 and 22 where 44.4% (16/36) of the ChIP-PET sites are unique (see Figure 5A). Figure 5 Comparison of ER Binding Sites Discovered by ChIP-PET and Published ChIP-on-Chip Experiments Indicate Sites Common to Both Technologies and Platform-Form Specific Sites (A) In human Chromosome 21 and 22 studies, 57 ER binding sites were discovered by Carroll et al. , and 36 sites were identified in this study. There is an overlap of 20 sites between the two studies. (B) Validation of select binding sites from both studies by ChIP and qPCR suggest that the common sites (both) are high affinity sites, whereas sites unique to each technology tend to have more moderate affinity for ER. (C) Venn diagram of the overlap between the 3,665 sites discovered by Carroll et al. and the 1,234 sites identified in this study. The 624 binding sites identified in this study actually correspond to 633 binding regions reported by Carroll et al. . It is likely that the difference between the two platforms are due to lower affinity ER binding sites being more susceptible to constraints and limitations inherent in the detection technologies and to possible differences in biological handling of cell lines in each study. Moreover, there appears to be content differences between the discovery capacity of the two technologies. An inherent disadvantage of the ChIP-on-chip approach is that arrays mask sites that contain repetitive sequences, whereas the output of the ChIP-PET technology is completely unbiased in regard to the presence or absence of repetitive sequences. To this point, we found that ~27.9% of the base pairs in bona fide binding sites discovered by the pair-end tag approach were associated with repetitive sequences, whereas 5.3% of those in Carroll et al. bore repeats. Taken together, these results further suggest that the ChIP-PET and the ChIP-on-chip are complementary methods for the identification of TFBSs in a whole genome manner. Integration of Binding Site and Gene Expression Data Indicates Diversity in Regulatory Region Architecture and Transcriptional Response The binding site map denotes ER transcriptional regulatory potential for a large number of genes. To determine the specific transcriptional responses corresponding to estrogen treatment and ER recruitment to cis-regulatory sites in MCF-7 cells, we performed gene expression profiling experiments using Affymetrix U133 microarrays on a time course following estradiol exposure. Differentially expressed genes were selected based on a q-value cut-off of less than 2% using a stringent significance analysis of microarrays analysis algorithm. We identified 802 probe sets, representing 544 unique named genes, whose expression levels were up-regulated in response to 10 nM E2 treatment for 12, 24, or 48 h, and 1,168 probe sets corresponding to 704 unique named genes, were down-regulated following hormone treatment. When combined with the ER binding site mapping data (within 100 kb of and closest to high quality binding regions identified by ChIP-PET), 171 up-regulated genes and 116 down-regulated genes were associated with high confidence ER binding sites. Table S2 contains the complete listing of estrogen responsive genes with adjacent ER binding sites identified in this study. We next examined whether there was a preference for positioning of the ER binding sites in up-regulated versus down-regulated genes. Our analysis revealed that there was a statistically significant association between the presence of an ER ChIP-PET cluster near an up-regulated gene and an under-representation of ER ChIP-PET clusters associated with down-regulated genes (p = 8.379e−08) (see Table 1). The over-representation of ER ChIP-PET clusters that can be associated with E2-up-regulated genes is particularly noticeable at the early 12-h time point. That more binding site clusters are associated with E2 up-regulated genes (60%) compared with down-regulated genes (40%) suggests that the ER protein more frequently is directly involved in the transcriptional regulation of E2 up-regulated genes as compared to down-regulated genes. This finding is in concordance with a previous study we conducted in T-47D human mammary carcinoma cells, where we found that out of 89 genes identified as direct target genes (i.e., E2-responsive and cycloheximide insensitive), 59 (66.3%) were up-regulated and only 30 (33.7%) were down-regulated by E2 treatment . Table 1 Statistics Showing Enrichment of ER Binding Sites Adjacent to Up-Regulated E2 Responsive Genes To further confirm the observed association of binding sites with the transcriptional response, we examined the association of the 107 sites validated by conventional ChIP qPCR to E2 regulated gene expression data. Of the 107 sites tested, 22 sites were found to be associated with an E2-regulated probe from the Affymetrix dataset. Out of these 22 sites, 16 were up-regulated by E2 whereas six were down-regulated by E2-treatment. The 16 sites associated with E2 up-regulated probes all showed high levels of enrichment (~25–473×) when analyzed by ChIP qPCR, whereas the six sites associated with down-regulated genes showed lower levels of enrichment (~1–25×) (Figure 6A). These data suggested that some potential mechanistic association exist between high affinity ER binding sites and induction of gene expression by ER. Figure 6 Comparative Analysis of Binding Site Affinity and Location Adjacent to E2 Up-Regulated Genes Versus Down-Regulated Genes (A) Binding site affinity measured by ChIP and qPCR for 22 ChIP-PET sites within 100 kb of E2 responsive genes detected in the microarray studies. Up-regulated genes are denoted in red bars and down-regulated genes in green bars. Each binding site is furthered characterized for the presence of EREs, half EREs, or no EREs. (B) Locations of binding sites adjacent to up- (blue line) and down- (red line) regulated genes are mapped relative to the transcripts. Relative location to a random set of genes from the UCSC KGs database is included as a reference. Exploring the association between ER binding sites and the directionality of the transcriptional response further, we mapped the locations of the binding sites relative to the start and termination sites of E2 up- and down-regulated genes as assessed by time course studies using Affymetrix expression arrays. As shown in Figure 6B, binding clusters associated with E2-induced genes are found, significantly above background, 50 kb upstream of the TSS, 50 kb downstream of the TSS within the early introns, and within 25 kb downstream of the termination site of their associated genes. Intriguingly, approximately 35 ER binding sites were found within 5 kb of the transcriptional initiation sites of up-regulated genes (Figure 6B). No such enrichment was detected with down-regulated genes or with genes randomly selected from the UCSC KG database. Taken together, our results suggest that ER binds to many sites in the genome, most bearing a discernable ERE-like sequences, and that genes induced by estrogen are significantly more likely to have an ER binding site within 50 kb of the TSS. Manual analysis of the intronic binding sites showed no evidence for internal alternative TSSs. Genes down-regulated by estrogen show no such positional enrichment and appear to be associated with lower ChIP efficiency ER binding sites. Moreover, genes repressed by estrogen are usually down-regulated later than those induced (48 h versus 12 h; as also observed elsewhere [3,16]). This suggests that genes repressed by ER may require further synthesis and recruitment of other factors to the ER binding sites and that the mechanism of gene repression is topographically distinct from that of gene induction. Supporting this is our observation that binding sites for up-regulated genes have higher moPET counts than binding sites for down-regulated genes (p = 0.0005). When sampled for validation by quantitative PCR, ER binding sites associated with induced genes had much higher fold enrichment for ER occupancy than repressed genes (see Figure 6A). Genes Adjacent to ER Binding Sites Are Associated with Breast Cancer Biology We posited that the genes adjacent to the ER binding sites identified by ChIP-PET are putative ER target genes and should reflect ER function in vivo. To assess this possibility, we examined whether the behavior of the collection of adjacent genes could determine the ER status of human breast cancers. All genes within 100 kb of an ER binding site or with an ER binding cluster within an intron were used to cluster 251 breast tumors from a cohort from Uppsala, Sweden previously analyzed for gene expression using Affymetrix U133 A and B arrays . Our results showed that this proximate gene list could easily segregate ER-positive and ER-negative breast tumors (see Figure 7A), whereas a random gene list from the U133 A and B gene set could not (unpublished data). Statistical analysis using the Fisher's exact test showed a highly significant separation based on ER-status with p = 3.914e−12. Moreover, patients with ER-positive like expression profiles, based on the ER-associated genes, have better disease-specific survival over ten years of follow-up as compared to those with the ER-negative like profiles (p = 0.0057) (Figure 7B). This is consistent with all current knowledge of the impact of ER-status in breast cancer prognosis. These results provide strong evidence that the ERα binding sites identified using ChIP-PET enrich for ER responsive genes that are associated with the biology of human breast cancers. Figure 7 ER Binding Sites Are Adjacent to Genes Associated with ER-Status and Disease Outcome in Breast Cancer Patients (A) Expression profiles of genes adjacent to the 1,234 ER binding sites ( 5 kb) or within intragenic regions. (1.3 MB AI). Click here for additional data file. Table S1 Complete Table of 1,234 High Confidence ChIP-PET Clusters Denoting ER Binding Sites in MCF-7 Cells (210 KB XLS) Click here for additional data file. Table S2 List of Estrogen Responsive Genes Identified in Microarray Experiments with Adjacent ER Binding Sites (47 KB XLS) Click here for additional data file. Table S3 Nonuniform Distribution of TFBSs in 1,234 ChIP-PET Clusters The Kolmogorov-Smirnov Test was employed to test whether the observed putative binding sites locations follow uniform distribution. (15 KB XLS) Click here for additional data file. Table S4 Nonuniform Distribution of TFBSs Relative to the Main EREs of the Binding Regions, Based on Kolmogorov-Smirnov Test, as Described Earlier (16 KB XLS) Click here for additional data file. Table S5 Nonuniform Distribution of TFBSs Relative to the Main Half EREs of the Binding Regions, Assessed under the Kolmogorov-Smirnov Test (15 KB XLS) Click here for additional data file.