94
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Motivation: Next-generation sequencing has become an important tool for genome-wide quantification of DNA and RNA. However, a major technical hurdle lies in the need to map short sequence reads back to their correct locations in a reference genome. Here, we investigate the impact of SNP variation on the reliability of read-mapping in the context of detecting allele-specific expression (ASE).

          Results: We generated 16 million 35 bp reads from mRNA of each of two HapMap Yoruba individuals. When we mapped these reads to the human genome we found that, at heterozygous SNPs, there was a significant bias toward higher mapping rates of the allele in the reference sequence, compared with the alternative allele. Masking known SNP positions in the genome sequence eliminated the reference bias but, surprisingly, did not lead to more reliable results overall. We find that even after masking, ∼5–10% of SNPs still have an inherent bias toward more effective mapping of one allele. Filtering out inherently biased SNPs removes 40% of the top signals of ASE. The remaining SNPs showing ASE are enriched in genes previously known to harbor cis-regulatory variation or known to show uniparental imprinting. Our results have implications for a variety of applications involving detection of alternate alleles from short-read sequence data.

          Availability: Scripts, written in Perl and R, for simulating short reads, masking SNP variation in a reference genome and analyzing the simulation output are available upon request from JFD. Raw short read data were deposited in GEO ( http://www.ncbi.nlm.nih.gov/geo/) under accession number GSE18156.

          Contact: jdegner@ 123456uchicago.edu ; marioni@ 123456uchicago.edu ; gilad@ 123456uchicago.edu ; pritch@ 123456uchicago.edu

          Supplementary information: Supplementary data are available at Bioinformatics online.

          Related collections

          Most cited references13

          • Record: found
          • Abstract: found
          • Article: not found

          High-Resolution Mapping of Expression-QTLs Yields Insight into Human Gene Regulation

          Introduction Genetic variation that affects gene regulation plays an important role in the genetics of disease and adaptive evolution [1],[2],[3]. However, unlike protein-coding sequences, we still know little about how to identify the DNA sequence elements that control gene expression. It is still difficult to predict with any confidence which SNPs are likely to affect gene expression, without performing targeted experimental assays. To address this gap, recent experimental and computational approaches have made progress on identifying elements that may be functional, for example through experimental methods that identify transcription factor binding sites [4],[5], by in vivo testing of possible enhancers [6] and by computational analysis of sequence data [7],[8],[9]. However, our understanding of the importance of different types of functional elements in gene regulation remains rudimentary. As a complementary approach, genome-wide studies of gene expression are now starting to provide information on genetic variation that impacts gene expression levels [10]. Recent studies in a variety of organisms have shown that levels of gene expression are often highly heritable [11],[12],[13],[14], and that for many genes it is possible to map cis- and trans-acting factors using linkage [13],[15],[16],[17],[14] or association mapping [12],[18],[19],[20],[21]. Recent studies of experimental crosses in yeast and mice have used the locations of SNPs within eQTL genes to provide further information about the identity of functional elements [22],[23]. In studies of human lymphoblastoid cells, it has been reported that most strong signals of association lie within 100 kb of the transcribed region [12], and that eQTLs cluster roughly symmetrically around the TSS [20]. In this study, we applied a new Bayesian framework to identify and fine map human lymphoblast eQTLs on a genome-wide scale. In effect, we treat the SNP data as a tool for assaying the functional impact of individual nucleotide changes on gene regulation. Our analysis focuses on the impact of common SNPs on gene expression levels. By using naturally occurring variation, we test the effects of several million variable sites in a single data set. Our results provide a detailed characterization of the types of SNPs that affect gene expression in lymphoblast cell lines. Results We analyzed gene expression measurements from lymphoblastoid cell lines representing 210 unrelated individuals studied by the International HapMap Project [24],[25]. These expression data, first reported by [19], were generated using the Illumina Sentrix Human-6 Expression BeadChip. For each sample we also used SNP genotype data from the Phase II HapMap Project, consisting of 3.3 million genotypes per individual [25]. After remapping the Illumina probes onto human mRNA sequences from RefSeq, we created a cleaned set of expression data for 12,227 distinct autosomal genes that had a unique RNA sequence in RefSeq (see Methods). For most analyses we removed 634 genes that had one or more HapMap SNPs within the expression probe and 147 very large genes (>500 kb), leaving us with a core data set of 11,446 genes. We then set out to identify SNPs that affect measured mRNA levels in cis. As an operational definition, we considered the “cis-candidate region” to start 500 kb upstream of the transcription start site (TSS) and to end 500 kb downstream of the transcription end site (TES). Consistent with previous work [20],[12], our preliminary analysis found that most detectable eQTLs lie within this region. Although the HapMap samples represent four different populations, originating from Africa, Europe and east Asia, our main analyses pooled the data into a single sample. To avoid false positives due to population-level expression differences [26],[20],[27], for each gene we transformed the African, European and east Asian expression data separately to standard normal distributions prior to combining the samples (Methods). Our rationale for combining samples was that we should achieve better power and better localization of signals than if we analyzed the populations separately. In doing so, we assume that functional variants usually have similar effects in different populations, an assumption that is parsimonious, and has empirical support [20], Figure S1. The overall results for analyses of individual populations are very similar (see Figures S2, S3, and S4). The Distribution of cis-Acting eQTLs For each of the 11,446 genes, we tested for putative cis-acting eQTLs by regressing measured mRNA levels against SNP genotypes, independently for each SNP in the cis-candidate region, using a standard linear regression model. Consistent with previous reports [20], we found a substantial number of genes with strong evidence for containing at least one eQTL. A total of 744 genes (6.5%) had at least one SNP with a p-value 50 kb from the corresponding gene. Third, as shown in Figure S9, there is a significant enrichment of eQTL SNPs in exons compared to introns. We will return to this observation later in the paper. 10.1371/journal.pgen.1000214.g002 Figure 2 Locations of the most significant eQTL SNPs for small, medium, and large genes. Each plot shows, for genes with an eQTL, the distribution of locations of the most significant SNP. The x-axis of each plot divides a typical cis-candidate region into a series of bins as described. The y-axis plots the number of SNPs in each bin that are the most significant SNP for the corresponding gene and that have a p-value one gene in cis. Statistical Analysis Notation The data consist of SNP genotypes and gene expression measurements for n individuals at each of K genes. Let yik denote the normalized gene expression data for individual i (i in 1,…, n) at gene k (k in 1,…, K). Yk will denote the vector of gene expression values (y.k ) across the n individuals at gene k. Next, let Mk be the number of genotyped SNPs in the cis-candidate region of gene k. We denote the entire matrix of genotype data for these Mk SNPs with the vector Gk , and individual genotypes as gijk for individual i at SNP j of gene k. Genotypes are coded as having 0, 1, or 2 copies of the minor allele. P-Value Method In the first part of the paper we used standard linear regression to test the gene expression data at each gene for association with SNPs in the cis-candidate region, as follows. The effect of individual i's genotype at SNP j (gijk ) on his/her gene expression level (yik ) is assumed to follow an additive linear model: (1) where μ is the mean expression level at that gene for individuals with g = 0, where ajk is the additive effect of the minor allele at SNP j and εijk is the residual. A standard p-value from a 1 df test can then be obtained for the hypothesis that SNP j is an eQTN for gene k (ajk ≠ 0). We used the following procedure to generate the results plotted in Figure 2. For each gene with expression data we assigned each SNP in the cis-candidate region to a single bin (see below). Let m be the total number of SNPs that fall into bin b, summing across all genes. (Note that most SNPs are in the cis-candidate regions of multiple genes and hence can contribute data to multiple bins.) Next, for each gene, we tested every SNP for association with gene expression. If the p-value of the most significant SNP was 1 SNPs, we considered that the signal was divided equally among the n most significant SNPs (i.e., a fraction 1/n of the signal was assigned to each SNP). Suppose that, by this way of counting, there are s signals in bin b. Prior to reporting the data, we also applied a correction for the possibility of spurious signals due to ungenotyped SNPs in the expression array probe. We used the 634 genes with a known HapMap SNP inside the probe to create a profile of the abundance of spurious signals as a function of distance from the probe. This profile was used to adjust the observed number of signals, s, to a corrected number s′, that removes the predicted number of spurious signals in each bin (see Figure S19 and Text S1 for details). In practice, we estimate that the contribution of spurious signals does not substantially change the overall uncorrected distribution of signals. Finally, we computed the fraction of most significant SNPs in bin b as s′/m. Bin Definitions To display the distribution of signals in Figures 2 and the left panel of Figure 3 we subdivided the cis-candidate region into discrete bins as follows. First, since there is dramatic variation in gene sizes, we analyzed genes in three separate categories based on transcript length: small genes (0–20 kb), medium genes (20–100 kb) and large genes (100–500 kb). Then, within each gene size category we divided the entire cis-candidate region into a series of bins, anchored at the TSS and TES. SNPs outside the transcript were assigned to bins based on their distance from the TSS (for the upstream region) or TES (downstream). Bins outside the transcript were 1 kb wide for small and medium genes and 15 kb wide for large genes. Transcribed regions were split into fixed numbers of bins: each small gene was split into ten bins of equal size, medium genes into 25 bins and large genes into 15 bins. Hence, bins inside the transcript indicate the fractional location of SNPs relative to the TSS and TES, and the physical sizes of the bins vary across genes. The bin sizes were chosen so that the average physical sizes of internal and external bins are roughly the same within each gene size category. Hierarchical Model We present here an overview of the hierarchical model. Complete details on the models are provided in the Supplementary Methods section (Text S1). Bayesian Regression Model The hierarchical model applies the Bayesian regression framework of Servin and Stephens [29]. The effect of individual i's genotype at SNP j (gijk ) on his/her gene expression level (yik ) is assumed to follow a linear model: (2) where μ is the mean expression level at that gene for individuals with g = 0, and where ajk and djk are the additive and dominance effects of the minor allele at SNP j. The residual, εijk , is assumed to be N(0,1/τ) and independent for each yik , where 1/τ is the variance of expression levels within each genotype class. The indicator function I(gijk  = 1) is defined as 1 if the genotype is heterozygous (gijk  = 1) and 0 otherwise. Let denote the probability of the expression data Yk under the null hypothesis that there are no cis-eQTNs in gene k (i.e., ajk  = djk  = 0 for all j). Similarly, let denote the probability of the expression data Yk assuming that SNP j is the eQTN. In this case, the effect sizes ajk and djk are modeled as being drawn from mixtures of normal distributions centered on 0 (see Text S1 for details). The Bayes factor (BF) for SNP j in gene k is defined as (3) and measures the relative support for the hypothesis that SNP j is an eQTN for gene k, versus the null hypothesis. We use priors on effect sizes that allow the BF to be calculated analytically (see Text S1). The Hierarchical Model We describe first the basic version of our hierarchical model. All the results presented in this paper additionally include a correction for the possibility that genes might show signals due to undetected SNPs in the probe. We describe that extension later in the Methods, briefly, and in detail in the Supplementary Methods (Text S1). Our basic model assumes that there are two mutually exclusive categories of genes. With probability Π0 there is no eQTN in the cis-candidate region, and with probability Π1 = 1−Π0 there is a single eQTN. Then the likelihood of the expression data at gene k is (4) where denotes the probability of the expression data Yk given that there is no eQTN in gene k and denotes the probability of the expression data given that there is exactly one eQTN. Note that our model allows for at most one eQTN per gene. If in fact there is more than eQTN, our model will usually assign the signal to the strongest of these. In practice, we see little variation in average effect size as a function of location, so this modeling simplification is unlikely to seriously distort the results. Given that there is a single eQTN in gene k, the probability of the observed expression data, , can be written as (5) where is the probability of the expression data given that SNP j is an eQTN, and π jk is the prior probability that SNP j is an eQTN, given that exactly one SNP in gene k is an eQTN. A key feature of the hierarchical model is that the probability that SNP j is an eQTN, π jk , is allowed to depend on the physical location of SNP j relative to one or more “anchor” points, and other relevant annotations (see Text S1). Suppose that we consider L different kinds of annotation, and let the indicator δjkl equal 1 if SNP j at gene k has the lth annotation, and equal 0 otherwise. Then define (6) where Λ = (λ1,…,λ L ) is a vector of annotation effect parameters. We use a logistic model to relate π jk to these annotation indicators, namely, (7) As detailed in the Supplementary Methods (Text S1), we parameterized the effect of distance from the anchor locations using a series of discrete bins that represent absolute physical distance from the relevant anchor. The bins nearest to the anchor are 1 kb wide, and increase in width to 10 kb and finally 100 kb with increasing distance from the anchor. For the two-anchor models, each SNP belongs to two position bins, each of which indicates distance from one anchor. Likelihood for the Hierarchical Model Substituting the above expressions for into (4), the likelihood for the hierarchical model is (8) (9) where Θ denotes the model parameters and BF jk is the BF from the Bayesian regression (3). To be explicit, the model parameters Θ include the annotation parameters Λ, the proportion Π0 and other parameters related to the Bayes factor computation (see Text S1). The likelihood of the entire data set is the product of (9) across all K genes. We fit the hierarchical model by maximizing the log-likelihood (10) with respect to the model parameters Θ. (Note that the first term, involving does not depend on Θ, and so need not be evaluated.) Accounting for the Effects of SNPs in Probes Since undetected SNPs in the probe sequence sometimes generate eQTLs, the results that we report include a modification to account for this effect. We used the 634 genes that have a known SNP in the probe region as training data to help parameterize the model. We assume that these represent ∼1/3 of all probes with common SNPs [25]. Suppose that with probability there is a gene inside the probe sequence (this is set to 1 for the training data), and suppose that when there is a SNP in the probe, there is a probability Π s that this generates a spurious signal. Then let be the probability of a spurious signal. We consider that we are only interested in real signals if there is no spurious signal, so we write the probability of the data as (11) where the first term is the likelihood when there is no spurious signal (as in Equation 4), and where the second term gives the likelihood ( ) when there is a spurious signal. Likelihood Maximization To maximize [10] we used an iterative strategy based on a point-by-point golden maximization strategy [46]. To speed convergence of the maximization process, we initialized the parameters using naive estimates of the λs based on the logarithm of the odds ratio computed assuming Π0 = 0. Posterior Probabilities Once the likelihood has been maximized, we can compute the posterior probability of a given SNP j to be an eQTN for gene k. In the case without spurious signals this is (12) and the general version is given in the Supplementary Methods (Text S1). Sequence Conservation and Transcription Factor Binding To compute the average sequence conservation as a function of position for Figure 4B, we estimated the average number of substitutions per site across the phylogeny of seven mammalian species (human, chimpanzee, macaque, mouse, rat, dog, and cow), using data and alignments from the UCSC browser. This was done for the main set of 11,446 genes analyzed in this paper. For each gene, 5 kb on each side of the TSS (and separately for the TES) was split into non-overlapping 50-bp bins. We then concatenated all the sites across all genes that lay in the same bin. After excluding sites in coding exons we estimated the average number of substitutions at each site using baseml, a program in the PAML package [47]. We obtained results on transcription factor binding density using ChIP-chip data collected by the ENCODE project (4). We used data for eight transcription factors that showed large numbers of binding fragments at a 1% false discovery rate in the ENCODE study. The left-hand panel of Figure 4C is essentially a replotting of data presented in Figure 5 of (4), while the right-hand panel shows analogous data plotted with respect to the TES. Software Availability The methods reported here are implemented in the package eQTNMiner, which is available from JBV on request. Supporting Information Figure S1 About 60% of the eQTNs are shared between at least two populations. Venn diagram of the set of eQTNs detected separately in each population. To generate the diagram, we admitted a SNP to the analysis (as an eQTL) if either the p-value in the combined sample (pooling the 3 populations) is lower than 7×10−6 or the p-value in a single population is lower than the p-value cutoff corresponding to a gene FDR of 5% within each population. We then considered two populations to share an eQTL if any single population has a p-value <1×10−2. Finally, for each gene having at least one such eQTL, we defined the eQTN as the SNP with the largest number of shared populations (sharing weight between multiple SNPs if there is a tie). (0.12 MB PNG) Click here for additional data file. Figure S2 Expression QTNs in the combined Japanese plus Chinese analysis panel (ASN) show similar patterns to those in the full data. The left panel (p-value method) was prepared in the same way as Figure 2 of the main paper and the right panel (hierarchical model with TSS+TES) was prepared in the same way as Figure 3 (left panel) of the main paper. Both display results analyzing only the Asian data. For the left panels we used a p-value cutoff of 1.25×10−5 obtained by permutations when analyzing only the Asian data and corresponding to a gene FDR of 5%. (0.43 MB PNG) Click here for additional data file. Figure S3 Expression QTNs in the European-derived sample (CEU) show similar patterns to those in the full data. The left panel (p-value method) was prepared in the same way as Figure 2 of the main paper and the right panel (hierarchical model with TSS+TES) was prepared in the same way as Figure 3 (left panel) of the main paper. Both display results analyzing only the European data. For the left panels we used a p-value cutoff of 3.5×10−6 obtained by permutations when analyzing only the European data and corresponding to a gene FDR of 5%. (0.46 MB PNG) Click here for additional data file. Figure S4 Expression QTNs in the Nigerian sample (YRI) show similar patterns to those in the full data. The left panel (p-value method) was prepared in the same way as Figure 2 of the main paper and the right panel (hierarchical model with TSS+TES) was prepared in the same way as Figure 3 (left panel) of the main paper. Both display results analyzing only the Nigerian data. For the left panels we used a p-value cutoff of 3.825×10−6 obtained by permutations when analyzing only the Nigerian data and corresponding to a gene FDR of 5%. (0.43 MB PNG) Click here for additional data file. Figure S5 Illustration of the ability of the HM to accurately estimate the distribution of eQTNs when all the actual eQTNs are genotyped. This figure is based on a simulated dataset assuming that for all genes the actual eQTN is genotyped (see Text S1). In both panels the black histograms represent the number of actual eQTNs using 1 kb bins anchored from the TSS (this is identical for both panels). A. P-value method: the green curve displays the number of most significant SNPs detected by the p-value method. As expected, due to LD and the stringency of the p-value cut-off, the profile is less peaked than the actual distribution. B. Hierarchical model: using our hierarchical model with the TSS-only model (see Methods) we are able to catch most of the actual eQTNs. The red curve indicates the expected number of eQTNs computed using the posterior probabilities from the hierarchical model. Notice that the hierarchical model provides a better picture of the distribution of signals. (0.15 MB PNG) Click here for additional data file. Figure S6 50% of the most significant SNPs lie within 7.5 kb of the actual eQTNs. Both panels are based on the results from the p-value method applied to a simulated dataset (see Text S1). The top panel plots the histogram of the fraction of most significant SNPs as a function of distance from the actual eQTNs. The bottom panel plots the corresponding cumulative probability. (0.05 MB PNG) Click here for additional data file. Figure S7 No obvious impact of the eQTN location on the mapping precision. Cumulative plot of the distance between the most significant SNPs and the actual eQTNs according to the eQTN location (upstream of the TSS, downstream of the TSS, within an exon, and within an intron). This plot was generated by averaging results from the p-value method applied to 10 simulated dataset (see Text S1). For the legend, the percentage between brackets give the fraction of actual eQTNs in the corresponding category. (0.08 MB PNG) Click here for additional data file. Figure S8 Impact of the local recombination rate on the eQTN mapping precision. Boxplot of the physical distance between the tag SNP and the actual eQTN as a function of the average recombination rate (cM/Mb) around the actual eQTN in a simulated dataset assuming that all eQTNs are not genotyped (see Text S1). We divided the data into four categories of equal sizes (from low to high level of recombination rate, the range of the recombination rate in each class is indicated along the x-axis below each box). As expected, the higher the recombination rate, the lower the expected distance between the tag SNP and the actual eQTN. (0.05 MB PNG) Click here for additional data file. Figure S9 There is a deficit of most-significant SNPs in internal introns, and an enrichment of such SNPs in last exons (p-value method). This figure is based on the subset of 295 genes for which there is a unique most significant SNP (and for which the smallest p-value is <7×10−6) that fall into the gene transcript region. For the five panels, the blue arrows represent the observed number of most significant SNPs in the five gene functional elements for which at least 5 most significant SNPs have been found. Here these counts have been corrected for putative spurious signal due to an unobserved SNP inside the probe (leading to the removal of {similar, tilde operator } 46 genes). Under the null hypothesis that these most significant SNPs are randomly distributed into the eight possible gene functional elements, we carried out a simple Monte-Carlo procedure where for each of the 295 genes we picked at random a SNP inside the gene transcript region to be the most significant SNP (and weight it by the probability that the gene has genuine signal according to the location of the observed most significant SNP with respect to the probe (see Text S1). The histograms depict the distribution of the numbers of most significant SNPs across 1000 simulated configurations. (0.12 MB PNG) Click here for additional data file. Figure S10 When distance is measured from the TSS (or TES) only, the TES (or TSS) peak is hidden due to the great variability in gene lengths. The plots show the fraction of SNPs with eQTN signals as a function of position in the cis-candidate region. The candidate region is divided into a series of 1 kb bins across the x-axis that indicate position relative to the TSS (or TES). For each bin we plot the proportion of SNPs that have the smallest p-value for the corresponding gene, and for which p<7×10−6 (gene FDR of 5%). (0.07 MB PNG) Click here for additional data file. Figure S11 Illustration of the ability of the HM to accurately estimate the distribution of eQTNs even when only 30% of the actual eQTNs are genotyped. These plots are based on a simulated dataset assuming that across all genes only 30% of the true eQTNs are genotyped (see Text S1). In both panels the black histograms represent the number of actual eQTNs using 1 kb bins anchored from the TSS (this is identical for both panels). A. P-value method: the green curve displays the number of most significant SNPs detected by the p-value method. As expected, due to the uncomplete SNP coverage, LD and the stringency of the p-value cut-off, the profile is less peaked than the actual distribution. B. Hierarchical model (TSS-only version): the red curve indicates the expected number of eQTNs computed using the posterior probabilities from the hierarchical model. The hierarchical model provides us with a much more accurate representation of the actual eQTN distribution. (0.20 MB PNG) Click here for additional data file. Figure S12 Simulated dataset with eQTNs symmetrically distributed around the TSS. The three left panels plot the true (simulated) probability to be the actual eQTN according to the gene size category. The three right panels plot the probability to be the most significant SNP (i.e the SNP with the smallest p-value inside the cis-candidate region) in genes having at least one SNP with a p-value lower than 7×10−6 (as for Figure 2 in the main text). Although only 30% of the actual eQTNs are observed, the distribution of the most significant SNPs (right panels) lines up pretty well with the distribution of the actual eQTNs (left panels). Furthermore, the distribution of signals for this TSS-only model is quite different than seen in the real data, consistent with our results that the TSS-only model does not provide a good description of the data. See Text S1 for a description of our simulation process. (0.43 MB PNG) Click here for additional data file. Figure S13 Numbers of SNPs inside each of the 9 mutually exclusive gene-related annotations as a function of position within the gene. SNPs inside coding exon are classified into synonymous and non-synonymous SNPs. Notice that ∼84% of genic SNPs occur inside internal introns. (0.12 MB PNG) Click here for additional data file. Figure S14 Fine-scale structure of eQTN peaks near the TSS and TES, and comparison to four types of functional annotation. The left- and right-hand columns show data for 5 kb on either side of the TSS and TES, respectively (averaging across all gene sizes). Locations inside genes are colored green and outside genes are black. A. Posterior expected fractions of SNPs in each bin that are eQTNs, as estimated by the hierarchical model (see Methods). Each bin is 25 bp wide. B. Probability that a SNP falls into a (putative) functional site: CpG island (CpG), conserved non-coding element (CNC), predicted cis-regulatory module (pCRM) and micro RNA binding site (miRNA). (0.27 MB PNG) Click here for additional data file. Figure S15 Genes with CpG islands spanning the TSS are expressed at higher average levels and are more likely to contain eQTLs than genes without a CpG island at the TSS. Results for genes with a CpG island ON the TSS are displayed in red while results for genes without a CpG island spanning the TSS (OFF) are displayed in black. These results are based by computing seperately for the two gene categories the posterior probabilities from the hierarchical model. A. Estimated probability for each gene category to have an eQTN anywhere in the cis-candidate region. B. Box plots of the means and the standard deviations of the log hybridization intensities for the two gene categories. Genes ON CpG have higher mean expression and standard deviations than Gene OFF CpG. C. After adjusting for the different overall rates of eQTNs, the distribution of signal locations in the two classes of genes is very similar. The plots show the fraction of SNPs with eQTN signals as a function of position in the cis-candidate region, based on the hierarchical model. In order to make the two classes of genes more comparable, the plots are conditional on the gene having an eQTN. Top panel shows results for the 7,069 genes with a CpG island spanning the TSS (ON CpG) and bottom panel shows results for the 4,377 genes without a CpG island spanning the TSS (OFF CpG). (0.27 MB PNG) Click here for additional data file. Figure S16 Schematic explanations of our gene structure annotation. The plot shows three pairs of hypothetical genes consisting of, respectively, 1, 2 and 6 exons. In each pair, the upper version of the gene shows the exon/intron structure (from RefSeq) and the translation start and stop sites (vertical red lines). The lower version of the gene shows how we annotate the gene structure (see color code at right of figure). A verbal explanation is also provided in the main text. (0.17 MB PNG) Click here for additional data file. Figure S17 Locations of the most significant eQTL SNPs for small, medium, and large genes using a p-value cutoff of A) 1×10−2 and B) 1×10−4. For A and B, the three panels was prepared in the same way as Figure 2 of the main paper. (0.44 MB PNG) Click here for additional data file. Figure S18 Locations of the most significant eQTL SNPs for small, medium, and large genes using a p-value cutoff of A) 1×10−6 and B) 1×10−8. For A and B, the three panels was prepared in the same way as Figure 2 of the main paper. (0.41 MB PNG) Click here for additional data file. Figure S19 Distribution of most significant eQTL SNPs around probes. The black bars indicate the numbers of spurious eQTL signals as a function of distance from the probes, among the 634 genes with a known SNP in the probe. The sum of the red+green bars gives the numbers of most significant eQTL SNPs among the remaining 11,446 genes; the red component is our estimate of the fraction that is spurious. (See section ‘Spurious Signal’ in Text S1 for further description.) (0.18 MB PNG) Click here for additional data file. Table S1 Table of descriptive statistics for each of the 9 mutually exclusive gene structure annotations for the 11,446 genes of our data set. The “Exp nber” and “Fraction” columns of the table are based on the posterior probabilities to be a genuine eQTN from the hierarchical model: left side for TSS-only+annotation model and right side for TSS+TES+annotation model. (0.03 MB PDF) Click here for additional data file. Table S2 Table of descriptive statistics for each of the 8 mutually exclusive gene structure annotations for the 11,446 genes of our data set. (0.03 MB PDF) Click here for additional data file. Table S3 Table of descriptive statistics for each of the 5 functional annotations for the 11,446 genes of our data set. (0.04 MB PDF) Click here for additional data file. Text S1 Supplementary methods. (0.15 MB PDF) Click here for additional data file.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Allelic variation in human gene expression.

              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Differential Allelic Expression in the Human Genome: A Robust Approach To Identify Genetic and Epigenetic Cis-Acting Mechanisms Regulating Gene Expression

              Introduction Understanding the genetic causes of phenotypic variation in humans still remains a major challenge for human genetics. In hundreds of cases, a single DNA sequence polymorphism affecting a protein coding sequence has been linked to a clear simple Mendelian phenotype (see e.g. [1]) and, for a much smaller but increasing number of cases, to more complex phenotypes [2]–[4]. Recent developments in high-density genotyping technologies have led to the completion of several whole genome association studies that test hundreds of thousands of markers for a specific disease. While earlier studies essentially focused on variants in coding sequences and regions immediately surrounding candidate genes, whole genome scans interrogate, in an unbiased way, most of the human genome including large regions of non-coding DNA that had not been studied previously. Interestingly, some of the strongest signals observed in these association studies are located in non-coding regions, either in large introns (e.g. [5]–[7]) or far away from any annotated loci (e.g. [8] and references therein). The mechanisms connecting these polymorphisms to the etiology of the diseases are still unclear but regulation of gene expression remains an obvious candidate. It is thus becoming particularly important to have a powerful and reliable method to easily test the influence of DNA polymorphisms on gene expression. One of the approaches commonly used to identify regulatory polymorphisms is to look for statistical associations between variation in gene expression and individual genotypes [9],[10]. This method offers the advantage of simultaneously analyzing thousands of genes using gene expression arrays and has yielded fascinating results in yeast [11],[12] and mouse [13]–[16]. Its application in humans [17]–[24] suffers from relatively low statistical power due to potential inter-individual differences in a large number of causal variants involved in the regulation of a specific gene [25], their modest effects and the burden of the multiple testing correction necessary to take into account the large number of independent tests performed. In addition, since this approach requires extensive genotype information for all individuals, it is costly to apply to new samples. An alternative approach is to compare the relative expression of the two alleles in one individual: the effect of a polymorphism affecting in cis the regulation of a particular transcript can be detected by measuring the relative expression of the two alleles in heterozygous individuals using a transcribed SNP as a marker [26]–[32]. Several studies have used this approach in humans but have been criticized for their low throughput or the apparent high variability. Here we describe a novel array-based method that allows high-throughput assessment of differential allelic expression. We used a modified version of the Illumina GoldenGate genotyping platform, the Allele-Specific Expression (ASE) assay, to assess the extent of differential allelic expression for over 1300 genes in more than 80 human lymphoblastoid cell lines (LCLs). Our analyses include 352 genes located in ENCODE regions and chromosome 21 that have been previously screened for cis-regulatory polymorphisms using total gene expression [19]. This allows us to directly compare the advantages and drawbacks of the two approaches in terms of range, sensitivity and robustness. We specifically address the issue of experimental noise and reproducibility of the findings and show that biology, not experimental variability, is responsible for the patterns observed. We discuss the relevance of our results for the identification of the molecular mechanisms regulating gene expression, as well as their implications for future genetic studies. Results Assessment of Differential Allelic Expression Using Illumina ASE Assay We first assessed the extent of differential allelic expression at 1,432 exonic SNPs using 81 individual LCLs with the Illumina ASE technology (Figure 1 and Table S1 for the composition of the Illumina ASE Cancer Panel). This technology uses primer extension assays with fluorescence-labeled allele-specific primers to measure the proportion of each allele separately at the genomic and transcriptomic levels (Figure 2). Five hundred and twelve SNPs (in 345 genes) displayed an expression level significantly higher than background in at least three heterozygous individuals and were further investigated (see Materials and Methods for details). The extent of differential allelic expression at each SNP was obtained by comparing the relative amount of each allele in RNA to the ratio observed in DNA. 10.1371/journal.pgen.1000006.g001 Figure 1 Experiment design and results obtained for the two panels used in the study. The Overall column corresponds to the combination of the two panels. The detailed composition of each panel is presented on Supplemental Table S1. 10.1371/journal.pgen.1000006.g002 Figure 2 Overview of the Illumina Allele-Specific Expression assay. Genomic DNA and total RNA are separately converted into biotinylated DNA and amplified using fluorescence-labeled universal primers following extension and ligation of allele-specific assay oligo-nucleotides. PCR products are captured by locus-specific beads and the fluorescence of each dye (i.e. allele) at each locus is measured by quantitative fluorescence imaging (see Materials and Methods for details). As a first effort to determine if the assay could reliably be used to assess differential expression we generated spike mixes using varying proportions of total RNA extracts from two individuals. For 20 exonic SNPs located in expressed transcript, the two individuals are homozygous for the different alleles (i.e. respectively AA and BB), while for 192 SNPs one individual is heterozygous and the other homozygous (i.e. either AB and AA, or AB and BB). Since the expression of each gene may differ between the two individuals, one does not expect to observe an exact translation of the proportions of total RNA mixed to the “allelic” expression level. However, the allelic expression differences estimated for the different spike mixes should be the proportional to each other. For all homozygous/homozygous mixes (20 out of 20 SNPs) and 83% of the heterozygous/homozygous mixes (159 out of 192), we observed a significant linear correlation (p 0.05) are indicated with r2 = 0. These analyses rely on the observation of the three genotypes in the population (i.e. AA, AB and BB). To also include SNPs with a lower minor allele frequency for each it was not possible to observe homozygotes for the minor allele in our small sample, we designed a second analysis method using solely the heterozygous individuals. If the alleles are differentially regulated we expect to observe in some cases a very large variance in the ratio of the two alleles in the population. We used this approach to determine SNPs for which the heterozygous individuals harbor a variance of the allelic ration higher than expected using a Maximum expectation algorithm (see Materials and Methods for details). This approach does not allow us to quantity the overall extent of differential allelic expression but identifies 8 genes with differential allelic expression that were not identified by the previous method. When one considers the estimates of allelic expression obtained using different SNPs in a same transcript showing significant differences in allele expression (i.e. with a population-average ratio greater than 60∶40), we note that 36 out of 44 correlations between individuals estimates are significant (for an average r2 of 0.83). Individuals showing a large allelic expression difference at one SNP display similar patterns at all heterozygous positions of the transcript (an example is shown on Figure 5). This observation supports our findings that the experimental variability is low in the Illumina ASE assay and that this assay allows quantitative assessment of differential allelic expression. Consequently, the population-average estimates of allelic imbalance obtained with different markers in the same transcript tend to be similar (Table S2) but can vary since different individuals will be included in the average (depending on whether they are heterozygotes at this marker). 10.1371/journal.pgen.1000006.g005 Figure 5 Estimates of allelic expression using two SNPs located in the IL1A gene. Each blue cross displays one individual heterozygous at two SNPs in IL1A. The x-axis represents the estimate of allelic imbalance using rs1304037, the y-axis the allelic imbalance measured using rs17561. The two axes cross at 50∶50 corresponding to an equal expression of both alleles, the allelic ratio 100∶0 corresponds to complete transcriptional silencing of one allele, 0∶100 to the silencing of the other allele. Validation of Allelic Imbalance Estimates Using Quantitative Sequencing To further assess the validity of our results, we randomly selected 25 genes tested on the Illumina ASE platform and used quantitative sequencing of RT-PCR products [33] to measure allelic imbalance for the same SNP in the same individuals (Figure S3). The selected genes consisted of eight autosomal genes with significant allelic imbalance and 17 genes for which the level of differential allelic expression did not reach our significance threshold. We analyzed the same 81 individual LCLs using RNA from the same extract as for the Illumina assay. Overall, we observed a strong correlation between the estimates of allelic imbalance obtained for each individual using the two methods for the genes with a ratio of allelic expression larger than our 40∶60 cut-off (r2>0.8 for 6 out of 8 genes, see Figure 6A as a example). The correlations were not statistically significant for the genes for which the average difference in allelic imbalance did not reach our significance threshold (16 out of 17 genes, Figure 6B): minor deviations observed in the allelic ratio for these genes likely correspond to random variations and are therefore not expected to be reproducible. The strength of the correlation (measured by Pearson's r2) for all 25 genes is shown on Figure 4. One SNP in CD44 (rs8193) displayed a low but significant correlation between our estimates of allelic imbalance obtained from the Illumina assay and those using quantitative sequencing (p 0.9), supporting the idea that differential allelic expression is little influenced by variations in culture environment. Comparison of Differential Allelic Expression with Total Gene Expression Association An alternative approach to identifying cis-regulatory polymorphisms is to test for statistical association in a population between total gene expression measurements and the genotypes at markers in or surrounding the transcript. Interestingly, one of the genes showing the most marked difference in allele expression in our analysis is one of the 14 genes identified by Cheung and colleagues in a previous genome-wide study [17]. One comprehensive analysis was recently conducted for 512 RefSeq genes in ENCODE regions and chromosome 21 using LCLs from 60 unrelated individuals genotyped by the HapMap project [19]. In order to compare the respective strengths and weaknesses of total gene expression mapping and differential allelic expression, we designed a second panel that includes SNPs in the same genomic regions to analyze the same individual LCLs (Figure 1 and Table S1 for details). Using the information from the HapMap phase I (release 16) to select common exonic SNPs, we were able to include 228 and 124 genes from, respectively, ENCODE regions and chromosome 21, while Stranger and colleagues selected 321 and 191 genes (after screening for genes with high variance in their expression among individuals, see Materials and Methods for details). From the regions analyzed by Stranger and al., two-hundreds and ninety SNPs (in 170 genes) showed an expression level significantly higher than the background in three or more heterozygous individuals and were further investigated. Forty-nine out of 170 genes show significant level of differential allelic expression including 6 out of the 21 genes identified by Stranger and colleagues and present in our panel. Additionally, TTC3 which shows significant association between total gene expression and genotypes in the study by Stranger et al. shows patterns of allelic expression consistent with differential allelic expression (including a very high correlation between the extent of differential allelic expression estimated using different SNPs) on the Illumina ASE assay, even though it did not pass the significance cut-off. Overall in this second panel, 497 SNPs in 317 genes were expressed in three or more heterozygous individuals (out of 1536 SNPs in 674 genes) and 78 SNPs in 65 genes showed a significant level of differential allelic expression (Figure 1). Intronic SNPs Can Be Used To Assess Differential Allelic To test whether intronic SNPs could be used instead of exonic SNP, we included for each gene on the second panel one intronic SNP. In general, intronic SNPs were less successfully analyzed and passed our expression threshold only for genes highly expressed in LCLs (Figure S5). This finding is consistent with previous observations [30] and the low proportion of unspliced mRNA (heteronuclear RNA) in cells relative to spliced transcripts. If the intronic SNP of a gene was detected in the RNA extract, it typically yields estimates of differential allelic expression very similar with those obtained using exonic SNPs. Differential Allelic Expression in the Human Genome Overall, 177 out of 1,009 expressed SNPs (in 140 out of 643 genes expressed, 22%) display population-average ratios of allelic expression larger than 40∶60 or an higher than expected variance in allelic expression among heterozygous individuals and are thus unlikely to result solely from stochastic variation in the experiment (Figure S6). Table 1 shows the 133 SNPs (100 genes) with significant allelic imbalance after manual curation to remove possible false positives due to a low number of individual homozygous for the minor allele (this list is likely over-conservative and the complete data is presented in Table S2). 10.1371/journal.pgen.1000006.t001 Table 1 Genes with significant allelic expression difference. rs Gene Panel Chr # Hets Variance & Mean Average AI rs2292305 THBS1 Cancer 15 13 Mean shifted NA rs2288539 NR2F6 Cancer 19 18 Mean shifted NA rs2459216 OAT Cancer 10 6 High variance NA rs3750105 PEG10 Cancer 7 8 High variance NA rs3745410 LILRB3 Encode 19 5 High variance NA rs2839500 TMPRSS3 Encode 21 13 High variance NA rs546782 FGF9 Cancer 13 10 High variance NA rs17756426 DDX43 Encode 6 6 High variance NA rs1893963 DSC2 Cancer 18 14 High variance NA rs9978281 C21orf7 Encode 21 12 High variance NA rs2834601 CLIC6 Encode 21 8 High variance NA rs2070807 DNASE1L1 Encode X 5 High variance NA rs705 SNRPN Cancer 15 42 High variance 95∶05 rs7810469 PEG10 Cancer 7 12 High variance 95∶05 rs13073 PEG10 Cancer 7 35 High variance 95∶05 rs2240176 FLJ35801 Encode 22 18 95∶05 rs311683 DDX43 Encode 6 15 High variance 90∶10 rs5956583 BIRC4 Cancer X 21 High variance 90∶10 rs1056831 CHI3L2 Cancer 1 39 90∶10 rs1050757 G6PD Encode X 6 Mean shifted 90∶10 rs1474593 BIRC4 Encode X 19 High variance 90∶10 rs8371 BIRC4 Cancer X 20 High variance 85∶15 rs1800291 F8 Encode X 6 85∶15 rs2734647 MECP2 Cancer X 12 High variance 85∶15 rs1057403 BTK Cancer X 4 85∶15 rs1059701 IRAK1 Cancer X 14 High variance 85∶15 rs700 BTK Cancer X 22 High variance 85∶15 rs9018 FHL1 Cancer X 4 85∶15 rs2734647 MECP2 Encode X 10 85∶15 rs3813455 GAB3 Encode X 10 High variance 85∶15 rs5945431 PLXNA3 Encode X 9 High variance 85∶15 rs5958343 BIRC4 Cancer X 26 High variance 85∶15 rs5987266 PLXNA3 Encode X 10 High variance 85∶15 rs311686 DDX43 Encode 6 17 High variance 85∶15 rs9856 BIRC4 Cancer X 24 High variance 85∶15 rs6151429 ARSA Encode 22 24 85∶15 rs17330644 BIRC4 Encode X 15 High variance 80∶20 rs1050705 F8 Encode X 12 High variance 80∶20 rs11887 VBP1 Cancer X 12 High variance 80∶20 rs12877 DNASE1L1 Cancer X 4 80∶20 rs10798 KCNQ1 Cancer 11 31 High variance 75∶25 rs6571303 CXorf12 Encode X 8 75∶25 rs9394782 NCR2 Encode 6 15 75∶25 rs183436 ABCG1 Encode 21 36 75∶25 rs6691569 FCRL3 Encode 1 12 75∶25 rs17561 IL1A Cancer 2 21 High variance 75∶25 rs2278699 ZAP70 Cancer 2 3 75∶25 rs8535 CHI3L2 Cancer 1 38 High variance 75∶25 rs1056825 CHI3L2 Cancer 1 38 High variance 75∶25 rs1304037 IL1A Cancer 2 26 High variance 75∶25 rs1571858 GSTM3 Encode 1 17 High variance 75∶25 rs3817405 PLXDC2 Cancer 10 12 High variance 75∶25 rs10863 MEST Cancer 7 29 High variance 70∶30 rs10336 CXCL9 Cancer 4 11 High variance 70∶30 rs5351 EDNRB Cancer 13 22 High variance 70∶30 rs1022477 RIBC2 Encode 22 26 70∶30 rs6007897 CELSR1 Encode 22 10 High variance 70∶30 rs11264793 FCRL3 Encode 1 24 70∶30 rs4445669 IGSF4 Cancer 11 40 High variance 70∶30 rs4767884 PXN Cancer 12 26 70∶30 rs1041985 CDH2 Encode 18 38 70∶30 rs140519 KLHDC7B Encode 22 30 70∶30 rs17197 PTGER2 Encode 14 10 70∶30 rs1803965 MGMT Cancer 10 5 70∶30 rs2837029 C21orf13 Encode 21 15 Mean shifted 70∶30 rs724558 SERPINB10 Encode 18 23 70∶30 rs6007594 C22orf8 Encode 22 28 70∶30 rs7561 LAMB1 Cancer 7 21 High variance 70∶30 rs165602 NEFH Encode 22 6 70∶30 rs10593 ITGB1BP1 Cancer 2 23 High variance 65∶35 rs1042531 PCK1 Encode 20 9 65∶35 rs1025689 IL17RB Cancer 3 29 65∶35 rs225334 TFF2 Encode 21 17 65∶35 rs17207369 LILRP2 Encode 19 18 65∶35 rs3856806 PPARG Encode 3 5 High variance 65∶35 rs300239 ENC1 Cancer 5 27 65∶35 rs677688 IMPACT Cancer 18 7 65∶35 rs6104 SERPINB2 Cancer 18 19 High variance 65∶35 rs8097425 SERPINB10 Encode 18 24 65∶35 rs1071676 IL1B Cancer 2 29 High variance 65∶35 rs9612234 GNAZ Encode 22 16 65∶35 rs2024233 WNT2 Encode 7 24 High variance 65∶35 rs7927012 TRIM6 Encode 11 30 65∶35 rs2024233 WNT2 Cancer 7 15 High variance 65∶35 rs162549 CYP1B1 Cancer 2 22 High variance 65∶35 rs2075760 PLSCR3 Cancer 17 19 65∶35 rs2832236 C21orf7 Encode 21 40 High variance 65∶35 rs15017 MOXD1 Encode 6 6 High variance 60∶40 rs1053474 IMPACT Cancer 18 31 60∶40 rs7120209 TRIM6 Encode 11 13 60∶40 rs958 MAPK10 Cancer 4 28 60∶40 rs7914 MCAM Cancer 11 31 High variance 60∶40 rs12593359 RAD51 Cancer 15 44 60∶40 rs1368439 IL12B Encode 5 23 60∶40 rs1103229 PPIL2 Encode 22 26 60∶40 rs2839600 NDUFV3 Encode 21 17 60∶40 rs3734744 MOXD1 Encode 6 11 60∶40 rs8807 HLA Cancer 6 15 60∶40 rs743616 ARSA Encode 22 37 60∶40 rs6214 IGF1 Cancer 12 31 High variance 60∶40 rs2822445 RBM11 Encode 21 38 60∶40 rs4947963 EGFR Encode 7 19 60∶40 rs5275 PTGS2 Cancer 1 41 60∶40 rs2257505 MGC33648 Encode 5 29 60∶40 rs1029365 FLJ21062 Encode 7 30 60∶40 rs2839536 TSGA2 Encode 21 19 60∶40 rs6518322 LOC284837 Encode 21 29 60∶40 rs2258119 C21orf91 Encode 21 25 60∶40 rs2229730 CSK Cancer 15 4 60∶40 rs2206593 PTGS2 Cancer 1 10 High variance 60∶40 rs2829877 JAM2 Encode 21 17 60∶40 rs1053395 TUBB4 Cancer 19 32 60∶40 rs1044104 BMP6 Cancer 6 22 60∶40 rs1801719 F2R Cancer 5 31 60∶40 rs235768 BMP2 Cancer 20 7 High variance 60∶40 rs2239730 ZNF215 Cancer 11 41 60∶40 rs2270121 GAS7 Cancer 17 38 60∶40 rs963075 SERPINB10 Encode 18 32 High variance 60∶40 rs4820268 TMPRSS6 Encode 22 38 60∶40 rs4798 ITGB1BP1 Cancer 2 34 60∶40 rs9782 ASCL1 Cancer 12 28 60∶40 rs180817 BCR Cancer 22 15 60∶40 rs406271 TFRC Cancer 3 33 60∶40 rs1476217 FGF2 Cancer 4 43 High variance 60∶40 rs2239731 ZNF215 Cancer 11 44 60∶40 rs10916 CYP1B1 Cancer 2 13 High variance 60∶40 rs2230033 KCNJ15 Encode 21 33 60∶40 rs3088440 CDKN2A Cancer 9 31 60∶40 rs2855658 CYP1B1 Cancer 2 37 High variance 55∶45 rs14983 MMP7 Cancer 11 22 High variance 55∶45 rs3747676 FGF2 Cancer 4 42 High variance 55∶45 rs10502001 MMP7 Cancer 11 19 High variance 55∶45 rs2066575 DLEU1 Cancer 13 27 High variance 55∶45 #Hets: number of heterozygous individuals which express the transcript. Variance & Mean: indicates whether the analyses of variance/mean allelic expression detected significant deviation of the expression of both alleles (see Materials and Methods for details). Average AI: population-average difference in allelic expression using all individuals heterozygous at this position (rounded down). These values correspond to values reported on the y-axis on Figure 4 and Figure S6. Many of the genes with the highest extent of allelic imbalance in LCLs are located on the X-chromosome. While it is known that one allele at most X-linked genes is silenced in females by inactivation of one entire chromosome [34],[35], we would expect that a polyclonal cell population (in which half of the cells inactivate one X chromosome and the other 50% inactivate the alternate X chromosome) would give a similar level of expression for both alleles. However, all X-linked genes on our two SNP panels (22 SNPs in 12 genes) were among the top 5% of genes with most dramatic allelic imbalance patterns. The extent of allelic imbalance at a given gene varies among individual LCLs but interestingly, the patterns of allelic imbalance are very consistent across genes for a given individual (Figure S7). Additionally, the inheritance of the expressed allele (determined, when possible, using the pedigree information for the two families included in this study) appeared random. It has been previously proposed that the extent of clonality of a cell line could explain the patterns of allelic imbalance at genes with random mono-allelic expression [30]: clonal cells will all have the same X chromosome inactivated and thus display very high ratios of allelic imbalance. In contrast, cell-lines composed of a polyclonal population of lymphoblasts will have one or the other of their X chromosomes inactivated in different cells and thus an apparent expression of both alleles (i.e., a low extent of differential allelic expression). Our observations at X-linked genes are consistent with this hypothesis and the biased clonality of these LCLs, which were created over 20 years ago and passaged numerous times (see also [30]). The two autosomal genes displaying the most dramatic allelic imbalance patterns have previously been shown to be imprinted in humans: PEG10 [36] and SNRPN [37]. In addition, KCNQ1, MEST and ZNF215 which are imprinted in humans [38]–[40] also show significant differences in allelic expression (Table 1). The mode of inheritance of the expressed allele also corresponds, in each case, to what has been described for the expression of these genes: for PEG10 and SNRPN, heterozygous individuals express the paternally-inherited allele (i.e. maternally imprinted) while for KCNQ1 the maternally-inherited allele is expressed. Our limited pedigree information is not conclusive for MEST and ZNF215. The only other known imprinted gene analyzable in our panel, PLAGL1 [41],[42] did not pass the significance threshold (i.e. an allelic ratio greater than 60∶40) but shows a population average allelic imbalance larger than 55∶45 and a high correlation between the two SNPs analyzable in the panel (rs2076684 and rs9373409); therefore it likely represents a significant difference in allelic expression. The 83 remaining genes (103 SNPs) with significant population-average allelic imbalance included several genes for which allelic imbalance had been shown in previous studies (e.g. IL1A or IGF1 described in [30]). For some genes (e.g., CHI3L2), one allele/haplotype is clearly expressed more than the other in heterozygotes and the inheritance pattern in families supports a genetic cause for allelic imbalance. For other genes, neither the direction of allelic imbalance nor the pedigree analysis allowed us to easily differentiate the genetic/epigenetic cause of the differential allelic expression (Table 1). For 56 genes with significant differences in allelic expression we tested whether differential allelic expression could be statistically associated to one of the SNP in the vicinity of the gene genotyped by the HapMap project (see Materials & Methods for details). The results of these tests for SERPINB10 and ABCG1 are shown on Figure 7 and the strongest nominal association for each gene is displayed on Table 2. Twenty-three genes still display statistical significant associations after Bonferroni correction for multiple testing (highlighted in green on Table 2) showing a clear enrichment relative to the 2–3 associations expected by chance. Our power to detect a significant association between a HapMap SNP and the under-/over-expressing chromosome in this setting is low due to our reduced sample size (only the heterozygous individuals are taken into account in this analysis) and the number of regulatory haplotypes identified is thus likely underestimated. Additionally, many SNPs are tested for each gene and it is thus possible that some of the regulatory haplotypes result from spurious associations (i.e. they are false positives). One argument against a very high rate of false positive in our analysis is that imprinted genes such as MEST or PEG10 do not show any signal of association (Figure S8) consistent with the fact that the cis-regulatory mechanism at these genes is not encoded in the DNA sequence. To further investigate the validity of our association, we attempted to independently confirm these regulatory haplotypes by testing for the statistical association between one SNP in the regulatory haplotype and gene expression level. We used gene expression measurements performed at the Wellcome Trust Sanger Institute (kindly provided by M. Dermitzakis) on the same individual cell lines assayed by Illumina gene expression arrays. For each gene, we tested whether the homozygotes for the regulatory haplotype associated with low allelic expression in heterozygotes show a significantly lower gene expression level than the homozygous individuals for the regulatory haplotype associated with high allelic expression. We also performed locus-specific RT-PCR and quantified the level of gene expression using SYBR-Green for eleven genes for which differential allelic expression was significantly associated with allelic expression but for which expression data were not available (2 genes) or genes with strong association with a regulatory haplotype but that were not validated using the Sanger dataset (9 genes). Overall, out of the 47 genes with a significant association between a SNP (or several, defining the regulatory haplotype) and differential allelic expression at the nominal cut-off, 10 were confirmed using gene expression measurements while 5 other genes showed a trend but did not reach statistical significance (Table 2). 10.1371/journal.pgen.1000006.g007 Figure 7 Association mapping of allelic imbalance to regulatory haplotypes for SERPINB10 (A) and ABCG1 (B). The green track shows the –log(p.value) for the association of the alleles for each SNP with the over-/under-expressing chromosome (i.e. the higher the bar, the strongest the association). The linkage disequilibrium pattern for the CEPH individuals genotyped by the HapMap project is displayed below (using r2). 10.1371/journal.pgen.1000006.t002 Table 2 Results of the association mapping and validation assays. Gene Chr # Hets Association Sanger RT-PCR Mapped to ABCG1 21 18 2.20E-10 6.81E-01 5.78E-01 Intron ARSA 22 17 2.26E-05 NA Gene ASCL1 12 13 1.69E-02 1.37E-02 BCR 22 6 6.06E-02 9.83E-01 BMP6 6 5 4.76E-02 8.62E-01 C21orf13 21 9 4.11E-05 No data 3.78E-02 Gene-5′ C21orf7 21 12 3.71E-01 7.06E-01 C21orf91 21 17 1.54E-08 8.47E-04 Gene C22orf8 22 8 1.55E-04 No data Gene CDH2 18 22 5.32E-05 7.00E-01 Gene CXCL9 4 7 4.66E-03 3.43E-01 Whole Region CYP1B1 2 17 2.11E-06 1.43E-02 Gene-5′ DDX43 6 9 3.47E-01 5.56E-01 EDNRB 13 14 7.03E-03 6.17E-01 Whole Region EGFR 7 6 4.55E-01 6.21E-01 F2R 5 10 1.08E-05 7.32E-01 9.65E-02 3′UTR FCRL3 1 14 6.88E-05 No data 8.04E-03 Gene-3′ FGF2 4 22 6.00E-03 7.63E-01 Gene-3′ FLJ35801 22 13 4.72E-02 1.23E-01 GAS7 17 12 7.40E-07 2.96E-01 1.45E-01 3′UTR GNAZ 22 5 7.94E-03 6.42E-01 1.16E-01 Gene-3′ IGF1 12 13 2.38E-01 9.17E-01 IGSF4 11 15 1.70E-03 2.91E-01 2.59E-01 Gene-3′ IL12B 5 13 1.92E-07 4.51E-02 Whole Region IL17RB 3 14 5.98E-06 6.81E-01 7.33E-01 5′ IL1A 2 9 1.49E-01 8.74E-01 IMPACT 18 11 7.52E-03 6.63E-01 5′distal ? ITGB1BP1 2 18 1.58E-03 No data Gene JAM2 21 9 2.26E-03 NA 5′ ? KCNJ15 21 13 1.69E-02 3.25E-01 Gene KCNQ1 11 21 2.07E-02 1.59E-01 KLHDC7B 22 21 3.72E-12 No data Gene-3′ LILRP2 19 10 2.30E-02 No data LOC284837 21 16 3.73E-02 No data MAPK10 4 11 2.84E-06 5.25E-01 Gene MCAM 11 12 6.44E-04 5.21E-01 Gene/3′ MEST 7 9 1.53E-01 5.50E-01 MGC33648 5 21 4.40E-06 3.59E-05 5′ MMP7 11 8 4.66E-03 7.46E-01 MOXD1 6 8 6.99E-03 1.12E-01 5′ NDUFV3 21 11 2.84E-06 3.73E-05 Gene-3′ PEG10 7 22 4.59E-02 NA PLSCR3 17 10 1.98E-02 2.22E-01 PLXDC2 10 8 7.69E-02 3.40E-02 PTGS2 1 5 2.06E-01 NA PXN 12 14 4.23E-04 9.24E-01 Gene-3′ RBM11 21 10 1.08E-05 4.59E-01 1.71E-01 3′ RIBC2 22 12 7.40E-07 No data Gene SERPINB10 18 22 1.84E-05 5.98E-11 5′ SERPINB2 18 10 1.19E-04 2.98E-01 Gene-5′ SNRPN 15 27 3.94E-02 3.78E-01 TMPRSS6 22 12 7.40E-07 8.42E-01 3′UTR TRIM6 11 8 2.56E-02 9.84E-01 TSGA2 21 8 1.55E-04 9.81E-01 Gene WNT2 7 9 9.05E-03 3.60E-01 6.45E-01 5′ Gene ZNF215 11 20 3.73E-02 1.45E-03 1.82E-02 Gene-5′ #Hets: number of heterozygous individuals for the gene considered among the HapMap individuals analyzed. Association: lowest nominal p-value obtained by Fischer exact test of the association of the alleles of all SNPs with over-/under-expressing chromosomes. Associations remaining significant after Bonferroni correction for multiple testing are highlighted in green. Sanger: p-value obtained in the linear regression of the Sanger gene expression measurements with the alleles of the SNP most strongly associated with allelic expression. RT-PCR: p-value obtained in the linear regression of the locus-specific RT-PCR gene expression measurements with the alleles of the SNP most strongly associated with allelic expression. Mapped to: broad localization of the regulatory haplotype with regards to the gene with significant difference in allele expression. Discussion Biology, Not Experimental Noise, Is Responsible for Differential Allelic Expression We analyzed differences in relative allelic expression (or allelic imbalance) at 1,380 human genes using 2,968 SNPs and more than 80 lymphoblastoid cell lines from individuals with European ancestry. Using quantitative sequencing we validated our results for a subset of genes and showed that the experimental variability in both settings is low and that the Illumina ASE assay and quantitative sequencing of RT-PCR products yield reproducible estimates of allelic imbalance consistent with each other. Overall, the experimental noise is much lower than the difference in relative allelic expression observed at many loci and therefore cannot be responsible for it. Additionally, the high concordance of the results obtained using different SNPs in the same transcript supports our findings that allelic imbalance, as we estimated it, is not an experimental artefact but reflects inherent biological differences in the relative expression of both alleles in heterozygous individuals. We also showed that lymphoblastoid cell lines, despite being simplified biological materials, are suitable resources to investigate mechanisms of gene regulation. Here, we demonstrated that our estimation of allelic imbalance is little affected by growth conditions and that LCLs harvested from different passages yield very similar results. Finally, the results efficiently recapitulate the consequences of the epigenetic mechanisms established in the individuals from which the cells have been derived (see also [43]). We are therefore confident that, overall, the patterns of allelic imbalance we observed are neither experimental artifacts, nor specific to the material studied, but represent a common biological phenomenon affecting human gene expression. Differential Allelic Expression Identifies Consequences of Epigenetic Mechanisms of Gene Regulation We showed that LCLs derived from female individuals still harbor the consequences of X-inactivation at all X-linked genes investigated, with one allele being transcriptionally silenced [34]. The extent of allelic imbalance detected at X-linked genes can vary among LCLs due to the various degrees of clonality of these cells but clonal LCLs consistently show complete silencing of one allele at all X-linked genes investigated (Figure S7). In addition, imprinting, established in the germ lines of the parents of the individuals from which the cells are derived [44], is also maintained in LCLs. In our experiments, PEG10, SNPRN, MEST and KCNQ1 show reduced or absent expression of one allele and, when the mode of inheritance can be determined, it corresponds to the imprinting mechanism described in the literature (i.e. PEG10 and SNPRN are maternally imprinted, KCNQ1 is paternally imprinted). We thus observe extensive differential allelic expression (i.e. allelic ratio larger than 70∶30) for all genes whose expression is known to be epigenetically regulated. This clearly shows that analysis of differential allelic expression is a suitable method for identifying the consequences of epigenetic mechanisms of gene regulation. The Illumina ASE assay would thus provide an efficient method to screen tumor tissues and identify patterns of differential allelic expression resulting from aberrant methylation or loss of imprinting that are known to be involved in the etiology of cancers [45]–[47]. Interestingly, IMPACT which shows significant extent of allelic imbalance at two SNPs (rs677688 and rs1053474) in our study, is known to be imprinted in mice [48] but not in humans [49]. The mode of inheritance of the over-expressed alleles could not be determined using the two families available in our study (i.e. the parents were always homozygous for the same allele). The attempt to map differential expression to a regulatory haplotype was not successful and is consistent with an epigenetic mechanism of gene regulation. More investigations are required to determine whether the pattern of allelic imbalance observed for IMPACT results from incomplete silencing of one allele following imprinting in the parental germ-lines or whether it results from random mono-allelic expression or another mechanism of gene expression regulation. Regulatory Polymorphisms Determine Allelic Expression for Some Human Genes Our analysis of 643 genes expressed in LCLs shows that, for a large proportion of them (∼20%), the two alleles are differentially expressed in most heterozygous individuals. For 18 genes, differential expression resulted from a known epigenetic silencing of one of the two alleles, either through X-inactivation in females or imprinting. The mechanisms leading to allele-specific expression at all other genes could be driven by a polymorphism affecting the cis-acting regulation (e.g. a SNP in a transcription factor or a miRNA biding site) or simply result from random silencing of one of the two alleles. We tested 56 genes for association of differential allelic expression patterns observed with a cis-acting regulatory polymorphism using genotypes generated by the HapMap project (see Materials and Methods for details). For 23 of these genes we identified a region statistically associated with differences in allele expression that could indicate the existence of a regulatory haplotype (i.e., a region of one chromosome likely containing the polymorphism(s) causing the differential cis-regulation). These regions are often tens of kb long, consistent with previous descriptions of the linkage disequilibrium patterns in humans [50]. Although this approach does not identify the actual polymorphism(s) responsible for the differential cis-regulation, examination of these regulatory haplotypes provides some valuable insights on the mechanisms leading to differential expression and can guide future investigations. For example, the regulatory haplotype for GAS7 is almost exclusively restricted to the 3′UTR of the gene and may indicate that the patterns of allelic imbalance observed are due to differential mRNA processing, stability or the presence of a 3′ enhancer. In contrast, the regulatory haplotype identified for MGC33648 is located in the 5′ region and does not seem to overlap with the gene itself. This might be indicative of alternative promoter usage or differential transcription efficiency (e.g. due to differential transcription factor binding site affinity). Allelic Imbalance Is Complementary of Total Gene Expression Association Several recent studies have used large-scale associations between gene expression and extensive genotype information to investigate gene regulation in humans, some of them using cell lines included in our study. In particular, Stranger and colleagues analyzed 630 genes located in ENCODE regions, on chromosome 21 and in one portion of chromosome 20. They found evidence of cis-acting regulation for 63 genes [19]. 2005). We were able to analyze 21 of these genes in our experiment. Six of them also showed evidence of cis-acting regulation (e.g. SERPINB10 or TSGA2) in our study while a seventh gene (TTC3) showed patterns consistent with differential allelic expression but did not reach our significance threshold. The remaining 14 genes did not show evidence of differential allelic expression in our analysis. Alternatively, we identified 10 new genes located in ENCODE region or chromosome 21 that showed significant level of differential allelic expression but were not detected in the Stranger study. Several non-exclusive reasons could explain the discrepancies between the results of the two approaches. First, it is worth noting that, even if the same individuals are analyzed by allelic-specific expression and gene expression association, the power to detect cis-acting effect differs depending on the allele frequency of the marker used: in gene expression association analysis all individuals are analyzed but the power in the regression analysis depends on their genotypes (e.g. the genotypes AA, AB and BB are encoded in the linear regression as 0, 1 and 2) while in allelic expression analysis only the individuals heterozygotes at the marker considered are analyzed. This can become particularly problematic to study differential allelic expression at some genes since it requires a relatively common exonic SNP to detect allelic imbalance. In this context, it is worth noting that intronic SNPs can successfully be used for genes that are highly expressed (see also [30]). Second, associations of gene expression to genotypes depends greatly on the linkage disequilibrium (LD) patterns and requires extensive genotype information from all the individuals in order to include one marker in LD with the regulatory polymorphism. Allelic expression, on the other hand, directly investigate cis effect directly at the gene level and thus only requires physical link between the gene and the regulatory polymorphism affecting it (i.e. they need to be on the same chromosome). Finally, the differences between allelic expression and gene expression mapping might indicate that some genes are also regulated by trans-acting mechanisms that differ among individuals: differential allelic expression is influenced only by cis-acting mechanisms of gene regulation while gene expression is influenced by cis- and trans-acting gene regulation. It is thus not unlikely that individual differences in trans-acting regulation swamp the signal from cis-acting polymorphisms. In this context, it is noteworthy that total gene expression mapping has been much more successful in mice and yeast for which the genetic heterogeneity is much lower and can be controlled (reviewed in [9],[10],[51]). In humans, or in any other outbred population, genetic heterogeneity greatly limits the identification of cis-acting mechanisms using gene expression data while measurements of differential allelic expression are unaffected. We showed here that allelic expression assays are complementary from gene expression mapping and that the Illumina ASE assay overcomes two of the major limitations and criticisms of the former methodologies used to assess differential allelic expression: it allows a robust and high-throughput estimation of allelic imbalance: it is now possible to reliably screen hundreds of RNAs for several hundreds of genes in a couple of days. Additionally, when several SNPs can be used to assess differential allelic expression, the assay becomes very robust since each marker provides an independent estimation and one can test the correlation among estimates obtained at different positions. It is worth noting here that since this assay relies on the comparison of allelic ratio in DNA and RNA of each individual, it internally controls for the existence of polymorphisms in the primer sites or copy number variation encompassing the gene studied (that will affect equally DNA and RNA). Likely, the greatest advantage of the analysis of differential allelic expression over total gene expression is its flexibility. To identify differential regulation of gene expression using total gene expression, one needs extensive genotype information to test whether, at any polymorphic position, the gene expression differences among individuals segregate according to their genotype. This precludes a quick assessment of the expression of one locus in one cohort of particular interest or using a specific tissue. In contrary, differential allelic expression offers the advantage that any one gene can be quickly assessed in any cohort or tissue by simply comparing the expression of the two alleles in each individual (the amount of genetic information recently made available by the HapMap project allows a quick and easy selection of markers likely to be polymorphic for a given gene). The determination of regulatory haplotypes would still require extensive information concerning surrounding polymorphisms but the initial screening to determine whether one transcript is differentially cis-regulated can be done very efficiently with a handful of markers. Conclusion We showed that differential allelic expression is a robust approach to identify cis-acting mechanism of gene regulation. It complements gene expression association studies and offers additional perspectives, notably on epigenetic mechanisms of gene regulation. It could thus be particularly interesting to apply this assay to tumors to detect mis-regulated genes due to aberrant methylation patterns or loss of imprinting. In addition, our approach is applicable to any new cohort or tissue since it is self-sufficient to identify differential cis-regulation and does not require additional genotyping. It can be easily used to follow-up interesting non-coding regions associated to a particular disease and test if they are involved in the etiology of the disease through some regulatory effects on neighboring genes. Materials and Methods Sample Preparation 83 lymphoblastoid cell lines (LCL) derived from blood samples from the CEPH collection were selected for this project. They included 60 unrelated individuals obtained from Utah residents with ancestry from western and northern Europe for which DNA was genotyped for millions of SNPs covering the entire genome by the International HapMap Project. Additionally, 21 LCLs from CEPH pedigrees 1420 and 1444 were included to provide complete information on two three-generation CEPH families. Cells were grown at 37°C and 5% CO2 in RPMI 1640 medium (Invitrogen, Burlington, Canada) supplemented with 15% heat-inactivated fetal bovine serum (Sigma-Aldricht, Oakville, Canda), 2 mM L-glutamine (Invitrogen, Burlington, Canada) and penicillin/streptomycin (Invitrogen, Burlington, Canada). The cell growth was monitored with a hemocytometer and the cells were harvested when the density reached 0.8–1.1 × 106 cells/mL. Cells were then resuspended and lysed in TRIzol reagent (Invitrogen, Burlington, Canada). For all LCLs, three successive growths were performed (corresponding to the 2nd, 4th and 6th passages) after thawing frozen cell aliquots. Illumina Allele-Specific Expression (ASE) Assay We estimated allelic imbalance at 1,380 genes (two panels of ∼1,500 SNPs, Figure 1) using the Illumina ASE assay (Figure 2). The experiment is similar to the one used for large-scale SNP genotyping [52] and gene expression profiling [53] except that DNA and RNA are independently assessed and compared to each other. RNA was first converted into biotinylated cDNA [53] while DNA was treated according to the usual GoldenGate assay protocol [52]. Biotinylated DNA (derived from genomic DNA or mRNA) was immobilized on paramagnetic beads and pooled SNP-specific oligonucleotides were annealed on the DNA. Hybridized oligonucleotides were then extended and ligated to generate DNA templates, which were amplified using universal fluorescently-labeled primers. Finally, single-stranded PCR products were hybridized to a Sentrix Array Matrix [52], and the arrays were imaged using the BeadArray Reader Scanner [54]. 96 samples (DNA or RNA) were analyzed per Sentrix Array for ∼1,500 SNPs. All RNA measurements were performed in duplicates. Analyses of ASE Results To estimate the extent of allelic imbalance in heterozygote individuals at each SNP of the Illumina ASE panel, we developed algorithms using two different approaches: i) we used information from individuals of all three genotypes (AA, AB and BB), and/or ii) we used only the heterozygote individuals. We first determined whether a given gene was expressed above a determined background in a given individual. To do so, we made use of the fact that the genotypes were known (from the DNA analysis) and developed a locus-specific expression background cut-off: homozygote individuals (i.e. AA or BB) can only express the corresponding allele, respectively A or B, at the RNA level (if at all). We thus determined a background fluorescence level (i.e. corresponding to random noise) for each allele (i.e. A and B) by measuring the emission in the corresponding dye (respectively, Cy3 and Cy5) in individuals homozygous for the other allele (respectively BB and AA). This is represented schematically on Figure S9. To avoid false positive results due to the inclusion of transcripts not expressed in the cell lines considered, we used a conservative approach and arbitrarily fixed the background emission cutoff to the maximum emission of the absent allele of all homozygotes, plus the mean emission of the absent allele divided by the number of homozygotes (to weight the uncertainty in the determination of the “maximum noise” by the numbers of individuals used to determine it). This procedure allowed us to independently estimate the background emission of each allele/dye specifically for each SNP, which is particularly important because the fluorescence emission can differ drastically between the dyes and among loci (data not shown). We then proceeded to the detection call using the background cut-offs: individuals with genotypes AA were considered to express a given transcript if the emission was larger than twice the cutoff background emission of A, individuals with genotypes AB if the fluorescence was larger than the sum of the background emission of A and the background emission of B, and individuals with genotypes BB if the emission is larger than twice the background emission of B. Since the inclusion in the analyses of transcripts expressed at low level (or not expressed at all) is very problematic, we excluded from our analyses all loci for which less than 75% of the individuals had discordant replicate expression (i.e., one replicate above expression background, the other under the cut-off value). The first method used to determine whether some heterozygote individuals expressed significantly differently the two alleles is locus-specific but requires having at least one individual expressed from each homozygote genotype (AA and BB). In this case, we determined the median log ratio of the two dyes for each homozygote clusters at the DNA and RNA level ( ) as well as the median absolute deviations (MAD). We used medians and MADs, instead of means and standard deviations, to down weight the influence of possible outliers. We then determined a range of “expected” (i.e. non significant) variation of allelic expression for the heterozygote individuals. We calculated the equation of the lines joining the median values plus/minus two MAD of AA and BB and estimated the range, for the log ratio of the dyes at the RNA level, between the lines at the value corresponding to the median of DNA in heterozygote individuals (Figure S10). If the observed log ratio of dyes for a given heterozygote individual fell outside the expected range of variation in absence of AI (Figure S8), we scored each heterozygote individual separately to obtain a quantitative estimation of allelic imbalance using the ratio: This simple estimate indicates both the magnitude of the allelic imbalance (i.e. the fold difference) and its direction (i.e. which allele is more expressed than its counterpart). In order to assess allelic imbalance for SNPs with low minor allele frequencies (for which homozygote individuals with the minor allele may not be present in a small sample size panel), we developed a second method based solely on the heterozygote individuals. If a given transcript is affected by allelic imbalance we expect that either the variance of the log ratio of dyes for heterozygote RNAs to be greatly increased relative to the variance of homozygote RNAs, or, if one allele is systematically more expressed than the other, the mean value of these log ratios to be drastically shifted from its expected intermediate position (between the mean for AA and the mean for BB homozygote RNAs). For all SNPs with at least five individuals with the same genotype expressed, we estimated the standard deviation of the log ratio of dyes for DNA and RNA. The distribution of the log ratio of the standard deviations (i.e. log σDNA/σRNA) over all loci for heterozygous individuals differed from those observed using homozygous individuals and did not seem to fit a normal distribution (Figure S11). Based on the assumption that this distribution may include some loci in allelic imbalance (and thus with a higher than expected RNA variance), we fitted a mixture of two Gaussians on our dataset (i.e., one corresponding to the loci with allelic imbalance, the second including all other loci) using a Maximum Expectation algorithm implemented in R (mixdist package). For our data, the best fit was obtained with a minor distribution (including ∼3% of the loci) corresponding to the most extremely negative log ratios of variances (i.e., that the RNA standard deviation was larger than expected). For each locus, we then used the probability of belonging to the “higher-than-expected RNA variance” distribution as an indication of allelic imbalance. Quantitative Sequencing of RT-PCR Products We assessed the extent of allelic imbalance by quantitative sequencing following the method described in Ge et al. [33]. Briefly, we isolated RNA using TRIzol reagent following the manufacturer's instructions. We assessed RNA quality with an Agilent 2100 Bioanalyzer (Agilent, Palo Alto, USA) before synthesizing first strand cDNA using random hexamers (Invitrogen, Burlington, Canada) and Superscript II reverse transcriptase (Invitrogen, Burlington, Canada). For each locus, we designed locus-specific primers, in the exon/UTR containing the SNP analyzed, at least 50 bp away from the SNP studied. 5 ng of genomic DNA and 10 ng of total cDNA were then amplified by PCR using Hot Start Taq Polymerase (Qiagen, Mississauga, Canada) with an activation step (95°C for 15 minutes) followed by 40 cycles (95°C for 30 s, 55°C for 30 s and 72°C for 45 s) and a final extension step (72°C for 6 minutes). PCR products were purified using Exonuclease I and Shrimp Alkaline Phosphatase (USB, Cleveland, USA) and sequenced using either one of the former primers or a nested primer, on an Applied Biosystems 3730xl DNA analyzer. We used PeakPeaker v.2.0 [33] with the default settings to quantify the relative amount of the two alleles measured from the chromatogram after peak intensity normalization. To estimate the experimental variability of the entire experimental setup we used a hierarchical strategy for two genes (cf. Figure S12): for two/three individual cell lines, we extracted independently RNAs three times and performed, on each extract, three independent RT-PCRs. All cDNA obtained were then split into three aliquots, each amplified independently by locus-specific PCR. These PCR products were finally sequenced each three times (i.e. three independent sequencing reactions). To estimate the variability at each experimental stage we calculated the mean standard variation normalized to the mean using the independent triplicates. To calculate the variance in the higher hierarchical levels (PCR, RT-PCR), we averaged the values from the lower level (e.g., to estimate the variability at the PCR level, we compared the means of the three sequencing values performed on each of the three PCRs: [s1,s2,s3] vs [s4,s5,s6] vs [s7,s8,s9]). The results are presented in Text S1. Association Mapping of Differential Allelic Expression We attempted to map allelic imbalance to regulatory haplotypes for all genes with significant differences in allelic expression that fulfilled these criteria: i) they are mapped on the build 34 of the human genome, ii) the SNP used in the Illumina ASE assay has also been genotyped by the HapMap [55] and iii) there are more than four HapMap individuals heterozygous at the marker SNP. For each gene, we retrieved the haplotype information from the phased chromosomes of each of the 57 HapMap CEPH individuals for 100,000 bp upstream and downstream of the SNP used to assess allelic imbalance. When a transcript contains more than one SNP or if two SNPs used to assess allelic imbalance at two transcripts are separated by less than 200,000 bp, the region retrieved spans from the most upstream marker plus 100,000 bp to the most downstream marker minus 100,000 bp. For each individual LCL, the over expressed and under expressed haplotype/chromosome were identified and each SNP was tested for segregation of the alleles in under- and over-expressed chromosomes using a Fischer's exact test. Between 47 and 592 SNPs were tested for each gene (mean  =  229) and the associations remaining significant after Bonferroni correction for multiple testing are shown in green in Table 2. Validation of Regulatory Haplotypes Illumina total gene expression data were obtained from the Wellcome Trust Sanger Institute for the 60 unrelated CEPH individuals genotyped by the HapMap project and included in our assay. We also determined the total expression for 10 genes using Real-Time PCR and SYBR Green labeling on an ABI 7900HT (Applied Biosystems, Foster City, CA) instrument. 8–10 ng of first strand cDNA were amplified using 0.32 µM of gene specific primers and Power SYBR Green PCR master mix (Applied Biosystems) according to the manufacturer's instructions. The amplifications started by 95°C for 10 min followed by 40 cycles at 95°C for 20 s, 58°C for 30 s and 72°C for 45 s. We performed the Real-Time PCR assays for the 60 individuals LCLs genotyped by the HapMap projects and analyzed 6 replicates per each sample. A standard curve was established using a dilution series of total cDNA of known concentration. The Ct for each replicate was transformed to a relative concentration using the estimated standard curve function (SDS 2.1, Applied Biosystems) and normalized based on 18S rRNA Taqman (Applied Biosystems) expression data obtained for each sample to account for well to well variability. Software All analysis scripts are available upon request. PeakPicker v.2.0 is available at http://www.genomequebec.mcgill.ca/EST-HapMap/. Supporting Information Text S1 Experimental variability using quantitative sequencing of RT-PCR products. (0.03 MB DOC) Click here for additional data file. Figure S1 Correlation between the estimates of allelic expression and the proportions of total RNA extract mixed. The graph displays the p-values of the linear regressions between the allelic ratios and the proportions of mixed RNA. Mixes homozygous-homozygous are shown in red, mixes heterozygous-homozygous are in blue. (1.94 MB TIF) Click here for additional data file. Figure S2 Estimation of experimental variability in the Illumina ASE assay. Average difference between duplicates for 411 SNPs analyzed using the Illumina ASE Cancer Panel. The variability is shown for each SNP as the fraction of the difference between the median dye ratio for homozygotes for one allele and the median dye ratio for homozygotes for other allele (e.g., a variability of 0.1 could artificially generate an allelic ratio of 60∶40 in heterozygotes). (1.86 MB TIF) Click here for additional data file. Figure S3 Assessment of differential allelic expression using quantitative sequencing of RT-PCR products. First strand cDNA is synthesized from total RNA extract using random hexamers and amplified by locus-specific primers surrounding a particular coding SNP. The allelic ratio is estimated directly from the sequencing trace file with the software PeakPicker v2.0. (3.26 MB TIF) Click here for additional data file. Figure S4 Influence of the culture conditions. The figure shows the correlation between the estimates of allelic imbalance using quantitative sequencing for cells harvested after 4 (“Harvest 2”, x-axis) and 6 (“Harvest 3”, y-axis) passages. Each blue cross stands for one heterozygous individual for the gene IGF1 (A), IL1A (B) and CHI3L2 (C). (2.50 MB TIF) Click here for additional data file. Figure S5 Exonic vs. intronic SNP. The graph shows the average number of individuals expressing a detectable transcript using an exonic SNP or an intronic SNP. (1.95 MB TIF) Click here for additional data file. Figure S6 Population-average estimates of allelic imbalance at 777 SNPs (both panels combined). See legend of Figure 4. (1.93 MB TIF) Click here for additional data file. Figure S7 Clonality and X-linked genes. The allelic imbalance estimates for 11 X-linked SNPs (in 7 genes) are displayed on the y-axis for every female individual (x-axis) (if the individual is heterozygous at the position considered). (2.50 MB TIF) Click here for additional data file. Figure S8 Association mapping of allelic imbalance to regulatory haplotypes for MEST (A) and PEG10 (B). (4.54 MB TIF) Click here for additional data file. Figure S9 Method used for the detection of transcript expression. See Materials and Methods for details. (1.86 MB TIF) Click here for additional data file. Figure S10 Individual assessment of differential allelic expression on the Illumina ASE assay. See Materials and Methods for details. (1.93 MB TIF) Click here for additional data file. Figure S11 Variance-based assessment of differential allelic expression on the Illumina ASE assay. See Materials and Methods for details. (1.91 MB TIF) Click here for additional data file. Figure S12 Estimation of experimental variability in quantitative sequencing assay. We performed, for two genes (and five individuals), triplicates of each experimental step: from one cell harvest we extract RNA three times independently. Each extract was then subject to three independent RT-PCRs and each aliquot was amplified three times by locus-specific PCR. Finally, PCR products were sequenced three times and allelic imbalance estimated using PeakPicker v2.0. (1.54 MB TIF) Click here for additional data file. Table S1 List of the 2,968 SNPs analyzed using the Illumina ASE assay. Origin. Displays if the gene is located in a ENCODE region, on chromosome 21 or 22 and whether the genes was included for its potential involvement in disease etiology. Intron/exon. SNPs in 3′UTR are shown as “exon”. (0.09 MB PDF) Click here for additional data file. Table S2 All SNPs expressed in at least three heterozygous individuals (0.05 MB PDF) Click here for additional data file.
                Bookmark

                Author and article information

                Journal
                Bioinformatics
                bioinformatics
                bioinfo
                Bioinformatics
                Oxford University Press
                1367-4803
                1367-4811
                15 December 2009
                6 October 2009
                6 October 2009
                : 25
                : 24
                : 3207-3212
                Affiliations
                1Department of Human Genetics, 2Committee on Genetics, Genomics and Systems Biology and 3Howard Hughes Medical Institute, University of Chicago, 920 E. 58th St., CLSC 507, Chicago, IL 60637, USA
                Author notes
                *To whom correspondence should be addressed.

                Associate Editor: Limsoon Wong

                Article
                btp579
                10.1093/bioinformatics/btp579
                2788925
                19808877
                ffbf0890-366a-417d-a0ae-4b64c8044de1

                This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 25 June 2009
                : 17 September 2009
                : 30 September 2009
                Categories
                Original Papers
                Genome Analysis

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article