48
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Genome-Wide mRNA Expression Correlates of Viral Control in CD4+ T-Cells from HIV-1-Infected Individuals

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          There is great interindividual variability in HIV-1 viral setpoint after seroconversion, some of which is known to be due to genetic differences among infected individuals. Here, our focus is on determining, genome-wide, the contribution of variable gene expression to viral control, and to relate it to genomic DNA polymorphism. RNA was extracted from purified CD4+ T-cells from 137 HIV-1 seroconverters, 16 elite controllers, and 3 healthy blood donors. Expression levels of more than 48,000 mRNA transcripts were assessed by the Human-6 v3 Expression BeadChips (Illumina). Genome-wide SNP data was generated from genomic DNA using the HumanHap550 Genotyping BeadChip (Illumina). We observed two distinct profiles with 260 genes differentially expressed depending on HIV-1 viral load. There was significant upregulation of expression of interferon stimulated genes with increasing viral load, including genes of the intrinsic antiretroviral defense. Upon successful antiretroviral treatment, the transcriptome profile of previously viremic individuals reverted to a pattern comparable to that of elite controllers and of uninfected individuals. Genome-wide evaluation of cis-acting SNPs identified genetic variants modulating expression of 190 genes. Those were compared to the genes whose expression was found associated with viral load: expression of one interferon stimulated gene, OAS1, was found to be regulated by a SNP (rs3177979, p = 4.9E-12); however, we could not detect an independent association of the SNP with viral setpoint. Thus, this study represents an attempt to integrate genome-wide SNP signals with genome-wide expression profiles in the search for biological correlates of HIV-1 control. It underscores the paradox of the association between increasing levels of viral load and greater expression of antiviral defense pathways. It also shows that elite controllers do not have a fully distinctive mRNA expression pattern in CD4+ T cells. Overall, changes in global RNA expression reflect responses to viral replication rather than a mechanism that might explain viral control.

          Author Summary

          There has been recent progress in understanding the genetic factors that modulate susceptibility to HIV-1 infection. Genetic variation explains to a certain extent differences in disease progression among individuals. Less is known regarding the contribution of differences in gene expression to viral control. The present study evaluated, genome-wide, gene expression levels in CD4+ T cell, the main target of HIV-1. Thereafter, it searched for genetic variants that would modify gene expression. Specific expression profiles associated with high levels of viremia—in particular, the upregulation of genes of the antiviral defense. In contrast, no expression profile associated with effective viral control. Multiple genetic variants modulated gene expression in CD4+ T cells; however, none had a strong influence on viral control. This integrated genome-wide assessment suggests that viral replication drives gene expression rather than expression pointing to mechanisms of viral control.

          Related collections

          Most cited references37

          • Record: found
          • Abstract: found
          • Article: not found

          Direct multiplexed measurement of gene expression with color-coded probe pairs.

          We describe a technology, the NanoString nCounter gene expression system, which captures and counts individual mRNA transcripts. Advantages over existing platforms include direct measurement of mRNA expression levels without enzymatic reactions or bias, sensitivity coupled with high multiplex capability, and digital readout. Experiments performed on 509 human genes yielded a replicate correlation coefficient of 0.999, a detection limit between 0.1 fM and 0.5 fM, and a linear dynamic range of over 500-fold. Comparison of the NanoString nCounter gene expression system with microarrays and TaqMan PCR demonstrated that the nCounter system is more sensitive than microarrays and similar in sensitivity to real-time PCR. Finally, a comparison of transcript levels for 21 genes across seven samples measured by the nCounter system and SYBR Green real-time PCR demonstrated similar patterns of gene expression at all transcript levels.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Tetherin inhibits retrovirus release and is antagonized by HIV-1 Vpu.

            Human cells possess an antiviral activity that inhibits the release of retrovirus particles, and other enveloped virus particles, and is antagonized by the HIV-1 accessory protein, Vpu. This antiviral activity can be constitutively expressed or induced by interferon-alpha, and it consists of protein-based tethers, which we term 'tetherins', that cause retention of fully formed virions on infected cell surfaces. Using deductive constraints and gene expression analyses, we identify CD317 (also called BST2 or HM1.24), a membrane protein of previously unknown function, as a tetherin. Specifically, CD317 expression correlated with, and induced, a requirement for Vpu during HIV-1 and murine leukaemia virus particle release. Furthermore, in cells where HIV-1 virion release requires Vpu expression, depletion of CD317 abolished this requirement. CD317 caused retention of virions on cell surfaces and, after endocytosis, in CD317-positive compartments. Vpu co-localized with CD317 and inhibited these effects. Inhibition of Vpu function and consequent mobilization of tetherin's antiviral activity is a potential therapeutic strategy in HIV/AIDS.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              High-Resolution Mapping of Expression-QTLs Yields Insight into Human Gene Regulation

              Introduction Genetic variation that affects gene regulation plays an important role in the genetics of disease and adaptive evolution [1],[2],[3]. However, unlike protein-coding sequences, we still know little about how to identify the DNA sequence elements that control gene expression. It is still difficult to predict with any confidence which SNPs are likely to affect gene expression, without performing targeted experimental assays. To address this gap, recent experimental and computational approaches have made progress on identifying elements that may be functional, for example through experimental methods that identify transcription factor binding sites [4],[5], by in vivo testing of possible enhancers [6] and by computational analysis of sequence data [7],[8],[9]. However, our understanding of the importance of different types of functional elements in gene regulation remains rudimentary. As a complementary approach, genome-wide studies of gene expression are now starting to provide information on genetic variation that impacts gene expression levels [10]. Recent studies in a variety of organisms have shown that levels of gene expression are often highly heritable [11],[12],[13],[14], and that for many genes it is possible to map cis- and trans-acting factors using linkage [13],[15],[16],[17],[14] or association mapping [12],[18],[19],[20],[21]. Recent studies of experimental crosses in yeast and mice have used the locations of SNPs within eQTL genes to provide further information about the identity of functional elements [22],[23]. In studies of human lymphoblastoid cells, it has been reported that most strong signals of association lie within 100 kb of the transcribed region [12], and that eQTLs cluster roughly symmetrically around the TSS [20]. In this study, we applied a new Bayesian framework to identify and fine map human lymphoblast eQTLs on a genome-wide scale. In effect, we treat the SNP data as a tool for assaying the functional impact of individual nucleotide changes on gene regulation. Our analysis focuses on the impact of common SNPs on gene expression levels. By using naturally occurring variation, we test the effects of several million variable sites in a single data set. Our results provide a detailed characterization of the types of SNPs that affect gene expression in lymphoblast cell lines. Results We analyzed gene expression measurements from lymphoblastoid cell lines representing 210 unrelated individuals studied by the International HapMap Project [24],[25]. These expression data, first reported by [19], were generated using the Illumina Sentrix Human-6 Expression BeadChip. For each sample we also used SNP genotype data from the Phase II HapMap Project, consisting of 3.3 million genotypes per individual [25]. After remapping the Illumina probes onto human mRNA sequences from RefSeq, we created a cleaned set of expression data for 12,227 distinct autosomal genes that had a unique RNA sequence in RefSeq (see Methods). For most analyses we removed 634 genes that had one or more HapMap SNPs within the expression probe and 147 very large genes (>500 kb), leaving us with a core data set of 11,446 genes. We then set out to identify SNPs that affect measured mRNA levels in cis. As an operational definition, we considered the “cis-candidate region” to start 500 kb upstream of the transcription start site (TSS) and to end 500 kb downstream of the transcription end site (TES). Consistent with previous work [20],[12], our preliminary analysis found that most detectable eQTLs lie within this region. Although the HapMap samples represent four different populations, originating from Africa, Europe and east Asia, our main analyses pooled the data into a single sample. To avoid false positives due to population-level expression differences [26],[20],[27], for each gene we transformed the African, European and east Asian expression data separately to standard normal distributions prior to combining the samples (Methods). Our rationale for combining samples was that we should achieve better power and better localization of signals than if we analyzed the populations separately. In doing so, we assume that functional variants usually have similar effects in different populations, an assumption that is parsimonious, and has empirical support [20], Figure S1. The overall results for analyses of individual populations are very similar (see Figures S2, S3, and S4). The Distribution of cis-Acting eQTLs For each of the 11,446 genes, we tested for putative cis-acting eQTLs by regressing measured mRNA levels against SNP genotypes, independently for each SNP in the cis-candidate region, using a standard linear regression model. Consistent with previous reports [20], we found a substantial number of genes with strong evidence for containing at least one eQTL. A total of 744 genes (6.5%) had at least one SNP with a p-value 50 kb from the corresponding gene. Third, as shown in Figure S9, there is a significant enrichment of eQTL SNPs in exons compared to introns. We will return to this observation later in the paper. 10.1371/journal.pgen.1000214.g002 Figure 2 Locations of the most significant eQTL SNPs for small, medium, and large genes. Each plot shows, for genes with an eQTL, the distribution of locations of the most significant SNP. The x-axis of each plot divides a typical cis-candidate region into a series of bins as described. The y-axis plots the number of SNPs in each bin that are the most significant SNP for the corresponding gene and that have a p-value one gene in cis. Statistical Analysis Notation The data consist of SNP genotypes and gene expression measurements for n individuals at each of K genes. Let yik denote the normalized gene expression data for individual i (i in 1,…, n) at gene k (k in 1,…, K). Yk will denote the vector of gene expression values (y.k ) across the n individuals at gene k. Next, let Mk be the number of genotyped SNPs in the cis-candidate region of gene k. We denote the entire matrix of genotype data for these Mk SNPs with the vector Gk , and individual genotypes as gijk for individual i at SNP j of gene k. Genotypes are coded as having 0, 1, or 2 copies of the minor allele. P-Value Method In the first part of the paper we used standard linear regression to test the gene expression data at each gene for association with SNPs in the cis-candidate region, as follows. The effect of individual i's genotype at SNP j (gijk ) on his/her gene expression level (yik ) is assumed to follow an additive linear model: (1) where μ is the mean expression level at that gene for individuals with g = 0, where ajk is the additive effect of the minor allele at SNP j and εijk is the residual. A standard p-value from a 1 df test can then be obtained for the hypothesis that SNP j is an eQTN for gene k (ajk ≠ 0). We used the following procedure to generate the results plotted in Figure 2. For each gene with expression data we assigned each SNP in the cis-candidate region to a single bin (see below). Let m be the total number of SNPs that fall into bin b, summing across all genes. (Note that most SNPs are in the cis-candidate regions of multiple genes and hence can contribute data to multiple bins.) Next, for each gene, we tested every SNP for association with gene expression. If the p-value of the most significant SNP was 1 SNPs, we considered that the signal was divided equally among the n most significant SNPs (i.e., a fraction 1/n of the signal was assigned to each SNP). Suppose that, by this way of counting, there are s signals in bin b. Prior to reporting the data, we also applied a correction for the possibility of spurious signals due to ungenotyped SNPs in the expression array probe. We used the 634 genes with a known HapMap SNP inside the probe to create a profile of the abundance of spurious signals as a function of distance from the probe. This profile was used to adjust the observed number of signals, s, to a corrected number s′, that removes the predicted number of spurious signals in each bin (see Figure S19 and Text S1 for details). In practice, we estimate that the contribution of spurious signals does not substantially change the overall uncorrected distribution of signals. Finally, we computed the fraction of most significant SNPs in bin b as s′/m. Bin Definitions To display the distribution of signals in Figures 2 and the left panel of Figure 3 we subdivided the cis-candidate region into discrete bins as follows. First, since there is dramatic variation in gene sizes, we analyzed genes in three separate categories based on transcript length: small genes (0–20 kb), medium genes (20–100 kb) and large genes (100–500 kb). Then, within each gene size category we divided the entire cis-candidate region into a series of bins, anchored at the TSS and TES. SNPs outside the transcript were assigned to bins based on their distance from the TSS (for the upstream region) or TES (downstream). Bins outside the transcript were 1 kb wide for small and medium genes and 15 kb wide for large genes. Transcribed regions were split into fixed numbers of bins: each small gene was split into ten bins of equal size, medium genes into 25 bins and large genes into 15 bins. Hence, bins inside the transcript indicate the fractional location of SNPs relative to the TSS and TES, and the physical sizes of the bins vary across genes. The bin sizes were chosen so that the average physical sizes of internal and external bins are roughly the same within each gene size category. Hierarchical Model We present here an overview of the hierarchical model. Complete details on the models are provided in the Supplementary Methods section (Text S1). Bayesian Regression Model The hierarchical model applies the Bayesian regression framework of Servin and Stephens [29]. The effect of individual i's genotype at SNP j (gijk ) on his/her gene expression level (yik ) is assumed to follow a linear model: (2) where μ is the mean expression level at that gene for individuals with g = 0, and where ajk and djk are the additive and dominance effects of the minor allele at SNP j. The residual, εijk , is assumed to be N(0,1/τ) and independent for each yik , where 1/τ is the variance of expression levels within each genotype class. The indicator function I(gijk  = 1) is defined as 1 if the genotype is heterozygous (gijk  = 1) and 0 otherwise. Let denote the probability of the expression data Yk under the null hypothesis that there are no cis-eQTNs in gene k (i.e., ajk  = djk  = 0 for all j). Similarly, let denote the probability of the expression data Yk assuming that SNP j is the eQTN. In this case, the effect sizes ajk and djk are modeled as being drawn from mixtures of normal distributions centered on 0 (see Text S1 for details). The Bayes factor (BF) for SNP j in gene k is defined as (3) and measures the relative support for the hypothesis that SNP j is an eQTN for gene k, versus the null hypothesis. We use priors on effect sizes that allow the BF to be calculated analytically (see Text S1). The Hierarchical Model We describe first the basic version of our hierarchical model. All the results presented in this paper additionally include a correction for the possibility that genes might show signals due to undetected SNPs in the probe. We describe that extension later in the Methods, briefly, and in detail in the Supplementary Methods (Text S1). Our basic model assumes that there are two mutually exclusive categories of genes. With probability Π0 there is no eQTN in the cis-candidate region, and with probability Π1 = 1−Π0 there is a single eQTN. Then the likelihood of the expression data at gene k is (4) where denotes the probability of the expression data Yk given that there is no eQTN in gene k and denotes the probability of the expression data given that there is exactly one eQTN. Note that our model allows for at most one eQTN per gene. If in fact there is more than eQTN, our model will usually assign the signal to the strongest of these. In practice, we see little variation in average effect size as a function of location, so this modeling simplification is unlikely to seriously distort the results. Given that there is a single eQTN in gene k, the probability of the observed expression data, , can be written as (5) where is the probability of the expression data given that SNP j is an eQTN, and π jk is the prior probability that SNP j is an eQTN, given that exactly one SNP in gene k is an eQTN. A key feature of the hierarchical model is that the probability that SNP j is an eQTN, π jk , is allowed to depend on the physical location of SNP j relative to one or more “anchor” points, and other relevant annotations (see Text S1). Suppose that we consider L different kinds of annotation, and let the indicator δjkl equal 1 if SNP j at gene k has the lth annotation, and equal 0 otherwise. Then define (6) where Λ = (λ1,…,λ L ) is a vector of annotation effect parameters. We use a logistic model to relate π jk to these annotation indicators, namely, (7) As detailed in the Supplementary Methods (Text S1), we parameterized the effect of distance from the anchor locations using a series of discrete bins that represent absolute physical distance from the relevant anchor. The bins nearest to the anchor are 1 kb wide, and increase in width to 10 kb and finally 100 kb with increasing distance from the anchor. For the two-anchor models, each SNP belongs to two position bins, each of which indicates distance from one anchor. Likelihood for the Hierarchical Model Substituting the above expressions for into (4), the likelihood for the hierarchical model is (8) (9) where Θ denotes the model parameters and BF jk is the BF from the Bayesian regression (3). To be explicit, the model parameters Θ include the annotation parameters Λ, the proportion Π0 and other parameters related to the Bayes factor computation (see Text S1). The likelihood of the entire data set is the product of (9) across all K genes. We fit the hierarchical model by maximizing the log-likelihood (10) with respect to the model parameters Θ. (Note that the first term, involving does not depend on Θ, and so need not be evaluated.) Accounting for the Effects of SNPs in Probes Since undetected SNPs in the probe sequence sometimes generate eQTLs, the results that we report include a modification to account for this effect. We used the 634 genes that have a known SNP in the probe region as training data to help parameterize the model. We assume that these represent ∼1/3 of all probes with common SNPs [25]. Suppose that with probability there is a gene inside the probe sequence (this is set to 1 for the training data), and suppose that when there is a SNP in the probe, there is a probability Π s that this generates a spurious signal. Then let be the probability of a spurious signal. We consider that we are only interested in real signals if there is no spurious signal, so we write the probability of the data as (11) where the first term is the likelihood when there is no spurious signal (as in Equation 4), and where the second term gives the likelihood ( ) when there is a spurious signal. Likelihood Maximization To maximize [10] we used an iterative strategy based on a point-by-point golden maximization strategy [46]. To speed convergence of the maximization process, we initialized the parameters using naive estimates of the λs based on the logarithm of the odds ratio computed assuming Π0 = 0. Posterior Probabilities Once the likelihood has been maximized, we can compute the posterior probability of a given SNP j to be an eQTN for gene k. In the case without spurious signals this is (12) and the general version is given in the Supplementary Methods (Text S1). Sequence Conservation and Transcription Factor Binding To compute the average sequence conservation as a function of position for Figure 4B, we estimated the average number of substitutions per site across the phylogeny of seven mammalian species (human, chimpanzee, macaque, mouse, rat, dog, and cow), using data and alignments from the UCSC browser. This was done for the main set of 11,446 genes analyzed in this paper. For each gene, 5 kb on each side of the TSS (and separately for the TES) was split into non-overlapping 50-bp bins. We then concatenated all the sites across all genes that lay in the same bin. After excluding sites in coding exons we estimated the average number of substitutions at each site using baseml, a program in the PAML package [47]. We obtained results on transcription factor binding density using ChIP-chip data collected by the ENCODE project (4). We used data for eight transcription factors that showed large numbers of binding fragments at a 1% false discovery rate in the ENCODE study. The left-hand panel of Figure 4C is essentially a replotting of data presented in Figure 5 of (4), while the right-hand panel shows analogous data plotted with respect to the TES. Software Availability The methods reported here are implemented in the package eQTNMiner, which is available from JBV on request. Supporting Information Figure S1 About 60% of the eQTNs are shared between at least two populations. Venn diagram of the set of eQTNs detected separately in each population. To generate the diagram, we admitted a SNP to the analysis (as an eQTL) if either the p-value in the combined sample (pooling the 3 populations) is lower than 7×10−6 or the p-value in a single population is lower than the p-value cutoff corresponding to a gene FDR of 5% within each population. We then considered two populations to share an eQTL if any single population has a p-value <1×10−2. Finally, for each gene having at least one such eQTL, we defined the eQTN as the SNP with the largest number of shared populations (sharing weight between multiple SNPs if there is a tie). (0.12 MB PNG) Click here for additional data file. Figure S2 Expression QTNs in the combined Japanese plus Chinese analysis panel (ASN) show similar patterns to those in the full data. The left panel (p-value method) was prepared in the same way as Figure 2 of the main paper and the right panel (hierarchical model with TSS+TES) was prepared in the same way as Figure 3 (left panel) of the main paper. Both display results analyzing only the Asian data. For the left panels we used a p-value cutoff of 1.25×10−5 obtained by permutations when analyzing only the Asian data and corresponding to a gene FDR of 5%. (0.43 MB PNG) Click here for additional data file. Figure S3 Expression QTNs in the European-derived sample (CEU) show similar patterns to those in the full data. The left panel (p-value method) was prepared in the same way as Figure 2 of the main paper and the right panel (hierarchical model with TSS+TES) was prepared in the same way as Figure 3 (left panel) of the main paper. Both display results analyzing only the European data. For the left panels we used a p-value cutoff of 3.5×10−6 obtained by permutations when analyzing only the European data and corresponding to a gene FDR of 5%. (0.46 MB PNG) Click here for additional data file. Figure S4 Expression QTNs in the Nigerian sample (YRI) show similar patterns to those in the full data. The left panel (p-value method) was prepared in the same way as Figure 2 of the main paper and the right panel (hierarchical model with TSS+TES) was prepared in the same way as Figure 3 (left panel) of the main paper. Both display results analyzing only the Nigerian data. For the left panels we used a p-value cutoff of 3.825×10−6 obtained by permutations when analyzing only the Nigerian data and corresponding to a gene FDR of 5%. (0.43 MB PNG) Click here for additional data file. Figure S5 Illustration of the ability of the HM to accurately estimate the distribution of eQTNs when all the actual eQTNs are genotyped. This figure is based on a simulated dataset assuming that for all genes the actual eQTN is genotyped (see Text S1). In both panels the black histograms represent the number of actual eQTNs using 1 kb bins anchored from the TSS (this is identical for both panels). A. P-value method: the green curve displays the number of most significant SNPs detected by the p-value method. As expected, due to LD and the stringency of the p-value cut-off, the profile is less peaked than the actual distribution. B. Hierarchical model: using our hierarchical model with the TSS-only model (see Methods) we are able to catch most of the actual eQTNs. The red curve indicates the expected number of eQTNs computed using the posterior probabilities from the hierarchical model. Notice that the hierarchical model provides a better picture of the distribution of signals. (0.15 MB PNG) Click here for additional data file. Figure S6 50% of the most significant SNPs lie within 7.5 kb of the actual eQTNs. Both panels are based on the results from the p-value method applied to a simulated dataset (see Text S1). The top panel plots the histogram of the fraction of most significant SNPs as a function of distance from the actual eQTNs. The bottom panel plots the corresponding cumulative probability. (0.05 MB PNG) Click here for additional data file. Figure S7 No obvious impact of the eQTN location on the mapping precision. Cumulative plot of the distance between the most significant SNPs and the actual eQTNs according to the eQTN location (upstream of the TSS, downstream of the TSS, within an exon, and within an intron). This plot was generated by averaging results from the p-value method applied to 10 simulated dataset (see Text S1). For the legend, the percentage between brackets give the fraction of actual eQTNs in the corresponding category. (0.08 MB PNG) Click here for additional data file. Figure S8 Impact of the local recombination rate on the eQTN mapping precision. Boxplot of the physical distance between the tag SNP and the actual eQTN as a function of the average recombination rate (cM/Mb) around the actual eQTN in a simulated dataset assuming that all eQTNs are not genotyped (see Text S1). We divided the data into four categories of equal sizes (from low to high level of recombination rate, the range of the recombination rate in each class is indicated along the x-axis below each box). As expected, the higher the recombination rate, the lower the expected distance between the tag SNP and the actual eQTN. (0.05 MB PNG) Click here for additional data file. Figure S9 There is a deficit of most-significant SNPs in internal introns, and an enrichment of such SNPs in last exons (p-value method). This figure is based on the subset of 295 genes for which there is a unique most significant SNP (and for which the smallest p-value is <7×10−6) that fall into the gene transcript region. For the five panels, the blue arrows represent the observed number of most significant SNPs in the five gene functional elements for which at least 5 most significant SNPs have been found. Here these counts have been corrected for putative spurious signal due to an unobserved SNP inside the probe (leading to the removal of {similar, tilde operator } 46 genes). Under the null hypothesis that these most significant SNPs are randomly distributed into the eight possible gene functional elements, we carried out a simple Monte-Carlo procedure where for each of the 295 genes we picked at random a SNP inside the gene transcript region to be the most significant SNP (and weight it by the probability that the gene has genuine signal according to the location of the observed most significant SNP with respect to the probe (see Text S1). The histograms depict the distribution of the numbers of most significant SNPs across 1000 simulated configurations. (0.12 MB PNG) Click here for additional data file. Figure S10 When distance is measured from the TSS (or TES) only, the TES (or TSS) peak is hidden due to the great variability in gene lengths. The plots show the fraction of SNPs with eQTN signals as a function of position in the cis-candidate region. The candidate region is divided into a series of 1 kb bins across the x-axis that indicate position relative to the TSS (or TES). For each bin we plot the proportion of SNPs that have the smallest p-value for the corresponding gene, and for which p<7×10−6 (gene FDR of 5%). (0.07 MB PNG) Click here for additional data file. Figure S11 Illustration of the ability of the HM to accurately estimate the distribution of eQTNs even when only 30% of the actual eQTNs are genotyped. These plots are based on a simulated dataset assuming that across all genes only 30% of the true eQTNs are genotyped (see Text S1). In both panels the black histograms represent the number of actual eQTNs using 1 kb bins anchored from the TSS (this is identical for both panels). A. P-value method: the green curve displays the number of most significant SNPs detected by the p-value method. As expected, due to the uncomplete SNP coverage, LD and the stringency of the p-value cut-off, the profile is less peaked than the actual distribution. B. Hierarchical model (TSS-only version): the red curve indicates the expected number of eQTNs computed using the posterior probabilities from the hierarchical model. The hierarchical model provides us with a much more accurate representation of the actual eQTN distribution. (0.20 MB PNG) Click here for additional data file. Figure S12 Simulated dataset with eQTNs symmetrically distributed around the TSS. The three left panels plot the true (simulated) probability to be the actual eQTN according to the gene size category. The three right panels plot the probability to be the most significant SNP (i.e the SNP with the smallest p-value inside the cis-candidate region) in genes having at least one SNP with a p-value lower than 7×10−6 (as for Figure 2 in the main text). Although only 30% of the actual eQTNs are observed, the distribution of the most significant SNPs (right panels) lines up pretty well with the distribution of the actual eQTNs (left panels). Furthermore, the distribution of signals for this TSS-only model is quite different than seen in the real data, consistent with our results that the TSS-only model does not provide a good description of the data. See Text S1 for a description of our simulation process. (0.43 MB PNG) Click here for additional data file. Figure S13 Numbers of SNPs inside each of the 9 mutually exclusive gene-related annotations as a function of position within the gene. SNPs inside coding exon are classified into synonymous and non-synonymous SNPs. Notice that ∼84% of genic SNPs occur inside internal introns. (0.12 MB PNG) Click here for additional data file. Figure S14 Fine-scale structure of eQTN peaks near the TSS and TES, and comparison to four types of functional annotation. The left- and right-hand columns show data for 5 kb on either side of the TSS and TES, respectively (averaging across all gene sizes). Locations inside genes are colored green and outside genes are black. A. Posterior expected fractions of SNPs in each bin that are eQTNs, as estimated by the hierarchical model (see Methods). Each bin is 25 bp wide. B. Probability that a SNP falls into a (putative) functional site: CpG island (CpG), conserved non-coding element (CNC), predicted cis-regulatory module (pCRM) and micro RNA binding site (miRNA). (0.27 MB PNG) Click here for additional data file. Figure S15 Genes with CpG islands spanning the TSS are expressed at higher average levels and are more likely to contain eQTLs than genes without a CpG island at the TSS. Results for genes with a CpG island ON the TSS are displayed in red while results for genes without a CpG island spanning the TSS (OFF) are displayed in black. These results are based by computing seperately for the two gene categories the posterior probabilities from the hierarchical model. A. Estimated probability for each gene category to have an eQTN anywhere in the cis-candidate region. B. Box plots of the means and the standard deviations of the log hybridization intensities for the two gene categories. Genes ON CpG have higher mean expression and standard deviations than Gene OFF CpG. C. After adjusting for the different overall rates of eQTNs, the distribution of signal locations in the two classes of genes is very similar. The plots show the fraction of SNPs with eQTN signals as a function of position in the cis-candidate region, based on the hierarchical model. In order to make the two classes of genes more comparable, the plots are conditional on the gene having an eQTN. Top panel shows results for the 7,069 genes with a CpG island spanning the TSS (ON CpG) and bottom panel shows results for the 4,377 genes without a CpG island spanning the TSS (OFF CpG). (0.27 MB PNG) Click here for additional data file. Figure S16 Schematic explanations of our gene structure annotation. The plot shows three pairs of hypothetical genes consisting of, respectively, 1, 2 and 6 exons. In each pair, the upper version of the gene shows the exon/intron structure (from RefSeq) and the translation start and stop sites (vertical red lines). The lower version of the gene shows how we annotate the gene structure (see color code at right of figure). A verbal explanation is also provided in the main text. (0.17 MB PNG) Click here for additional data file. Figure S17 Locations of the most significant eQTL SNPs for small, medium, and large genes using a p-value cutoff of A) 1×10−2 and B) 1×10−4. For A and B, the three panels was prepared in the same way as Figure 2 of the main paper. (0.44 MB PNG) Click here for additional data file. Figure S18 Locations of the most significant eQTL SNPs for small, medium, and large genes using a p-value cutoff of A) 1×10−6 and B) 1×10−8. For A and B, the three panels was prepared in the same way as Figure 2 of the main paper. (0.41 MB PNG) Click here for additional data file. Figure S19 Distribution of most significant eQTL SNPs around probes. The black bars indicate the numbers of spurious eQTL signals as a function of distance from the probes, among the 634 genes with a known SNP in the probe. The sum of the red+green bars gives the numbers of most significant eQTL SNPs among the remaining 11,446 genes; the red component is our estimate of the fraction that is spurious. (See section ‘Spurious Signal’ in Text S1 for further description.) (0.18 MB PNG) Click here for additional data file. Table S1 Table of descriptive statistics for each of the 9 mutually exclusive gene structure annotations for the 11,446 genes of our data set. The “Exp nber” and “Fraction” columns of the table are based on the posterior probabilities to be a genuine eQTN from the hierarchical model: left side for TSS-only+annotation model and right side for TSS+TES+annotation model. (0.03 MB PDF) Click here for additional data file. Table S2 Table of descriptive statistics for each of the 8 mutually exclusive gene structure annotations for the 11,446 genes of our data set. (0.03 MB PDF) Click here for additional data file. Table S3 Table of descriptive statistics for each of the 5 functional annotations for the 11,446 genes of our data set. (0.04 MB PDF) Click here for additional data file. Text S1 Supplementary methods. (0.15 MB PDF) Click here for additional data file.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS Pathog
                plos
                plospath
                PLoS Pathogens
                Public Library of Science (San Francisco, USA )
                1553-7366
                1553-7374
                February 2010
                February 2010
                26 February 2010
                : 6
                : 2
                : e1000781
                Affiliations
                [1 ]Institute of Microbiology, University Hospital and University of Lausanne, Lausanne, Switzerland
                [2 ]Institute for Genome Sciences & Policy, Duke University, Durham, North Carolina, United States of America
                [3 ]Department of Biostatistics and Bioinformatics, Duke University, Durham, North Carolina, United States of America
                [4 ]Genomics Platform, University of Geneva, Geneva, Switzerland
                [5 ]Division of Infectious Diseases, University Hospital Zurich, University of Zurich, Zurich, Switzerland
                Fred Hutchinson Cancer Research Center, United States of America
                Author notes

                Conceived and designed the experiments: MR DBG AT. Performed the experiments: MR PD. Analyzed the data: MR KKD JF ELH DG. Contributed reagents/materials/analysis tools: SF DBG AT. Wrote the paper: MR KKD JF AT. Performed the genome-wide genotyping: JF PD KVS. Reviewed the manuscript for important intellectual content and approved of the final version: ELH SF PD KVS DG HFG. Organized the clinical cohort: MR HFG AT.

                ¶ Membership of the The Swiss HIV Cohort Study is provided in the Acknowledgments.

                Article
                09-PLPA-RA-0878R3
                10.1371/journal.ppat.1000781
                2829051
                20195503
                b3c398d7-0fc7-4ca0-a433-709ac4b17e8a
                Rotger et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
                History
                : 2 June 2009
                : 20 January 2010
                Page count
                Pages: 11
                Categories
                Research Article
                Genetics and Genomics/Gene Expression
                Genetics and Genomics/Genetics of Disease
                Genetics and Genomics/Population Genetics
                Infectious Diseases/HIV Infection and AIDS

                Infectious disease & Microbiology
                Infectious disease & Microbiology

                Comments

                Comment on this article