+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      DNA methylation arrays as surrogate measures of cell mixture distribution

      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.



          There has been a long-standing need in biomedical research for a method that quantifies the normally mixed composition of leukocytes beyond what is possible by simple histological or flow cytometric assessments. The latter is restricted by the labile nature of protein epitopes, requirements for cell processing, and timely cell analysis. In a diverse array of diseases and following numerous immune-toxic exposures, leukocyte composition will critically inform the underlying immuno-biology to most chronic medical conditions. Emerging research demonstrates that DNA methylation is responsible for cellular differentiation, and when measured in whole peripheral blood, serves to distinguish cancer cases from controls.


          Here we present a method, similar to regression calibration, for inferring changes in the distribution of white blood cells between different subpopulations (e.g. cases and controls) using DNA methylation signatures, in combination with a previously obtained external validation set consisting of signatures from purified leukocyte samples. We validate the fundamental idea in a cell mixture reconstruction experiment, then demonstrate our method on DNA methylation data sets from several studies, including data from a Head and Neck Squamous Cell Carcinoma (HNSCC) study and an ovarian cancer study. Our method produces results consistent with prior biological findings, thereby validating the approach.


          Our method, in combination with an appropriate external validation set, promises new opportunities for large-scale immunological studies of both disease states and noxious exposures.

          Related collections

          Most cited references 45

          • Record: found
          • Abstract: found
          • Article: found

          Hallmarks of Cancer: The Next Generation

          The hallmarks of cancer comprise six biological capabilities acquired during the multistep development of human tumors. The hallmarks constitute an organizing principle for rationalizing the complexities of neoplastic disease. They include sustaining proliferative signaling, evading growth suppressors, resisting cell death, enabling replicative immortality, inducing angiogenesis, and activating invasion and metastasis. Underlying these hallmarks are genome instability, which generates the genetic diversity that expedites their acquisition, and inflammation, which fosters multiple hallmark functions. Conceptual progress in the last decade has added two emerging hallmarks of potential generality to this list-reprogramming of energy metabolism and evading immune destruction. In addition to cancer cells, tumors exhibit another dimension of complexity: they contain a repertoire of recruited, ostensibly normal cells that contribute to the acquisition of hallmark traits by creating the "tumor microenvironment." Recognition of the widespread applicability of these concepts will increasingly affect the development of new means to treat human cancer. Copyright © 2011 Elsevier Inc. All rights reserved.
            • Record: found
            • Abstract: found
            • Article: not found

            Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

            Although genomewide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. We demonstrate how GSEA yields insights into several cancer-related data sets, including leukemia and lung cancer. Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets.
              • Record: found
              • Abstract: found
              • Article: not found

              Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis

              Introduction Large-scale gene expression studies allow one to characterize transcriptional variation with respect to measured variables of interest, such as differing environments, treatments, time points, phenotypes, or clinical outcomes. However, a number of unmeasured or unmodeled factors may also influence the expression of any particular gene. Besides inducing widespread dependence in measurements across genes [1,2], these influential factors create additional sources of differential expression, which, unlike gene-specific fluctuations, represent common sources of variation in gene expression that can be observed among multiple genes. We call “primary measured variables” (or primary variables) those variables that are explicitly modeled in the analysis of an expression study. These variables may or may not be associated with any given gene's expression variation. We classify all the remaining sources of expression variation into three basic types. “Unmodeled factors” are sources of variation explained by measured variables, but are not explicitly included in the statistical model (e.g., because their relationship to expression is intractable or the relevant measured variables were excluded because of sample size restrictions). “Unmeasured factors” are sources of expression variation that are not measured in the course of the study, so we also call these unmodeled factors. Finally, “gene-specific noise” refers to random fluctuations in gene expression independently realized from gene to gene. As a simple example meant only for illustrative purposes, consider a human expression study where disease state on a particular tissue type is the primary variable. Suppose that in addition to changes in expression being associated with disease state, the age of the individuals also has a substantial influence on expression. Thus, some genes exhibit differential expression with respect to disease state, some with respect to age, and some with respect to both. If age is not included in the model when identifying differential expression with respect to disease state, we show that this may (a) induce extra variability in the expression levels due to the effect of age, decreasing our power to detect associations with disease state, (b) introduce spurious signal due to the fact that the effect of age on expression may be confounded with disease state, or (c) induce long-range dependence in the apparent “noise” of the expression data, complicating any assessment of statistical significance for differential expression. In practice, even if age were known, it may be one of dozens of available measured factors, making it statistically intractable to determine which to include in the model. Furthermore, even measured factors such as age may act on distinct sets of genes in different ways, or may interact with an unobserved factor, making the effect of age on expression difficult to model. “Expression heterogeneity” (EH) is used here to describe patterns of variation due to any unmodeled factor. Major sources of expression variation are due to technical [3,4], environmental [5,6], demographic [7,8], or genetic [9–11] factors. It is well known that sources of variation due to experimental design or large-scale systematic sources of signal may be present in expression data [3,4,12,13], sometimes even after normalization has been applied [14]. Genetic factors can also have a large-scale impact on gene expression levels. Specific genetic loci have been shown to influence the expression of hundreds or thousands of genes in several organisms [10,11,15]. Expression heterogeneity is particularly pronounced in human expression data, especially in the study of complex systems, such as cancer or responses to stress [16–18]. Recently, Lamb et al. proposed the “Connectivity Map” for identifying functional connections between cancer subtypes, genetic background, and drug action [19]. Lamb et al. noted EH (e.g., due to cell type and batch effects) presented a major hurdle for extracting relevant biological signal from the Connectivity Map. In each of these studies, expression variation with respect to one or at most a handful of variables is explored. However, it is likely that in each study multiple sources of EH will act on distinct, but possibly overlapping, sets of genes. Normalization techniques are commonly used to detect and adjust for systematic expression variation due to well-characterized laboratory and technical sources [12,13,20]. However, to date there has been no approach for identifying and accounting for all sources of systematic expression variation, including variation due to unmeasured or unmodeled factors of both biological and technical sources. We show here that biological sources of variation not modeled in the analysis can be just as problematic as technical sources of variation. Here, we introduce “surrogate variable analysis” (SVA) to identify, estimate, and utilize the components of EH. Figure 1 shows the effects of failing to account for unmodeled factors in a differential expression analysis, and the potential benefits of the SVA approach. EH causes drastic increases in the variability of the ranking of genes for differential expression (Figure 1A), distorts the null distribution potentially causing highly conservative or anticonservative significance estimates (Figure 1B), and reduces the power to distinguish true associations between a measured variable of interest and gene expression (Figure 1C). However, employing SVA in these studies produces operating characteristics nearly equivalent to what one would obtain with no EH at all. Figure 1 Impact of Expression Heterogeneity One thousand gene expression datasets containing EH were simulated, tested, and ranked for differential expression as detailed in Simulated Examples. (A) A boxplot of the standard deviation of the ranks of each gene for differential expression over repeated simulated studies. Results are shown for analyses that ignore expression heterogeneity (Unadjusted), take expression heterogeneity into account by SVA (Adjusted), and for simulated data unaffected by expression heterogeneity (Ideal). (B) For each simulated dataset, a Kolmogorov-Smirnov test was employed to assess whether the p-values of null genes followed the correct null Uniform distribution (Text S1). A quantile–quantile plot of the 1,000 Kolmogorov-Smirnov p-values are shown for the SVA-adjusted analysis (solid line) and the unadjusted analysis (dashed line). It can be seen that the SVA-adjusted analysis provides correctly distributed null p-values, whereas the unadjusted analysis does not due to EH. (C) A plot of expected true positives versus FDR for the SVA-adjusted (solid) and -unadjusted (dashed) analyses. The SVA-adjusted analysis shows increased power to detect true differential expression. We apply SVA to three distinct expression studies [7,21,22], where each study contains clear patterns of EH (Figure S1). These studies represent major classes of gene expression studies performed in practice: genetic dissection of expression variation, differential expression analysis between disease classes, and differential expression over time. We show that SVA is able to accurately identify and estimate the impact of unmodeled factors in each type of study, using only the expression data itself. We further show that SVA improves accuracy and consistency in detecting differential expression. SVA orders the significant gene lists to more accurately and reproducibly reflect the ordering of the genes with respect to their true differential expression signal. SVA is particularly useful in producing reproducible results in microarray studies, because adjusting for surrogate variables reduces differential expression due to sources other than the primary variables. These results indicate that EH is prevalent across a range of studies and that SVA can be used to capture and account for these patterns to improve the characterization of biological signal in expression analyses. Results Surrogate Variables We have developed an approach called surrogate variable analysis that appropriately borrows information across genes to estimate the large-scale effects of all unmodeled factors directly from the expression data. Figure 2A shows a simulated example of EH. The primary variable distinguishes the first ten arrays from the last ten (Figure 2B); however, the unmodeled factor may have a variety of effects on expression (Figure 2C). The SVA approach flexibly captures signatures of EH, including highly irregular patterns not following any simple model, by estimating the signatures of EH in the expression data themselves rather than attempting to estimate specific unmodeled factors such as age or gender. After the surrogate variables are constructed, they are then incorporated into any subsequent analysis as covariates in the usual way. The SVA algorithm, described in mathematical detail in Materials and Methods, can conceptually be broken down into four basic steps: (Step 1) Remove the signal due to the primary variable(s) of interest to obtain a residual expression matrix. Apply a decomposition to the residual expression matrix to identify signatures of EH in terms of an orthogonal basis of singular vectors that completely reproduces these signatures. Use a statistical test to determine the singular vectors that represent significantly more variation than would be expected by chance. (Step 2) Identify the subset of genes driving each orthogonal signature of EH through a significance analysis of associations between the genes and the EH signatures on the residual expression matrix. (Step 3) For each subset of genes, build a surrogate variable based on the full EH signature of that subset in the original expression data. (Step 4) Include all significant surrogate variables as covariates in subsequent regression analyses, allowing for gene-specific coefficients for each surrogate variable. Figure 2 Example of Expression Heterogeneity (A) A heatmap of a simulated microarray study consisting of 1,000 genes measured on 20 arrays. (B) Genes 1–300 in this simulated study are differentially expressed between two hypothetical treatment groups; here the two groups are shown as an indicator variable for each array. (C) Genes 201–500 in each simulated study are affected by an independent factor that causes EH. This factor is distinct from, but possibly correlated with, the group variable. Here, the factor is shown as a quantitative variable, but it could also be an indicator variable or some linear or nonlinear function of the covariates. The four-step procedure is necessary both to ensure that the surrogate variables indeed estimate EH and not the signal from the primary variable (Step 1), to ensure an accurate estimate of each surrogate variable by identifying the specific subset of genes driving each EH signature (Step 2), to allow for correlation between the primary variable and the surrogate variables by building the surrogate variables on the original expression data (Step 3), and to take into account the fact that a surrogate variable may have a different effect on each gene (Step 4). The third and fourth steps are particularly important for maintaining unbiased significance with SVA, as demonstrated below. Definition of a Correct Procedure The overall goal of SVA is to provide a more accurate and reproducible parsing of signal and noise in the analysis of an expression study when EH is present. One way in which signal is commonly quantified is through a significance analysis [23]. The most basic definition of a significance analysis being performed “correctly” is if the null distribution is calculated properly [24]. A straightforward means for determining whether this is true is to assess whether the p-values corresponding to true null hypotheses are Uniformly distributed between zero and one. Indeed, p-values are specifically defined so that those corresponding to true null hypotheses have a Uniform(0,1) distribution if and only if the null distribution has been correctly calculated [25]. Throughout this paper, we examine the distribution of p-values from null genes to determine whether various procedures are able to recover the correct null distribution in the presence of EH. To assess statistically any deviations from the Uniform distribution for the null p-values, we apply a nested Kolmogorov-Smirnov test that is robust to chance fluctuations that may be present in a single simulated dataset (see Text S1). Simulated Examples We performed a simulation study to investigate the properties of SVA with respect to large-scale significance testing. Specifically, we show that the SVA algorithm (a) accurately estimates signatures of expression heterogeneity, (b) corrects the null distribution of p-values from multiple hypothesis tests, (c) improves estimation of the false discovery rate (FDR) [23,26], and (d) is robust to confounding between the primary variable and surrogate variables. The primary variable for our simulation was a binary variable indicating two disease classes. We simulated 1,000 expression studies, drawn from the same hypothetical population. For each study, we simulated expression for 1,000 genes on 20 arrays divided between the two disease states. The first 300 genes were simulated to be differentially expressed between disease states and genes 200–500 were affected by an independent unobserved factor to simulate a randomized study (Materials and Methods). Surrogate variables accurately estimated. We first assessed the accuracy of the surrogate variables estimated from SVA. In 99.5% of the simulated studies, a permutation procedure [27] correctly identified one significant surrogate variable. Since there is only one unmodeled factor that was simulated in this study, we assessed the accuracy of the surrogate variable estimation by correlation. (If there is more than one surrogate variable or more than one unmodeled factor, then one must assess the accuracy by using some sort of multiple regression and calculating an R 2 value.) The average correlation between the estimated surrogate variable and the true unmodeled factor over all 1,000 experiments was 0.95 with a standard deviation of 0.05. Each surrogate variable is a weighted average of the expression measurements over a subset of genes. We chose a liberal adaptive cutoff for determining the number of genes affected by each orthogonal EH signal to avoid overfitting. The SVA algorithm correctly identifies the genes affected by the unmodeled factor. On average, 30.5% of the truly affected genes were identified as affected, whereas only 9.9% of the truly unaffected genes were identified as affected. Correct p-value distribution. It is well known that in a significance analysis, p-values corresponding to null genes should be Uniformly distributed (i.e., “flat”) [28]. Statistics has classically dealt with effects from unmodeled factors by performing randomized studies. In our simulations, the unmodeled factor was independently realized, which is equivalent to randomizing the unmodeled factor with respect to the primary variable. Because of this, the p-values corresponding to any single null gene over many simulated datasets follow the Uniform distribution. However, for any given experiment, a single randomization is applied to all genes. Therefore, the thousands of p-values resulting from a single microarray study are not the same as thousands of p-values resulting from independent randomizations of an unmodeled factor. The dependence across genes induced by EH can result in major fluctuations and bias in the p-values for the null genes for any single expression study, even in a well designed, randomized study. This bias generally takes the form of a global deviation of the null p-values from the Uniform distribution. Specifically, if the unmodeled factor is correlated with the primary variable, the null p-values will be too small, biased towards zero. If the unmodeled factor is uncorrelated with the primary variable, the null p-values will be too big, biased towards one. These noteworthy fluctuations and biases in the null p-values can be seen in nine representative datasets from our simulation study in Figure S2 and across all 1,000 simulated datasets in Figure 1B. The bias results in incorrect assessment of significance, regardless of the particular significance measure chosen [24]. By applying the SVA algorithm to adjust the significance analysis, the p-values from the null genes for any single experiment are now corrected toward the Uniform distribution. This can be seen when SVA is applied to these same nine datasets in Figure S3 and across all 1,000 simulated datasets in Figure 1B. Figure 1B shows that the null p-values consistently follow the Uniform distribution when SVA is applied, but they consistently do not follow the Uniform in a typical unadjusted analysis. To confirm that SVA is robust to the distribution of the gene specific error, we ran a second independent simulation study where the residuals were drawn from a published microarray study (Materials and Methods). Figure S4 shows that behavior of the null p-values is corrected by SVA, which is qualitatively the same as in the case of purely simulated data. It should also be noted that p-values corresponding to differentially expressed genes will be similarly affected; the loss in power can be seen in Figure 1C. Although, note that power versus FDR is calculated in Figure 1C when we know the correct answers, which clearly will not be reflected in actual studies where an unadjusted analysis produces an incorrect set of null p-values. Therefore, the application of SVA can result in empirical increases or decreases in power, depending on whether the null p-values are spuriously pushed towards zero or one, even though SVA tends to only provide increases in the true power. Gene ranking more accurate and stable. Perhaps most importantly, SVA also results in a more powerful and reproducible ranking of genes for differential expression. This can be seen in Figures 1A and S5; SVA-adjusted analyses provide gene rankings comparable to the scenario where there is no heterogeneity, whereas an unadjusted analysis allows for incorrect and highly variable gene rankings. This is arguably the most important feature of SVA, since an accurate and reproducible gene ranking is key for making biological inference when only a subset of genes will be selected for future study. In other words, these results suggest that SVA would yield results reproducible on the level that we would expect given that the primary variable is the only source of signal. Improved FDR estimation. There has been much recent interest in the effect of expression dependence across genes on estimates of multiple testing significance measures. Large-scale dependence has been shown to be particularly problematic for estimating FDR, as dependence across genes increases the variance of most standard FDR estimators [1,2,29–35]. EH represents large-scale dependence across genes that may significantly affect estimates of the FDR and related measures. To evaluate the potential impact of SVA in this situation, we performed a simulation study as described above. However, in this case, to create large-scale dependence, we let genes 201–1,000 be affected by the unmodeled factor. SVA reduces the variability in both the estimate of the proportion of null hypotheses and the q-values for each study (Figure S6). Furthermore, the behavior of the SVA-adjusted FDR estimates is almost identical to the behavior under the scenario with no EH. Robustness to confounding in observational studies. To assess the accuracy of the SVA algorithm in the case where the primary variable and unmodeled factors are heavily correlated, we performed a second simulation study. The set-up for the second simulation study was identical to that for the original study above, except in this case the unmodeled factor was simulated such that the average correlation with the primary variable was 0.50 with a standard deviation of 0.16. Under this model, the unobserved factor is both correlated with the primary variable and affects an overlapping set of genes. This is representative of the potential confounding present in observational microarray studies (see Disease Class below) and that which happens by chance in a non-negligible subset of randomized studies. Even in this set-up, the permutation hypothesis test correctly identified a single surrogate variable in 94.5% of the simulated datasets. Further, the average correlation between the estimated surrogate variable and the true unmodeled factor over 1,000 datasets was 0.94 with a standard deviation of 0.22. Thus, SVA accurately estimates the unobserved factor even when there is strong dependence between the primary and unobserved factors, with a subset of genes affected by both. SVA also provided a correct Uniform distribution of null p-values as in the above randomized study scenario. Proof of Concept: Genetics of Gene Expression in Yeast Several recent studies have carried out the genetic dissection of expression variation at the genome-wide level [10,11,15]. Brem et al. [10, 21] measured expression genome wide in 112 segregants of a cross between two isogenic strains of yeast. They also obtained genotypes for each segregant at markers covering 99% of the genome (Materials and Methods). It was shown that many gene expression traits are cis-linking, i.e., the quantitative trait locus (QTL) linkage peak coincided with the physical location of the open reading frame for the expression trait [36]. At the same time, it was also shown that a number of gene expression traits are trans-linking, with linkage peaks at loci distant from the physical location of their open reading frames. In particular, several “pivotal” loci each appear to influence the expression of hundreds or even thousands of gene expression traits. Similar highly influential loci have been observed in other organisms [11,15]. These pivotal loci act as a major source of EH, regardless of whether genotypes have been measured in an expression study. As proof of concept, the Brem et al. [10,21] dataset was used to show that well-defined EH exists in actual studies and that SVA can properly capture and incorporate this EH structure into the statistical analysis of measured variables of interest. First, we analyzed the full dataset to identify the expression traits under the influence of these pivotal trans-acting loci, as well as the patterns of EH induced by these loci. Then we applied SVA to only the expression data, ignoring the genotype data to identify relevant surrogate variables capturing EH. Linkage analysis was performed again including the surrogate variables as covariates, showing that the effects from the pivotal loci are now negligible. In other words, SVA was able to capture and remove the effects of these few pivotal loci without the need for genotypes. A number of expression traits have significant trans-linking eQTL mapping to pivotal loci on Chromosomes II, III, VIII, XII, XIV, and XV (Figure 3A). In the SVA-adjusted analysis, the majority of the trans-linkages to the pivotal loci have been eliminated (Figure 3B). The pervasive trans-linkage signal mapping to the pivotal loci can be viewed as global expression heterogeneity. The reduction in trans-linkage to these loci in the SVA-adjusted significance analysis indicates that SVA effectively captures genetic EH. Figure 3 SVA Captures EH Due to Genotype (A) A plot of significant linkage peaks (p-value k), we conservatively force monotonicity among the p-values. Thus, set pk = max (pk −1, pk ) for k = 2,...,n-df. 10. For a user-chosen significance level 0≤α≤1, call eigengene k a significant signature of residual EH if pk ≤ α. Algorithm to construct surrogate variables. 1. Form estimates and by fitting the model xij .= μi + fi (yj ) + eij , and calculate the residuals r ij = xij − (yj ) to remove the effect of the primary variable on expression. Form the m × n residual matrix R, where the (i, j) element of R is rij . 2. Calculate the singular value decomposition of the residual expression matrix R = UDV T . Let e k = (e k 1 ,...,e kn ) T be the kth column of V (for k=1,...,n). These e k are the residual eigengenes and represent orthogonal residual EH signals independent of the signal due to the primary variable. 3. Set to the number of significant eigengenes found by the above algorithm. Note that “significant” means that the eigengene represents a greater proportion of variation than expected by chance. For each significant eigengene e k k=1,..., . 4. Regress e k on the x i (i = 1,...,m) and calculate a p-value testing for an association between the residual eigengene and each gene's expression. This p-value measures the strength of association between the residual eigengene e k and the expression for gene i. 5. Let π 0 be the proportion of genes with expression not truly associated with e k ; form an estimate , as described previously [23] and estimate the number of genes associated with the residual eigengene by . Let be the indices of the genes with smallest p-values from this test. 6. Form the × n reduced expression matrix . Since is an estimate of the number of genes associated with residual eigengene k, the reduced expression matrix represents the expression of those genes estimated to contain the EH signature represented by some h k as described above. As was done for R, calculate the eigengenes of X r , and denote these by for j=1,...,n. 7. Let j * = argmax1≤j≤n cor and set . In other words, set the estimate of the surrogate variable to be the eigengene of the reduced matrix most correlated with the corresponding residual eigengene. Since the reduced matrix is enriched for genes associated with this residual eigengene, this is a principled choice for the estimated surrogate variable that allows for correlation with the primary variable. 8. In any subsequent analysis, employ the model xij = μi + fi (yj ) + , which serves as an estimate of the ideal model xij = μi + fi (yj ) + . The singular value decomposition is employed in these SVA algorithms. It may be possible to utilize other decomposition methods, but since the singular value decomposition provides uncorrelated variables that decompose the data in an additive linear fashion with the goal of minimizing the sum of squares, we found this to be the most appropriate decomposition. If the primary variables are modeled for data that are not continuous, then it may make sense to decompose the variation with respect to whatever model-fitting criteria will be employed Software. SVA has been made freely available as an R package at Supporting Information Figure S1 Examples of Expression Heterogeneity Heatmaps of hierarchically clustered gene expression data for a random subset of 1,000 genes from three studies are shown. (A) Hedenfalk et al. [22] compared gene expression across tumor subtypes defined by germline BRCA mutations (yellow divides BRCA tumor subtypes), (B) Brem et al. [10,21] measured expression in naturally recombining yeast populations, and (C) Rodwell et al. [7] measured gene expression in kidney samples for patients ranging in age from 27–92 y. (849 KB PDF) Click here for additional data file. Figure S2 Unadjusted p-Values Show Bias and Fluctuations Histograms of the null p-values for nine independent realizations of the simulated gene expression data. The null p-values should be Uniformly distributed, or “flat,” for each experiment. However, across independently simulated datasets, the null p-values range from being conservatively biased to anticonservatively biased depending on the configuration of the unmeasured or unmodeled factor. (17 KB PDF) Click here for additional data file. Figure S3 SVA-Adjusted p-Values Are Uniform Histograms of the null p-values for nine independent realizations of the simulated gene expression experiment, adjusted by SVA. The p-values for the null genes in each simulated experiment are Uniformly distributed. None of these deviates from the Uniform according to a Kolmogorov-Smirnov test. (18 KB PDF) Click here for additional data file. Figure S4 Behavior of Simulated Null p-Values from Microarray Data For each simulated dataset based on the permuted residuals from the Hedenfalk et al. study, a nested Kolmogorov-Smirnov test was employed to assess whether the p-values of null genes followed the correct null Uniform distribution. A quantile–quantile plot of the one thousand Kolmogorov-Smirnov p-values are shown for the SVA-adjusted analysis (solid line) and the unadjusted analysis (dashed line). The grey line represents the expected quantiles. It can be seen that the SVA-adjusted analysis provides correctly distributed null p-values, whereas the unadjusted analysis does not, due to EH. (62 KB PDF) Click here for additional data file. Figure S5 Effect of EH on Gene Ranks A plot of the true rank (according to signal-to-noise ratio) versus the significance test–based average rank (black) plus or minus one standard deviation (red) for each differentially expressed gene in simulated studies (A) affected by EH with an unadjusted analysis, (B) affected by EH with an SVA-adjusted analysis, and (C) unaffected by EH. (439 KB PDF) Click here for additional data file. Figure S6 Effect of SVA on FDR Calculations (A) A histogram of the estimates of the proportion of true nulls π 0 for studies affected by EH. (B) A histogram of the estimates of the proportion of true nulls π 0 for studies affected by EH, after adjusting for SVA. (C) A histogram of the estimates of the proportion of true nulls π 0 for studies without EH. (D) A plot of observed FDR versus true FDR (grey) and average observed FDR versus true FDR (red) for simulated studies affected by EH. (E) A plot of observed FDR versus true FDR (grey) and average observed FDR versus true FDR (red) for simulated studies affected by EH, adjusted by SVA. (F) A plot of observed FDR versus true FDR (grey) and average observed FDR versus true FDR (red) for simulated studies without EH. (265 KB PDF) Click here for additional data file. Figure S7 BRCA Surrogate Variables (A) A plot of the top surrogate variable from the breast cancer data of Hedenfalk et al. [22]; triangles are BRCA1, pluses are BRCA2. (B) A plot of the expression for eukaryotic translation initiation factor 2, EIF2S2, which follows a similar pattern to the top surrogate variable. (87 KB PDF) Click here for additional data file. Figure S8 SVA-Induced Change in Gene Ranking for Differential Expression A plot of the p-value rankings for the SVA-adjusted versus unadjusted significance analysis of the breast cancer data [22], showing substantial differences in the rankings obtained from the two analyses. The red line represents equality of ranking between the two procedures. (181 KB PDF) Click here for additional data file. Figure S9 Regression on Standard Eigengenes (Version 1) Adjusted p-Values Histograms of the null p-values for nine independent realizations of the simulated gene expression experiment, after adjustment by the first regression on standard eigengenes algorithm. (16 KB PDF) Click here for additional data file. Figure S10 Regression on Standard Eigengenes (Version 2) Adjusted p-Values Histograms of the null p-values for nine independent realizations of the simulated gene expression experiment, after adjustment by the second regression on standard eigengenes algorithm. (16 KB PDF) Click here for additional data file. Figure S11 “Empirical Null” p-Values Show Bias For 1,000 simulated datasets based on the Normal residuals, a nested Kolmogorov-Smirnov test was employed to assess whether the p-values of null genes followed the correct null Uniform distribution. A quantile–quantile plot of the one thousand Kolmogorov-Smirnov p-values are shown for the SVA-adjusted analysis (solid line) and the “Empirical Null” technique [31,32]. The grey line represents the expected quantiles. It can be seen that the SVA-adjusted analysis provides correctly distributed null p-values, whereas the “Empirical Null” adjusted null p-values do not. (62 KB PDF) Click here for additional data file. Text S1 Supplementary Text for Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis (50 KB PDF) Click here for additional data file.

                Author and article information

                [1 ]College of Public Health and Human Sciences, Oregon State University, Corvallis, OR 97331, USA
                [2 ]Department of Pathology and Laboratory Medicine, Brown University, Providence, RI 02912, USA
                [3 ]Section of Biostatistics and Epidemiology, Dartmouth Medical School, Hanover, NH 03755, USA
                [4 ]Department of Epidemiology, University of Minnesota, Minneapolis, MN 55455, USA
                [5 ]Department of Neurological Surgery, University of California San Francisco, San Francisco, CA 94158, USA
                [6 ]Department of Epidemiology, Brown University, Providence, RI 02912, USA
                BMC Bioinformatics
                BMC Bioinformatics
                BMC Bioinformatics
                BioMed Central
                8 May 2012
                : 13
                : 86
                Copyright ©2012 Houseman et al.; licensee BioMed Central Ltd.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                Research Article

                Bioinformatics & Computational biology


                Comment on this article