Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have substantial effects on gene expression levels. In addition to the measured variable(s) of interest, there will tend to be sources of signal due to factors that are unknown, unmeasured, or too complicated to capture through simple models. We show that failing to incorporate these sources of heterogeneity into an analysis can have widespread and detrimental effects on the study. Not only can this reduce power or induce unwanted dependence across genes, but it can also introduce sources of spurious signal to many genes. This phenomenon is true even for well-designed, randomized studies. We introduce “surrogate variable analysis” (SVA) to overcome the problems caused by heterogeneity in expression studies. SVA can be applied in conjunction with standard analysis techniques to accurately capture the relationship between expression and any modeled variables of interest. We apply SVA to disease class, time course, and genetics of gene expression studies. We show that SVA increases the biological accuracy and reproducibility of analyses in genome-wide expression studies.

Author Summary

In scientific and medical studies, great care must be taken when collecting data to understand the relationship between two variables, such as a drug and its effect on a disease. In any given study there will be many other variables at play, such as the effects of age and sex on the disease. We show that in studies where the expression levels of thousands of genes are measured at once, these issues become surprisingly critical. Due to the complexity of our genomes, environment, and demographic features, there are many sources of variation when analyzing gene expression levels. In any given study, it is impossible to measure every single variable that may be influencing how our genes are expressed. Despite this, we show that by considering all expression levels simultaneously, one can actually recover the effects of these important missed variables and essentially produce an analysis as if all relevant variables were included. As opposed to traditional studies, the massive amount of data available in this setting is what makes the method, called surrogate variable analysis, possible. We hypothesize that surrogate variable analysis will be useful in many large-scale gene expression studies.

Related collections

Most cited references 59

Record: found
Abstract: not found
Book: not found

R: A language and environment for statistical computing

R. Team, Core R, R Core Team … (2006)

0 comments Cited 1550 times – based on 0 reviews

Bookmark

Record: found
Abstract: found
Article: not found

Principal components analysis corrects for stratification in genome-wide association studies.

Alkes L. Price, Nick Patterson, Robert Plenge … (2006)

Population stratification--allele frequency differences between cases and controls due to systematic ancestry differences-can cause spurious associations in disease studies. We describe a method that enables explicit detection and correction of population stratification on a genome-wide scale. Our method uses principal components analysis to explicitly model ancestry differences between cases and controls. The resulting correction is specific to a candidate marker's variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. Our simple, efficient approach can easily be applied to disease studies with hundreds of thousands of markers.

0 comments Cited 1288 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Statistical significance for genomewide studies.

J. Storey, R Tibshirani (2011)

With the increase in genomewide experiments and the sequencing of multiple genomes, the analysis of large data sets has become commonplace in biology. It is often the case that thousands of features in a genomewide data set are tested against some null hypothesis, where a number of features are expected to be significant. Here we propose an approach to measuring statistical significance in these genomewide studies based on the concept of the false discovery rate. This approach offers a sensible balance between the number of true and false positives that is automatically calibrated and easily interpreted. In doing so, a measure of statistical significance called the q value is associated with each tested feature. The q value is similar to the well known p value, except it is a measure of significance in terms of the false discovery rate rather than the false positive rate. Our approach avoids a flood of false positive results, while offering a more liberal criterion than what has been used in genome scans for linkage.

0 comments Cited 925 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

: Role: Editor

Journal

Journal ID (nlm-ta): PLoS Genet

Journal ID (publisher-id): pgen

Title: PLoS Genetics

Publisher: Public Library of Science (San Francisco, USA )

ISSN (Print): 1553-7390

ISSN (Electronic): 1553-7404

Publication date (Print): September 2007

Publication date (Electronic): 28 September 2007

Publication date (Electronic preprint): 1 August 2007

Volume: 3

Issue: 9

Electronic Location Identifier: e161

Affiliations

[1 ] Department of Biostatistics, University of Washington, Seattle, Washington, United States of America

[2 ] Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America

North Carolina State University, United States of America

Author notes

* To whom correspondence should be addressed. E-mail: jstorey@ 123456u.washington.edu

Article

Publisher ID: 07-PLGE-RA-0237R2 Serial Item and Contribution ID: plge-03-09-20

DOI: 10.1371/journal.pgen.0030161

PMC ID: 1994707

PubMed ID: 17907809

SO-VID: 2258cc7c-3a85-42cb-8a7f-9a9b04a1d39c

Copyright © Copyright: © 2007 Leek and Storey. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 9 April 2007

Date accepted : 1 August 2007

Page count

Pages: 12

Custom metadata

citation Leek JT, Storey JD (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet 3(9): e161. doi: 10.1371/journal.pgen.0030161

ScienceOpen disciplines: Genetics

Data availability:

ScienceOpen disciplines: Genetics

Comments

Comment on this article

scite_

Cited by 739

See all cited by

Most referenced authors 770

See all reference authors

- Version 1
- Version 1

Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis

Read this article at

Abstract

Author Summary

Related collections

Genes & Diseases

Most cited references 59

R: A language and environment for statistical computing

Principal components analysis corrects for stratification in genome-wide association studies.

Statistical significance for genomewide studies.

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Custom metadata

Comments

Comment on this article

Similar content 28

Cited by 739

Most referenced authors 770