123
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      Heritability and Genomics of Gene Expression in Peripheral Blood

      research-article
      1 , 2 ,   3 , 4 , 5 , 5 , 5 , 5 , 6 , 5 , 1 , 7 , 8 , 8 , 5 , 5 , 9 , 3 , 5 , 7 , 8 , 7 , 9 , 7 , 9 , 10 , 4 , 3 , 1 , 6 , 7 , 9 , 11 , 8 , 7 , 6 , 7 , 5 , 6 , 7
      Nature genetics
      gene expression, peripheral blood, twin study, heritability, expression quantitative trait loci, eQTL

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          We assessed gene expression profiles in 2,752 twins, using a classic twin design to quantify expression heritability and quantitative trait loci (eQTL) in peripheral blood. The most highly heritable genes (~777) were grouped into distinct expression clusters, enriched in gene-poor regions, associated with specific gene function/ontology classes, and strongly associated with disease designation. The design enabled a comparison of twin-based heritability to estimates based on dizygotic IBD sharing and distant genetic relatedness. Consideration of sampling variation suggests that previous heritability estimates have been upwardly biased. Genotyping of 2,494 twins enabled powerful identification of eQTLs, which were further examined in a replication set of 1,895 unrelated subjects. A large number of local eQTLs (6,988) met replication criteria, while a relatively small number of distant eQTLs (165) met quality control and replication standards. Our results provide an important new resource toward understanding the genetic control of transcription.

          Related collections

          Most cited references52

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Trait-Associated SNPs Are More Likely to Be eQTLs: Annotation to Enhance Discovery from GWAS

          Introduction Results of genome-wide association studies (GWAS) in complex traits published to date have provided us with surprisingly little new information on the nature of the genetic component to these phenotypes, despite the large number of single nucleotide polymorphisms (SNPs) found to be reproducibly associated with such traits. In some cases, this reflects the fact that major aspects of the biological basis for disease were already well understood; results of GWAS in autoimmune disorders, for example, have reinforced the central importance of the immune system and its regulation. For a few disorders, results of GWAS have highlighted biological contributing factors to disease that had not previously been recognized as central, such as the complement system in macular degeneration [1] or autophagy in Crohn's disease [2]–[4]. While it does seem ungrateful to question the utility of GWAS when they have yielded so many more reproducible associations than we have achieved with any other approach, the fact is that our primary goal in conducting GWAS for a complex trait – achieving a comprehensive understanding of the genetic basis for that trait – remains elusive for most of the traits that have been examined. The failure to achieve this goal is made particularly acute by the recognition that the loci that have been successfully identified not only provide us with little insight into the genetic basis for the trait, but also account for little of the overall heritability [5]. Although there are important caveats to these grim statistics – most published studies have conducted discovery research in populations of European descent, and some important disorders have not even been examined using GWAS yet – the collective experience has led to sharp disagreement as to whether there is sufficient value in targeting near-term investments in genomics to GWAS, or whether such investments are better targeted to sequencing [6]–[8]. The key issue in this controversy is whether the genetic risk factors not yet discovered are largely similar in frequency and effect size to those that have been discovered using GWAS (i.e. common alleles with low risk) or are, instead, rarer alleles that would be best identified through sequencing studies. Studies we report here suggest that we have not yet exhausted the signals that can be discovered through GWAS. Annotating SNPs with information on expression can improve our ability to more easily distinguish those associations likely to be replicated, and provide us with a better understanding of the genes and mechanisms driving the associations we discover. Moreover, it appears that for at least a subset of complex disorders, there are many more common variants truly associated with disease and highly likely to be expression quantitative trait loci (eQTLs). These variants can be identified and characterized using existing GWAS and tools such as the SCAN database (SNP and Copy number ANnotation; http://www.scandb.org) [9]. Results SNPs Associated with Complex Traits Are More Likely to Be eQTLs than Frequency-Matched SNPs from GWAS Platforms The 1598 SNPs characterized in the GWAS catalog [10] as showing association with complex traits included 625 that would be classified as eQTLs with a p-value threshold of 10−4 using CEPH (Centre d'Etude du Polymorphisme Humain) samples of European descent from Utah (CEU) or 972 using the combined sample of the CEU plus the Yoruban samples from Ibadan Nigeria (YRI) data to define eQTLs, 46 (83 for the combined CEU+YRI) that would be classified as eQTLs with a p-value threshold of 10−6 and 17 (18 for the combined CEU+YRI) that would be classified as eQTLs with a p-value threshold of 10−8. As summarized in Figure 1, we observed significantly more eQTLs among these 1598 trait-associated SNPs than expected given their minor allele frequency (MAF) distribution. Given that many gene transcript levels show substantial correlations, it is, perhaps, not surprising that we also observed that SNPs associated with many transcripts (sometimes called eQTL hot spots or master regulators) were also enriched among trait-associated SNPs. Using a more stringent definition for defining trait-associated SNPs (up to 10−8) reduced their absolute numbers of such SNPs, but increased the significance of the observed enrichment of eQTLs among trait-associated SNPs. 10.1371/journal.pgen.1000888.g001 Figure 1 Trait-associated SNPs are more likely to be eQTLs. The distribution of the number of eQTLs (defined as p 0.3 and for each replicate conducted 100 simulations conditional on MAF as for Figure 1 (see Materials and Methods for simulation details). The eQTL enrichment was preserved in LD-pruned sets of cataloged variants, for eQTL definition thresholds with p 0.05) on high throughput platforms, associated with traits in the GWAS catalog, and associated with phenotypes selected to represent those for which lymphoblastoid cell lines (LCLs) might be expected to be a poor proxy for the human tissues most relevant for disease (neurological/psychiatric phenotypes, cancers) and those for which LCLs might be expected to be a good proxy for the human tissues most relevant to disease (autoimmune disorders). As is apparent from the table, neurological and psychiatric phenotypes as well as cancers (regardless of tumor site) showed levels of enrichment similar to those observed for the overall trait-associated SNP set, although as expected, autoimmune disorders showed somewhat greater levels of enrichment. Division of the GWAS catalog SNPs into those identified through studies on autoimmune disorders and those identified through studies on other complex traits reveals that the apparent enrichment of eQTLs (defined at a threshold of p .05) 1,213,906 595,285 (.490) 29,347 (.024) All Catalog SNPs 1598 972 (0.608) 83 (0.052) Autoimmune Disorders 259 165 (0.637) 21 (0.081) Cancers 93 56 (0.602) 4 (0.043) Neurological/Psychiatric Disorders 63 41 (0.651) 2 (0.032) Using eQTLs from CEU only Platform SNPs (MAF >.05) 1,213,906 345,249 (.284) 12,749 (.011) All Catalog SNPs 1598 625 (0.391) 46 (0.029) Autoimmune Disorders 259 116 (0.448) 17 (0.066) Cancers 93 30 (0.323) 3 (.032) Neurological/Psychiatric Disorders 63 20 (0.317) 1 (0.016) SNPs from the NHGRI catalog were classified according to the type of disease leading to their inclusion as a trait-associated SNP to investigate whether eQTLs identified in LCLs are more enriched in disorders for which LCLs are more likely to be an appropriate tissue match for the disease. Studies were conducted separately using eQTLs identified in the combined CEU+YRI samples (upper lines), and those identified only in the CEU (lower lines). Only SNPs with MAF >0.05 are included in these studies, and eQTL p-value thresholds of 10−4 and 10−6 are shown. Using eQTL Information Improves Ability to Identify Reproducible Signals in GWAS In order to understand the practical utility of our observation that trait-associated SNPs are more likely to be eQTLs, we examined in more detail results summaries from the Wellcome Trust Case Control Consortium (WTCCC) GWAS [11], using the data on Crohn's disease as a primary example application. If SNPs associated with Crohn's disease are no more likely than non-associated SNPs to be eQTLs, we would expect the proportion of SNPs associated with Crohn's disease at p 3. SNPs have been ordered according to their WTCCC Crohn's association p-values from the most to the least significant, and divided in groups of 10,000. For each group bin of 10,000 trait-associated SNPs, the number of SNPs with expression score larger than 3 was calculated and the results are shown in the scatterplot. The horizontal line illustrates the expectation based on the observed eQTL function scores in all SNPs in the WTCCC Crohn's dataset. The dotted lines represent estimated 95% confidence bands obtained using simulations. Analyses on other phenotypes from the WTCCC association studies [11] confirmed that our proposed annotations benefit phenotypes beyond Crohn's disease. Figure 4 illustrates that both type 1 diabetes (T1D) and rheumatoid arthritis (RA) had significantly more SNPs than expected with phenotype associations (p 3. The plots show results of eQTL enrichment analysis for the remaining Wellcome Trust phenotypes beginning with the SNPs most strongly associated with disease (similar to analyses for Crohn's disease summarized in Figure 3). For each disease, SNPs have been ordered according to their association p-values from the most to the least significant, and divided in groups of 10,000. For each group, the number of SNPs with expression score larger than 3 was calculated and the results are shown in the scatterplots. The horizontal lines illustrate the expectation based on the observed scores in all SNPs in the relevant WTCCC dataset. The dotted lines represent estimated 95% confidence bands obtained using simulations. Discussion Our results build on those of previous studies, particularly those of Schadt and colleagues [12]–[15] that focused on using transcriptome information to improve understanding of genetic signals, but the rapid accumulation of trait-associated SNPs through GWAS, coupled with the systematic efforts to catalog these variants [10],[16] has enabled us to generalize the concept that information on eQTLs will have utility for understanding the genetic component to complex traits. Our primary observations – that SNPs reproducibly associated with complex human traits are more likely to be eQTLs – are novel, although perhaps not unexpected given that few of even those SNPs most reproducibly associated with complex traits have been attributed to missense or nonsense variants. Our subsequent observations made in association studies in which we annotated SNPs with eQTL information, imply that at least some complex disorders have substantial numbers of undiscovered susceptibility loci that can be more easily discovered and characterized by annotating SNPs with information on eQTL scores. This is also a novel observation. Moreover, use of information on expression appears to benefit our understanding of not only more tractable disorders, such as Crohn's disease and autoimmune disorders, for which relatively large numbers of loci have already been identified at the cost of genotyping 10,000 or fewer samples, but also for less tractable disorders such as hypertension and bipolar disease, for which fewer associations have been characterized despite the large numbers of genotyped samples (>29,000 in GWAS and >34,000 for replication in studies of hypertension [17]). Use of expression information should move us closer to that elusive goal of developing a more comprehensive understanding of the genetic basis for complex traits in two ways: (1) by identifying and characterizing a larger number of contributing loci, we should be better able to discern the key biological functions affected by genetic risk factors and (2) SNP signals may be more accurately characterized with respect to target genes and mechanism of effect, further enhancing our ability to discern relevant biological functions. Thus, our results provide a strong additional rationale for the inclusion of eQTL information in the annotation of SNPs from GWAS. Moreover, even the analyses we have completed to characterize trait-associated SNPs as being more likely than allele frequency matched SNPs from the same platform to be eQTLs have revealed information about the potential biological rationale for some of the observed associations. For example, an intronic SNP, rs3129934, on chromosome 6 that is a cis-acting eQTL for multiple HLA transcripts with a high eQTL functional score (24.9) shows the strongest association with T1D in the MHC region in the WTCCC study [11] and is also the SNP most strongly associated with increased risk for multiple sclerosis (MS) in a pooling-based GWAS [18]. A variety of the approaches we used in conducting these studies should be of general utility in enrichment studies. It is clear, for example, that the MAF spectrum of SNPs associated with complex human traits is quite different than that of SNPs on the products used to conduct GWAS, or in the overall HapMap. The much higher minor allele frequencies observed for SNPs reproducibly associated with complex traits than for SNPs on high throughput platforms is a natural consequence of the increased power to detect associations for SNPs with higher minor allele frequencies. It is clearly critical to condition on the MAF in enrichment studies in which simulations are conducted using SNPs from high throughput platforms or from the HapMap, and to date, this has not been widely appreciated. The correlation among SNPs due to LD is a further complicating factor in conducting enrichment studies. We are confident that failure to adequately account for LD will not affect our estimate of the proportion of SNPs that are eQTLs, but rather the variance of this estimate. There are, however, more subtle biases related to LD that might have affected our studies. For example, true associations may be more easily detected in regions of the genome with high LD because these regions are more likely to have good coverage on high throughput platforms for GWAS. Although it is unclear that this would bias results of our studies, it is important to note that these types of subtle biases can affect results of enrichment studies in unpredictable ways. The challenges of conducting eQTL studies are also widely appreciated. The eQTLs identified in array-based transcriptome studies in LCLs are likely to be a limited subset of the eQTLs that will ultimately be identified using more sensitive techniques in comprehensive collections of human tissues. Moreover, a variety of studies have highlighted the potential sensitivity of eQTL studies to cell and tissue subtypes [15],[19],[20]. Although several studies have suggested that eQTLs will be largely tissue specific [21],[22], others note that substantial numbers of eQTLs are shared across tissues [15],[23]. Similarly, some have found that LCLs are prone to such high levels of non-genetic variability (growth parameters, EBV levels, ATP levels, etc.) that they might be considered unreliable for genomic studies [24], while others highlight the reproducibility of results of transcriptome studies [25],[26] and find that LCLs show little variability as a consequence of non-genetic factors [27]. To the degree that the transcriptome is a reflection of the local and temporal environment, it could be argued that eQTLs identified in generic studies will have little value for enhancing understanding of complex trait studies unless eQTLs are identified in relevant tissues of the individuals participating in the complex trait studies. Our results provide little support for such arguments, and instead highlight the utility of LCLs as a reasonable model for the study of expression. And while LCLs are clearly not the ideal tissue for expression studies in many complex disorders, we note that we plan to serve results of eQTL studies in a variety of tissues in SCAN, and consider the studies summarized in this manuscript to be an initial effort that will be enriched by subsequent studies in additional tissues. It is possible that diseases without evidence for eQTL enrichment using data from LCLs will ultimately yield such evidence when data from more relevant tissues is used; alternatively, at least a subset of those disorders without evidence for enrichment for eQTLs may turn out to derive substantial heritability from rare variants. In any case, results of our studies clearly support additional investment in eQTL identification in other tissues, as well as in the development of public access databases and software tools, such as those in SCAN (http://www.scandb.org), that allows results of such studies to be productively used by the scientific community. Materials and Methods SNPs Reproducibly Associated with Complex Human Traits The National Human Genome Research Institute (NHGRI) has collected the results of published GWA studies in a publicly available online database (http://www.genome.gov/gwastudies) [16]. This catalog is continually updated; the version we utilize for the present study was downloaded on June 29, 2009. We used the NHGRI default set of SNPs reported to be associated to complex traits with a p-value of at least 10−5 for all figures and tables in the manuscript, but we note that results show more significant enrichment for more stringent definitions for trait-associated SNPs (we tested 10−5 through 10−8). The SCAN Database We have created an online database to serve the results of our gene-expression studies in LCLs. SCAN (SNP and Copy number ANnotation database; http://www.scandb.org) can be queried through either the SNP or gene to retrieve information on the relationship between SNPs and transcript levels at user-specified p-value thresholds [9]. SCAN enables batch queries of genes to retrieve a list of SNPs that predict the expression of the genes at a user-specified threshold. SCAN also holds multi-locus disequilibrium measures [28] to summarize reported LD information among SNPs and to characterize coverage of genes by the high-throughput genotyping platforms. The expression data served in SCAN had been assayed in HapMap LCLs (87 CEU and 89 YRI) with the Affymetrix GeneChip Human Exon 1.0 ST Array [29]. The exon array measures both exon-level and gene-level expression, includes approximately 1.4 million probesets, and profiles over 17,000 gene transcripts. Genome-wide association analyses of 13,080 transcript clusters (gene-level) with reliable expression and more than 2 million SNPs in the CEU or YRI populations were conducted using the Quantitative Trait Disequilibrium Test (QTDT) [30]. A transcript cluster or probeset in the exon array is defined as reliably expressed in LCLs if the log2-transformed expression signal is greater than 6 in at least 80% of the 176 HapMap samples included analyses. Each transcript cluster includes a set of probesets (exon-level) containing all known exons and 5′- and 3′- untranslated regions (UTRs) in the genome. In contrast to other arrays (e.g., U95 and U133 series arrays), the probes on the exon array cover entire gene regions and not just 3′-UTRs. Since SNPs in probes can generate spurious eQTL signals by affecting hybridization, we downloaded SNP data from dbSNP (build 129, human genome assembly build 36) to identify probes in probesets that hybridize to regions containing SNPs. The genomic coordinates of the probes were downloaded from the Affymetrix website (http://www.affymetrix.com). We filtered these probes before conducting the expression analyses generating the results served in SCAN. Analysis of eQTL Enrichment All analyses have been conducted using all available information (eQTLs for both CEU and YRI samples) and also using only information from the CEU samples. Results are substantially similar, but because most GWAS reported to date have been conducted in samples of recent European descent, and to be conservative, summary figures use only the eQTL information from the CEU samples. We also provide details for some of the key eQTL studies in tabular form (Table 1), where we provide summary information for enrichment studies using CEU+YRI eQTL data and for enrichment studies using only the CEU eQTL data. The MAF distribution for SNPs showing reproducible associations with complex human traits is markedly different from the MAF distribution for all SNPs on the high density SNP genotyping platforms, or the minor allele frequency distribution of SNPs included on high throughput platforms used for GWAS (Figure S1). Thus, any study focused on enrichments among SNPs reproducibly associated with complex human phenotypes must be conditioned on MAF. To enable us to conduct simulations conditional on MAF, we constructed MAF bins as follows: allele frequencies were calculated from HapMap genotype data on the CEU. Only the parents' genotypes were included in the frequency calculations since the children's alleles are not independent. We used the toolkit PLINK [31] to calculate frequencies. We classified SNPs on Affymetrix Genome-Wide Human SNP Array 6.0 and Illumina's High Density Human 1M-Duo into non-overlapping minor allele frequency (MAF) bins, each of width .05, using the MAFs of the SNPs in the CEU samples, as most of the published reports summarized in the NHGRI catalog are from GWAS on individuals of recent European descent. We conducted simulations to test for an enrichment of eQTLs among the NHGRI variants associated with complex traits by generating 1000 randomized SNP sets Si each of the same size as the original NHGRI list (1598) containing variants matched on MAF distribution, sampled without replacement from the set of typed SNPs on each high-throughput platform, which had been grouped into discrete MAF bins. For each set, we determined the number of eQTLs, Qi , at a p-value threshold. We tested the robustness of the eQTL enrichment across a range of p-value thresholds (10−8 to 10−4), as shown in Figure 1. These simulations (N = 1000) yield an empirical p-value, calculated as the proportion of simulations in which the number of eQTLs exceeds the observed number, Q, in the NHGRI list. Because some GWAS report associations for SNPs that were interrogated through imputation rather than direct genotyping, we repeated the analysis described above except that we used the entire set of HapMap SNPs to generate the 1000 randomized SNP sets with MAF matched to catalog SNPs, rather than only the SNPs included on high throughput GWAS platforms. Results of these studies show an even more significant enrichment of eQTLs among catalog SNPs (Figure S2). Thus, we focus in this paper on the more conservative results obtained using the simulations generated from SNPs on high-throughput GWAS platforms. To investigate whether the eQTL enrichment we observe in the NHGRI variants is driven by linkage disequilibrium (LD), an LD-pruned SNP set was generated from the NHGRI list using r2 4 MB from (or on another chromosome than) any gene with which the SNP shows transcript level association. In order to devise a one-dimensional score that summarizes as much of the eQTL signal information as possible for prioritizing GWAS signals according to their “functionality”, we first note that neither the scale of the score nor the distribution of the score for non-functional SNPs are relevant – just the appropriate ranking of SNPs with a function in regulating transcript levels. The building blocks for the SNP score are: po which is the overall smallest eQTL p-value (it can be either cis or trans), pc which is the smallest cis p-value, No which is the overall number of transcripts used in the analysis, and Nc which is the number of cis transcripts (i.e. the number of transcripts annotated to genes in the vicinity of the SNP under investigation). The score we propose: builds on Bonferroni corrections applied separately for trans and cis signals (there are 13,080 transcripts used in the analysis). Note that the scores are truncated at zero. Utilizing Expression Information in GWAS In order to determine whether our newly derived functional score can aid in identifying associations likely to be replicated, we first examined the summary data for Crohn's disease from the WTCCC association study [11] (downloaded when results data were publicly available). After QC (less than 5% missing data, good clustering), we filtered SNPs at a 1% MAF threshold, and removed markers with no rs ID, leaving 391,878 SNPs. We matched these SNPs using the rs IDs within the SCAN database, leaving us with 386,306 SNPs. We then ordered the SNPs according to their score S from largest to smallest score. The top SNPs based on the eQTL score show an excess of small association p-values with Crohn's disease (for example, in the top ten SNPs, six have p-values smaller than 0.01). We divided the SNPs into groups of 10,000 (in the order of their eQTL scores) and counted the number of SNPs in each group with p 3, >2, and >1 among the top 1000 SNPs for Crohn's disease (Figure 3). We repeated the Crohn's disease enrichment studies in the other WTCCC phenotypes. As with Crohn's disease, we first conducted analyses checking for an excess of small association p-values in the SNPs that have high expression annotation scores (Figure 4). Finally, we ranked the SNPs according to their association p-values and looked for an enrichment of high expression scores (Figure 5). To insure that results of enrichment studies were not unduly influenced by large numbers of correlated SNPs in the MHC region with high eQTL scores, we repeated the analyses summarized in Figure 2 and Figure 4 after removing all SNPs on chromosome 6 (see Figure S5). Results were relatively unchanged. Supporting Information Figure S1 Minor allele frequency (MAF) distributions for Affymetrix 6.0 SNPs (A), Illumina 1M SNPs (B), and NHGRI Associated SNPs (C). (1.57 MB TIF) Click here for additional data file. Figure S2 (A) The distribution of the number of eQTLs (at p<10−4 observed for each of 1,000 draws of 1,598 SNPs from bins matched for minor allele frequency to the 1,598 SNPs downloaded from the NHGRI catalog (bins include all HapMap SNPs) is shown in the bar graphs, with the actual number of eQTLs observed in the 1,598 SNPs from the NHGRI catalog shown as a solid circle. (B) Identical to Figure 1, leftmost panel, with analysis as above, except that SNPs for the simulation were drawn from SNPs on high-throughput GWAS panels. (2.69 MB TIF) Click here for additional data file. Figure S3 Observed numbers of eQTLs among all trait-associated SNPs trimmed for LD are shown as solid black circles with the bar graph showing the distribution of eQTLs among SNPs chosen at random from the same allele frequency bins from among all SNPs included on high-throughput GWAS products. Enrichment of eQTLs among trait-associated SNPs is preserved even in the absence of LD among trait-associated SNPs. (2.09 MB TIF) Click here for additional data file. Figure S4 Observed numbers of master regulators among all trait-associated SNPs are indicated by the solid black circles and the distribution of the number of master regulators (defined as SNPs predicting 10 transcripts and SNPs predicting 100 transcripts, both for p<10−4) observed in 1,000 draws of 1,598 SNPs from minor-allele-frequency-matched bins (including all SNPs on Illumina 1M and Affymetrix 6.0 products) is plotted in the bar graphs. (2.14 MB TIF) Click here for additional data file. Figure S5 This is similar to Figure 2, except that all SNPs on chromosome 6 have been removed from calculations, demonstrating that the Crohn's enrichment for eQTLs is not dependent on SNPs in the HLA region. Results were similar for T1D and RA. (0.88 MB TIF) Click here for additional data file.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Genetics of gene expression and its effect on disease.

            Common human diseases result from the interplay of many genes and environmental factors. Therefore, a more integrative biology approach is needed to unravel the complexity and causes of such diseases. To elucidate the complexity of common human diseases such as obesity, we have analysed the expression of 23,720 transcripts in large population-based blood and adipose tissue cohorts comprehensively assessed for various phenotypes, including traits related to clinical obesity. In contrast to the blood expression profiles, we observed a marked correlation between gene expression in adipose tissue and obesity-related traits. Genome-wide linkage and association mapping revealed a highly significant genetic component to gene expression traits, including a strong genetic effect of proximal (cis) signals, with 50% of the cis signals overlapping between the two tissues profiled. Here we demonstrate an extensive transcriptional network constructed from the human adipose data that exhibits significant overlap with similar network modules constructed from mouse adipose data. A core network module in humans and mice was identified that is enriched for genes involved in the inflammatory and immune response and has been found to be causally associated to obesity-related traits.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              A Scan for Positively Selected Genes in the Genomes of Humans and Chimpanzees

              Introduction Genes, or regions of the genome, that have been affected by natural selection may show an excess of functionally important molecular changes, beyond what would be expected in the absence of selection. Genomic regions with such an excess of changes are said to have experienced positive selection, i.e., selection in favor of new genetic variants. The most common statistical technique for detecting positive selection takes advantage of the fact that mutations in coding regions of genes come in two classes: nonsynonymous mutations that change the resulting amino acid sequence of the protein and synonymous mutations, which do not change the encoded protein. An excess of nonsynonymous mutations over synonymous mutations, beyond what would be expected if the two types of mutations occur at the same rate, provides strong evidence for the past action of positive selection at the protein level. Using this logic, there have recently been numerous studies documenting positive selection in a variety of genes and organisms, including immune-response-related genes [1–3], viral genes [4–6], fertilization genes [7,8], and genes involved in sensory perception and olfaction in humans [9]. Clark et al. [10] compared 7,645 genes from humans to their orthologs from the chimpanzee and the mouse. For each gene, they tested if there was an excess of nonsynonymous substitutions on the evolutionary lineage leading to humans. They showed that there was an excess of putatively positively selected genes in several functional classes, including genes involved in sensory perception, olfaction, and amino acid catabolism. They also showed that human genes that have been targeted by positive selection are significantly more likely to harbor variation associated with known genetic diseases. We here report the results of an analysis of 20,361 human and chimpanzee genes (of which 6,630 later were eliminated in a very conservative quality control), which includes the 7,645 genes analyzed by Clark et al. [10]. While the objective of the study by Clark et al. [10] was to find genes that have experienced accelerated evolution on the human lineage, using the mouse as an outgroup, the aim of the current study is to find genes that have been targeted by positive selection at any point in time during the evolution of humans and chimpanzees, based on a larger set of genes. We use a likelihood ratio test to identify positive selection and do extensive simulations to find the appropriate critical values of the test. Positive selection is inferred if the ratio of nonsynonymous substitutions per nonsynonymous site to synonymous substitutions per synonymous site (dN /dS) is statistically significantly greater than one in a test of the neutral null hypothesis dN /dS = 1 [11,12]. The method used for detecting positive selection takes transition/transversion rate biases and unequal codon and amino acid frequencies into account. The test for positive selection applied in this study is a traditional test of dN /dS greater than one. It has more power than the test used in the Clark et al. study [10] if selection affects both the human and the chimpanzee lineages because it uses information from both lineages. Results Chimpanzee sequence was obtained by PCR using primers designed to flank exon sequence annotated in the human genome [10]. Our analysis begins with data from 20,361 coding regions, including 103,606 nucleotide differences and 403 indels among 17,687,331 aligned nucleotides. These numbers are significantly lower than the genome-wide averages [13,14], presumably due to selective constraints in the coding regions. The distributions of nonsynonymous and synonymous nucleotide differences among genes are shown in Figure 1. The average numbers of nonsynonymous and synonymous mutations per nucleotide site are 0.002578 and 0.003281, respectively. Eliminating reads without a hit to known genes in public databases (see Materials and Methods), there are 71,896 nucleotide differences in 13,731 genes. The remaining analysis is restricted to this set of genes. Among them, 5,574 were eliminated from the positive selection analysis because they had fewer than three mutations, and 797 were eliminated because the sequence was less than 50 bp long. Additionally, 45 genes were eliminated because they contained internal stop codons, presumably due to erroneous annotations or sequencing errors. Among the remaining 8,079 genes, 3,913 were also analyzed by Clark et al. [10]. The average level of sequence divergence was 0.60%, corresponding to a divergence level of 1.57% in silent sites. This figure matches well the level of divergence observed by Ebersberger et al. [14] for Chromosome 22 of 1.44% overall and 2.26% in CpG islands. Seven hundred thirty-three of the 8,079 genes evolved with dN /dS greater than one, but only 35 had p-values less than 0.05, as determined by a likelihood ratio test of the null hypothesis of dN /dS = 1 against the alternative hypothesis of dN /dS greater than one. The number of significant genes at the 5% level, in this one-sided test, is lower than the nominal level because the vast majority of genes are conserved and evolve with dN /dS less than one. Nonetheless, after using Simes's improved Bonferroni procedure [15] we can, at the 5% significance level, reject the hypothesis that none of the genes are evolving with dN /dS greater than one. This also implies that a 5% false discovery rate set is nonempty. Even though the level of divergence between humans and chimpanzees is very low, there is statistically significant evidence for positive selection in the DNA sequences of these two species. Results for all genes are available in Dataset S1. Biological Processes Affected by Positive Selection To identify functional groups of genes with an overrepresentation of putatively positively selected genes, we used the PANTHER [16,17] classification of biological processes and a Mann-Whitney U test (MWU) based on the p-values from the likelihood ratio test (Table 1). The classification based on the MWU identifies categories of genes with small p-values from the likelihood ratio test. It is important to notice that genes that evolve approximately neutrally will tend to have smaller p-values than genes evolving under strong functional constraints. The classification based on the MWUs, therefore, does not provide unambiguous evidence for positive selection, but it provides a key to which groups harbors the most candidates for positive selection. Immune-defense-related genes appear at the top of the list. It is not surprising that several of the genes experiencing most positive selection are involved in immune responses to viruses. Considering the speed at which many pathogens, such as viruses, evolve (e.g., [5]), a coevolutionary molecular arms race between pathogens and host cells might explain the presence of strong selection favoring new mutations in these genes. Other forces, including overdominant selection to diversify the spectrum of immune responses, may also cause positive selection in immune- and defense-related genes. Such explanations have previously been used to explain the presence of positive selection in the human major histocompatibility complex [18]. As in [10] we also identify genes involved in various forms of sensory perception, including olfaction and genes classified as “unknown biological function.” Many of the genes with unknown biological function show sequence similarity with known transcription factors (data not shown). Much of the selection on sensory genes is driven by the selection on olfactory receptors previously found by Gilad et al. [9]. In contrast to Clark et al. [10], we also find that genes involved in spermatogenesis appear to have an excess of positively selected genes. The genes involved in spermatogenesis showing the strongest evidence for positive selection include several KRAB-containing zinc finger proteins that serve as repressors of transcription and are believed to be involved in determining the differentiation of pluripotent stem cells [19]. Expression Patterns and Positive Selection We also categorized 3,464 of the 8,079 genes according to the tissue of expression in the Novartis Gene Expression Atlas [20]. Because of the relatively small number of tissue-selective genes in our dataset (204) and the large number of tissues analyzed (28), many tissues had fewer than 20 tissue-selective genes, providing little statistical power for further subdivision. Therefore, we examined instead whether the tissue of maximal expression for a gene was correlated with positive selection, since high expression levels and importance in tissue function are often, but not always, correlated. The set of genes that have their maximal expression in the testes is the only one showing an excess of positive selection, after a Bonferroni correction for multiple tests (Table 2). Genes with their maximal expression in the brain do not have an excess tendency toward positive selection. In fact, genes expressed in the brain seem to be among the most conserved genes with the least evidence for positive selection. MWUs, comparing genes with their maximal expression in the brain (83 genes) to all other genes, show that these genes tend to have significantly higher p-values of the likelihood ratio test for positive selection (p = 0.035), indicating high levels of selective constraint. Genes that are expressed in the brain at a level of twice the expression level found in blood show an even stronger tendency toward avoidance of positive selection (p = 0.0002). Although studies of gene expression in the brain tissue are complicated by low-abundance transcripts and heterogeneous specialized brain regions [21], the overall evidence points toward a deficiency of positively, or fast evolving, genes among those expressed in the brain. The causes for the cognitive differences may instead be sought in adaptive changes in just a few genes, in changes in gene expression [22], or in changes in copy number and/or organization of genes relating to cognitive function [23]. Dorus et al. [24] found that genes expressed in the nervous system showed a relative increase in the rate in primates relative to rodents when compared to housekeeping genes, but provided no direct evidence for positive selection on these genes. Nervous-system-specific genes appear to be so conserved that it is unlikely that direct evidence for positive selection will be discovered in this group of genes. Positive Selection in the X Chromosome We also tested if any chromosomes show an excess of genes with evidence for positive selection. The only chromosome enriched in genes with small p-values from the likelihood ratio test for positive selection is the X chromosome (p = 0.0049; MWU). Several factors influence the contrast between the X and autosomes in tests of selection, including hemizygosity of the X in males, resulting in more effective selection against deleterious recessive and in favor of positive recessive mutations [25]. Male hemizygosity also results in mutations, with male-specific effects being more readily fixed by selection on the X [26]. This increased efficiency of selection for male-specific genes on the X may explain the excess of X-linked genes expressed in spermatogonia [27]. The observation that reproductive proteins generally evolve at a greater rate, coupled with the overrepresentation of male-specific genes on the X, could produce the excess positive selection seen on the X. However, after eliminating all genes with highest expression levels in the testis, or annotated as functioning in spermatogenesis, there is still an excess of putatively positively selected genes on the X chromosome (p = 0.0131; MWU). Thus, it appears that the elevated positive selection on the X is likely due to the general tendency of mutations to be recessive, regardless of their tendency to be male-limited in expression. Although other factors, such as an elevated male mutation rate [28], differences in the efficacy of genetic hitchhiking between autosomes and the X chromosome [29], and correlations between recombination rate and divergence [30], may cause differences in variability and substitution rate between autosomes and the X chromosome, none of these factors alone can explain the excess of positively selected genes on the X chromosome. Analysis of the 50 Genes Showing Strongest Evidence for Selection We studied the 50 genes with the highest likelihood ratios in greater detail to further characterize the causes of positive selection and examine error rates (Table 3). To investigate the degree to which our results might be influenced by sequencing errors, we compared the data for these genes with the public data available for the same genes. In the regions with overlap between the public data and our data there were a total of 327 mutations in the public data and 306 mutations in our data. This demonstrates that there is not an excess of (potentially artifactual) mutations in our data in the genes that show evidence for positive selection. While most of the 50 genes also show strong evidence for positive selection in the public data, six of the genes do not. HC19953, HC2758, HC6579, HC7761, HC8067, and HC9844 do not have dN /dS ratios larger than one in the public data. In most cases, the difference is caused by the fact that our database and the public database contain different regions of the genes. Not all regions of a gene are expected to be targeted by positive selection, but this does not challenge the evidence for positive selection in the regions of the genes included in this analysis. In any case, using the public data would not change the qualitative conclusions of the analysis of the genes presented here. Immunity and Defense Genes Targeted by Positive Selection The top 50 genes include many genes that we might a priori expect to be targets of positive selection, including four genes involved in olfaction (OR2W1, OR5I1, OR2B2, and C20orf185) and several genes involved in host–pathogen interactions, such as CMRF35H, CD72 antigen, pre-T-cell antigen receptor α (PTCRA), APOBEC3F, and granzyme H (GZMH). Only one of these genes was among the 50 most significant entries in the Clark et al. [10] model 2 analysis. APOBEC3F encodes an antiviral factor that has previously been demonstrated to be under positive selection by Sawyer et al. [3] who note that this gene has been associated with anti-HIV activity. Presumably, most of these genes have been targeted by positive selection throughout the primate and mammalian phylogeny. The widespread evidence for positive selection in immune-related genes confirms the hypothesis that much positive selection in the human and mammalian genomes may be driven by a coevolutionary arms race between host immune system and pathogens. Spermatogenesis- and Apoptosis-Related Genes The list also contains many testis- or sperm-specific genes including Protamine-1 (PRM1), which previously has been shown to be under positive selection [31], possibly due to sperm competition (but see [32] for an alternative explanation). Other sperm-specific genes on the list include USP26, C15orf2, PEPP-2, TCP11, HYAL3, and TSARG1. The inclusion of these genes in the list of the genes showing the strongest evidence for positive selection is consistent with the results, based on the PANTHER annotation and the Novartis expression data, of excess positive selection in sperm/testis-specific genes. The possible causes include sperm competition (e.g., [31]), sexual conflict (e.g., [7,8]), selection for reproductive isolation, pathogen-driven selection in the reproductive organs, and selection related to the occurrence of mutations causing segregation distortion. We notice that at least one of these genes (TSARG1) is involved in apoptosis during spermatogenesis. Apoptosis of germ cells is conspicuous during normal spermatogenesis, eliminating up to 75% of the potential spermazoa [33–35], affecting cells both before and after the meiotic division [36]. It has been hypothesized that the main cause for the high rate of apoptosis during spermatogenesis is to maintain a proper cell-number ratio between maturing germ cells and Sertoli cells [35]. The natural process of elimination of germ cells by apoptosis creates a genomic conflict in which each individual germ cell will benefit from avoiding apoptosis, but apoptosis of a certain fraction of germ cells may be beneficial to the mature organism. New mutations occurring in cells during spermatogenesis, which reduces the probability of apoptosis, will be positively selected. This effect will be particularly strong for mutations in genes expressed after the meiotic division, potentially resulting in segregation distortion. A mutant with an even very small increase in the probability of escaping postmeiotic apoptosis will have a strong selective advantage. Compensatory mutations, reducing or eliminating the effect of the apoptosis avoidance mutation, may then later occur. These dynamics may lead to recurrent events of positive selection in genes affecting spermatogenesis apoptosis. The 40 genes in this study involved in inhibition of apoptosis show an excess of evidence for positive selection compared to other categories (p = 0.0047; see Table 2). Many of the genes showing most evidence for positive selection are known to be involved in either spermatogenesis, apoptosis, or both. For example, the apoptosis-related gene showing the strongest evidence for positive selection (DFFA) is an inhibitor of Fas-mediated apoptosis, which has been shown to be involved in apoptosis during spermatogenesis [36]. This may suggest that genomic conflict due to spermatogenesis apoptosis may be driving positive selection in many of the included genes. Cancer-Related Genes While we expected to find genes involved in olfaction, spermatogenesis, and immune defense among the 50 annotated genes showing the strongest evidence for positive selection, we were surprised to find a very large proportion of cancer-related genes, especially genes involved in tumor suppression, apoptosis, and cell cycle control. These genes include four putative tumor suppressors: HYAL3, DFFA, PEPP-2 (note that both HYAL3 and PEPP-2 also appear to be involved in spermatogenesis), and C16orf3, another gene associated with tumor progression (MMP26), and a gene with unknown function but high similarity to melanoma-associated antigens (FLJ32965). In addition, there are several genes involved in apoptosis (PPP1R15A, HSJ001348, TSARG1, and GZMH). Given that many of the genes have very little functional information, it is surprising to find such a large proportion of genes that may be related to tumor development and control. The factors causing positive selection on these genes are unknown, but genes important in tumor development and suppression may be positively selected due to other functional effects of the genes, particularly in immunity and defense or in spermatogenesis. Several of the genes involved in tumor suppression or progression show testis-specific expression, and models of genomic conflict may explain the presence of positive selection in these genes. It should be noted that there is no pattern of human-specific selection in these genes. The high number of nonsynonymous mutations in these genes is approximately evenly distributed between the human and the chimpanzee lineage (results not shown). PAML Analysis For each of the 50 genes, we searched public databases to find orthologous genes in other mammals. For 25 of the genes we were able to identify orthologs from mouse and rat, and for these 25 genes we estimated the dN /dS ratio of each lineage of the underlying phylogeny using PAML [37]. The dN /dS ratio was elevated (p if i is less than j. The polarity of the mutation was determined using the chimpanzee sequence as outgroup. Analysis of ascertainment bias. To assess the impact of the ascertainment scheme in the tests that contrast human polymorphism data to the human–chimp divergence, new datasets were simulated, using standard neutral coalescence simulations (e.g., [38]). Each simulated dataset generated one chimp sequence and 78 human sequences for each of the 13,731 genes. For each simulated gene, one human sequence was randomly chosen and compared to the chimp sequence using a chi-square statistic for the goodness-of-fit test of dN /dS = 1. The 50 genes with largest chi-square statistic among genes with dN /dS greater than one were selected for population genetic analysis. This scheme was repeated 1,000 times to investigate the effect of the ascertainment protocol of the 50 genes. The parameters of the simulations were estimated from the data, using the observed distribution of sequence lengths, and synonymous-site mutation rate and humans–chimp divergence time estimated from the concatenated data. The distribution of dN /dS ratios among genes was estimated assuming the dN /dS ratios follow a γ distribution among genes, keeping the synonymous rate constant among them. Power analysis. To analyze the power of the test for positive selection, we simulated pairs of sequences and performed likelihood ratio tests of H0: dN /dS equals one versus dN /dS is greater than one for each sequence pair. The simulations were done using the average value of synonymous sequence divergence observed in the data, while nonsynonymous divergence was varied. For more details regarding such simulations, see, e.g. [50]. PRF analysis. Assume nonlethal mutations enter a population of constant size 2N according to a Poisson process and are assigned to one of three categories: neutral (S = 0), positively selected with selection coefficient S +, and negatively selected with selection coefficient S –, according to probabilities p 0, p +, and p – (where p 0 + p + + p – = 1). Furthermore, assume mutations evolve independently. It follows from standard population genetic theory, the total law of probability, and the rules of conditional probability that the probability of an SNP being found at frequency i out of n chromosomes under this scheme [44] is where F(i,n,S) --> is given by The likelihood of observing counts x 1, x 2, . . ., xS where S is the total number of segregating sites out of n 1, n 2, …, nS chromosomes is, thus, The maximum likelihood value and the maximum likelihood parameter estimates can then be obtained by numerically maximizing this function with respect to the parameters. Likelihood ratio tests can be constructed by constraining certain of the parameters to take on particular values. For example, setting p 0 = 1 defines a model with no selected mutations. Likewise, setting p 0 + p – = 1 defines a model that allows negative selection, but no positive selection. This analysis assumes that mutations are independent. Because of linkage and the possibility of epistasis, the independence assumption may not be met by the data. However, a full analysis taking the correlation among SNPs into account is not computationally feasible. Fortunately, the average correlation is low between SNPs because they have been sampled among 50 genes distributed throughout the genome. The effect of the correlation among SNPs on this analysis should, therefore, be minimal. The maximum log likelihood value for the full model is –234.19. However, the maximum log likelihood values for models assuming only neutral mutations, or a single class of selected mutations, are –243.82 and –240.88, respectively. Under the independence assumption, both of these simpler models can be rejected against the model with three classes of mutations, using a likelihood ratio test (p = 0.0006 and p = 0.004). Supporting Information Dataset S1 Results File (3.1 MB XLS). Click here for additional data file. Dataset S2 Alignment File (9.8 MB ZIP). Click here for additional data file. Accession Numbers The sequence analyzed in this study has been submitted to GenBank (http://www.ncbi.nlm.nih.gov/Genbank/).
                Bookmark

                Author and article information

                Journal
                9216904
                2419
                Nat Genet
                Nat. Genet.
                Nature genetics
                1061-4036
                1546-1718
                25 April 2014
                13 April 2014
                May 2014
                01 November 2014
                : 46
                : 5
                : 430-437
                Affiliations
                [1 ]Bioinformatics Research Center and Department of Statistics, North Carolina State University, Raleigh, NC, USA
                [2 ]Department of Biological Sciences, North Carolina State University, Raleigh, NC, USA
                [3 ]Department of Genetics, University of North Carolina at Chapel Hill, NC, USA
                [4 ]Department of Genetics, Rutgers University, New Brunswick, NJ, USA
                [5 ]Department of Biostatistics, University of North Carolina at Chapel Hill, NC, USA
                [6 ]Department of Psychiatry, VU Medical Center, Amsterdam, Netherlands
                [7 ]Department of Biological Psychology, VU University, Amsterdam, Netherlands
                [8 ]Department of Computer Science, University of North Carolina at Chapel Hill, NC
                [9 ]Environmental and Occupational Health Sciences Institute, Rutgers University, New Brunswick, NJ, USA
                [10 ]Department of Pharmacotherapy & Outcomes Science, Virginia Commonwealth University, Richmond, VA, USA
                [11 ]Department of Computer Science, University of California, Los Angeles, USA
                Author notes
                Correspondence should be addressed to F.A.W. ( fred_wright@ 123456ncsu.edu ) or P.F.S ( pfsulliv@ 123456email.unc.edu )
                [12]

                These authors contributed equally to this work.

                Article
                NIHMS576016
                10.1038/ng.2951
                4012342
                24728292
                c195b6e3-555c-478d-803e-416a1fa721a0
                History
                Categories
                Article

                Genetics
                gene expression,peripheral blood,twin study,heritability,expression quantitative trait loci,eqtl

                Comments

                Comment on this article