Blog
About

29
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Genome-wide association study identifies novel susceptibility loci for cutaneous squamous cell carcinoma

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Cutaneous squamous cell carcinoma represents the second most common cutaneous malignancy, affecting 7–11% of Caucasians in the United States. The genetic determinants of susceptibility to cutaneous squamous cell carcinoma remain largely unknown. Here we report the results of a two-stage genome-wide association study of cutaneous squamous cell carcinoma, totalling 7,404 cases and 292,076 controls. Eleven loci reached genome-wide significance ( P<5 × 10 −8) including seven previously confirmed pigmentation-related loci: MC1R, ASIP, TYR, SLC45A2, OCA2, IRF4 and BNC2. We identify an additional four susceptibility loci: 11q23.3 CADM1, a metastasis suppressor gene involved in modifying tumour interaction with cell-mediated immunity; 2p22.3; 7p21.1 AHR, the dioxin receptor involved in anti-apoptotic pathways and melanoma progression; and 9q34.3 SEC16A, a putative oncogene with roles in secretion and cellular proliferation. These susceptibility loci provide deeper insight into the pathogenesis of squamous cell carcinoma.

          Abstract

          Cutaneous squamous cell carcinoma is the second most common type of skin cancer. In this genome-wide association study, which includes over 7,000 cases, the authors identify 4 new susceptibility loci for this cancer and also provide independent replication of 9 previously reported susceptibility loci.

          Related collections

          Most cited references 35

          • Record: found
          • Abstract: found
          • Article: not found

          Mapping the Genetic Architecture of Gene Expression in Human Liver

          Introduction Recent large-scale, genome-wide association studies have now delivered a number of novel findings across a diversity of diseases, including age-related macular degeneration [1–3], heart disease [4,5], host control of HIV-1 [6], type I and II diabetes [7,8], and obesity [9]. However, despite this astonishing rate of success, the major challenge still remains to not only confirm that the genes implicated in these studies are truly the genes conferring protection from or risk of disease, but to elucidate the functional roles that these implicated genes play with respect to disease. Most of the genetic association studies reporting novel, highly replicated associations to disease traits do not provide experimental data supporting the putative functional roles a given candidate susceptibility gene may play in disease onset or progression. Even in cases where susceptibility genes are well studied, with well known functions, nailing down how these genes confer disease susceptibility can linger for years, or even decades, as has been the case for genes like ApoE, an Alzheimer disease susceptibility gene identified more than 15 years ago [10]. Complex networks of molecular phenotypes—gene expression (mRNA, ncRNA, miRNA, and so on), protein expression, protein state, and metabolite levels—respond more proximally to DNA variations that lead to variations in disease-associated traits. These intermediate phenotypes respond to variations in DNA that in turn can induce changes in disease associated traits. Because a majority of single nucleotide polymorphisms (SNPs) detected as associated with disease traits from the recent wave of genome-wide association studies (GWASs) do not appear to affect protein sequence, it is likely that these SNPs either regulate gene activity at the transcript level directly or link to other DNA variations involved in this type of regulatory role. Therefore, to uncover the genetic determinants affecting expression in a metabolically active tissue that is relevant to the study of obesity, diabetes, atherosclerosis, and other common human diseases, we profiled 427 human liver samples on a comprehensive gene expression microarray targeting more than 39,000 transcripts, and we genotyped DNA from each of these samples at 782,476 unique SNPs. The relatively large sample size of this study and the large number of SNPs genotyped provided the means to assess the relationship between genetic variants and gene expression with more statistical power than many previous studies allowed [11–13]. A comprehensive analysis of the liver gene expression traits revealed that thousands of these traits are under the control of well-defined genetic loci, with many of the genes having already been implicated in a number of human diseases. Here we demonstrate directly how integrating genotypic and expression data in mouse and human can provide much-needed functional support for candidate susceptibility genes identified in a growing number of genetic loci that have been identified as key drivers of disease from GWASs. Specifically, we highlight how the gene RPS26 and not ERBB3 is most strongly supported by our data as a susceptibility gene for a novel type 1 diabetes (T1D) locus that was recently identified in a large-scale GWAS [14] and subsequently extensively replicated in a number of cohorts [15]. We also identify SORT1 and CELSR2 as candidate susceptibility genes for a locus recently associated with coronary artery disease [16] and plasma low-density lipoprotein (LDL)-cholesterol levels [17,18]. Results To characterize the genetic architecture of gene expression in human liver, we compiled a tissue-specific human liver cohort (HLC), which comprised 427 Caucasian subjects (Table S1). DNA and RNA were isolated from all liver tissue samples. Each RNA sample was profiled on a custom Agilent 44,000 feature microarray composed of 39,280 oligonucleotide probes targeting transcripts representing 34,266 known and predicted genes, including high-confidence, noncoding RNA sequences. Each DNA sample was genotyped on the Affymetrix 500K SNP and Illumina 650Y SNP genotyping arrays. Analysis was restricted to those SNPs that had a genotyping call rate greater than 75%, a minor allele frequency greater than 4%, and that did not deviate significantly from Hardy-Weinberg equilibrium in the HLC. A total of 310,744 and 557,240 SNPs met these criteria from the Affymetrix and Illumina sets, respectively, resulting in a set of 782,476 unique SNPs (85,508 SNPs were in the intersection), referred to here as the analysis SNP set. Genome-Wide Screen for Putative cis- and trans-Acting Expression Quantitative Trait Loci To identify expression quantitative trait loci (eQTL) that have putative cis and trans [19] regulatory effects on the liver gene expression traits, we tested all expression traits for association with each of the SNPs in the analysis SNP set typed in the HLC. The strongest putative cis eQTL for a given expression trait was defined as the SNP most strongly associated with the expression trait over all of the SNPs typed within 1 megabase (Mb) of the transcription start or stop of the corresponding structural gene. The association p-values were adjusted to control for testing of multiple SNPs and expression traits using two different methods: (1) a highly conservative Bonferroni correction method to constrain the study-wise significance level, and (2) an empirical false discovery rate (FDR) method that constrains the overall rate of false positive events. For cis eQTL, we only test for associations to SNPs that are within 1 Mb of the annotated start or stop site of the corresponding structural gene. To achieve a study-wise significance level of 0.05, the Bonferroni adjusted p-value threshold was computed as , where Ni denotes the number of SNPs tested for trait i, over all 39,280 expression traits tested. At this threshold, 1,350 expression traits corresponding to 1,273 genes were identified. The Bonferroni adjustment method can be conservative when there is dependence among the expression traits and among the SNP genotypes. Given that strong correlation structures exist among expression traits and among SNP genotypes in a givenlinkage disequilibrium (LD) block, the Bonferroni adjustment may be overly conservative. Therefore, we used an empirical FDR method based on permutations that accounts for the correlation structures among the expression traits and among the SNP genotypes. We constrained the empirically determined FDR to be less than 10% (see Methods). At this level, we identified 3,210 expression traits corresponding to 3,043 genes that were significantly associated with at least one SNP near the corresponding gene region (referred to here as a putative cis eQTL). The full list of association results are provided in Table S2. The magnitude of the effects ranged from SNPs that explained roughly 2% of the in vivo expression variation (p ∼ 0.003) to those that explained roughly 90% of the expression variation (p 100 kb away. That is, greater than 30% of all cis eSNPs fall greater than 100 kb away from the transcription start and stop sites of the corresponding gene. Therefore, at least for expression traits, the nearest SNP rule for inferring genes given an association finding would result in an unacceptably high miss-call rate. Genes with expression values that are strongly associated with variations in DNA provide a different path to elucidate the gene or genes and their respective functions underlying genetic loci associated with disease in a more objective fashion. Identifying candidate susceptibility genes for T1D. In one of the largest GWASs carried out to date, the Wellcome Trust Case Control Consortium (WTCCC) studied 14,000 cases and 3,000 shared controls with respect to seven common diseases [14]. T1D was one of the key disease focuses of this study, with a number of replications reported simultaneously in a separate follow-up study [15]. In addition, a number of T1D susceptibility genes identified prior to the WTCCC study have been identified and more thoroughly replicated, including the HLA class II genes INS, CD25, CTLA4, PTPN22, and IFIH1. Given that the SNPs genotyped in the WTCCC study were also genotyped in the HLC, we examined the extent to which the T1D SNPs identified in the WTCCC study were associated with the expression traits corresponding to the genes implicated in the study. Table 1 highlights nine genes previously identified as T1D susceptibility genes or inferred as T1D susceptibility genes from the WTCCC study. The expression levels for five of these genes (CTLA4, HLA-DRB1, IL2RA, LONRF2, and CHST10) in the HLC were associated with the corresponding T1D-associated SNP. For IL2RA and the four other genes (AFF3, ADAD1, PTPN2, and IL2), the expression levels were associated with other SNPs in the region of the T1D-associated SNPs. We also examined whether other genes in the vicinity of the T1D-associated SNPs had expression levels that were also associated with these SNPs. An additional four genes highlighted in Table 1 (RPS26, CLECL1, IGF2AS, and Hct1837134) were identified in this way, in addition to two HLA class II genes highlighted in Table 2 (HLA-DQB1 and HLA-DQA2). Given the role that HLA class II genes are known to play in T1D, we also examined the 14 HLA class II gene expression traits represented on the array used in this study, and we found that 11 of them gave rise to significant genetic associations (Table 2). Some of the associations were striking and highlight additional SNPs that may be of interest in genetic disease association studies. For example, greater than 50% of the HLA-DRB5 expression variation observed in the HLC could be explained by a single cis eSNP (rs9271366). Table 1 Expression Traits Corresponding to Genes Implicated in the T1D WTCCC Study [14,15] or Close to Genes Associated with Either SNPs That Were Associated with T1D in the WTCCC Study or with SNPs Close the T1D-Associated SNPs Table 2 Significant Associations Detected in the HLC for 11 of the 14 HLA Class II Gene Expression Traits Represented on the Microarray Used in This Study The absence of an association between a T1D-associated SNP and the HLC expression values corresponding to a candidate susceptibility gene for that SNP cannot be taken as strong evidence against the gene's candidacy as a susceptibility gene. The underlying causal change in DNA may not affect expression levels of the gene in question, or the variation in expression may be specific to a given tissue not profiled or to conditions not reflected in the HLC. However, strong associations between T1D-associated SNPs and expression levels of genes near the SNP provide direct functional support for a gene's involvement in disease susceptibility. For example, rs3764021 was identified as a T1D susceptibility locus in the WTCCC study and then extensively replicated [15]. CLEC2D was inferred as the most likely susceptibility gene at this locus. However, CLEC2D expression in the HLC data was not associated with this SNP; but a flanking gene, CLECL1 was significantly associated (p = 5.78 × 10−17; Table 1). Given that CLEC2D and CLECL1 are in the same gene family, the strong association between the T1D SNP and CLECL1 expression data suggest that CLECL1 may be a better candidate susceptibility gene to examine. In cases where disease-associated traits and expression traits are scored in the same cohort, there is the potential to directly infer causal relationships between genes and disease [25]. However, even without disease trait data in tissue-specific cohorts like the HLC, an integrative genomics approach can be used to identify the most likely candidate susceptibility gene for a given locus. For example, one of the more novel regions associated with T1D from the WTCCC study was Chromosomes 12q13 (rs2292239). ERBB3, a receptor tyrosine-protein kinase with a presumed role in immune signaling, was identified as the most plausible susceptibility gene at this locus. While ERRB3 expression in the HLC was not associated with this SNP, the expression of a flanking gene, RPS26, was significantly associated with this SNP (p = 4.03 × 10−22; Table 1). In fact, 40% of the in vivo expression variation for RPS26 in the HLC was explained by this single T1D associated SNP, and this SNP was the most strongly associated with RPS26 expression out of the greater than 800,000 SNPs genotyped in the HLC, . The association to RPS26 expression suggests that this gene warrants further study in the context of T1D. However, these data on their own are still far from conclusive, given there may be DNA variants that affect RPS26 expression independently of T1D, but where these variants are in strong LD with the DNA variants explaining the T1D susceptibility. Therefore, to further explore the role RPS26 and ERBB3 may play in T1D, we examined the expression data for these genes in an expression atlas for human, monkey, and mouse, where for each species, between 45 and 60 tissue samples were profiled [26,27]. Although both genes are expressed in mouse, monkey, and human tissues, the expression of RPS26 is >1–2 log units higher in the pancreas and islets of Langerhan compared to ERBB3 (Figure S1), with ERBB3 observed as lowly expressed in islets as measured in the mouse body atlas. Given the central role that pancreas and islets play in T1D, these results further suggest RPS26 as a candidate susceptibility gene for T1D. What the genetic association and atlas data lack is a more refined context within which to assess the functional role a given gene plays in a system. We have previously described a method to reconstruct probabilistic, causal networks by integrating genetic and gene expression data [25,28–30]. Examining candidate susceptibility genes in the context of these networks can provide insights into the pathways in which they operate. We constructed whole-gene networks from three F2 intercross populations constructed from the B6, C3H, and CAST strains (see Methods for details). Liver and adipose expression data were generated from these populations and integrated with the genotypic data also generated in these populations to reconstruct the networks as previously described [28,30]. We then examined RPS26 and ERBB3 in the context of these networks (Figure 1). Figure 1 Local Networks for Rps26 and Erbb3 Derived from Causal, Probabilistic Whole-Gene Networks Constructed from the Liver, Adipose, Muscle, and Brain Gene Expression Data Generated from the BXH/wt and BXC Mouse Crosses (A) The Rps26 subnetwork includes a number of known T1D associated genes (green nodes), and RPS26 in this subnetwork is directly linked to H2-Eb1, a mouse ortholog of HLA-DRB1, a previously identified T1D susceptibility gene that is also strongly associated with a cis eSNP in the HLC (Table 2). The known T1D genes annotated by the Gene Ontology are significantly enriched in this subnetwork (Table 3). (B) The Erbb3 subnetwork is not associated with any pathways known or predicted to be involved in T1D. Figure 1A highlights how RPS26 is directly connected to a number of known T1D genes. For example, RPS26 is directly connected to a mouse ortholog of HLA-DRB1, a gene previously associated with T1D and highlighted in this present study as having liver expression values that are strongly associated with a highly replicated T1D SNP (Table 1). In fact, the genes comprising the local network structure around RPS26 are enriched for genes annotated as T1D genes, in addition to being enriched for genes operating in a number of pathways commonly associated with T1D (Table 3). On the other hand, whereas ERBB3 also resided in the context of a well defined subnetwork (Figure 1B), the genes comprising this subnetwork were not enriched for any T1D associated pathways. Table 3 GO Biological Process Categories Enriched in the RPS26 Subnetwork Depicted in Figure 1A Identifying candidate susceptibility genes for coronary artery disease and LDL cholesterol levels. Another GWAS involving the WTCCC resulted in the identification of seven loci associated with coronary artery disease (CAD) [16]. The seven top-hitting SNPs associated with CAD at each of the seven loci in this study were represented on the Affymetrix 500K array. Therefore, we examined the HLC data to identify expression traits that were significantly associated with any of the seven CAD-associated SNPs. Given the roughly 40,000 expression traits examined at each of the seven SNPs (280,000 tests in all), we set a nominal p-value threshold of 0.05/280,000 = 1.79 × 10−7 for significance. Only one of the seven SNPs identified in the WTCCC CAD study, rs599839 on Chromosome 1p13.3, was significantly associated with any of the HLC expression traits (Figure 2). Four different expression traits were identified as significantly associated with rs599839 (Table 4). One of the four expression traits corresponded to a gene, PSRC1, that had been identified as a candidate susceptibility gene in the WTCCC CAD study [16]. Figure 2 PSRC1, CELSR2, and SORT1 Liver Expression Is Associated with a CAD Risk Allele and Plasma LDL Cholesterol Levels The CAD risk allele for SNP rs599839 was established in a previous WTCCC study [16] (lilac panel). In the HLC, this same SNP is strongly associated with PSRC1, CELSR2, and SORT1 expression, with the CAD risk allele associated with lower relative expression (pink panel). In the BXH/wt cross designed to study metabolic traits that increase cardiovascular risk (green panel), all three of these expression traits were strongly correlated with plasma LDL cholesterol levels, a major CAD risk factor (scatter plots associated with the green panel). Given the association of these genes to plasma LDL-cholesterol levels, we examined whether rs599839 was associated with LDL cholesterol in a previously published GWAS [35] and found this SNP was significantly associated with LDL cholesterol levels, where the CAD risk allele was associated with higher LDL cholesterol levels in this cohort. Lower levels of CELSR2 and SORT1 expression were associated with the risk allele in humans, and with higher LDL cholesterol levels in mouse, making them ideal candidate susceptibility genes for the CAD and LDL cholesterol associations to this locus. On the other hand, lower levels of PSRC1 expression were associated with the risk allele in humans, but with lower LDL cholesterol levels in mouse, suggesting that PSRC1 is not the gene increasing CAD risk, but instead may be acting to protect against it. Table 4 Significant Associations Detected between Liver Expression Traits in the HLC and the CAD-Associated SNP, rs599839, on Chromosome 1p13.3 To further characterize the association of these four expression traits with CAD-associated traits, we examined the activity of these genes in the BXH/wt cross (see Methods for details), a cross designed specifically to study metabolic traits that increase risk of cardiovascular disease. The liver expression levels of Psrc1, Sort1, and Celsr2, but not Sypl2, in the BXH/wt cross were significantly associated with plasma LDL cholesterol levels (Table 4), a major CAD risk factor. However, while Psrc1 expression levels were positively correlated with plasma LDL cholesterol levels, Sort1 and Celsr2 expression levels were negatively correlated. In addition, for liver expression traits in the BXH/wt cross significantly correlated with these three genes, the Sort1 and Celsr2 correlation signatures were most significantly enriched for the GO Biological Process category “cell surface receptor linked signal transduction” (1.4-fold enrichment, p = 1.91 × 10−5, and 1.6-fold enrichment, p = 4.57 × 10−10, for Sort1 and Celsr2, respectively), while the Psrc1 correlation signature was most enriched for the “cell cycle” category (3-fold enrichment, p = 0.00044), suggesting that Sort1 and Celsr2 may be involved in similar biological processes that are distinct from processes involving Psrc1. To further elucidate the involvement of these genes in metabolic phenotypes associated with CAD, we examined Psrc1, Celsr2, and Sort1 in the context of the probabilistic, causal network constructed as described above for the Erbb3/Rps26 example. All three genes not only fell in the same subnetwork, they were all directly connected to the same gene, 2010200O16Rik, demonstrating that these genes are tightly co-regulated, possibly driven by common regulatory factors (Figure 3A). This same subnetwork also included genes like Tgfbr2, Pparg, Lpl, Ppm1l, and Alox5ap, all of which have been previously identified and validated as being associated with traits related to obesity, diabetes, cholesterol levels, and cardiovascular disease [25,31–33]. More generally, Psrc1 and Sort1 participate in a previously defined macrophage-enriched metabolic (MEM) subnetwork validated as causal for obesity-, diabetes-, and atherosclerosis-related traits [34]. In fact, the subnetwork depicted in Figure 3A is composed of 1,346 genes, with 226 of these genes overlapping the set of 1,406 genes composing the MEM subnetwork (82 would have been expected by chance). This 2.76-fold enrichment in this case is highly significant, with a Fisher exact test p = 8.20 × 10−47. Figure 3 Local Networks for PSRC1, CELSR2, and SORT1 Derived from Causal, Probabilistic Whole-Gene Networks in Mouse and Human (A) Mouse network for Psrc1, Celsr2, and Sort1 derived from the liver, adipose, muscle, and brain gene expression data generated from the BXH/wt and BXC mouse crosses. (B) Human network for PSRC1, CELSR2, and SORT1 derived from the HLC and from a previously published adipose and blood tissue cohort [21]. To establish whether PSRC1, CELSR2, and SORT1 are closely connected in human transcriptional networks as they are in mouse, we constructed a probabilistic, causal network from the HLC and from a previously published adipose and blood tissue cohort [21], using previously described methods [25,28–30]. As depicted in Figure 3B, PSRC1, CELSR2, and SORT1 fall in the same subnetwork and are closely connected, as in the mouse network. In addition, the genes comprising this human subnetwork are enriched for genes that fall in the mouse network depicted in Figure 3A (Fisher exact test p = 1.78 × 10−8). Further, the human subnetwork is also enriched for genes falling in the MEM module (Fisher exact test p = 5.03 × 10−8), confirming the association to metabolic phenotypes detected in the mouse network. These data combined suggest that PSRC1, CELSR2, and SORT1 operate in a conserved subnetwork causally associated with cholesterol levels, obesity, diabetes and atherosclerosis. Given the strong association between plasma LDL cholesterol levels and the expression of Psrc1, Sort1, and Celsr2 expression in the BXH/wt cross, we examined a recent GWAS available in the public domain in which LDL cholesterol levels were monitored [35]. A significant association was detected between rs599839 genotypes and LDL cholesterol levels in this human cohort (p = 9.0 × 10−8)[35]. Interestingly, the common allele for rs599839 was associated with higher LDL cholesterol levels [35], consistent with the association of this allele with increased CAD risk. Low SORT1, CELSR2, and PSRC1 expression levels in the HLC are also associated with the rs599839 common allele. However, given low Sort1 and Celsr2 expression levels in the BXH/wt cross are associated with increased LDL cholesterol levels (whereas low Psrc1 expression levels are associated with low LDL cholesterol levels), SORT1 and CELSR2 are the most logical candidate susceptibility gene in the 1p13.3 locus (Figure 2), although direct experimental manipulation of these two genes would be required to provide more direct functional support that these genes are involved in modulating LDL cholesterol levels. The association of this locus with LDL cholesterol levels as well as liver expression levels of SORT1, CELSR2, and PSRC1 were recently reported in multiple independent studies [17,18]. Discussion Previous studies on the genetics of gene expression in humans have focused primarily on lymphoblastoid cell lines or other blood-derived samples [13,14,17]. We have provided a large-scale assessment of the genetics of gene expression in human liver, a metabolically active tissue that is critical to a number of core biological processes and that plays a role in a number of common human diseases. After profiling 427 human liver samples on a comprehensive gene expression microarray and genotyping the DNA from these samples at greater than one million SNPs, we identified a significant genetic signature underlying the expression of more than 6,000 genes, with many of these genes already implicated as causal for a number of different diseases, including heart disease, breast cancer, inflammatory bowel disease, age-related macular degeneration, schizophrenia, and Alzheimer disease. This set of data highlights the utility of monitoring molecular phenotypes that underlie the higher order clinical states of a system. Whereas the eQTL data in the human liver cohort is valuable in its own right, when integrated with other GWAS data and with genetics of gene expression and clinical data in segregating mouse populations, there is the potential to directly identify experimentally supported candidate susceptibility genes for disease. We demonstrated directly how genetics of gene expression data can complement multiple GWAS datasets by highlighting SORT1 and CELSR2 as candidate susceptibility genes for CAD and LDL cholesterol levels at a recently identified locus associated with CAD [16]. In this instance, the association to LDL cholesterol levels is novel and based on publicly available GWAS data and a mouse cross designed specifically to study lipid and other metabolic syndrome traits. In addition to the CAD locus, we highlighted RPS26 as a candidate susceptibility gene for T1D from a novel, highly replicated T1D locus on Chromosome 12q13, which was identified in a separate GWAS [15]. Not only was the expression of this gene in the HLC strongly associated with the T1D SNP at this locus, but it was observed to operate in a part of the molecular network that is significantly enriched for genes associated with T1D (like HLA-DRB1), whereas the gene inferred as the most likely susceptibility gene at that locus (ERBB3) [15] was not supported by any of our experimental data. Recent studies have demonstrated that ribosomal proteins may be involved in auto-immune diseases like systemic lupus erythematosus [36]. In addition, recent work has demonstrated a connection between endoplasmic reticulum (ER) stress in the cytoplasm and diabetes, where protein unfolding in response to ER stress is hypothesized to disrupt processes associated with diabetes [37]. Given RPS26's protein translation role as part of the ribosomal complex on the ER, its association to T1D is particularly intriguing. The unfolded protein response has also been linked to inflammation and oxidative stress [38], hence the putative connection between RPS26 and an auto-immune disease like T1D is worthy of further consideration. Cells with high secretory capacity like pancreatic beta cells are also more likely to be susceptible to ER stress, making the link between RPS26 and T1D even more plausible. In fact, previous work has indicated higher ER stress levels in T1D patients [39]. It is important to note that a lack of association between expression traits in the HLC and disease-associated SNPs is not a valid filter for excluding a gene as a candidate disease susceptibility gene, given that variation in a gene leading to disease may affect protein function and not expression, or it may affect expression in a different tissue or under different environmental conditions. However, the approach of analyzing the genetics of gene expression in human populations does provide a more objective view into the functioning of genes in a given disease-associated region. This view has the potential to lead to higher confidence candidates in the absence of direct functional support for any one gene, which is typically the case in GWASs where the SNPs identified have no known functional role. Given the potential that genetics of gene expression studies have to affect our understanding of common human diseases, generating even larger-scale molecular profiling datasets in segregating populations may provide a path to more rapidly elucidating not only the genetic basis of disease, but the impact the genetic basis of disease has on molecular networks that in turn induce variations in disease associated traits. Materials and Methods HLC and tissue collection. The HLC was assembled from a total of 780 liver samples (1–2 g) that were acquired from Caucasian individuals from three independent liver collections at tissue resource centers at Vanderbilt University, the University of Pittsburgh, and Merck Research Laboratories (Table S1). The Vanderbilt samples (n = 504) included both postmortem tissue and surgical resections from organ donors and were obtained from the Nashville Regional Organ Procurement Agency (Nashville, Tennessee), the National Disease Research Interchange (Philadelphia, Pennsylvania), and the Cooperative Human Tissue Network (University of Pennsylvania, Ohio State University, and University of Alabama at Birmingham). The Pittsburgh samples were normal postmortem human liver and were obtained through the Liver Tissue Procurement and Distribution System (Dr. Stephen Strom, University of Pittsburgh, Pittsburgh, Pennsylvania). The University of Pittsburgh samples (n = 211) were all postmortem, as were the Merck samples (n = 65), which collected by the Drug Metabolism Department and reported previously [40]. All samples were stored frozen at −80 °C from collection until processing for RNA and DNA; some samples had been stored for over a decade before being processed for this study. Demographic data varied across centers for these samples and were missing in many cases. In cases where age, sex, or ethnicity data were not available in the patient records, we imputed it from the gene expression and/or genotype data (see below). Of the 780 samples collected, high-quality DNA was isolated on 548 samples, and 517 of these were successfully genotyped on the Affymetrix genotyping platform (see Methods below). Of the 517 successfully genotyped samples, high-quality RNA was isolated and successfully profiled on 427 samples. This set of 427 genotyped and expression profiled samples comprised the HLC. Table S1 gives a summary of the demographics and other annotations on the 427 individuals that were successfully genotyped and expression profiled. All counts and descriptive statistics include the imputed data. All samples and patient data were handled in accordance with the policies and procedures of the participating organizations. Mouse crosses and tissue collection. C57BL/6J (B6) mice were intercrossed with C3H/HeJ (C3H) mice to generate 321 F2 progeny (161 females, 160 males) for the BXH wild type (BXH/wt). C57BL/6J (B6) mice were intercrossed with Castaneus (CAST) mice to generate 442 F2 progeny (276 females, 166 males) for the BXC cross. All mice were maintained on a 12 h light–12 h dark cycle and fed ad libitum. BXH mice were fed Purina Chow (Ralston-Purina) containing 4% fat until 8 wk of age. From that time until the mice were killed at 20 wk, mice were fed a western diet (Teklad 88137, Harlan Teklad) containing 42% fat and 0.15% cholesterol. BXC mice were fed Purina Chow until 10 wk of age, and then fed western diet (Teklad 88137, Harlan Teklad) for the subsequent 8 wk. Mice were fasted overnight before they were killed. Their livers were collected, flash frozen in liquid nitrogen, and stored in −80 °C prior to RNA isolation. The BXH cross on an ApoE null background (BXH/apoE) was previously described [41]. Briefly, C57BL/6J ApoE null (B6.ApoE–/–) were purchased from Jackson Laboratory. C3H/HeJ ApoE null (C3H.Apo E–/–) were generated by backcrossing B6.ApoE–/– to C3H for ten generations. F1 mice were generated from reciprocal intercrossing between B6.ApoE–/– and C3H.ApoE–/–, and F2 mice were subsequently bred by intercrossing F1 mice. A total of 334 (169 female, 165 male) were bred, and all were fed Purina Chow containing 4% fat until 8 wk of age, and then transferred to western diet containing 42% fat and 0.15% cholesterol for 16 wk. Mice were killed at 24 wk, and liver, white adipose tissue, and whole brains were immediately collected and flash-frozen in liquid nitrogen. All procedures of housing and treatment of animals were performed in accordance with Institutional Animal Care and Use Committee regulations. Microarray design, RNA sample preparation, hybridization, and expression analysis. Array design and preparation of labeled cDNA and hybridizations to microarrays for the human liver cohort. RNA preparation and array hybridizations were performed at Rosetta Inpharmatics. The custom ink-jet microarrays used in this study were manufactured by Agilent Technologies and consisted of 4,720 control probes and 39,280 noncontrol oligonucleotides extracted from mouse Unigene clusters and combined with RefSeq sequences and RIKEN full-length cDNA clones (Table S4). Liver samples extracted from the 427 Caucasian individuals were homogenized, and total RNA extracted using TRIzol reagent (Invitrogen) according to manufacturer's protocol. Three micrograms of total RNA was reverse transcribed and labeled with either Cy3 or Cy5 fluorochrome. Purified Cy3 or Cy5 complementary RNA was hybridized to at least two single microarrays with fluor reversal for 24 h in a hybridization chamber, washed, and scanned using a laser confocal scanner. Arrays were quantified on the basis of spot intensity relative to background, adjusted for experimental variation between arrays using average intensity over multiple channels, and fitted to an error model to determine significance (type I error), as previously described [42]. Gene expression is reported as the mean-log ratio relative to the pool derived from 192 liver samples selected for sex balance from the Vanderbilt and Pittsburgh samples, because the RNA from the Merck samples had been amplified at an earlier date. The error model used to assess whether a given gene is significantly differentially expressed in a single sample relative to a pool composed of a randomly selected subset of samples has been extensively described and tested in a number of publications [42–44]. The age, sex, race, center, alcohol use, drug use, and steatosis variables presented in Table S1 were tested for association to the gene expression traits. Only age, sex, race, and center were significantly associated with the expression traits beyond what would be expected by chance. As a result, all gene expression traits were adjusted for these covariates. The lack of association between the expression traits and alcohol use, drug use, and steatosis was somewhat surprising, but may be due to the sparseness of these data, resulting in a lack of power to detect significant associations. Array design and preparation of labeled cDNA and hybridizations to microarrays for the mouse liver and adipose tissue samples. RNA preparation and array hybridizations were again performed at Rosetta Inpharmatics. The custom ink-jet microarrays used in the BXH/wt, BXH/apoE, and BXC crosses were manufactured by Agilent Technologies. The array used for the BXH/apoE and BXH/wt samples consisted of 2,186 control probes and 23,574 noncontrol oligonucleotides extracted from mouse Unigene clusters and combined with RefSeq sequences and RIKEN full-length cDNA clones (Table S5). The array used for the BXC cross consisted of 39,280 noncontrol oligonuceotides again extracted from the mouse Unigene clusters and combined with RefSeq sequences and RIKEN full-length cDNA clones (Table S6). Mouse adipose and liver tissues from all of the crosses were homogenized, and total RNA extracted using Trizol reagent (Invitrogen) according to manufacturer's protocol. Three micrograms of total RNA was reverse transcribed and labeled with either Cy3 or Cy5 fluorochrome. Labeled complementary RNA (cRNA) from each F2 animal was hybridized against a cross-specific pool of labeled cRNAs constructed from equal aliquots of RNA from 150 F2 animals and parental mouse strains for each of the three tissues for each cross. The hybridizations for the BXH/apoE cross were performed in fluor reversal for 24 h in a hybridization chamber, washed, and scanned using a confocal laser scanner. The hybridizations for the BXH/wt and BXC crosses were performed to single arrays (individuals F2 samples labeled with Cy5 and reference pools labeled with Cy3 fluorochromes) for 24 h in a hybridization chamber, washed, and again scanned using a confocal laser scanner. Arrays were quantified on the basis of spot intensity relative to background, adjusted for experimental variation between arrays using average intensity over multiple channels, and fitted to a previously described error model to determine significance (type I error) [42]. Gene expression measures are reported as the ratio of the mean log10 intensity (mlratio). DNA processing. DNA isolation. DNA isolation was performed at Rosetta Inpharmatics. DNeasy tissue kits from QIAGEN were used to carry out all DNA extractions. For each liver sample, 20–30 mg of liver was placed in a 1.5-ml microcentrifuge tube along with 80 μl buffer ATL and 20 μl proteinase K. The contents of each tube were then mixed thoroughly by vortexing, followed by incubation at 55 °C until the tissue was completely lysed. Transcriptionally active tissues such as liver and kidney contain high levels of RNA, which will co-purify with genomic DNA. Because RNA-free genomic DNA was required for processing, 4 μl RNase A (100 mg/ml) was added and mixed by vortexing, followed by incubation for 2 min at room temperature before continuing. Samples were then vortexed and 200 μl buffer AL was added to the sample and mixed thoroughly. After 10 min incubation at 70 °C, 200 μl ethanol (96%–100%) was then added and mixed again. The mixture was placed into the DNeasy Mini column and centrifuged at 6,000g (8,000 rpm) for 1 min. The DNeasy Mini spin column was then placed in a new 2-ml collection tube, and 500 μl buffer AW1 was added, followed by placement in a centrifuge for 1 min at 6,000g (8,000 rpm). The DNeasy Mini spin column was then placed in a new 2-ml collection tube again, and 500 μl buffer AW2 was added and centrifuged for 3 min at 20,000g (14,000 rpm) to dry the DNeasy membrane. Then the DNeasy Mini spin column was placed in a clean 1.5-ml or 2-ml microcentrifuge tube and 200 μl buffer AE was pipetted directly onto the DNeasy membrane. This was incubated at room temperature for 1 min and then centrifuged for 1 min at 6,000g (8,000 rpm) to elute. Two 200-μl elutions were performed followed by ethanol/sodium acetate precipitation and resuspension of the resultant pellet with TE buffer. Genotyping data from the Affymetrix 500K panel. SNP genotyping was performed with the commercial release of the Affymetrix 500K genotyping array. The genotyping was carried out at the Perlegen genotyping facility in Mountain View, California. Genotyping was attempted on 548 samples. 18 samples were unable to be genotyped because of poor DNA quality, and an additional 13 samples were removed after genotyping because their overall call rate did not exceed the 90% cutoff we required. We then applied SNP-wise quality checks on the 517 samples that were successfully genotyped. The Affymetrix 500K array consisted of 500,568 SNPs in total, 429,545 SNPs provided quality data from the genotyping assay, and we rejected those SNPs with a call rate 4%, and there was not significant deviation from Hardy-Weinberg equilibrium at the 0.0001 significance level. The sample set for analysis was restricted to the 427 HLC samples that had both genotype and gene expression data available, passed the criteria outlined above and those that were identified as Caucasian, or imputed to be Caucasian when data was missing (see below). Data preprocessing. Sex confirmation. Sex identifiers were available for most of the liver samples obtained from the three study centers. We independently confirmed the sex of each individual providing a liver sample by two methods. First, we looked for expression of Y-specific genes in the liver gene expression based on three probes representing three distinct transcripts. Second, we scored heterozygosity of X-chromosome markers. We excluded any individual for which there was a discrepancy in any of the three measures of sex in order to ensure a coherent data set for analysis and that we had excluded as many potential cases of annotation or sample-handling errors as possible. For samples where sex was not noted in the records, we imputed the sex call if both the genotype and gene-expression data were concordant. Ethnicity. Ethnicities were confirmed or imputed using STRUCTURE [45]. A panel of 106 autosomal markers was randomly selected from around the genome to be unlinked and ancestry informative. Markers were selected from the HapMap data [46] that were present on the Affy 500K panel such that the minor allele frequency was >0.05 and the absolute allele frequency difference in the Caucasians and African Americans ∼0.5, with average minor allele frequency 0.5 (standard deviation = 12). Several K were tested (K = 1–6) with burn-in 100,000 and 100,000 reps of MCMC before any information was collected. In all cases, the greatest support was for K = 2. Admixture was detected for some individuals in some runs and some individuals were reclassified. For those unknown and reclassified, population reassignment was made if the probability of group membership was >0.9 for that individual. This resulted in 469 individuals assigned to the Caucasian group, 28 individuals assigned to the African descent or African American group, and 18 individuals assigned as “unknown”. The data set for further analysis was restricted to Caucasian samples. Age. Ages were imputed using the Elastic Net method [47]. This method performs model selection and parameter estimation in a manner that is a combination of ridge-regression and the lasso. The prediction method is also explained in [47]. For computational reasons, λ was set to zero, in which case the Elastic Net method reduces to the lasso method. For most applications, experience demonstrates that the optimal value for λ is zero or quite near zero. Ages were imputed using separate models for each data source, due to evidence of a source effect, and each sex separately. In cases where the sex was missing or the reported sex was different from the sex implied by the expression data, the sex implied by the expression data was used. This was done so that in the case the annotation data and expression data were mismatched, the imputed age would correspond to the data used to predict it. The 5,000 genes with the highest correlation to age were used as potential regressors. Cross-validation was used to select the number of steps in the model selection procedure. The number of predictors in the model was between 67 and 76 for the four different models. The percentage of variation explained in the training set is quite high (97%–99%) for three of the models. For the fourth, the model for Vanderbilt females, the percentage of variation explained was slightly lower, 0.92. This is a vast improvement over more naïve imputation methods that are used when adjusting for covariates with missing data, where mean values of the nonmissing data are used to fill in the missing values. Very few of the predictors we constructed were common between the different models. Given the number of predictors with high correlation to age, this is not surprising. Nonetheless, within a given data source (i.e., Pittsburgh or Vanderbilt samples), the male model is a reasonable predictor for the ages of the females and vice-versa. This same trend did not hold for predicting the ages of same-sex individuals across data sources. Statistical and data visualization methods. Expression trait processing. Expression traits were adjusted for age, sex, and medical center. Residuals were computed using rlm function from R statistical package (M-estimation with Tukey's bisquare weights). In examining the distributions of the mean log ratio measures for each expression trait in the HLC set, we noted a high rate of outliers. As a result, we used robust residuals and nonparametric tests to carry out the association analyses in the HLC. For each expression trait, residual values deviating from the median by more than three robust standard deviations were filtered out as outliers. Genome-wide eQTL association analysis. The Kruskal-Wallis test was used to determine association between adjusted expression traits and genotypes. We chose this nonparametric method because of its robust nature to underlying genetic model and trait distribution. p-Values were computed using nag_mann_whitney (for loci with two observed genotypes) and nag_kruskal_wallis_test (for loci with three observed genotypes) routines from NAG C library (http://www.nag.co.uk). We used FDR for multiple-test correction. FDR was estimated as the ratio of the average number of eQTLs found in datasets with randomized sample labels to the number of eQTLs identified in the original data set. Since the number of tests was large (∼1,010), we found the empirical null distribution was very stable and three permutation runs were sufficient for convergence to estimate FDR. FDR computation was performed separately for cis (<1 Mb probe to SNP distance) and trans associations resulting in nominal p-value cutoffs of 5.0 × 10−5 and 1.0 × 10−8 for cis and trans eQTLs, respectively. Targeted set association analysis. The 3,346 SNPs identified in the first round of analysis as associating with expression traits in cis at an FDR < 0.1 were picked for a second round of analysis. To assess the significance of the resulting set of expression traits detected as associated with this set of SNPs, sets of randomly selected SNPs of size 3,346 with MAF distributions identical to the original set were generated. All sets of SNPS were then analyzed using the same method described above for genome-wide associations. Identifying differentially expressed genes. To assess whether a gene in a given sample was differentially expressed, we used a previously described and validated error model for testing whether the mean log ratio of the intensity measures between the experiment and reference channels was significantly different from zero [42,43,48]. Based on this error model we obtained p-values for each of the individual gene expression measures in each sample as previously described [33]. We then computed the standard deviation of –log10 of the p-value for each gene expression measure over all samples profiled for a given tissue, and then rank ordered all of the genes profiled in each tissue based on this standard deviation value (rank ordered in descending order). Genes that fall at the top of this rank ordered list can be considered as the most differentially expressed or variable genes in the study. We have previously shown that this type of ordering approach well captures the most active genes in a set of samples [33]. For demonstrating the number of genome-wide significant eQTLs and eSNPs as a function of differential gene expression, we binned the expression traits into quartiles (Q1-Q4) based on the rank-ordered gene list, with each bin containing 10,025 genes and the bins increasing in significance with respect to differential expression, from Q1 to Q4. Visualization of networks. Networks were visualized using the Target Gene Information (TGI) Network Analysis and Visualization (NAV) desktop application developed at Rosetta Inpharmatics. This tool enables rapid, real-time, graphical analysis of pathway network models built from a comprehensive and fully integrated set of public and proprietary interaction databases available through a back-end central database, described in detail in a separate report. Additionally, the TGI NAV tool supports experimentally generated systems biology data such as the statistical associations and causal relationships described here. TGI NAV enables integration and visualization of orthogonal data sets using network models as a framework and facilitates dissection of networks into smaller, functionally significant subnetworks amenable to biological interpretation. To construct the local networks for H2-Eb1, Erbb3, and Rps26, the whole-gene probabilistic causal networks were loaded into the database and the TGI NAV tool was used to extract all edges from this network involving the central gene of interest. In the case of the Erbb3 network, the local network was expanded by extracting all additional edges involving any genes directly connected to Erbb3. Note that while the underlying networks describe causal relationships between transcripts, TGI NAV was used to translate this network into the space of genes using an integrated mapping database that clusters transcripts into gene models utilizing their genomic coordinates. As a result, multiple causal relationships between gene pairs can be observed in cases where multiple transcripts for a single gene were profiled. Visualization properties of nodes (e.g., color) are specified in TGI NAV either for individual nodes, or in a data-driven manner by associating attributes, such as KEGG pathway membership, with groups of nodes and mapping visualization properties to these attributes. Supporting Information Figure S1 Atlas of Gene Expression for Rps26 and Erbb3 For all panels, the horizontal bar for each row represents the mean expression value and the horizontal line indicating the standard deviation. The red arrow off to the left highlights the pancreas tissue. (A) Expression levels of Rps26 in 60 murine tissues and cell lines. The tissues and cell lines are given along the y-axis, and the mean relative transcript abundances are given along the x-axis. (B) Expression levels of Erbb3 in 60 murine tissues and cell lines. The tissues and cell lines are given along the y-axis, and the mean relative transcript abundances are given along the x-axis. (C) Expression levels of Rps26 in 46 monkey tissues and cell lines. The tissues and cell lines are given along the y-axis, and the mean relative transcript abundances are given along the x-axis. (D) Expression levels of Erbb3 in 46 monkey tissues and cell lines. The tissues and cell lines are given along the y-axis, and the mean relative transcript abundances are given along the x-axis. (E) Expression levels of RPS26 in 50 human tissues and cell lines. The tissues and cell lines are given along the y-axis, and the mean relative transcript abundances as determined by each of six individual reporters on the microarray that target RPS26 are given along the x-axis. (F) Expression levels of Erbb3 in 50 human tissues and cell lines. The tissues and cell lines are given along the y-axis, and the mean relative transcript abundances are given along the x-axis. (441 KB PDF) Click here for additional data file. Table S1 Population Demographics of the HLC (81 KB XLS) Click here for additional data file. Table S2 Association Results for HLC Expression and Genotyping Data (1.64 MB XLS) Click here for additional data file. Table S3 Expression Traits Corresponding to Genes Associated with Human Diseases Are under Significant Genetic Control in the HLC (172 KB DOC) Click here for additional data file. Table S4 Genes Represented on the HLC Microarray Described in the Main Text (7.97 MB XLS) Click here for additional data file. Table S5 Genes Represented on the BXH/apoE Microarray Described in the Main Text (9.33 MB XLS) Click here for additional data file. Table S6 Genes Represented on the BXH/wt and BXC Microarray Described in the Main Text (6.58 MB XLS) Click here for additional data file. Accession Numbers All microarray data associated with the HLC have been deposited into the Gene Expression Ominbus database under accession number GSE9588.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Web-Based, Participant-Driven Studies Yield Novel Genetic Associations for Common Traits

            Introduction Many common human traits have long been understood to have a genetic basis, yet in only a few cases have influential genes been identified. Even pigmentation, for which almost 40 years ago Cavalli-Sforza [1] estimated that there were four genes underlying variation, has yielded associations only recently [2]–[5] (showing that pigmentation is rather polygenic [6]). We have conducted, within a novel, web-based research framework, genome-wide association studies of 22 common human traits. These traits were selected based on indications of heritability or a simple mode of inheritance, ease of phenotype data collection via web-based self-report, and broad interest. Data for these studies was collected within a research framework wherein research participants, derived from the customer base of 23andMe, Inc., a direct-to-consumer genetic information company, consented to the use of their data for research and were provided with access to their personal genetic information (Figure 1). They were then given the option of contributing phenotype data via a series of web-based surveys. The result is a single, continually expanding cohort, containing a self-selected set of individuals who participate in multiple studies in parallel. 10.1371/journal.pgen.1000993.g001 Figure 1 Web-based accrual of genotype and phenotype data via a personal genetic information service. (A) Individual participant signs up for service via web; (B) Participant receives sample collection kit; (C) Participant consents, via web, to use of anonymized genotype and survey responses for research; (D) Participant sends saliva sample to contracted lab; (E) Participant logs on to service web site with option to respond to one or more surveys, prior to having access to personal genetic data; (F) Laboratory processes sample, generating data for single nucleotide polymorphisms (SNPs); (G) Encrypted genotype data transferred from laboratory to secure server; (H) Participant logs on to service web site with option to access personal genetic data (both raw genotypes and customized reports); (I) Participant logs on and has the option to respond to one or more surveys; (J) New genetic reports posted, new surveys posted; (K) Genotype data and survey responses for individuals, coded and stripped of individually identifying information, are transferred to research team. Shaded boxes indicate participant actions; clear boxes indicate lab or service actions. Dashed boxes indicate optional participant actions within framework of service access. The parallel and continual nature of this research framework facilitates the rapid recruitment of participants to many studies at once. Furthermore, the presentation of interpreted genetic data to the participants creates incentive for them to return to the website, lowering the marginal cost of recontacting for additional analyses. The participant-driven nature of this study design and the resulting heterogeneity of the data sets require that care be taken to eliminate population stratification and other possible sources of bias. However, the challenges of eliminating such stratification and biases are balanced by the continuous accrual of new data as participants sign up and respond to new surveys. From the initial set of surveys released, we report results on the 22 traits meeting our sample size criteria (over 1500 unrelated northern European respondents with the additional requirement of at least 500 cases for binary traits). The phenotypes that met these criteria are described below. Pigmentation Pigmentation has been a fruitful area for genetic research since the 19th century, when scientists realized that the mice with varied coat colors that “mouse fanciers” had been developing for centuries provided easily tracked phenotypes for genetic analysis [7]. Many genetic variants underlying “normal” variation in human pigmentation have recently been discovered [2]–[5]. These variants account for a significant part of the known variation in pigmentation, (approximately 30% for hair color, 60% for eye color, see Results), but much remains unexplained. Hair morphology Human hair varies in thickness as well as in the extent of curl, which is related to the shape of the hair (round versus flat cross-section). Over 100 years ago Davenport and Davenport [8] reported a study of hair morphology in families, concluding that straight hair was recessive to curly hair. More recently, a candidate gene approach discovered that EDAR is associated with hair thickness in Asians [9]. Ability to smell the urinary metabolites of asparagus The study of the ability to smell the urinary metabolites of asparagus (probably mostly methanethiol, a sulfur-containing compound) also dates back over 100 years [10]. Since then, authors have debated whether the variation among humans is in the ability to produce methanethiol (thought from family studies to be inherited in an autosomal dominant manner [10]) or in the ability to smell that compound [11]. No previous studies have reported genes or single nucleotide polymorphisms (SNPs) associated with this sensory ability. Photic sneeze Listed under “ACHOO (Autosomal-dominant Compelling Helio-Ophthalmic Outburst) syndrome” in Online Mendelian Inheritance in Man (OMIM), the “photic sneeze reflex” refers to the tendency to sneeze when moving from relative darkness into bright light—most often sunlight. Aristotle discussed the trait in a section of his Book of Problems called “Problems concerning the nose,” hypothesizing that heat-generated movement led to tickling of the nose. No previous studies have reported genes or SNPs associated with this particular reflex. Other phenotypes The other traits analyzed here fall into three broad categories. The first category consists of laterality preferences: handedness, footedness, ocular dominance, and hand-clasp (which thumb is on top when clasping one's hands). The second group consists of simple physical characteristics: whether participants have had cavities, have worn braces, have had wisdom teeth removed, have astigmatism, wear glasses, have attached earlobes, and suffer from motion sickness while riding in a car. The third group consists of personality traits and preferences: optimism, a preference for sweet versus salty food, and preference for night-time versus morning-time activity. None of these traits have well-established associations with SNPs, although handedness [12] and diurnal preference [13] have putative genetic associations. Results Of the 22 studies, eight yielded positive results, with novel associations discovered for four traits and replications for five traits (Table 1). All five replications are for pigmentation-related traits. The novel associations reveal SNPs associated with hair morphology, detection of a urinary metabolite of asparagus, photic sneeze reflex, and freckling. Manhattan and qqplots for the novel associations and replications are shown in Figure 2, Figure 3, and Figure 4. 10.1371/journal.pgen.1000993.g002 Figure 2 Manhattan plots for new associations. Shown are (A) hair curl, (B) asparagus anosmia, (C) photic sneeze reflex, and (D) freckling. Plots show scores ( p-values) for all SNPs by physical position. All plots are trimmed at a maximum score of 15. For regions with a more significant association, the strongest score in that region is shown above the region. 10.1371/journal.pgen.1000993.g003 Figure 3 Manhattan plots for replications. Shown are (A) hair color, (B) red hair, (C) blue to brown eye color, and (D) green versus blue eye color. Plots show scores ( p-values) for all SNPs by physical position. All plots are trimmed at a maximum score of 15. For regions with a more significant association, the strongest score in that region is shown above the region. 10.1371/journal.pgen.1000993.g004 Figure 4 Quantile-quantile plots for new associations and replications. Shown are (A) hair curl, (B) asparagus anosmia, (C) photic sneeze reflex, (D) freckling, (E) hair color, (F) red hair, (G) blue to brown eye color, and (H) green versus blue eye color. All plots are trimmed at a maximum score of 15 to better show details. Approximate 95% CIs are indicated in blue. The red line passes through the median p-value. 10.1371/journal.pgen.1000993.t001 Table 1 Statistics on the 22 studies (with or without associations reaching genome-wide significance). Phenotype Size Top hit # Loci Loci Eye color, blue to brown 4402 6 OCA2, SLC24A4, IRF4, SLC45A2, TYR, (TYRP1) Freckles 4405 90.68 5 IRF4, MC1R, ASIP, BNC2, (TYR) Hair color, blond to brown 3044 87.07 5 OCA2, IRF4, SLC45A2, SLC24A4, MC1R Red hair 4422 86.28 2 MC1R, ASIP Eye color, green/blue 2826 51.52 3 OCA2, SLC24A4, TYR Hair curl 5385 41.80 3 TCHH, WNT10A, (OFCC1) Asparagus anosmia 4742 23.18 1 OR2M7 Photic sneeze reflex 5390 10.93 2 2q22.3, (NR2F2) Footedness 3079 6.75 0 Attached earlobes 3915 6.59 0 Morningness 4264 6.50 0 Braces 4011 6.45 0 Optimism 3936 6.29 0 Astigmatism 7701 6.17 0 Prefer sweet snacks 3100 6.07 0 Wisdom teeth 3983 5.89 0 Cavities 5366 5.81 0 Glasses 5386 5.76 0 Ocular dominance 3126 5.70 0 Hand-clasp 5256 5.66 0 Motion sickness 2987 5.55 0 Handedness 4268 5.30 0 Loci are called significant if they contain a SNP with p-value over 8.4 and suggestively significant if they have one between 7.1 and 8.4. Loci that were not previously associated with the given trait are in bold, those where we report a remapping of a previous hit are in italics, and suggestively significant loci are in parentheses. Size refers to the total number of individuals in the study. The “top hit” refers to the largest p-value for the given trait. The genomic control inflation factor, , [59] was between 1.0 and 1.02 for all studies. For more details, including , numbers of cases and controls, and covariates used in the analyses, see Supplementary Table 1 of Text S6. The studies were performed on multiple overlapping datasets drawn from a single cohort. The cohort was derived from the subset of the customer base of 23andMe, Inc. who took surveys relevant to the 22 traits considered here. Of these individuals, only those assessed to be of northern European ancestry were included. In addition, individuals were eliminated until any pair of participants shared at most 700 cM of full or half identity by descent (IBD), approximately the lower end of sharing between a pair of first cousins. Average IBD between a pair of participants was 0.146cM, median IBD was 0 and only 123 pairs of individuals shared more than 100 cM. The resulting cohort consisted of a total of 9126 individuals who had answered at least one of the surveys considered here. Each individual was genotyped on the Illumina HumanHap550+ BeadChip platform (consisting of the HumanHap550 panel along with a custom set of approximately SNPs selected by 23andMe). After quality control (see Methods), SNPs were used from this platform. Phenotypes were collected using 13 surveys posted on the 23andMe website. From these surveys, 22 traits met our criteria for inclusion, which required over unrelated participants who responded to the relevant survey questions and were assessed to be of northern European origin. In addition, for binary phenotypes we required at least participants with each outcome before analysis. See Text S5 for full descriptions of the phenotypes. Detailed summaries of results are shown in Table 2, Table 3, and Table 4. These summaries include the SNPs selected within each associated region to give the best predictive model while attempting not to over-fit (using a stepwise regression approach with the AIC, see Methods). The reader should be warned that this approach is anti-conservative about the number of effects fitted and, in particular, all SNPs appearing in these tables are not necessarily independently associated. Throughout, “score” refers to the negative p-value for the association between a SNP and a trait. We also include Bayes factors (see Methods) in the tables and the plots of imputed and genotyped SNPs within each associated region due to their usefulness in comparing associations and their ability to incorporate uncertainty arising from imputation. Because we performed multiple (22) parallel studies, we used a very conservative threshold of for a SNP before it was claimed to reach genome-wide significance (see Methods). Associations significant under this correction for a single study but not for all studies (that is, with scores between and ) are called “suggestive” and are shown in Table 5. 10.1371/journal.pgen.1000993.t002 Table 2 Selected SNPs reaching genome-wide significance for hair curl, asparagus anosmia, photic sneeze reflex, and freckling. Phenotype SNP Chr Position Locus Alleles MAF Score BF OR Hair curl rs17646946 1 150,329,391 TCHH G/A 0.20 41.80 39.90 −0.294 — rs499697 1 150,759,778 LCE3E A/G 0.29 9.95 7.61 0.126 — rs7349332 2 219,464,627 WNT10A C/T 0.14 13.48 11.01 0.193 — Asparagus rs4481887 1 246,563,486 OR2M7 G/A 0.26 23.18 21.74 −0.514 0.598 rs4309013 1 246,547,391 OR2M7 T/C 0.27 22.27 20.78 −0.496 0.609 rs4244187 1 246,560,134 OR2M7 C/T 0.27 22.02 20.44 −0.493 0.611 Sneeze rs10427255 2 145,841,993 2q22.3 T/C 0.46 10.93 8.65 0.280 1.323 Freckling rs12203592 6 341,321 IRF4 C/T 0.18 90.68 87.04 1.611 — rs872071 6 356,064 IRF4 A/G 0.49 14.38 12.52 0.508 — rs3778607 6 348,799 IRF4 G/A 0.47 12.67 10.65 −0.471 — rs9328192 6 379,364 IRF4 A/G 0.50 9.39 7.11 0.395 — rs9405675 6 389,600 IRF4 A/G 0.36 8.72 6.86 −0.392 — rs12931267 16 88,346,233 MC1R C/G 0.08 61.08 60.89 1.875 — rs8049897 16 88,551,703 MC1R G/A 0.15 29.79 27.76 1.033 — rs1805008 16 88,513,645 MC1R C/T 0.07 23.18 21.05 1.212 — rs1805009 16 88,514,047 MC1R G/C 0.02 17.32 14.41 2.017 — rs11547464 16 88,513,592 MC1R G/A 0.01 12.85 9.91 2.430 — rs11861084 16 88,403,211 MC1R C/A 0.41 11.27 8.90 −0.445 — rs7204478 16 88,322,986 MC1R C/T 0.45 10.61 8.40 0.427 — rs7195066 16 88,363,824 MC1R C/T 0.32 10.12 7.76 −0.449 — rs11648785 16 88,612,062 MC1R C/T 0.31 8.96 6.70 −0.424 — rs8060934 16 88,447,526 MC1R T/C 0.48 8.86 6.60 −0.390 — rs154659 16 88,194,838 MC1R T/C 0.25 8.62 6.45 0.433 — rs619865 20 33,331,111 ASIP G/A 0.10 13.26 10.78 0.771 — rs4911442 20 32,818,707 ASIP A/G 0.12 10.24 7.98 0.629 — rs291671 20 31,414,506 ASIP A/G 0.10 12.04 9.89 0.772 — rs6088316 20 31,890,503 ASIP A/G 0.17 10.40 8.37 0.568 — rs761238 20 31,983,649 ASIP T/G 0.33 9.90 7.56 0.428 — rs17305657 20 31,270,249 ASIP T/C 0.09 9.82 7.56 0.697 — rs4812405 20 34,709,999 ASIP C/A 0.07 9.70 7.40 0.783 — rs1474976 20 34,195,786 ASIP A/C 0.11 9.50 7.23 0.620 — rs4911414 20 32,193,105 ASIP G/T 0.34 8.52 6.26 0.393 — rs2153271 9 16,854,521 BNC2 T/C 0.41 9.40 7.28 −0.402 — “Score” is the negative p-value for the (linear or logistic) regression coefficient where genotypes were coded as 0, 1, or 2 according to the number of copies of the minor allele. Alleles are listed as major/minor. “OR” refers to the allelic odds ratio (for binary traits). “BF” is the Bayes factor; position is the NCBI build 36 human assembly position. Hair curl was coded as a quantitative trait on a 6 point scale where zero was straight and five was “very tight curls.” “Sneeze” refers to photic sneeze reflex and “Asparagus” to asparagus anosmia, the lack of the ability to smell the urinary metabolites of asparagus. Freckling was coded as a quantitative trait on a 17 point scale where zero referred to the lack of freckles. Confidence intervals for ORs and can be found in the tables of the Text S6. Displayed SNPs were selected among all significant hits using a stepwise regression procedure. 10.1371/journal.pgen.1000993.t003 Table 3 Selected SNPs reaching genome-wide significance for replications of hair color associations. Phenotype SNP Chr Position Locus Alleles MAF Score BF OR Hair color rs12913832 15 26,039,213 OCA2 G/A 0.23 87.07 85.76 0.974 — rs7183877 15 26,039,328 OCA2 C/A 0.08 21.15 18.47 0.759 — rs4778138 15 26,009,415 OCA2 A/G 0.12 14.40 11.83 0.510 — rs12203592 6 341,321 IRF4 C/T 0.18 27.64 27.17 0.589 — rs16891982 5 33,987,450 SLC45A2 G/C 0.03 19.41 16.76 1.104 — rs12896399 14 91,843,416 SLC24A4 G/T 0.44 12.34 9.83 −0.308 — rs12931267 16 88,346,233 MC1R C/G 0.08 9.48 7.33 −0.556 — Red hair rs12931267 16 88,346,233 MC1R C/G 0.08 86.28 89.80 0.556 — rs4238833 16 88,578,190 MC1R T/G 0.37 42.77 41.02 0.228 — rs8049897 16 88,551,703 MC1R G/A 0.15 42.38 40.65 0.309 — rs1805008 16 88,513,645 MC1R C/T 0.07 37.48 36.19 0.387 — rs4408545 16 88,571,529 MC1R C/T 0.50 24.23 22.14 −0.164 — rs1805009 16 88,514,047 MC1R G/C 0.02 17.43 14.48 0.508 — rs2353033 16 87,913,062 MC1R T/C 0.43 13.93 11.55 0.123 — rs2159116 16 88,359,011 MC1R C/A 0.17 12.58 10.37 0.157 — rs7196459 16 88,668,978 MC1R G/T 0.08 11.31 8.94 0.208 — rs7195066 16 88,363,824 MC1R C/T 0.32 11.10 8.69 −0.118 — rs447735 16 88,261,850 MC1R T/C 0.44 9.85 7.49 −0.103 — rs11648785 16 88,612,062 MC1R C/T 0.31 8.47 6.21 −0.103 — rs4347628 16 88,098,136 MC1R C/T 0.44 8.42 6.28 −0.095 — rs291671 20 31,414,506 ASIP A/G 0.10 14.70 12.45 0.215 — rs17305657 20 31,270,249 ASIP T/C 0.09 12.31 9.89 0.197 — rs619865 20 33,331,111 ASIP G/A 0.10 9.89 7.59 0.165 — See Table 2 for details. Hair color was coded on an eight point scale from light to dark. Red hair was coded on a four point scale based on the amount of red contained. 10.1371/journal.pgen.1000993.t004 Table 4 Selected SNPs reaching genome-wide significance for replications of eye color associations. Phenotype SNP Chr Position Locus Alleles MAF Score BF OR Eye color rs12913832 15 26,039,213 OCA2 G/A 0.23 2.491 — rs8039195 15 26,189,679 OCA2 T/C 0.14 291.35 293.35 2.088 — rs16950987 15 26,199,823 OCA2 G/A 0.05 125.44 121.47 2.169 — rs2240204 15 26,167,627 OCA2 G/A 0.05 117.51 113.57 2.127 — rs6497253 15 25,962,144 OCA2 G/A 0.21 43.47 40.32 0.710 — rs1470608 15 25,961,716 OCA2 G/T 0.14 42.49 39.33 0.839 — rs4778232 15 25,955,360 OCA2 C/T 0.21 38.75 35.90 0.663 — rs1597196 15 25,968,517 OCA2 G/T 0.17 31.88 28.81 0.648 — rs7170451 15 25,865,819 OCA2 G/A 0.27 26.48 23.52 0.503 — rs1800407 15 25,903,913 OCA2 C/T 0.08 23.49 20.78 0.823 — rs4778220 15 25,894,733 OCA2 A/G 0.17 20.58 17.83 0.531 — rs728405 15 25,873,448 OCA2 A/C 0.22 19.91 17.34 0.466 — rs1800404 15 25,909,368 OCA2 T/C 0.21 14.47 11.91 0.411 — rs1107267 15 25,645,234 OCA2 G/A 0.26 11.76 10.33 0.338 — rs3930739 15 25,713,937 OCA2 C/T 0.46 11.38 8.96 −0.288 — rs17565953 15 25,692,250 OCA2 G/T 0.40 10.94 8.68 −0.289 — rs977588 15 25,852,901 OCA2 A/C 0.41 8.55 6.48 0.252 — rs12896399 14 91,843,416 SLC24A4 G/T 0.44 15.95 13.49 −0.343 — rs4904868 14 91,850,754 SLC24A4 C/T 0.45 13.49 10.95 0.317 — rs12203592 6 341,321 IRF4 C/T 0.18 14.69 13.26 −0.424 — rs16891982 5 33,987,450 SLC45A2 G/C 0.03 11.83 9.42 0.840 — rs1393350 11 88,650,694 TYR G/A 0.27 8.49 6.25 −0.278 — Green eyes rs12913832 15 26,039,213 OCA2 G/A 0.23 51.52 41.34 2.131 8.425 rs7183877 15 26,039,328 OCA2 C/A 0.08 29.54 21.57 2.275 9.732 rs12896399 14 91,843,416 SLC24A4 G/T 0.44 22.82 19.49 −0.546 0.579 rs1847134 11 88,644,901 TYR A/C 0.32 14.95 13.13 −0.461 0.631 rs1827430 11 88,658,088 TYR A/G 0.38 10.97 8.93 −0.377 0.686 rs11018528 11 88,570,025 TYR A/G 0.30 10.48 8.73 −0.391 0.676 rs7120151 11 88,380,027 TYR G/A 0.29 9.61 8.19 −0.375 0.687 rs1806319 11 88,677,584 TYR T/C 0.37 9.35 7.37 −0.348 0.706 See Table 2 for details. Eye color was coded on a seven point scale from blue to brown. Green eye color was treated as a binary trait, where cases had green eyes and controls blue. 10.1371/journal.pgen.1000993.t005 Table 5 Suggestive associations: those significant under the Bonferroni correction for a single study but not for all 22 studies. Phenotype SNP Chr Position Locus Alleles MAF Score BF OR Haircurl rs1556547 6 10,378,363 OFCC1 A/G 0.40 7.81 5.61 −0.101 — Eye color rs10960751 9 12,665,264 TYRP1 C/T 0.37 7.54 5.35 0.240 — Freckle rs1042602 11 88,551,344 TYR C/A 0.37 7.25 5.50 −0.356 — Sneeze rs11856995 15 94,126,647 NR2F2 T/C 0.30 7.13 5.71 −0.244 0.784 We have used a conservative threshold for reporting associations due to our increased multiple testing burden arising from the large number of traits analyzed; however, several SNPs score high enough that they would have been deemed significant as part of a single study. Shown are novel associations for haircurl and photic sneeze reflex plus replications for freckling and eye color. Phenotype data The collection of a broad range of data from each participant allows us to control for some sources of bias and to assess the error rate of the phenotype collection. To control for sources of bias, phenotypes were checked for correlations with a set of covariates (age, sex, and principal components of population variation). Covariates showing significant correlations (at the level) were included in the analysis (Supplementary Table 2 in Text S4). In addition, by asking participants the same question multiple times in different ways, we were able to assess the repeatability of responses. We asked about eye color, hair color, freckles, handedness and age twice each in different places or ways. A total of 177 people were removed from analysis based on a single discordant answer to one of these questions. Overall, a total of only 0.72% of participants answered any pair of these questions inconsistently. One source of bias unique to our replication studies was the fact that participants were shown their genetic data along with analyses of their data for approximately 100 traits and diseases. In some cases, this led to a severe bias. For example, a survey examining perceived performance in sprint versus long-distance races was placed on a web page within 23andme.com where customers were shown their genotype for rs1815739 (a SNP in ACTN3 [14]). If they logged on before their genotype data had been processed, they saw the survey question alongside sample data. If their data was available, they were predicted to fall in a category including either world-class sprinters (carriers of the C allele) or endurance athletes (T homozygotes). The response distribution differed significantly ( ) between respondents who had seen their genotypes with the suggested outcome versus those who hadn't. The results (Supplementary Table 1 in Text S5) of this comparison are consistent with large fractions (24.2% of C carriers, 41.2% of T homozygotes) of respondents answering differently than they would have if they had not seen their genotype data and interpretation. Six of the 13 surveys were posted on pages where customers were shown their genotypes and predictions for related conditions. Due to the possibility of bias from this prediction, primary analysis of these six surveys considered only those participants who took the surveys before receiving their genetic data (so they only saw a sample prediction for the phenotype). As a result, none of these traits made our sample size cutoff. For the 22 phenotypes considered here, participants were shown predictions for hair color, eye color, and freckling, although they were on separate pages from the surveys for these phenotypes. There was no evidence that for any of these traits participants who saw their genotypes gave different responses from those who did not (Methods). Therefore, we did not restrict attention to only those who hadn't seen their genotypes. Survey response rates correlated with sex, age and the first (north-to-south) principal component of population structure. That is, women, people of northern European ancestry, and older people were more likely to answer more surveys than men, people of central European ancestry, and younger people (p-values , , and ), respectively. A genome-wide association study (GWAS) using the number of surveys answered as the trait analyzed did not show any significant associations (with p-values under ) when these covariates were taken into account. Hair curl We found regions associated with hair curl near the genes TCHH, LCE3E and WNT10A, as well as a region suggestively associated near OFCC1. We found an association between a SNP near TCHH, rs17646946 and hair curl, with score 41.8, see Figure 5 and Table 6. The minor allele is associated with straighter hair, with each A conferring a reduction in curliness of about 0.29 points on a scale from 0 to 5. There is evidence of a second, possibly independent, association in this region: the SNP rs499697, about 430kb away near LCE3E, has a score of 9.9. Here the minor, derived allele is associated with curlier hair, 0.13 points per G. These SNPs lie in the epidermal differentiation complex, which contains a large number of genes required for late epidermal differentiation. Many of these genes are involved in the production of the cornified envelope (CE), the highly cross-linked outermost layer of skin that provides mechanical protection from the environment, or are involved in cross-linking the CE with the network of keratin filaments in the cells. 10.1371/journal.pgen.1000993.g005 Figure 5 Bayes factors for genotyped and imputed SNPs for hair curl around TCHH. Plotted are Bayes factors for all SNPs in HapMap in the region. Red/blue circles represent typed SNPs, magenta/green squares imputed SNPs; Red/magenta points have log Bayes factors over 6. Genes are marked in the top track with gene names inside the figure. Linkage disequilibrium (relative to the SNP with the largest Bayes factor) is plotted in the middle track. 10.1371/journal.pgen.1000993.t006 Table 6 Counts for hair curl versus genotype at rs17646946. GG (%) GA (%) AA (%) straight 923 (27.3) 797 (44.9) 111 (49.8) slightly wavy 1629 (48.1) 743 (41.8) 96 (43.0) wavy 568 (16.8) 174 (9.8) 10 (4.5) big curls 187 (5.5) 42 (2.4) 5 (2.2) small curls 61 (1.8) 17 (1.0) 1 (0.4) very tight curls 16 (0.5) 3 (0.2) 0 (0.0) The LD block containing the most significant SNPs includes four genes: S100A11, TCHHL1 (trichohyalin-like 1), TCHH (trichohyalin), and RPTN (repetin). All four are putative calcium-binding proteins that contain two EF hand domains. S100A11, TCHH, and RPTN have all been shown to be associated with the CE [15]–[17]. Trichohyalin and repetin are both expressed at high levels in hair follicles—specialized epidermal structures that produce hair—specifically in the inner root sheath layer [17], [18]. We also found an association between rs7349332 and hair curl, with score 13.4, in an intron of WNT10A. The minor allele is associated with curly hair, each T is associated with a 0.2 point change. See Figure 6 and Table 7. 10.1371/journal.pgen.1000993.g006 Figure 6 Bayes factors for genotyped and imputed SNPs for hair curl around WNT10A. For details, see Figure 5. 10.1371/journal.pgen.1000993.t007 Table 7 Counts for hair curl versus genotype at rs7349332. CC (%) CT (%) TT (%) straight 1432 (36.1) 376 (28.3) 25 (26.3) slightly wavy 1812 (45.7) 614 (46.2) 41 (43.2) wavy 513 (12.9) 222 (16.7) 17 (17.9) big curls 141 (3.6) 88 (6.6) 6 (6.3) small curls 53 (1.3) 23 (1.7) 4 (4.2) very tight curls 11 (0.3) 6 (0.5) 2 (2.1) Finally, near OFCC1 (orofacial cleft candidate 1), rs1556547 has a score of 7.8 and the minor allele is associated with straighter hair (0.1 points per G, Table 8) and Figure 7. 10.1371/journal.pgen.1000993.g007 Figure 7 Bayes factors for genotyped and imputed SNPs for hair curl around OFCC1. For details, see Figure 5. 10.1371/journal.pgen.1000993.t008 Table 8 Counts for hair curl versus genotype at rs1556547. AA (%) AG (%) GG (%) straight 593 (30.6) 898 (34.6) 343 (40.2) slightly wavy 901 (46.5) 1192 (45.9) 376 (44.0) wavy 302 (15.6) 356 (13.7) 94 (11.0) big curls 96 (5.0) 105 (4.0) 34 (4.0) small curls 35 (1.8) 40 (1.5) 5 (0.6) very tight curls 9 (0.5) 8 (0.3) 2 (0.2) Asparagus anosmia Odorous urine after eating asparagus is thought to be due to the excretion of methanethiol, a volatile sulfur-containing compound that smells like rotten or boiling cabbage [19]. In mammals, odor detection is mediated through olfactory receptors (ORs), seven-transmembrane domain proteins that are expressed in the cell membranes of olfactory neurons and are responsible for detection of odorants. A number of ORs are known to have specific odor sensitivities, and account for some of the variation in an individual's ability to detect those odors [20]. We have found a region on chromosome 1 (Figure 8) containing a cluster of olfactory receptor genes that is significantly associated with the ability to smell asparagus metabolites. This region contains 39 OR genes (ten annotated as pseudogenes). The LD block that contains the most significant SNP found in this study rs4481887 (with a score of 23.21) includes ten genes, all olfactory receptors: OR2M2, OR2M3, OR2M4, OR2T33, OR2T12, OR2M7, OR5BF1, OR2T4, OR2T6, and OR2T1. Two of these (OR2M3, OR2T6) are annotated as pseudogenes in ORDB [21]. The most significant association, at rs4481887 (about 9kb upstream from OR2M7), has score 23.2 and appears to be acting in a dominant fashion decreasing the odds of anosmia, with an odds ratio under a dominant model of 0.48 (the odds ratio of 0.60 reported in Table 2 is under an additive model). See Table 9 for details. 10.1371/journal.pgen.1000993.g008 Figure 8 Bayes factors for genotyped and imputed SNPs for asparagus anosmia around OR2M7. For details, see Figure 5. 10.1371/journal.pgen.1000993.t009 Table 9 Counts for asparagus anosmia versus genotype at rs4481887. GG (%) GA (%) AA (%) Can smell 1458 (56.7) 1313 (70.9) 231 (74.0) Anosmic 1114 (43.3) 540 (29.1) 81 (26.0) Another nearby SNP, rs7555310, with a similar odds ratio but slightly weaker score of 21.7, is a non-synonymous change in OR2M7. This SNP changes a valine to an alanine and is predicted to lie in one of the transmembrane domains of this protein. Though the valine is well conserved in mammals, the functional significance of this conservative amino acid change is not clear. Due to the extensive linkage disequilibrium in this region, it is impossible to tell without additional evidence which gene most likely codes for the receptor that detects this odorant. Photic sneeze reflex For photic sneeze reflex, we find a novel association with rs10427255 (score 10.9 and an OR of 1.32). This SNP lies in a large intergenic region of 2q22.3 between ZEB2 and (the annotated pseudogene) PABPCP2 (725kb and 1.2MB away, respectively). We also find a suggestive association with rs11856995 (score 7.13 and OR of 0.78). It also lies in a large intergenic region of 15q26.2, with the nearest gene being NR2F2, some 560kb away. Details for the SNPs in these regions are shown in Figure 9, Table 10, and Table 11. 10.1371/journal.pgen.1000993.g009 Figure 9 Bayes factors for genotyped and imputed SNPs for two photic sneeze associations. Regions shown are (A) near ZFHX1B/ZEB2 on 2q22.3 and (B) near NR2F2 on 15q26.2. For details, see Figure 5. 10.1371/journal.pgen.1000993.t010 Table 10 Counts for photic sneeze reflex versus genotype at rs10427255. TT (%) TC (%) CC (%) No sneeze 1168 (72.4) 1795 (67.9) 677 (59.9) Sneeze 446 (27.6) 849 (32.1) 453 (40.1) 10.1371/journal.pgen.1000993.t011 Table 11 Counts for photic sneeze reflex versus genotype at rs11856995. TT (%) TC (%) CC (%) No sneeze 1714 (65.2) 1530 (67.8) 397 (78.8) Sneeze 916 (34.8) 727 (32.2) 107 (21.2) Freckling We find one new association for freckling and replicate two known regions. The novel association is at rs2153271, in an intron of BNC2 (Zinc finger protein basonuclin-2), with a score of 9.4 and an estimated of −0.4 (on a 17 point scale). See Supplementary Table 2 in Text S6 for details. Our most significant association, rs12203592, with score 90.7, lies in an intron in IRF4 (Figure 10). This SNP was previously associated with hair color, eye color, and tanning response to sunlight [5]. A more mildly associated SNP, rs1540771 (with score 13.2), in this region has previously been associated with freckling (as well as eye color, sensitivity to sun, and hair color) [4], however rs12203592 (60kb away) was not typed in that analysis. For eye and hair color and tanning ability it was suggested [5] that in fact rs12203592 was in closer LD with the causal SNP. Here we confirm this finding for hair and eye color and establish the same for freckling. 10.1371/journal.pgen.1000993.g010 Figure 10 Bayes factors for genotyped and imputed SNPs for freckling around BNC2. For details, see Figure 5. The other loci we associate with freckling are MC1R, ASIP, and TYR, all known associations [2], [22]. Although the SNPs selected by the regression procedure as most influential are slightly different than those for red hair, the sets are quite similar for these highly correlated phenotypes. Hair color We confirm known associations for hair color, both blond to brown and non-red to red. For blond to brown, excluding red, we find hits in five regions: OCA2/HERC2, IRF4, SLC24A4, SLC45A2, and MC1R (aside from MC1R, the same set of regions as [5] in their analysis excluding red hair). A multiple regression using the seven SNPs in Table 3 (with sex and five principal components) estimates that these five regions together explain about 28.1% of the variance in hair color (blond to brown) within northern Europe. In the OCA2/HERC2 region, rs12913832, first found by [3], has a score of and of , explaining % of the variance. These numbers (as well as those for the other SNPs) concord well with those in [5] (which estimated % of the variance was explained by this SNP and using a five point scale from dark to light as compared to our eight point scale from light to dark). For IRF4, rs12203592 has an estimated of 0.59 and explains 3.9% of the variance. For SLC24A4, rs12896399 has an estimated of and explains 1.7% of the variance. In SLC45A2, rs16891982 has and explains % of the variance. Finally, rs12931267 near MC1R has of and explains % of the variance. Sulem et al. [4] found associations for hair color in four of these five regions (excluding SLC45A2) as well as KITLG. The SNP rs12821256 in KITLG showed a mild but significant association with hair color in our study, with a score of 4.2 and of (95% CI from to ). This is similar to the relatively weak effect for this SNP found in [5]. For the other direction of variation in hair color (red versus non-red hair) we found many associated SNPs in the MC1R region, long known to be associated with this phenotype [2]. Although some of the SNPs contributing to the model in this region lie far from MC1R, some of the biggest effects are from rs1805008 and rs1805009, non-synonymous changes in the MC1R gene. This region is strongly associated with common variation in red hair [4], [5]. We also replicated the claim that a large haplotype containing the pigmentation gene ASIP is associated with red hair (also associated with burning and freckling in [22]). While rs291671 is about 900kb away from ASIP, it appears to be tagging the same haplotype found there. Eye color We confirm known associations for each of two traits: blue to brown and green versus blue eye color (Table 4). For blue to brown, we confirm six regions: OCA2/HERC2, SLC24A4, IRF4, SLC45A2, TYR, and TYRP1. In HERC2, the well-known SNP rs12913832 has a score over 300 and of (thus the A allele is associated with darker eyes). This single SNP explains % of the variance in eye color. In SLC24A4, rs12896399 has a score of 15.9 and of , explaining 1.6% of the variance in eye color. In IRF4, rs12203592 has a score of 14.7 and of , explaining 1.4% of the variance. In SLC45A2, rs16891982 has a score of 11.8, of , and explains % of the variance. In TYR, rs1393350 has a score of 8.5, of , and explains % of the variance. Finally, as shown in Table 5, (as it does not quite meet our strict genome-wide significance levels), rs10960751 in TYRP1 has a score of 7.5, of , and explains % of the variance. For green versus blue eyes, three regions show associations: OCA2/HERC2, SLC24A4, and TYR. For this trait, SNP rs12913832 in OCA2/HERC2 has a score of 51.5 and an estimated allelic OR of . The SNP rs1667394 in this same region has an estimated OR of (4.85–10.06), very close to the ORs in [4], which range from to in their three populations. In SLC24A4, rs12896399 has a score of 22.8 and OR of 0.58 (0.52–0.65), again similar to [4], where the ORs range from to . In TYR, we report rs1847134 with a score of . The nearby SNP rs1393350, which was reported to have ORs between and in Sulem et al. has an OR of 0.64 (0.57–0.72) in our data. Discussion We have conducted genome-wide association studies for 22 traits. The sheer size of this study was possible due to an original, web-based, parallel design. We have found novel associations for hair curl, freckling, photic sneeze reflex, and the ability to smell the urinary metabolites of asparagus. In addition, we have replicated a wide array of associations for pigmentation traits; these replications show great consistency with the numbers reported in other studies. Associations Hair curl We have found associations between hair curl and SNPs near the genes TCHH, LCE3E, WNT10A, and OFCC1. that together explain about 6% of the variance in hair curl within northern Europe. The association with TCHH replicates the finding of [23] (which also mentions WNT10A as a non-significant but interesting association). In the TCHH region, the strongly associated imputed SNP rs11803731 (in an exon of TCHH, causing the change M790L) is of note. This position is a possible helix initiator (between two prolines and a probable helix [24]), and differences in the ability to initiate the helix due to mutations at this position could lead to changes in protein behavior. Experiments with in vitro cultures of curly and straight human hairs have shown that hair shape appears to be programmed from the basal area of the follicle, which includes the inner root sheath. Curly hairs originate from follicles that have a golf-club like bend at the base, while straight hairs originate from straight follicles [25]. Asymmetry in the thickness of the inner root sheath, which surrounds and provides protection for the growing hair at the base of the follicle, may play a role in moulding the shape of the hair [26]. Wool follicles from sheep also have a curved bulb, and exhibit an asymmetry in the thickness of trichohyalin containing structures, with more trichohyalin on the concave side [27]. Biochemical studies of the trichohyalin protein suggest that trichohyalin provides mechanical strength to the hair follicle by cross-linking the CE with the cytoplasmic keratin filament network in inner root sheath cells [28]. The bulk of this evidence points to TCHH as the most likely candidate for controlling hair curl in this region. However, due to the related nature of these genes, it is difficult to rule out S100A11, TCHHL1, or RPTN as playing a role. A second association 400kb away appears to be independent, but again there are many genes nearby (LCE3E, LCE5A, etc.) that could play a role. For WNT10A, there appears to be a direct connection to hair morphology: mutations in this gene are known to cause odonto-onycho-dermal dysplasia, which includes dry, misformed hair as a symptom [29]. Also, WNT10A is upregulated in hair follicles at the beginning of a new growth cycle [30]. Finally, a translocation breakpoint in OFCC1 has been associated with orofacial clefting [31]. Orofacial clefting and ectodermal dysplasia can appear together, for example in EEC, Rapp-Hodgkin, and Hay-Wells syndromes [32]. Asparagus anosmia It has been debated whether variance in the production or the detection of methanethiol explains the differences in the reporting of the ability to detect asparagus metabolites in urine. One study suggested that production is an autosomal dominant trait [10]; a second study concluded that the variation is instead in the ability to detect the compound and that it is a roughly bimodal trait with respect to the dilution at which the odor is detected [11]. We have identified a locus associated with this trait. This locus lies within a region containing many olfactory receptors and appears to act in a dominant fashion. Both of these facts suggest that the genetic variation in this trait is in the ability to detect the odorant. Freckling We have found a novel association between freckling and rs2153271 in an intron of BNC2. BNC2 is a potential transcriptional regulator in keratinocytes based on its close similarity to BNC1 (basonuclin 1) [33] (which is involved in keratinocyte proliferation), however there is evidence that BNC1 and BNC2 have different functions [34]. As the pigment that shows up in freckles is housed in keratinocytes, there could well be a link between BNC2 and freckling. We have also found a strong association between rs12203592 in an intron of IRF4 and freckling. This region has previously been associated with freckling, however this SNP appears to more closely tag the causal variant than the previous association in this region [4]. Like many other genes associated with pigmentation, IRF4 appears to be regulated by MITF [35]. Also, there is a striking, sudden drop-off in significance around this SNP that does not appear to be due to the presence of any recombination hotspots nearby. There is evidence of substantial copy-number variation (CNV) in this region (cf. [36], [37]). Although we are unable to detect CNVs here, it could explain the LD pattern. Hair color We replicated many known associations for hair color, both blond to black and red versus non-red. Of interest is the fact that rs12821256 in KITLG, discovered in [4], does not have a large effect size either here or in [5]. This raises the question of whether the OR of 1.9 reported in [4] is an overestimate or whether the discordant results are due to differences in the populations studied (as Han et al. studied a U.S. population probably similar to our cohort while Sulem et al. studied Icelandic and Dutch populations). Photic sneeze There is evidence that photic sneeze reflex is genetic: Peroutka and Peroutka [38] concluded that this trait is inherited in an autosomal dominant fashion. We find one region associated with photic sneeze reflex about 725kb away from ZEB2 and a second suggestively associated about 550kb away from NR2F2. ZEB2 is mutated in individuals with Mowat-Wilson syndrome, which has seizures as a common symptom. There may be a link between photosensitive epileptic seizures and photic sneeze reflex (triggered by a sudden switch from being dark-adjusted to light) [39], providing a possible link between this region and photic sneeze reflex. NR2F2 (also known as COUP-TFII), also has a possible link to sneezing: it interacts with NR2F6 [40] and NR2F6 knockout mice show defects in the locus ceruleus (part of the brainstem) [41]. Certain stimulations cause signals to be sent to the brainstem to cause a sneeze; it is thought that photic sneeze reflex also progresses through this fashion [42], [43]. Also, the locus ceruleus is disrupted in Rett syndrome, which also has seizures as a common symptom [44], [45]. Handedness Handedness has putative genetic associations [12] (an association found only in dyslexic siblings). The haplotype there associated with left-handedness (when paternally inherited) was at most moderately associated with left-handedness in our data (estimated OR of 1.17, 95% CI of 0.96–1.42, p-value of 0.06). However, we are underpowered to detect such a small OR in this fairly rare (8% frequency) haplotype. Looking at the laterality measures (handedness, footedness, ocular-dominance, and hand-clasp) as a whole, there is no overlap between the marginal hits for any of these phenotypes: no SNP has a score of above 5 for more than a single one of these measures. Research framework In this initial set of results, we have shown parallel, web-based phenotype collection to be quick and reliable. The population structure present in the wider dataset makes statistical analysis more involved than in a typical GWAS, however it does not influence the results. Furthermore, this complex population structure facilitates studies of multiple ethnicities simultaneously. It also compels the development of robust methods for genetic research in populations reflecting the complexities of human populations. This design has several statistical advantages over traditional studies. By centralizing many studies, we have the ability to avoid publication bias by reporting statistics on a large collection of independent association studies, including both positive and negative results. Such a bias is a concern [46]. For example, imagine that 20 separate groups each perform a GWAS on a phenotype with no true association. One of them will probably find a result by chance, and this will be the only result submitted and published. While false positives are unavoidable in a statistical test, we can reduce them by incorporating the total number of simultaneous trials into the multiple testing calculation. We thus formalize the notion of a significance and testing burden across multiple scans, rather than with respect to a single GWAS. Ignoring linkage disequilibrium (i.e., if one assumes the approximately 10 million tests performed in this paper are all independent), one would expect to see one or two suggestive associations (with scores between 7.1 and 8.4) by chance in about half of all studies similar to ours. This is a conservative estimate; the false-discovery rate (FDR) analysis estimates a FDR of about 5.2% for a cutoff of 7.1, meaning that we would expect 5.2% of the associations with scores in this range to be false positives. In addition, genotypes for cases and controls are collected and treated the same way, avoiding sources of bias (cf. [47]). In fact, an individual will typically be a case for several studies and a control for several others. This is an extension of the model used in [48], where controls were shared across seven studies. This new research model raises interesting methodological questions. For example, since new participants join the study continuously, results are constantly changing. For this study, we chose to set an end date and inclusion criteria in advance to avoid possible bias due to choosing a stop date according to the latest results. However, it is an interesting problem to design a criteria for significance using a continually expanding cohort to replace the traditional design of phased data collection with discovery followed by replication. For example, Figure 11 shows the history of two SNPs from this study that both had scores over 8.4 at one point in time. Further data shows that one appears to have been a true positive and one a false positive. 10.1371/journal.pgen.1000993.g011 Figure 11 Association over time for two SNPs and two traits. (A) Association between rs4481887 and asparagus anosmia, steadily increasing in certainty. (B) Association between rs6650382 and photic sneeze: a very promising initial result, but , the log odds ratio, regressed towards 0. Both traits were assessed in the same survey, so they had approximately the same number of responses at all points in time. Scores ( p-values) and regression coefficients are plotted using the genotype and phenotype data that was available at various points in time. The red line indicates our significance threshold of 8.4. While we have not formalized this notion in this paper, we note that the novel associations we report here have been steadily increasing in significance (cf. Figure 11A and Supplementary Figure 7 in Text S6) and that they have several other significant SNPs in the same LD block. These facts are strong evidence that these novel associations are not the result of some genotyping artifact. Our research model makes possible studies that might be infeasible otherwise due to the low marginal cost of asking additional questions over the web and the speed of broadcasting recruitment messages in parallel online. This speed and flexibility allows us to easily study traits for which funding may not be readily available, such as asparagus anosmia. We believe that providing participants with well-explained descriptions of their genetic data can substantially benefit genetic research as a whole. It is an opportunity to harness the public interest in genetics and make the participants a part of the study. This participation provides a wonderful chance to educate the public about genetics, statistics and research and to give something back to the individuals who contribute to genetic research. Methods Cohort Participants were drawn from the customer base of 23andMe. After purchase, customers were shipped a kit for saliva collection, containing a bar-coded tube to be returned and a code to claim that kit online. They entered this code on the 23andme.com website, created an account, and agreed to a Consent and Legal Agreement. Upon signing up for 23andWe (the portion of the website that allows participants to enter phenotype data through a series of surveys), participants were reminded that they had consented to the use of survey responses for research. The Consent and Legal Agreement stated that participants' genotype data and whatever phenotype data they entered would be used for internal research after being coded and stripped of individually identifying information (“anonymized”). Individually identifying information refers to personal information that is collected during purchase, such as name, credit card information, billing and shipping addresses, and contact information such as an email address or telephone number. The consent form stated that aggregate genetic data might be shared publicly (as it is in this paper). However, it provided that individual-level genetic data would not be shared with outside researchers without separate consent. For this study, all access to individual-level data occurred at 23andMe by full-time 23andMe employees. One author (IP) was working as a 23andMe consultant and did not access individual-level data. Another author with a secondary, non-23andMe affiliation (JM) was acting in her capacity as a 23andMe employee during her work on this study. As part of the consent form, participants were told they could withdraw from all future research at any time by emailing 23andMe. This information was also present as a FAQ on the 23andMe website. A small number of customers have elected this option and those who did so before January 30, 2010 were not included in any analysis in this paper. All anonymized data was placed in a secure research environment accessible only by 23andMe scientists. These scientists had no access to individually identifying information that was held separately at the company. The scientists also had no interaction with study participants. To obtain formal recognition of this strict partitioning of the research, following consultation with the editors at PLoS, we sought and were granted an independent determination by a commercial Institutional Review Board (IRB) that this research did not involve human subjects under the Department of Health and Human Services definition, 45 CFR 46.102(f). This definition states that research performed on anonymized data with no contact between investigators and participants does not constitute research on human subjects. It is 23andMe policy that every scientist who obtains access to the restricted research environment first undergo online human subject assurance training made available by the Office for Human Research Protections (OHRP). 23andMe is committed to an ongoing evaluation of our process of Institutional Review and oversight of Consent as part of our commitment to protection of participant rights. There are significant novel challenges in this process relating to the evolution of policies concerning privacy of genetic data, the ongoing nature of our study, and the return of data to participants. Our study participants arguably have greater opportunity to review and consent to their involvement than most participants in genetic studies. However, they also have greater access to their own genetic data and hence to personal results that may impact their self-perception. We expect the definition of human subjects research to evolve along with standards for protecting individual identity and providing public access to GWAS results. We are continually evaluating our protocols with an external, AAHRPP accredited IRB in an effort to set the standard for web-based genetic studies. Genotyping and SNP quality control DNA extraction and genotyping were performed on saliva samples by National Genetics Institute (NGI), a CLIA licensed clinical laboratory and a subsidiary of Laboratory Corporation of America. Samples were genotyped on the Illumina HumanHap550+ BeadChip platform, which included SNPs from the standard HumanHap550 panel in addition to a custom set of about 25,000 SNPs selected by the 23andMe staff. Every sample that failed to reach 98.5% call rate was re-analyzed. Individuals whose analyses failed repeatedly were re-contacted by 23andMe customer service to provide additional samples, as is done for all 23andMe customers. Two slightly different versions of the genotyping platform were used in this study. See Text S3 for details about the two versions and quality control measures used to eliminate a subset of poorly performing assays. SNPs with a call rate under 98% were excluded from analysis, as were those with minor allele frequency under or a p-value for Hardy-Weinberg equilibrium (using the test from [49] within all the northern Europeans in the database) under . Both minor allele frequency and Hardy-Weinberg statistics were calculated within our dataset. Due to the two slightly different platforms in the analysis the no-call rate was calculated only among individuals genotyped on a given platform. However, all SNPs discussed in this paper are contained in the intersection of the two platforms. In addition, a total of 1553 SNPs with Mendelian discordance rates (the fraction of trios within the entire 23andMe customer database in which the called SNPs followed an impossible inheritance pattern) of at least 1% were discarded. In the end, SNPs were used with an average call rate per person on these SNPs of 99.91% and a median call rate of 99.97%. Statistical analysis All p-values were calculated using linear or logistic regressions as appropriate (or the corresponding score tests in the case of analyses without covariates). For linear regressions with covariates, we used a simple multilevel model: first we performed the regression of the phenotype solely on the covariates and then regressed the residuals against individual SNPs. This multilevel model is similar to the model in EIGENSTRAT [50]. In addition to being faster this multilevel model avoids multicollinearity. Regressions were performed using R [51], PLINK [52], [53], or internal software packages. The estimated regression coefficient for each SNP, denoted by throughout, refers to a coding of genotypes as 0, 1, 2 counting the number of minor alleles present. The strand used was determined from NCBI build 36 of the human genome. Codings for the phenotypes discussed are given below and in Text S5. Bayes factors (shown in region plots and in Text S6) were calculated using the default prior in BIMBAM [54], [55]. Regions of interest were further analyzed by imputation against the phased HapMap CEU subset [56] using BIMBAM. Manhattan and quantile-quantile plots were trimmed at a p-value of in order to better show the details and also since for extremely small p-values the standard approximations become less exact. For the tables (e.g., Table 2), the set of significant SNPs was pared down within each region using a backwards stepwise regression procedure that attempted to minimize the AIC (Akaike's information criterion, using the step command in R). As input to this procedure, all SNPs within the region with p-values under were included and only those that contributed to the optimal model were displayed. Of particular note is that the displayed SNPs were not necessarily all significant upon correction for the other SNPs in that region, only that the AIC judged that a SNP usefully contributed to the joint model. See Table S1 for all SNPs in these studies with scores above 6.0. Multiple testing We give two estimates for the multiple testing burden over all analyses. Most conservatively, a Bonferroni correction that takes into account the number of SNPs tested and the number of phenotypes tested yields a score of corresponding to a significance level of 0.05 for all 22 studies simultaneously. In order to estimate the chance that the suggestive associations are true positives, we calculated the false-discovery rate (FDR) [57] over the set of p-values for all 22 studies. For each study we excluded all SNPs within 100kb of a SNP with score over 8.4 from the FDR analysis (removing the “true positives” and most SNPs in LD with them from the analysis). The analysis concluded that a cutoff of 7.1 corresponds to an estimated FDR of 5.2% (meaning that 5.2% of the SNPs with scores between 7.1 and 8.4 would be expected to be false positives). A cutoff of 6.5 raises the FDR to 10.4% and a cutoff of 6.0 raises the FDR to 19.8%. Analysis of related individuals We measured identity by descent (IBD) for all pairs of participants using a novel algorithm that acts on unphased data by comparing homozygous calls in a window (Text S2). A set of “unrelated” participants was defined by requiring that no two individuals share over 700 cM IBD, counting both full (diploid) and half (haploid) levels of identity by descent. This level of relatedness (approximately 20% of the genome) corresponds approximately to the minimal expected sharing between first-cousins in an outbred population (Supplementary Figure 1 of Text S2). Population stratification Extensive population structure exists in the customer base as a whole, which includes individuals from around the world. This structure might yield spurious associations if not taken into account [50]. We selected a subset of individuals having northern European ancestry (including western Europe as well) using multi-dimensional scaling (MDS) and a support vector machine (SVM) trained on three datasets containing individuals of known ancestry. Two of these datasets were the 1043 HGDP-CEPH individuals [58] and 326 individuals of European ancestry from Illumina's iControlDB and Peter Gregersen. The third dataset consisted of several hundred customers who reported having four grandparents either of Ashkenazi descent or having lived in a single European country for several countries chosen to complement the existing datasets. See Text S1 for details. Only unrelated (in the aforementioned sense) individuals from this subset were considered during the present association studies. Although the sample included only individuals of northern European ancestry, mild population structure still existed. Inspection of the eigenvalues showed that the first five principal components captured the majority of this structure (Supplementary Figure 2 of Text S1). Thus, to guard against spurious associations we included each individual's first five principal components as covariates in the regression model for those traits that showed association with the principal components. For each phenotype, we also computed genomic control inflation factors ( ) [59]. They were quite close to unity for the adjusted models (Table 1). See Text S1 for details. Phenotypes Phenotype data was collected via 13 surveys administered to research participants via the 23andMe.com web site. See Figure 1 and Text S5 for further details. Inclusion criteria Due to the frequent release of new questions and the continual accumulation of responses, criteria for inclusion of genotype and phenotype data were established before the writing of this paper. We analyzed all phenotypes derived from the 13 surveys released between May and October of 2008 that met our criteria. Only responses and genotypes obtained before January 30, 2010 were used. Our requirements for selection of phenotypes were as follows: At least 1500 responses among unrelated northern Europeans. For binary phenotypes, at least 500 cases. For surveys that were displayed alongside genetic predictions, only respondents who answered before their data was ready were included. For phenotypes with significant correlations with covariates, only respondents who had provided those covariates were included. Twenty-two traits met these criteria, they came from the six surveys “Ten Things About You,” “Ocular Dominance,” “Handedness,” “Optimism,” “Ten More Things About You,” “Footedness” and “Pigmentation.” The phenotypes analyzed in the paper are described in detail below, see Text S5 for details on the others. For the analysis of whether seeing genotypes influenced survey responses, we looked at SNPs that were reported to the customer as influencing five traits analyzed here: rs12896399 and rs1393350 (green eye color), rs12913832 (eye color), rs1805007 (red hair), rs1667394 (hair color), rs4778138 and rs1805007 (freckling). For each SNP, we tested whether seeing their data influenced their responses, controlling for their genotype at that SNP, sex, age, and five principal components. Unadjusted p-values (using a logistic regression) were all over 0.1 except for rs1667394 and hair color, which had a p-value of 0.02. After adjusting for the seven tests performed, we fail to reject the hypothesis that seeing the data did not influence people's responses. Hair curl Participants were asked “Is your hair naturally straight or curly?” Answer choices were presented as a series of six pictures with accompanying descriptive text. Analysis used the codings (0 = “stick straight,” 1 = “Slightly wavy,” 2 = “Wavy,” 3 = “Big curls,” 4 = “Small curls,” 5 = “Very tight curls”) in a linear regression. The pictures are shown in Supplementary Figure 1 of Text S5. Ability to smell the urinary metabolites of asparagus The question asked was “Have you ever noticed a peculiar odor when you pee after eating asparagus?” This phenotype was scored as cases and controls where those who could not smell the odor after eating asparagus were considered cases. Participants who answered that they did not know or did not eat asparagus were excluded. Freckling Participants were asked to compare the amount of freckling on their face, arms, and shoulders to three series of images (Supplementary Figure 2 of Text S5). There were six images in the “arms” and “shoulders” categories and seven images in the “face” category. Each category was scored from zero to five or six and then the three categories were summed, leading to a score between zero and sixteen ranging from not-freckled to heavily freckled. Hair color We performed two analyses of hair color: blond to brown and red versus not red. For blond to brown, we regressed an ordinal coding (0 = blond, 1 = dark blond, 2 = light brown, 3 = reddish brown, 4 = medium brown, 5 = dark brown, 6 = black) for hair color on the ordinal coding for genotype (0, 1, 2) as well as five principal components and sex. For red versus non-red hair color, participants were asked to “describe the amount of red in my hair (before I went gray, if I am gray now)” with available choices “No red at all,” “A tinge of red,” “Some red” and “A lot of red” which were coded from 0 to 3 in a linear regression. Eye color We performed two analyses of eye color: blue to brown and blue versus green. Eye color was assessed by asking participants to match their eye color to a set of 7 pictures (without accompanying text) ranging from blue to dark brown. These were coded as 0 = blue through 6 = dark brown for the main analysis. See Supplementary Figure 4 of Text S5 for the pictures. For blue versus green eye color, blue eyes were treated as controls and greenish-blue or green eyes were treated as cases. Photic sneeze reflex Participants were asked one question for this trait: “Do you have a tendency to sneeze when exposed to bright sunlight?” Available answers were “Yes” and “No, what are you talking about?” People who did sneeze were treated as cases, those who did not were controls. Sweet taste preference Participants were asked “When you're in the mood for a snack, what kind of snack do you usually reach for?” Available answers were “Sweet” “Salty or savory” “Both” or “Neither.” People answering either both or neither were disregarded. People reaching for sweet snacks were treated as cases. Handedness Handedness was scored on an eight-point scale (where 1 was “pure right” and 8 was “pure left”) using the questions and scoring from [60]. The haplotype analysis used Beagle [61] to phase data on chromosome 2 between positions 80375000 and 80523000. To calculate an odds ratio, we collapsed the eight-point scale to a binary scale, using people scored as pure right-handed or right-handed with weak left-handed tendencies as controls and all others as cases. Supporting Information Table S1 All significant and marginal SNPs for the 22 studies. All SNPs with p-values under 10−6 for the 22 studies. (0.04 MB XLS) Click here for additional data file. Text S1 Selection of study individuals by ancestry. (1.17 MB PDF) Click here for additional data file. Text S2 Relatedness. (0.06 MB PDF) Click here for additional data file. Text S3 Genotyping and SNP quality control. (0.04 MB PDF) Click here for additional data file. Text S4 Phenotypes. (0.06 MB PDF) Click here for additional data file. Text S5 Questions for individual phenotypes. (0.18 MB PDF) Click here for additional data file. Text S6 Details for specific associations. (6.54 MB PDF) Click here for additional data file.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              New common variants affecting susceptibility to basal cell carcinoma.

              In a follow-up to our previously reported genome-wide association study of cutaneous basal cell carcinoma (BCC), we describe here several new susceptibility variants. SNP rs11170164, encoding a G138E substitution in the keratin 5 (KRT5) gene, affects risk of BCC (OR = 1.35, P = 2.1 x 10(-9)). A variant at 9p21 near CDKN2A and CDKN2B also confers susceptibility to BCC (rs2151280[C]; OR = 1.19, P = 6.9 x 10(-9)), as does rs157935[T] at 7q32 near the imprinted gene KLF14 (OR = 1.23, P = 5.7 x 10(-10)). The effect of rs157935[T] is dependent on the parental origin of the risk allele. None of these variants were found to be associated with melanoma or fair-pigmentation traits. A melanoma- and pigmentation-associated variant in the SLC45A2 gene, L374F, is associated with risk of both BCC and squamous cell carcinoma. Finally, we report conclusive evidence that rs401681[C] in the TERT-CLPTM1L locus confers susceptibility to BCC but protects against melanoma.
                Bookmark

                Author and article information

                Journal
                Nat Commun
                Nat Commun
                Nature Communications
                Nature Publishing Group
                2041-1723
                18 July 2016
                2016
                : 7
                Affiliations
                [1 ]Department of Dermatology, Stanford University School of Medicine , Stanford, California 94305, USA
                [2 ]Department of Epidemiology, Richard M. Fairbanks School of Public Health, Melvin & Bren Simon Cancer Center, Indiana University , Indianapolis, Indiana 46202, USA
                [3 ]23andMe Inc. , Mountain View, California 94041, USA
                [4 ]Department of Epidemiology and Biostatistics, Tianjin Medical University Cancer Hospital and Institute, National Clinical Research Center for Cancer, Tianjin & Key Laboratory of Cancer Prevention and Therapy , Tianjin 300060, China
                [5 ]Department of Dermatology, Warren Alpert Medical School, Brown University , Providence, Rhode Island 02903, USA
                [6 ]Department of Epidemiology, School of Public Health, Brown University , Providence, Rhode Island 02903, USA
                [7 ]Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School , Boston, Massachusetts 02115, USA
                [8 ]Department of Epidemiology, Harvard T.H. Chan School of Public Health , Boston, Massachusetts 02115, USA
                [9 ]Department of Biostatistics, Harvard T.H. Chan School of Public Health , Boston, Massachusetts 02115, USA
                Author notes
                [*]

                These authors contributed equally to this work

                [†]

                These authors jointly supervised the work

                Article
                ncomms12048
                10.1038/ncomms12048
                4960294
                27424798
                Copyright © 2016, Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.

                This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

                Categories
                Article

                Uncategorized

                Comments

                Comment on this article