Introduction Differences in population genetic structure and substructure between cases and controls can lead to false positive association tests [1–5]. Interest in this issue has accelerated with the application of whole genome association (WGA) screens for deciphering the genetics of complex diseases. The importance of recognizing and controlling for population structure is magnified when population controls are not closely matched to cases, a process that requires multiple demographic considerations and similar sample acquisition methods. These conditions are difficult and often not practical to fulfill completely. Since many studies focus on participants of European descent, the potential impact of European substructure on association testing has specifically engendered interest [6,7]. In fact, the current study was undertaken as part of an effort to effectively ascertain and adjust for differences in population substructure among cases and controls in our studies of the genetics of rheumatoid arthritis in a participant set that predominantly includes participants of European descent. Recent studies have addressed differences in population substructure and methods to control for these differences in association testing [8–13]. Population substructure can be explored and ascertained using a variety of algorithms that apply principal component analysis (PCA) or non-hierarchical cluster analysis based on allele frequencies in individuals and groups. Unlike other multi-locus adjustments (e.g. genomic control methods ) these newer approaches adjust for the fact that some SNPs have large frequency variations across different populations compared to other SNPs . The ability of these methods to control for large differences in population substructure has been at least partially demonstrated by both real data and simulations [6,11,12]. However, the practical application of these methods and limitations requires more extensive exploration in a variety of real datasets. Recent studies by our group and others have led to the identification of SNP subsets that can provide European substructure information [6,7]; this is consistent with previous work suggesting distinct clines of genetic variation within Europe [15–20]. These European substructure ancestry informative markers (ESAIMs) may be particularly important in large replication studies in which independent sets of case and control genotypes are necessary to confirm and further define associations without the benefit of genome-wide SNP typing. Previous studies have been limited to initial SNP genotyping sets of less than 10,000 SNPs [6,7]. The current study uses 300K to 500K genome-wide SNP data to enhance the ability to define elements of European substructure that were not evident or poorly defined using smaller sets of SNPs. Results Principal Component and Cluster Analyses Show Major Differences between European Populations A set of 952 self-identified participants of diverse European descent genotyped with >300K SNPs was used for the first phase of European population substructure analysis. This participant group predominantly included European Americans as well as smaller numbers of individuals from Italy and Spain (see Methods). In order to reduce potential noise created by continental admixture this study included only those individuals who did not have evidence of non-European continental ancestry (see Methods). The genotypes were examined using the principle component analysis (PCA) algorithm implemented in the EIGENSTRAT program , a computational method that enables rapid analyses of very large datasets. Using multiple criteria including ANOVA, a split half reliability test (see Methods) and a test for normality of distribution, substructure was present in multiple principle components (Table 1). However, most of the variance among the populations was observed in the first principal component (PC). This PC accounted for >5 fold the variance of the second PC. Table 1 Evaluation of Principal Components Analyses in European Populations Using 300 K SNPs The clustering of individuals for PC1 and PC2 corresponded to self-reported regional and ethnic origins (Figure 1A and 1B). This is best illustrated when considering only those participants with the same grandparental country of origin and those individuals that indicated Ashkenazi Jewish ancestry (Figure 1B). Similar to our previous studies using smaller sets of SNPs, the clustering of individuals of Ashkenazi Jewish ancestry does not correspond to grandparental European country of origin, which was diverse . Figure 1 European Substructure Analysis of a Diverse Set of Individuals of European Descent (A) Graphic representation of the first two PCs for 952 individuals genotyped with 300K SNPs. (B) Color code shows subgroup of individuals with more detailed grandparental origin information. Each color-coded individual had 4GP of origin information with the exception of the AJA group. The individuals included 14 Spanish (SPN), 28 Italian (ITN), eight Greek (GRK), 11 German (GERM), 52 IRISH, five United Kingdom (UK), three Scandinavian (SCAN), and two Netherland (NETH). For the Ashkenazi Jewish individuals, 38 had 4GP information (AJA_4GP), and 220 participants were self identified as Ashkenazi Jewish (AJA) but without other information. (C) The STRUCTURE analyses shows results from the same participant set using three random sets of >3,500 SNPs for assessment of the number of population groups (K). The ordinate shows the Ln probability (mean +/− SD) corresponding to the number of clusters. (D) STRUCTURE results under the assumption of two population groups (K = 2). The proportion of each cluster group (population) for each individual is shown by the color code. The first PC showed a gradient that distinguished “southern” or Mediterranean origin from “northern” European ancestry (Figure 1B). The mean +/− SD of the first PC scores for those individuals with the same (or adjoining for Scandinavian) 4 grandparent (GP) country of origin or 4GP Ashkenazi ancestry information were: Irish (51 individuals), mean −0.022 +/− 0.002; Scandinavian (3 individuals), mean, −0.022 +/− 0.002; United Kingdom (5 individuals), −0.020 +/− 0.002; German (11 individuals), −0.016 +/− 0.004; Spanish (14 individuals), 0.004 +/− 0.003; Italian (28 individuals), 0.015 +/− 0.006; Greek (9 individuals), 0.022 +/- 0.011; and Ashkenazi (38 individuals), 0.045 +/− 0.003. For participants self-identified as of Ashkenazi heritage, but who lacked 4 GP information (234 individuals), the mean PC1 score value was 0.043 +/− 0.008. The same dataset was also examined using a Bayesian clustering algorithm (STRUCTURE) . For these analyses we examined three sets of >3500 SNPs that were selected randomly except for the criterion that the minimum inter-SNP distance was >500 Kb (see Methods). This was done to both ensure genome-wide distribution and eliminate linkage disequilibrium between SNPs. This analysis similar to our previously reported studies was most consistent with two population groups (K = 2) explaining the major substructure in this set of European individuals (Figure 1C). The distribution of the individuals (K = 2) was similar to that shown on the first axis of the PCA (Figure 1D) and the individual population contributions were highly correlated with the first PC scores (r2 > 0.95 for each of the three random sets compared with the for the 500K SNP data analyzed by PCA). We also explored whether the PCA was affected by either inclusion or exclusion of specific population groups or the number of individuals in different population groups. Most prominently, a major difference in the relationships among the populations for the second PC was observed when either Ashkenazi Jewish individuals or Irish individuals were excluded (Figure 2). These results suggest some caution in interpretation of specific clines and particular relationships among different European groups (see discussion). Figure 2 Comparison of Principal Component Analysis Excluding Different Individual Groups Color key shows groups as defined in Figure 1. (A) All individuals with 4GP information. (B) Same individual set except exclusion of individuals of Irish descent. (C) Same individual set with exclusion of Ashkenazi Jewish individuals. Identification of SNPs Distinguishing Northern from Southern European Origin For many association tests including candidate genes and replication studies for candidate chromosomal regions it is useful to identify smaller numbers of SNPs that can distinguish European substructure. Previous studies including our own utilized genome-wide SNP sets of ≤10K SNPs. To identify a more robust set of SNPs that could distinguish the largest component of substructure observed in the current data we used the genotypic differences observed in >300K SNPs between two groups of individuals, 150 Ashkenazi Jewish and 125 Northern European individuals. The Ashkenazi Jewish individuals were chosen since 1) this individual group was most clearly distinguishable from the Northern European individuals, 2) might more closely represent an “older” population of Mediterranean origin and 3) we had substantial number of genotyped individuals to enable a good representation of this population. To select the most informative SNPs distinguishing between these groups we determined the informativeness (In)  for each of >300K SNPs. The 20,000 SNPs with the highest In values were then selected to capture the most informative SNPs. To ensure both a more uniform genome-wide distribution and minimize linkage disequilibrium the set of putative European substructure ancestry informative markers (ESAIMS) were chosen to obtain the markers with highest In with a minimum inter-SNP distance >500 Kb. This resulted in a set of 1441 SNPs (Table S1). The STRUCTURE results (K = 2) from individuals with 4 grandparental data (not used for ESAIM selection) showed separation of most of the 220 self-identified individuals of Ashkenazi Jewish heritage (mean 83% south; median, 87%) from 37 individuals of Western, Northern or Central heritage belonging to the “northern” group (mean 4% south; median, 3%), and 51 individuals of Greek, Italian, or Spanish origin were intermediate (mean 41% south; median, 42%) (Figures 3 and S1). These 1441 north/south-ESAIM showed small confidence limits in the assignments; of the total of 677 individual individuals not used in ESAIM selection the maximum 90% Bayesian confidence interval (CI) was 21.1% (e.g. 13.7 % south, 90% CI 2.6 % – 23.0 %) and the median CI was 13.9%. Smaller north/south-ESAIM sets showed strong correlations with the 1441 set e.g. 384 ESAIMs (r2 = 0.970) (Figure S2). However, the smaller north/south-ESAIM sets showed somewhat broader confidence limits (e.g. 384 north/south ESAIM set showed maximum CI = 38.9% and a median CI = 17.1%. However, these differences are unlikely to affect most studies. The larger number of north/south-ESAIMs may be useful if a very homogeneous set of individuals of a particular ethnic group is desired for a specific study. Figure 3 STRUCTURE Analysis Using 1,400 ESAIMs Selected for North/South Information Analysis was performed without any prior population assignment using STRUCTURE under the assumption of two population groups (K = 2). The results are shown for only individuals not used in selection of the north/south-ESIAMs. The individual individuals and 90% confidence limits are shown for selected groups with ethnic and grandparental origins (see Figure S1 for entire results). The individuals grouped by self identification included: Ashkenazi 4GP; Ashkenazi Jewish (without 4GP information) (AJA); Greek (GRK); Italian (ITN); Spanish (SPN); German (GERM); Scandinavian (SCAN); United Kingdom (UK); and Irish. Further Analysis of Northern European Populations Although the STRUCTURE analysis was most consistent with two population groups explaining most of the substructure within Europe, the distribution of individuals from different countries of origin along the second axis in the PCA (Table 2; Figure 1B) suggested that further analysis of substructure was warranted. This substructure was examined using individuals of “northern” European ancestry in the context of a large dataset of rheumatoid arthritis cases and controls (over 2000 total individuals) that were recently genotyped with >500K SNPs as part of the NARAC studies (see Methods). For these PCA we examined only those European individuals that showed >90% membership in the northern European group by STRUCTURE analysis using the 1441 north/south-ESAIMs. This criterion closely matched the individual distribution along the first principal component axis of this dataset (Figure S3). Controlling for this first vector in analysis of cases vs. controls decreased the inflation of the median chi-square distribution using the genomic controls parameter (λgc) from 1.43 to 1.15. Table 2 Summary of Principal Component and STRUCTURE Results for Northern European Population Groups PCA of the “north” only subset showed substantial substructure differences in the distribution of North American Rheumatoid Arthritis (NARAC) cases and controls along the first PC (Figure 4). Importantly, we controlled for this difference in our genome-wide association scan and excluded SNPs that showed association based on this substructure difference . The distribution of individuals in this PC showed a distinct pattern with respect to the context of country of origin information that was available for a subset of control individuals (Figure 4B). Most notably, Irish individuals were distinguished from those of eastern, northern and central European descent. These relationships were further defined by inclusion of additional individuals with the same country of origin genotyped with the 300K SNP set (Table 2). Similar results were also observed using a STRUCTURE analysis of the same dataset (Table 2). The results suggest that the difference in numbers of individuals of Irish ancestry was primarily responsible for the major difference in substructure observed in the NARAC cases and controls . Controlling for this aspect of substructure the λgc in this individual set decreased from 1.15 to 1.07. Since the sample set had a disproportionately large contribution of participants of Irish ancestry we also examined a small set of individuals with nearly proportionate representation of Irish, German, Eastern European, and United Kingdom individuals. Similar to the results on the larger set of individuals, these PCA results showed a west-east gradient (Figure S4). Here however, there was no difference observed between the Irish and UK individuals. Thus, these results further indicate that the number of individuals from each individual group may partially alter relationships among individual groups. Figure 4 Analysis of European Substructure in Northern European Individuals (A) The first two PCs are depicted for RA cases and NYCP controls. (B) Color codes show the Irish contribution to each individual with at least two GP country of origin information in the sample set shown in (A), e.g., the 2GP Irish individuals have 2GP Irish origin and 2GP unknown or USA origin; Not Irish includes only individuals without known Irish ancestry and with at least 2GP information; mixed Irish are those individuals with at least one GP Irish and one GP non-Irish. (C) Analysis using 1,211 ESAIMs selected for differences along PC1 in northern European individuals (see Results). Principal Component Analysis also Shows Other Aspects of Genomic Architecture Inspection of the second axis of the Northern European subset (see Figure 4A and 4B), also showed an unexpected grouping of individuals on the Y axis into three separate groups. When we ascertained informative SNPs between the top and bottom groups, all of the SNPs with In values >0.02 were found to be located in a 3.8 Mb segment of human Chromosome 8 (8.135 – 11.936 Mb). This region has been previously shown to contain a common inversion within European populations [24,25]. When only SNPs within this interval were used the distribution of the individuals formed the same grouping of three clusters as found using the entire 500K set (data not shown). As expected two dominant haplotypes (A and B) were ascertained with twenty selected markers with very large Ins and described the same three individual groups (AA, AB, and BB) and were highly correlated (r2 = 0.83) (Figure 5). Although the λgc in the entire NARAC case-control dataset is decreased from 1.073 to 1.048 by considering this axis, our analyses indicate that the position of individuals on this axis is almost completely due to this localized inversion. This region is presumably identified by PCA because of the long stretch of linkage disequilibrium caused by the chromosomal inversion. Figure 5 Principal Component Analysis Shows Chromosome 8 Inversion The selected informative SNPs from a 3.8 Mb segment of Chromosome 8 shows the same PC score distribution as the entire SNP set for the second PC in analysis of “northern” European individuals. The graph shows the position of each of 382 tested individuals for the second axis in the PCA using 500K SNPs (ordinate) and the position based on analysis using 20 selected SNPs from the 3.8 Mb segment of Chromosome 8 (abscissa). The 20 selected SNPs were those with the highest In between the outer groups in an independent dataset separated by a minimum of 50 kb. Selection of ESAIMs for Northern European Population Studies Another set of ESAIMs (north-ESAIMs) was ascertained using the results of the first PC scores of the “northern” European only analysis. We selected two disparate sets of individuals comprised of 93 and 132 individuals, by randomly selecting half of the individuals with PC scores one standard deviation above the mean and half of the individuals with PC scores one standard deviation below the mean. We used this procedure to provide both a distribution of allele frequencies in the disparate individual sets as well as maintain a well distributed set of individuals for evaluating the functional performance of the putative north-ESAIMs. These ESAIMs were then selected using the same method (In values) and criterion (minimum inter-SNP distance = 500 kb). We initially examined the best 1250 north-ESAIMs with 1608 individuals that had not been included in any of the ESAIM selections. Initial evaluation of this north-ESAIM set showed a distortion of the PCA in which the individuals were divided in three groups diagonally across the first two axes. Deletion of markers within the Chromosome 8 inversion (see above) resulted in a set of 1211 SNPs that no longer showed this pattern. This north-ESAIM set distinguished “northern” European individuals in a pattern similar to that observed using the 500K SNP set along the first PC (Figure 4C and see Table S2 for SNP list). The PC scores using these north-ESAIMs in the “northern” European only set correlated with the 500K first PC result; r2 = 0.46 (p 0.05) or the STRAT analysis (p > 0.05). For IRF4, the association becomes stronger after correcting for the north/south difference (Eigen statistic 1). However, when the second PC is considered the signal is greatly diminished. With the combined ESAIMs (north/south and north), the evidence for association is also greatly diminished by the EIGENSTRAT analysis and eliminated in the STRAT analysis. We examined different sets of ESAIMs including several different combinations of north/south ESAIMs and north-ESAIMs. The results were identical when 192 north/south ESAIMs or 384 north/south ESAIMs were used for the PC1 correction (data not shown). However, as expected based on our PCA results, decreasing the 1211 north-ESAIMs led to poorer PC2 correction and less complete correction of the false positive IRF4 association (see footnote Table 3). Together these results suggest the potential application of ESAIMs in association studies of candidate genes or in replication studies (see discussion). Discussion The current study provides additional insight into European substructure and differences among different ethnic groups that may impact our understanding of the genetics of complex diseases. First, together with our recent report of a whole genome association study for RA in European Americans, this report emphasizes the importance of controlling for substructure in the ascertainment of putative susceptibility associated SNPs. Most notably without an analysis of substructure, IRF4 would appear as a very strong candidate for this disease. However, the large differences in allele frequency for this gene are largely due to the difference in allele frequency among different European subpopulation groups. Furthermore, this difference is accentuated when the northern population subgroup is examined. When only NYCP controls are considered an IRF4 SNP (rs12203592) showed the largest allele frequency difference between the Irish individuals and those individuals of Northern, Central European and Eastern European descent (δ = 0.40, Fst = 0.27). Using an algorithm based on the PCs, EIGENSTRAT, this SNP no longer appears significantly associated with RA. The difference in allele frequency for IRF4 within European populations has recently also been described by the Welcome Trust Case Control Consortium study . The current study extends and complements other studies showing evidence of European substructure. Overall, the current results are consistent with a major north/south (or northwest/southeastern) gradient as the largest difference within European groups confirming both our previous studies and others using up to 10,000 markers and is generally consistent with much earlier studies using classic gene-frequency data [15,18]. The current results differ from previous studies in defining a northern European axis that was critically important in the case control analyses . The relationship between the population groups was consistent when analysis was restricted to “northern” European population groups. As discussed further below, when more disparate populations are examined (including different “southern” populations) these relationships are not as clearly defined (see Figure 2). Thus, differences in these results compared to other studies can in part be attributed to both inclusion of different population groups and perhaps complex relationships reflecting different population origins that includes migration, admixture, and isolation. In addition, the much larger SNP set, 300K compared to maximum of 10K SNPs in previous studies, is also likely to have exposed aspects of substructure not evident in other studies. For PC's >1, comparison of sets of 40K SNPs (Table S3). Our results also suggests the potential for further definition of more homogeneous population groups for genetic studies that may theoretically decrease both type 1 and type 2 error rates. Geneticists have long recognized that different population groups may provide enhanced opportunities to uncover susceptibility loci based on more limited genetic heterogeneity. For complex genetic diseases some specific studies may focus on particular population groups to enhance the power to find important gene variants. For example, the study of Crohn's disease [#266600] in Ashkenazi Jewish individuals has the advantage of examining a potentially more homogeneous population with a higher frequency of this particular disease than in a mixed European population. This approach is supported by our results suggesting that a very large proportion of this particular ethnic group can be distinguished by analysis of substructure. Moreover, our results provide the ability to further define and restrict this study population by allowing the identification and exclusion of subgroup outliers in association tests in studies of complex genetics in Ashkenazi Jewish populations. In addition, pre-genotyping of potential cases and controls with as few as several hundred north/south-ESAIMs could enable pre-identification of a more homogenous subgroup for WGA or be utilized in candidate SNP replication studies to reduce error rates. With respect to identification of population substructure there are several limitations in the current study. First, analyses are based on a diverse set of individuals of European descent with variable ancestral contributions from different European countries that is only partially defined. This limits certain conclusions with regards to specific aspects of substructure related to population subgroups. However, we believe that the concordant grouping of the majority of participants with grandparental information provides strong support for the major relationships and differences in these population groups. The overall strong correlation between results using principal components and those using a Bayesian clustering algorithm provide additional confidence in the general results. Second, the PCA is sensitive to differences in the inclusion or exclusion of specific population groups. When the second axis is considered for the entire European group, we observed changes in the country-of-origin order for the northern group with respect to the southern group in subset analyses (Figure 2). We speculate that this observation may reflect the difference in the origins of the additional substructure in the northern group compared to the other elements of substructure in the southern group. This result suggests that overall geographic suggestions of clines based on principal components must be cautiously interpreted. Third, the PCA can be dramatically affected by differences in relatively small genomic regions that may not reflect true population substructure. This is illustrated by our finding that the second axis in the “northern” European analysis (also observed for the third axis in the entire European set) is dependent solely on a 90% European ancestry by STRUCTURE analysis were included in these studies. A total of 51 NYCP individuals (from 1255) and 31 RA individuals (from 900) were excluded. Genotyping. Genotyping was performed according to the Illumina Infinium 2 assay manual (Illumina, San Diego), as previously described . The dataset was filtered for individuals with >10% missing genotypes, and SNPs with >10% missing data, Hardy-Weinberg equilibrium (HWE) (p < 0.00001) and individual samples for evidence of possible DNA contamination, cryptic family relationships. Statistical analyses. Fst was determined using Genetix software  that applies the Weir and Cockerham algorithm , and δ was calculated by determining the absolute value of the allele frequency difference between two populations. A measure of informativeness for each SNP (In) was determined using an algorithm previously described . Linkage disequilibrium was examined using the Genetix software . False discovery rate statistics  were determined using HelixTree 5.0.2 software (Golden Helix, Bozeman, MT, USA). Population structure was examined using STRUCTURE v2.1 [21,40]. Each STRUCTURE analysis was performed without any prior population assignment and was performed using 10,000 replicates and 5000 burn-in cycles under the admixture model applying the infer α option with a separate α estimated for each population under the F model (where α is the Dirichlet parameter for degree of admixture). Runs were performed under the λ = 1 option where λ parameterizes the allele frequency prior and is based on the Dirichlet distribution of allele frequencies. A uniform prior distribution of allele frequencies over all loci is used when λ = 1. Structured association was performed using the STRAT software  that performs association tests with population structure information that is provided by a prior analysis with STRUCTURE . For initial STRUCTURE analyses we selected random SNPs based on a minimum inter-SNP distance 500 kb; there was no evidence for LD among adjacent markers in each self identified ethnic set (r2 < 0.2). The selected sets contained 3500 to 4500 SNPs that were suitable for STRUCTURE analyses. Larger SNP sets have extraordinary computational time requirements for accurate estimates of the parameter values when applied to studies with large sample sizes. PCA, PCA control for association testing, and determination of the genomic control parameter (λgc)  was determined using the EIGENSTRAT statistical package . Several tests were used to assess the significance of PCA. As suggested previously , both analysis of variance (ANOVA) and a split half reliability test adjusted by the Spearman-Brown formula  were performed. The ANOVA examined the statistical significance of the difference in PC scores among individual groups pre-assigned based on self-identification. The split half reliability test can determine whether independent (non-overlapping) SNP sets provide the same or different results. Unlike ANOVA this test does not rely on correct pre-knowledge of group assignment. For the absence of population structure the null hypothesis is that there will be no correlation in the PCA results. The split half reliability test was performed three times using 1) alternate chromosomes, 2) alternate half chromosomes, and 3) half genome SNP sets. These sets were chosen to eliminate any dependency in each test between the two half datasets based on linkage disequilibrium. Thus, correlation of the independent SNP sets should be due to similar substructure. In addition, the current study also examined whether the distribution of individuals in each principal component (PC) was normally distributed using the Shapiro and Wilk's W-statistic test for normality . In the absence of population structure, the null hypothesis is that the data will be normally distributed. PCA can be sensitive to quality control issues that can give rise to spurious clustering . Several factors in our design and execution mitigate against this possibility. First the individuals from different ancestry groups and the Irish group in particular were randomly distributed over plates. Furthermore, the genotyping of approximately half the individuals was performed separately. Comparison of the first run and the second run showed very similar results with respect to the distribution of self identified ancestry groups. As indicated in the methods, we used both genotype completeness as well as a loose (p < 0.00001) HW exclusion to exclude SNPs with genotype artifacts. Finally, as shown in Table S3, independent random sets (three) showed very strong correlations with the 300K set for PC1 and PC2 (r2 all above 0.93). Supporting Information Figure S1 Complete STRUCTURE Results Using 1,441 ESAIMs Selected for North/South Information (909 KB TIF) Click here for additional data file. Figure S2 Correlation of Individual Substructure Information Using Different Numbers of ESAIMs Informative for “North/South” European Substructure (777 KB TIF) Click here for additional data file. Figure S3 Graphic Depiction of the “Southern” European Individuals Excluded on the Basis of STRUCTURE Results Using North/South ESAIMs (1.4 MB TIF) Click here for additional data file. Figure S4 Analysis of European Substructure in 42 Northern European Individuals (1.1 MB TIF) Click here for additional data file. Table S1 North/South European Substructure Ancestry Informative Markers (147 KB RTF) Click here for additional data file. Table S2 European Substructure Ancestry Informative Markers Distinguishing Northern European Populations (131 KB RTF) Click here for additional data file. Table S3 Correlation of Results Using Different Numbers of Random SNPs (53 KB RTF) Click here for additional data file.