- Record: found
- Abstract: found
- Article: not found

Authors: S Pichler, Jill C. Graff, Joe T R Clarke, C. Brayne, A L DeStafano, G. Spalletta, Caroline Graff, B Kunkle, F Lin, H Soininen, S. Choi, D Harold, Erwin R. Schmidt, P Barberger Gateau, Kristelle Brown, Sam D Blacker, Paul Gilbert, P Caffarra, P Hyslop, S Sorbi, Chien-Jen Wang, Lihong Wang, M Nalls, Teresa A. Myers, Peter Holmans, Minerva M. Carrasquillo, D Tsuang, F Pasquier, K Ritchie, M Arfan Ikram, M Bullido, Alison M Goate, Oscar Lopez, M Ingelsson, O Valladares, Garry S A Myers, Carol Brayne, C Tzourio, M Hiltuenen, Mark Bennett, S Moebus, Walter Kukull, J. Bis, Gui F. Zou, L Farrer, Edward C Holmes, Thomas A Smith, Jonathan L. Haines, P Proitsi, M Bihoreau, Christiaan van der Schoot, M. Deniz, J Dartigues, John Powell, Janet Johnston, Tatiana M. Foroud, A Gerrish, M. Owen, A Means, Debby Tsuang, John Weger, V. Deramecourt, L Lannefelt, A Rotter, Gary W. Beecham, M Huentelman, D Wallon, B Psaty, Dan Rujescu, J Clarimon, Lisa Jones, M Dunstan, D Galimberti, Martin Wells, M D Zompo, P Jäger, K Sleegers, C Berr, G. Mark Lathrop, C Dufouil, Kevin Brown, Margaret A. Pericak-Vance, T Montine, Curtis S. Younkin, D Gallacher, N D Amin, Edward A. Bayer, Antonio F. Moron, John Hardy, William Evans, G Eiriksdottir, Laura Cantwell, Kathy-Anne Clarke, Alexis Brice, Kevin Morgan, H Hampel, John S. Kauwe, Rozemeire G M Del Nero, V Chouraki, Andrea N. Jones, W Gu, Meei-Ju Lin, B Nacmias, I Mateo, David E. Larson, M. Mancuso, N Fievet, Mike A Nalls, Edmond Choi, F Panza, J. Collinge, A Goate, P Bossù, W Kukull, Michel Hofman, Douglas R. Green, Nick Fox, Ana Frank García, Duncan Ruiz, Elizabeth L Cantwell, Ellen E. Martin, D Zelenika, T Mosley, S M O'Donovan, Badri N. Vardarajan, K Faber, Sebastian Russo, Hakon Hakonarson, Richard Harris, C Razquin, Rosemeire C P Pastor, P. Bosco, B McGuiness, Meic H. Schmidt, Adam Naj, John. Sims, Celia Coto, Tracy Keller, Andrea L. Launer, C Bellenguez, Anita C. Thomas, Kathryn Lunetta, Bernardo Alvarez, S Lovestone, Tara G. Martin, C Ibrahim, Christopher Lambert, M Mayhaus, Filip De Keyser, Tim Harris, Y Kamatani, Melanie L. Dunstan, W. Maier, P Jonsson, A. Pilotto, Jean Ruiz, M Ilyas Kamboh, P Amouyel, Steffany Bennett, David Craig, John R. Gilbert, Markus Nöthen, Ekaterina Rogaeva, B Grenier-Boley, A Lleó, Gerard D. Schellenberg, S Seshadri, A Brice, Clinton F Matthews, L Yu, Tony George, P Mecocci, Michael Gill, Valentina Moskvina, E Boerwinkle, D Hannequin, David C Rubinsztein, Carlos Cruchaga, M Riemenschneider, O Combarros, K. Hamilton, L Letenneur, R Mayeux, Helen C O'Donovan, J S K Kauwe, L Fratiglioni, Gregory P. Passmore, S Love, Thomas Montine, F. Sánchez, M Tsolaki, M Boada, Stephen Todd, Sally Mead, D Beekly, K Lunetta, Leena Ibrahim, C Reitz, K Bettens, Preston T. J. A. Williams

Publication date: 2013-11-30

Journal: Nature genetics

Keywords: Genetic Loci, Polymorphism, Single Nucleotide, Middle Aged, Male, Humans, statistics & numerical data, Genome-Wide Association Study, Genetic Predisposition to Disease, Female, Cohort Studies, Case-Control Studies, genetics, epidemiology, Alzheimer Disease, Aged, 80 and over, Aged, Age of Onset

Eleven susceptibility loci for late-onset Alzheimer's disease (LOAD) were identified by previous studies; however, a large portion of the genetic risk for this disease remains unexplained. We conducted a large, two-stage meta-analysis of genome-wide association studies (GWAS) in individuals of European ancestry. In stage 1, we used genotyped and imputed data (7,055,881 SNPs) to perform meta-analysis on 4 previously published GWAS data sets consisting of 17,008 Alzheimer's disease cases and 37,154 controls. In stage 2, 11,632 SNPs were genotyped and tested for association in an independent set of 8,572 Alzheimer's disease cases and 11,312 controls. In addition to the APOE locus (encoding apolipoprotein E), 19 loci reached genome-wide significance (P < 5 × 10(-8)) in the combined stage 1 and stage 2 analysis, of which 11 are newly associated with Alzheimer's disease.

- Record: found
- Abstract: found
- Article: not found

Shaun Purcell, Benjamin M. Neale, Kathe Todd-Brown … (2007)

Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.

- Record: found
- Abstract: found
- Article: not found

Bryan Howie, Peter Donnelly, Jonathan Marchini (2009)

Introduction Genome-wide association studies have identified many putative disease susceptibility loci in recent years [1]–[3]. This approach to studying disease has succeeded largely because of improved catalogues of human genetic variation [4] and advances in genotyping technology, but it has also been bolstered by the rise of genotype imputation methods [5]–[8], which have allowed researchers to tease increasingly subtle signals out of large and complex genetic datasets [9],[10]. Imputation methods work by combining a reference panel of individuals genotyped at a dense set of polymorphic sites (usually single-nucleotide polymorphisms, or “SNPs”) with a study sample collected from a genetically similar population and genotyped at a subset of these sites. Figure 1 shows a schematic example of such a dataset. Imputation methods predict unobserved genotypes in the study sample by using a population genetic model to extrapolate allelic correlations measured in the reference panel. The imputed genotypes expand the set of SNPs that can be tested for association, and this more comprehensive view of the genetic variation in a study can enhance true association signals and facilitate meta-analysis [9],[10]. 10.1371/journal.pgen.1000529.g001 Figure 1 Schematic drawing of imputation Scenario A. In this drawing, haplotypes are represented as horizontal boxes containing 0's and 1's (for alternate SNP alleles), and unphased genotypes are represented as rows of 0's, 1's, 2's, and ?'s (where ‘1’ is the heterozygous state and ‘?’ denotes a missing genotype). The SNPs (columns) in the dataset can be partitioned into two disjoint sets: a set T (blue) that is genotyped in all individuals and a set U (green) that is genotyped only in the haploid reference panel. The goal of imputation in this scenario is to estimate the genotypes of SNPs in set U in the study sample. To date, most imputation analyses have used reference panels composed of haplotypes from Phase II of the International HapMap Project, together with study samples genotyped on commercial genome-wide SNP arrays. Figure 1 depicts this arrangement, which we call Scenario A. To understand how imputation methods work in this setting, it helps to observe that the SNPs exist in a natural hierarchy, such that they can be partitioned into two disjoint sets: a set T that is typed in both the study sample and the reference panel, and a set U that is untyped in the study sample but typed in the reference panel. Informally, most imputation methods phase the study genotypes at SNPs in T and look for perfect or near matches between the resulting haplotypes and the corresponding partial haplotypes in the reference panel—haplotypes that match at SNPs in T are assumed to also match at SNPs in U. This is the fundamental basis of genotype imputation. Several important points emerge from this description. First, the accuracy with which the study haplotypes are phased at SNPs in T should determine how well they can be matched to haplotypes in the reference panel, which should in turn influence the accuracy of imputation at SNPs in U. Second, accounting for the unknown phase of the SNPs in T can be computationally expensive; if the haplotypes at these SNPs were known, most methods would be able to impute genotypes at SNPs in U more quickly. Third, many existing methods do not use all of the available information to phase the study genotypes at SNPs in T. In principle, a phasing algorithm should be able to “learn” about desirable phasing configurations for a given study individual by pooling information across the reference panel and all other individuals in the study, and the phasing accuracy should increase with the sample size; in standard practice, most imputation methods gain phasing information about each study individual only from the reference panel, and phasing accuracy does not depend on the size of the study sample. (This description applies to imputation methods based on hidden Markov models, or “HMMs” [6],[11]; non-HMM methods often discard other kinds of information.) The BEAGLE imputation model [12],[13] is one notable exception to this point, and we discuss its alternative modeling strategy in detail in this work. We have developed a new algorithm that seeks to improve imputation accuracy at untyped SNPs by improving phasing accuracy at typed SNPs, building on the points raised above. Most HMM-based imputation methods simultaneously estimate missing genotypes and analytically integrate over the unknown phase of SNPs in T. By contrast, we propose to alternately estimate haplotypes at SNPs in T and impute alleles at SNPs in U, assuming the haplotype guesses are correct. We account for the phasing uncertainty in the data by iterating these steps in a Markov chain Monte Carlo (MCMC) framework. Separating the phasing and imputation steps allows us to focus more computational effort on phasing and use more of the available information; the extra computation used in this step is largely balanced by the quick haploid imputation in the step that follows. This approach can improve imputation accuracy in Scenario A, as we show in the Results section, but another major motivation of this work is to extend IMPUTE [6] to handle “next-generation” association datasets. By this, we refer to studies in the near future that will have access to additional reference data that could inform imputation. Next-generation reference panels will present new challenges for imputation, including larger sample sizes; unphased and incomplete genotypes; and multiple reference panels containing different SNP sets. Our method aims to use the principles outlined above to address these challenges and improve imputation accuracy in next-generation studies. One new data configuration, which we call Scenario B and explore in detail in the current study, is presented in Figure 2; we will address other next-generation reference panels in the Discussion. In Scenario B, there are different amounts of genotype data in different cohorts of a study. For example, the Wellcome Trust Case Control Consortium (WTCCC) is currently performing an association study in which 6,000 controls will be genotyped on both the Affymetrix 6.0 and Illumina 1 M SNP chips, whereas disease cohorts will be genotyped only on either the Affymetrix 6.0 chip or the Illumina 670 k chip. In other words, a large set of controls will be genotyped at a subset of HapMap SNPs, and each case cohort will be genotyped at a subset of the SNPs typed in the controls. Published studies have already employed this design [14], and it may become more prevalent in the future as common sets of population controls become more widely available. 10.1371/journal.pgen.1000529.g002 Figure 2 Schematic drawing of imputation Scenario B. In this drawing, haplotypes are represented as horizontal boxes containing 0's and 1's (for alternate SNP alleles), and unphased genotypes are represented as rows of 0's, 1's, 2's, and ?'s (where ‘1’ is the heterozygous state and ‘?’ denotes a missing genotype). The SNPs (columns) in the dataset can be partitioned into three disjoint sets: a set T (blue) that is genotyped in all individuals, a set U2 (yellow) that is genotyped in both the haploid and diploid reference panels but not the study sample, and a set U1 (green) that is genotyped only in the haploid reference panel. The goal of imputation in this scenario is to estimate the genotypes of SNPs in set U2 in the study sample and SNPs in the set U1 in both the study sample and, if desired, the diploid reference panel. In Scenario B, the study individuals genotyped on a larger number of SNPs can be used as an unphased, or “diploid”, reference panel for imputation in the remaining samples (which do not necessarily have to be cases). As before, we approach such a dataset by partitioning the SNPs into disjoint sets, named with reference to the study sample: a set U1 that is untyped in the study sample and typed only in the haploid reference panel, a set U2 that is untyped in the study sample and typed in both the haploid and diploid reference panels, and a set T that is typed in all samples. We apply the same inference principles to Scenario B as to Scenario A: at each MCMC iteration we phase all of the observed data, pooling information across samples typed on common sets of SNPs to estimate each haplotype pair, then perform haploid imputation assuming that all of the haplotype guesses are correct. One novelty of this scenario is that, at SNPs in U2 , the reference panel may contain thousands of chromosomes, in contrast to HapMap Phase II panels that contain only 120–180 chromosomes each. In principle, this added depth should improve imputation accuracy at SNPs in U2 , with notable gains at rare SNPs. The latter point is especially relevant because rare SNPs are an important source of power in imputation analyses [5],[6]. Scenario B also introduces the problem of multiple reference panels genotyped on different, hierarchical sets of SNPs. Many next-generation imputation datasets will follow this paradigm, which presents modeling challenges that remain largely unexplored. In the sections that follow, we describe the details of our new method as applied to the scenarios in Figure 1 and Figure 2. We then compare the method with other imputation approaches on real datasets from the United Kingdom that emulate Scenarios A and B. We show that our method can attain higher accuracy than existing methods in Scenario A, but that the absolute gains are small, which we attribute to the inherent limitations of a small set of reference haplotypes. In an example of Scenario B, we demonstrate that our method can use a large unphased reference panel to achieve higher accuracy than imputation based on the HapMap alone. We also show that our method can impute genotypes more accurately than other sophisticated [11],[13] and simpler [15] methods applied to the same dataset, and that our approach has higher sensitivity and specificity to detect copies of the minor allele at rare SNPs. In addition, we present results that highlight important practical advantages of our imputation modeling strategy over the one used by BEAGLE. We have implemented our new imputation method as an update to our existing software package IMPUTE; the new program is called “IMPUTE version 2” (IMPUTE v2). We refer to our previously published method [6] as “IMPUTE version 1” (IMPUTE v1). Materials and Methods Software IMPUTE v1 and IMPUTE v2 are freely available for academic use from the website http://www.stats.ox.ac.uk/~marchini/software/gwas/gwas.html Scenario A In Scenario A, IMPUTE v2 estimates marginal posterior probabilities of missing genotypes by alternately phasing all of the SNPs in T in the study sample (simultaneously imputing any sporadically missing genotypes) and then imputing study genotypes at the SNPs in U, conditional on the haplotype guesses from the first step. To explain this process in more detail, we begin by defining , the set of known reference haplotypes at SNPs in T and U (i.e., the entire reference panel); , the set of known reference haplotypes at SNPs in T; and , the set of unobserved study haplotypes at SNPs in T. If there are NS individuals in the study sample, their haplotypes at SNPs in T can be represented as , where is the haplotype pair for study individual i. The method begins by choosing initial guesses for the haplotypes in – by default, we choose haplotypes that are consistent with the observed genotype data but phased at random. We then perform a number of MCMC iterations. Each iteration updates every study individual i (in some arbitrary order) in two steps: Sample a new haplotype pair for individual i at SNPs in T. This is accomplished by sampling from the conditional distribution , where is individual i's multilocus genotype at SNPs in T, contains current-guess haplotypes at SNPs in T for all study individuals except i, contains the reference panel haplotypes at SNPs in T, and ρ is the fine-scale, population-scaled recombination map for the region of interest. We describe this distribution further below. Impute new alleles (in two independent haploid steps) for SNPs in U, conditional on , , and ρ. We typically run the method for a relatively small number of burn-in iterations that invoke only the phasing step, followed by a larger number of main iterations that include both steps and contribute to the final imputation probabilities. We investigate the convergence properties of the method in Text S1, Figure S1, and Table S1. In Step 1, the algorithm phases individual i's observed genotype by sampling from . The model we use to specify this conditional distribution is essentially the same one used by IMPUTE v1 [6] – i.e., we use a hidden Markov model that is based on an approximation to the coalescent-with-recombination process [16]. This model views newly sampled haplotypes as “imperfect mosaics” of haplotypes that have already been observed. As with IMPUTE v1, we use an estimated fine-scale recombination map [17] for SNP-to-SNP transition probabilities and a result from population genetics theory [6] for emission probabilities, which model historical mutation. One difference between versions is that IMPUTE v1 analytically integrates over the unknown phase of the genotypes in the study sample, whereas IMPUTE v2 uses Step 1 to integrate over the space of phase reconstructions via Monte Carlo. This step is accomplished for each individual by sampling a pair of paths through the hidden states (haplotypes) of the model, then probabilistically sampling a pair of haplotypes that is consistent with the observed multilocus genotype. Path sampling is a standard operation for HMMs, although in this case the calculation burden can be reduced by careful inspection of the equations for the HMM forward algorithm [11]. By default, the state space of the model in Step 1 includes all of the known haplotypes in and the current-guess haplotypes in . The computational burden of these calculations (both in terms of running time and memory usage) grows quadratically with the number of haplotypes and linearly with the number of SNPs. We later propose approximations to make these calculations more tractable on large datasets. In Step 2, the algorithm uses each of the haplotypes in (which were sampled in Step 1) to impute new genotypes for SNPs in U. The HMM state space for this step includes only the reference panel haplotypes . The imputation is accomplished by running the forward-backward algorithm for HMMs independently on each haplotype in and then analytically determining the marginal posterior probabilities of the missing alleles – this process is simply a haploid analogue of the one used by IMPUTE v1. If we assume that both haplotypes were sampled from a population that conforms to Hardy-Weinberg Equilibrium (HWE), it is straightforward to convert these allelic probabilities to genotypic probabilities for individual i. Across iterations, we can then sum the posterior probabilities for each missing genotype as if they were weighted counts; at the end of a run, the final Monte Carlo posterior probabilities can be calculated by renormalizing these sums. By contrast with Step 1, the computational burden of these calculations grows only linearly with the number of haplotypes. Consequently, Step 2 can usually avoid the approximations needed to make Step 1 feasible, thereby allowing us to make full use of even very large reference panels. By using both the reference panel and the study sample to inform phasing updates in Step 1, IMPUTE v2 uses more of the information in the data than most comparable methods [6],[11], which typically account for phase uncertainty using only the reference panel. At the same time, each iteration is relatively fast because untyped SNPs are imputed in a haploid framework rather than the more computationally intensive diploid framework that is used by other HMM methods. For example, one iteration of IMPUTE v2 will typically finish faster and use less computer memory than a run of IMPUTE v1 on the same dataset, although IMPUTE v2 tends to be slower than IMPUTE v1 on the whole since the new method requires multiple iterations. We explore the computational burden of the method in detail in the Results section. Scenario B The structure of the dataset is more complex in this scenario than in the previous one, but we follow the same basic principles of imputation: phase the observed data, then impute alleles in each haplotype separately, conditioning on as much observed data as possible. Here, the goal of the phasing step is to end up with three sets of haplotypes: , the known haploid reference panel haplotypes at SNPs in T, U1 , and U2 ; , the unobserved diploid reference panel haplotypes at SNPs in T and U2 ; and , the set of unobserved study haplotypes at SNPs in T. If there are NDR individuals in the diploid reference panel, their haplotypes can be represented as , where is the haplotype pair for diploid reference individual i. The method begins by choosing initial guesses for the haplotypes in and – as before, we choose haplotypes that are consistent with the observed genotype data but phased at random. Each MCMC iteration now includes five steps. First, we update every diploid reference individual i: Sample a new haplotype pair for individual i at SNPs in T and U2 . This is accomplished by sampling from the conditional distribution . Impute new alleles (in two independent haploid steps) for SNPs in U1 , conditional on , , and ρ. In other words, we phase the observed data for diploid reference individual i by pooling information across both reference panels, then use these haplotypes in separate imputation steps based on the haploid reference panel. Up to this point, we have simply recapitulated Scenario A with different notation. Next, we update every study individual i: Sample a new haplotype pair for individual i at SNPs in T. This is accomplished by sampling from the conditional distribution . Impute new alleles (in two independent haploid steps) for SNPs in U2 , conditional on , , , and ρ. Impute new alleles (in two independent haploid steps) for SNPs in U1 , conditional on , , and ρ. As is Scenario A, burn-in iterations are used only for phasing (Steps 1 and 3), while subsequent iterations cycle through all five steps. In this algorithm, each study individual gains phasing information from all other individuals in the dataset, which can lead to very accurate haplotype estimates at typed SNPs when the total sample size is large. Once a study individual has sampled a new pair of haplotypes, the imputation step is broken into two parts: SNPs in U2 are imputed using information from both the haploid and diploid reference panels (Step 4), and SNPs in U1 are imputed using only the haploid reference panel (Step 5). This modeling choice highlights a core principle of our inference framework: we allow the method to naturally adapt to the amount of information in the data by conditioning only on observed genotypes, not imputed ones, at each step. Choice of conditioning states As noted above, the HMM calculations underpinning our method require more running time and computer memory as more haplotypes are added to the state space of the model. This can be a problem for the phasing updates, whose computational burden increases quadratically with the number of haplotypes included in the calculation. One solution, implemented in the phasing routine of the MACH software, is to use only a random subset of the available haplotypes for each update. For example, when sampling a new haplotype pair from in Step 1 of our algorithm for Scenario A, we could use a random subset of k haplotypes drawn from to build the conditional distribution, rather than the default approach of using all of the haplotypes. This approximation to the model will generally decrease accuracy, but it will also cause the computational burden of the phasing updates to increase linearly (for fixed k), rather than quadratically, with the number of chromosomes in the dataset. We have developed another approximation that also constrains phasing updates to condition on a subset of k haplotypes. Rather than selecting haplotypes at random, our approach seeks to identify the k haplotypes that are in some sense “closest” to the haplotypes of the individual being updated. In genealogical terms, this amounts to focusing attention on the parts of the underlying tree where that individual's haplotypes are located. The idea is that haplotypes that reside nearby in the genealogical tree will the most informative about the haplotypes of interest. The structure of the underlying genealogical tree is usually unknown (indeed, knowing the tree would essentially solve the phasing problem), so we frame the list of the k closest haplotypes as a random variable that gets updated for each individual at each MCMC iteration. To sample a new phase configuration for diploid individual i, we choose k conditioning states as follows: for each available non-self haplotype (including current-guess haplotypes for other diploid individuals), we calculate the Hamming distance to each of individual i's current-guess haplotypes and store the minimum of these two distances. Then, we use the k haplotypes with the smallest distances to build the HMM and sample a new pair of haplotypes for individual i. The transition and emission probabilities of our model [6] depend explicitly on k. The intuition is that, as k gets larger, jumps between different copied haplotypes should become less likely and those haplotypes should be copied with higher fidelity; this is because a chromosome will coalesce faster into a larger genealogy, leaving less time for recombination and mutation events to occur [18]. The underlying theory assumes that the haplotypes in question were sampled randomly from a population, which is clearly not the case when we select k haplotypes in the manner described above. To account for the fact that these haplotypes will find common ancestors (going backwards in genealogical time) more quickly than would k haplotypes chosen at random, we replace k with the total number of available haplotypes when specifying the HMM parameters for a phasing update. We refer to this approximation as informed selection of conditioning states. While this method is built upon genealogical intuitions, we emphasize that no explicit genealogies are constructed in our inference scheme. One way of understanding our approach is by comparison to the phasing method of Kong et al. [19]. Their method uses rule-based techniques to phase putative “unrelateds” by identifying long stretches of identity-by-state (IBS) sharing between individuals, under the assumption that such sharing is caused by recent common descent. Our Hamming distance metric can be viewed as a way of identifying near-IBS sharing, and our method combines information across multiple closely related individuals in a model-based way rather than seeking perfect IBS matching between specific individuals. In this sense, our approximation can be viewed as a flexible middle ground between full conditional modeling (which uses all of the available haplotypes to phase an individual) and the Kong et al. method (which may use only a small fraction of the available haplotypes to phase an individual). In our experience, imputation based on this informed method for choosing conditioning states is only trivially slower than otherwise identical analyses based on random state selection, effectively because the common HMM calculations take much longer than calculating all pairwise Hamming distances in the informed method. At the same time, the informed method can generally achieve the same phasing accuracy as the random method using many fewer states, or higher accuracy for a fixed number of states (data not shown). This is a major advantage because it is computationally expensive to add states to the model (i.e., to increase k). We therefore focus on the informed state selection method in this study, with the random method used only during MCMC burn-in, although both approaches are implemented in our software. We conduct an exploration of the parameter settings under informed selection, including the dependence of imputation accuracy on k, in Text S1, where we also discuss potential limitations of the informed state selection scheme. Modeling strategies for imputation datasets In order to understand the modeling choices underlying our new imputation algorithm, it is crucial to consider the statistical issues that arise in imputation datasets. For simplicity, we will discuss these issues in the context of Scenario A, although we will also extend them to Scenario B in the Results section. Fundamentally, imputation is very similar to phasing, so it is no surprise that most imputation algorithms are based on population genetic models that were originally used in phasing methods. The most important distinction between phasing and imputation datasets is that the latter include large proportions of systematically missing genotypes. Large amounts of missing data greatly increase the space of possible outcomes, and most phasing algorithms are not able to explore this space efficiently enough to be useful for inference in large studies. A standard way to overcome this problem with HMMs [6],[11] is to make the approximation that, conditional on the reference panel, each study individual's multilocus genotype is independent of the genotypes for the rest of the study sample. This transforms the inference problem into a separate imputation step for each study individual, with each step involving only a small proportion of missing data since the reference panel is assumed to be missing few, if any, genotypes. In motivating our new imputation methodology, we pointed out that modeling the study individuals independently, rather than jointly, sacrifices phasing accuracy at typed SNPs; this led us to propose a hybrid approach that models the study haplotypes jointly at typed SNPs but independently at untyped SNPs. We made the latter choice partly to improve efficiency – it is fast to impute untyped alleles independently for different haplotypes, which allows us to use all of the information in large reference panels – but also because of the intuition that there is little to be gained from jointly modeling the study sample at untyped SNPs. By contrast, the recently published BEAGLE [13] imputation approach fits a full joint model to all individuals at all SNPs. To overcome the difficulties caused by the large space of possible genotype configurations, BEAGLE initializes its model using a few ad-hoc burn-in iterations in which genotype imputation is driven primarily by the reference panel. The intuition is that this burn-in period will help the model reach a plausible part of parameter space, which can be used as a starting point for fitting a full joint model. This alternative modeling strategy raises the question of whether, and to what extent, it is advantageous to model the study sample jointly at untyped SNPs. One argument [20] holds that there is no point in jointly modeling such SNPs because all of the linkage disequilibrium information needed to impute them is contained in the reference panel. A counterargument is that, as with any statistical missing data problem, the “correct” inference approach is to create a joint model of all observed and missing data. We have found that a full joint model may indeed improve accuracy on small, contrived imputation datasets (data not shown), and this leads us to believe that joint modeling could theoretically increase accuracy in more realistic datasets. However, a more salient question is whether there is any useful information to be gained from jointly modeling untyped SNPs, and whether this information can be obtained with a reasonable amount of computational effort. Most imputation methods, including our new algorithm, implicitly assume that such information is not worth pursuing, whereas BEAGLE assumes that it is. We explore this question further in the sections that follow. Results To test our new imputation method, we compared it with established methods on realistic datasets that fit the two scenarios described above. Scenario A As an example of Scenario A, we used the 120 HapMap CEU parental haplotypes as a reference panel to impute genotypes in the WTCCC 1958 Birth Cohort (58 C) controls [1]. The 58 C samples were genotyped on the Affymetrix 500 K SNP chip, and the data were subjected to the SNP and sample filters specified in the WTCCC study [1]. Of the 1,502 58 C individuals, 1,407 were also genotyped on the Illumina 550 K chip, and 1,377 passed filtering in both datasets. We supplied only the latter set of individuals to the imputation methods, and we asked them to impute the 22,270 CEU HapMap SNPs on chromosome 10 that were represented on the Illumina chip but not the Affymetrix chip. We then used the imputed Illumina genotypes to evaluate the success of imputation based on the Affymetrix data. Program settings We used the following methods to perform the imputation: IMPUTE v1.0; MACH v0.1.10 with analytical (“mle”) imputation, where the model parameters were selected by running the “greedy” algorithm for 100 iterations on a random subset of 500 58 C samples, as suggested in the online tutorial that accompanies the software (http://www.sph.umich.edu/csg/abecasis/mach/tour/imputation.html); fastPHASE [11] v1.3.2 with 20 and 30 clusters (K = 20 and K = 30, in separate runs), 15 starts of the expectation-maximization (EM) algorithm to estimate model parameters, and 35 iterations per EM start (this version of fastPHASE automatically fits the clustering model to the reference panel and then imputes each study individual separately, conditional on the fitted model); BEAGLE v3.0.2 on default settings and with 50 iterations (rather than the default 10); and IMPUTE v2.0 with 40 and 80 conditioning states used for diploid updates at typed SNPs (k = 40 and k = 80, in separate runs) and 120 conditioning states (i.e., the full HapMap CEU panel) used for all haploid updates. We ran IMPUTE v2 with 10 burn-in iterations followed by 20 additional iterations. The first 3 burn-in iterations used random conditioning states for phasing updates, and all subsequent iterations used informed conditioning states. We discuss the motivations for these settings in Text S1. We declined to include other imputation methods in this analysis because the Genetic Association Information Network (GAIN) Imputation Working Group is planning to publish a similar comparison using a broad cross-section of methods; our goal here is mainly to benchmark a new method. In order to speed up the analysis via our parallel computing facilities, we split chromosome 10 into 20 non-overlapping analysis chunks. Each imputation run spanned 7 Mb, with an additional 250 kb buffer on either side that was used for inference but omitted from the results – this buffer guards against a deterioration of imputation quality near the chunk edges. We ran every algorithm using the same analysis chunks. Accuracy comparison The results of this analysis are shown in Figure 3. The x-axes in this figure display the discordance between imputed genotype calls and observed Illumina calls, which is a surrogate for imputation error rate; the y-axes display the percentage of genotypes for which no call was made. Each method's line is formed by considering several different calling thresholds for imputation posterior probabilities. For example, a certain number of maximum posterior probabilities will exceed a threshold of 0.9, and among these we can ask what percentage of the best-guess imputed genotypes disagree with the Illumina genotypes. This yields an x-coordinate, and the y-coordinate is simply the percentage of all imputed genotypes for which no posterior probability exceeds the threshold. We generated the lines on the plots by repeating these calculations for calling thresholds ranging from 0.33 to 0.99 for each method. 10.1371/journal.pgen.1000529.g003 Figure 3 Percentage discordance versus percentage missing genotypes for Scenario A dataset. (A) Full range of results, corresponding to calling thresholds from 0.33 to 0.99. (B) Magnified results for calling thresholds near 0.99. (C) Magnified results for calling thresholds near 0.33. On these plots, lines that are below and to the left of other lines are more desirable. One interpretation is that, for a given level of missing data, an imputation method with a line further to the left has lower discordance with the external genotypes. Such plots allow us to evaluate competing methods in a more nuanced way than just looking at best-guess genotypes (which is equivalent to setting a single calling threshold of 1/3). We strongly emphasize, however, that the point of this exercise is not to determine an “optimal” calling threshold and use this to make hard calls of imputed genotypes for downstream analyses. Imputation results inherently contain more uncertainty than experimental genotype calls, and a host of methods have been developed to appropriately take this uncertainty into account when doing things like association testing [6]. Such methods are implemented in our freely available association testing software, SNPTEST. Figure 3A shows the full results of this comparison. The curves are difficult to distinguish in this plot, so Figure 3B and 3C magnifies either end of the range to highlight the salient features. The grid lines in all three panels are shown at the same vertices to help convey the degree of magnification. The results can also be summarized by the best-guess error rate for each method (x-intercept on the plots): BEAGLE (default), 6.33%; BEAGLE (50 iterations), 6.24%; fastPHASE (K = 20), 6.07%; fastPHASE (K = 30), 5.92%; IMPUTE v1, 5.42%; IMPUTE v2 (k = 40), 5.23%; IMPUTE v2 (k = 80), 5.16%; MACH, 5.46%. Figure 3 shows that IMPUTE v1 (blue) achieved error rates that were consistently, if only slightly, lower than those of MACH (cyan) across the range of calling thresholds, and that both methods yielded lower error rates than fastPHASE (black) and BEAGLE (green). The IMPUTE v2 run with k = 40 (solid red line) attained similar accuracy to IMPUTE v1 at stringent calling thresholds (Figure 3B), although IMPUTE v2 gained a slight advantage at more lenient thresholds (Figure 3C). The IMPUTE v2 run with k = 80 (dotted red line) showed a small but consistent improvement over both IMPUTE v1 and the other IMPUTE v2 run. Computational requirements To describe the relative computational burdens of these methods, we re-ran each program on a more limited dataset on a single Linux server, which had four dual-core Intel Xeon processors (running at 2.33 GHz, with a 6.1 MB cache, and using a 64-bit architecture) and a total of 8 GB of RAM. Specifically, we repeated the analysis for the 4th, 8th, 12th, and 16th chunks, each of which encompasses a 7.5 Mb region of chromosome 10 (centered, respectively, at positions 22.22 Mb, 50.22 Mb, 78.22 Mb, and 106.22 Mb in NCBI Build 35 coordinates). The average running times and memory requirements for these analyses are shown in Table 1. 10.1371/journal.pgen.1000529.t001 Table 1 Running times and memory requirements for various algorithms in Scenario A. Method Avg. running time (min) Avg. required RAM (MB) BEAGLE 56 3100 BEAGLE (50iter) 392 3200 fastPHASE (K = 20) 397 8 fastPHASE (K = 30) 855 16 IMPUTE v1 43 1000 IMPUTE v2 (k = 40) 270 155 IMPUTE v2 (k = 80) 505 180 MACH 105 80 Running times are in minutes (min) and RAM requirements are in megabytes (MB). Each entry in the table is an average across four runs on different 7.5 Mb regions of chromosome 10. Each analysis included a reference panel of 120 chromosomes (CEU HapMap) and a study sample of 1,377 individuals genotyped on the Affymetrix 500 K SNP chip. Table 1 shows that IMPUTE v1 was the fastest of the methods considered here, followed by BEAGLE (default), MACH, IMPUTE v2 (k = 40), BEAGLE (50 iterations), fastPHASE (K = 20), IMPUTE v2 (k = 80), and fastPHASE (K = 30). Conversely, fastPHASE required the least computer memory, followed by MACH, IMPUTE v2, IMPUTE v1, and BEAGLE. Note that, while IMPUTE v2 with k = 40 took about six times as long as IMPUTE v1, it needed less than 16% of the RAM; this is mainly a consequence of modeling SNPs in U as haploid in version 2, as opposed to diploid in version 1. We also note that both fastPHASE and MACH spent most of their running time fitting their models to the HapMap, and that both methods could probably decrease running times (via more lenient settings) without sacrificing much accuracy. Scenario B We simulated Scenario B by modifying the WTCCC 58 C dataset as follows: First, we integrated the genotypes from the two SNP chips for the 1,377 shared 58 C individuals (see Text S1 for details), yielding a consensus set of 44,875 SNPs. Next, we split the 58 C samples into two groups: a diploid reference panel of 918 individuals (2/3 of the dataset) and a study sample of 459 individuals. To complete the reference panel, we added 120 haplotypes from the HapMap Phase II CEU data. We then created two Scenario B study sample datasets by masking the genotypes of SNPs unique to each chip in turn; there were 18,489 such SNPs on the Affymetrix chip and 22,219 such SNPs on the Illumina chip. Modeling considerations A full representation of Scenario B would include all HapMap SNPs that are polymorphic in the CEU panel. There were 138,592 such SNPs in our dataset, with 44,875 of these belonging to set U2 and the remaining 93,717 to set U1 . This data structure is problematic for most imputation methods because their modeling strategies are premised on a single reference panel in which most genotypes have been observed (i.e., some version of Scenario A). If the data from both reference panels in Scenario B were combined into a single panel, many reference SNPs (those in U1 ) would be missing large proportions of their genotypes, which could substantially decrease imputation accuracy in the study sample. Ad-hoc modifications of these approaches are not attractive either. For example, it would be possible for such methods to impute SNPs in U1 in the diploid reference panel and then combine the observed and imputed genotypes to impute SNPs in U1 and U2 in the study sample, but failing to account for the uncertainty in the imputed reference genotypes would probably lead to overconfident and lower-quality inferences. Alternatively, it would be possible to perform separate imputation runs on the SNPs in {U1 ,T} and the SNPs in {U2 ,T}, but this approach is neither elegant nor convenient in a large association study. To our knowledge, BEAGLE is the only method other than ours that has proposed a strategy for overcoming these difficulties. (This strategy is not discussed in the paper [13], but it is detailed in the documentation accompanying the BEAGLE v3.0 software.) When BEAGLE encounters multiple reference panels, as in Scenario B, it simply downweights the less complete panels during the burn-in stage of its model-fitting procedure. Specifically, every individual in the dataset is assigned a weight that reflects the completeness of that individual's genotypes – individuals with more missing data get lower weights, and therefore have less influence on the early steps of the model-fitting algorithm. This detail aside, BEAGLE still fits a joint model to the complete dataset in Scenario B, in contrast to the IMPUTE v2 approach of modeling the observed data jointly but the missing data independently. In light of these considerations, we decided to create two versions of our Scenario B dataset: one that includes the full set of HapMap SNPs, and one in which the HapMap dataset is restricted to SNPs that were genotyped on at least one of the chips (i.e., in which all SNPs in U1 have been removed). We used the latter dataset to broaden the range of methods that could be included in the comparison (at the cost of removing some of the complexity of Scenario B), and we used the former dataset to evaluate BEAGLE and IMPUTE v2 in a more realistic setting. Program settings In the restricted dataset, we used IMPUTE v1.0, IMPUTE v2.0, BEAGLE v3.0.2, fastPHASE v1.3.2, and PLINK [15] v1.03 to impute each chip's masked genotypes from the other chip's study sample genotypes and the reference panels. IMPUTE v2, BEAGLE, fastPHASE and PLINK used the 918 diploid individuals and the 120 HapMap CEU haplotypes as an expanded reference panel for imputation, while IMPUTE v1 was provided with only the HapMap reference panel. We ran IMPUTE v1, BEAGLE, and PLINK on their default imputation settings, and we also performed separate BEAGLE runs with 50 iterations (rather than the default 10). We ran fastPHASE with 20 and 30 clusters (K = 20 and K = 30, in separate runs), 15 starts of the EM algorithm to estimate model parameters, and 35 iterations per EM start. As in Scenario A, we first fit the fastPHASE clustering model to the reference data, then instructed the software to impute each of the 459 study individuals independently, conditional on the fitted model. Finally, we set IMPUTE v2 to use 40 and 80 conditioning states (k = 40 and k = 80, in separate runs) for phasing updates in both diploid panels and 1,956 reference panel states (2×918+120) for haploid imputation updates in the study sample. As before, we ran the algorithm for 30 MCMC iterations with the first 10 discarded as burn-in, and we specified that the algorithm should choose random conditioning states for phasing updates in the first 3 iterations and informed conditioning states thereafter. On the full Scenario B dataset, we ran BEAGLE and IMPUTE v2 using the faster settings described above: 10 iterations for BEAGLE and k = 40 for IMPUTE v2. For each SNP that was not typed in the study sample, IMPUTE v2 used all observed reference panel chromosomes in each imputation step (1,956 states at SNPs in U2 and 120 states at SNPs in U1 ). Accuracy comparison on restricted dataset The results of our restricted Scenario B comparison are shown in Figure 4, using the same discordance vs. missing genotype percentage format as Figure 3. Note that each panel in this figure follows a different scale. The top two panels (Figure 4A and 4B) share a common set of grid lines, and the bottom two panels (Figure 4C and 4D) share a finer set of grid lines. We omitted the results from the BEAGLE run with 50 iterations since they were only trivially better than the results based on default settings. 10.1371/journal.pgen.1000529.g004 Figure 4 Percentage discordance versus percentage missing genotypes for restricted Scenario B dataset. (A) Results for masked Illumina genotypes imputed from Affymetrix genotypes in the study sample. (B) Results for masked Affymetrix genotypes imputed from Illumina genotypes in the study sample. (C) Results for masked Illumina genotypes (SNPs with MAF<5% only) imputed from Affymetrix genotypes in the study sample. (D) Results for masked Affymetrix genotypes (SNPs with MAF<5% only) imputed from Illumina genotypes in the study sample. Figure 4A shows the results for all Illumina-only SNPs imputed from Affymetrix genotypes, and Figure 4B shows the equivalent results for Affymetrix-only SNPs imputed from Illumina genotypes. One striking difference between these plots is that the imputations based on Illumina genotypes (Figure 4B) are generally more accurate. There are a number of possible explanations for this trend: the Illumina chip has a higher SNP density, and imputation generally improves as more SNPs are observed; the Affymetrix chip contains a larger proportion of rare SNPs, which are easier to impute on the whole, as we discuss below; and, while the Illumina SNPs were specifically chosen to predict, or “tag”, many of the common Affymetrix SNPs via the HapMap, the reverse is not true. Regardless, one trend within Figure 4A and 4B is clear: with an expanded reference panel containing nearly 2,000 chromosomes, it is possible to improve imputation accuracy substantially over what is attainable with 120 chromosomes. For example, the IMPUTE v2 runs with k = 40 (solid red line) achieved best-guess discordance rates of 3.40% and 0.86% in Figure 4A and 4B, respectively, whereas the rates for IMPUTE v1 (which had access to only the HapMap reference panel; blue line) were 5.42% and 1.62%. BEAGLE (green), fastPHASE (black), and IMPUTE v2 (red) were all able to increase accuracy with the expanded reference panel, but the improvements for fastPHASE were smaller. BEAGLE (solid green line) and IMPUTE v2 with k = 40 (solid red line) yielded similar results: for BEAGLE, the best-guess discordance rates in Figure 4A and 4B were 3.46% and 0.93%. For IMPUTE v2, increasing the number of conditioning states used for phasing updates to k = 80 further reduced the discordance rates to 3.07% and 0.78%. Unlike the other imputation methods with access to the expanded reference panel, PLINK achieved lower accuracy than IMPUTE v1; in Figure 4A and 4B, PLINK's best-guess discordances were 7.83% and 2.45%. We tried varying PLINK's settings from their defaults, including settings that were much more computationally rigorous, but these additional runs led to negligible improvements. PLINK is faster than the other methods considered here, which are all based on HMMs, but it also uses a simpler population genetics model. The multinomial haplotype frequency model that PLINK uses for imputation has fared poorly in recent comparisons of phasing methods [21]; its role in this analysis was to see if any accuracy is lost by using a simpler method to speed up imputation in a large and complex dataset. Our results suggest that the model used by PLINK (which also underpins other imputation methods [7]) may be a liability in a dataset in which a large proportion of genotypes, including those in the reference panel, are unphased. However, we also note that PLINK's imputation functionality is still in beta testing. A recent study of Type 1 Diabetes [14] used a similar method to impute genotypes in a Scenario B dataset. Like PLINK, this method defines a multimarker tag for each SNP to be imputed, although in this case there is no phasing model since the tagging is based on correlations between unphased genotypes. It is not clear how this method would have fared in our comparison, but its similarities with PLINK imply that future studies might be better off using more sophisticated imputation methods. Figure 4C and 4D mirrors Figure 4A and 4B, respectively, but these results are restricted to imputed SNPs with minor allele frequencies (MAFs) less than 5%—Figure 4C is based on 1,113 SNPs and Figure 4D is based on 1,979 SNPs. The same relative patterns remain, although the discordance and missing data percentages are lower because it is easier to guess most of the genotypes correctly and with high confidence at a rare SNP than a common one, simply because most genotypes at a rare SNP will be homozygous for the common allele. Among the most accurate methods, the best-guess discordances based on Affymetrix genotypes (Figure 4C) were 1.01% (IMPUTE v2, k = 40), 0.84% (IMPUTE v2, k = 80), and 0.97% (BEAGLE), as compared to 1.73% for HapMap-based imputation with IMPUTE v1; the discordances based on Illumina genotypes (Figure 4D) were 0.48% (IMPUTE v2, k = 40), 0.38% (IMPUTE v2, k = 80), and 0.46% (BEAGLE), as compared to 0.93% for IMPUTE v1. Accuracy comparison on full dataset The results of our full Scenario B comparison are shown in Figure 5 in two panels that mirror Figure 4A and 4B. Although this dataset contains a large number of HapMap-only SNPs that were imputed here but not in the restricted dataset, we calculated discordance only at masked chip SNPs in the study sample, so the curves in Figure 4 and Figure 5 are based on exactly the same sets of masked genotypes. There are four curves in each panel: IMPUTE v2 in the full Scenario B dataset (k = 40; dashed red line); BEAGLE in the full dataset (default settings; dashed green line); BEAGLE in the restricted dataset (default settings; solid green line); and IMPUTE v1 in the restricted dataset (solid blue line). The first two curves (dashed lines) are the main focus of this comparison, and the latter two curves (solid lines) are carried over from Figure 4 for reference. 10.1371/journal.pgen.1000529.g005 Figure 5 Percentage discordance versus percentage missing genotypes for full Scenario B dataset. (A) Results for masked Illumina genotypes imputed from Affymetrix genotypes in the study sample. (B) Results for masked Affymetrix genotypes imputed from Illumina genotypes in the study sample. Solid lines were obtained from the restricted Scenario B dataset (Figure 4) and are shown for reference; dashed lines were obtained from the full Scenario B dataset. One important point about this figure is that the IMPUTE v2 curve is in exactly the same place as in Figure 4. It follows from our modeling approach that simply adding SNPs to the set U1 , as we have done here, will not affect the imputation of SNPs in U2 . Conversely, Figure 5 shows that adding SNPs to U1 actually makes BEAGLE's imputation results worse at SNPs in U2 : between the restricted and full datasets, the best-guess discordance increased from 3.46% to 4.01% in panel A and from 0.93% to 1.04% in panel B. We observed a similar decline in accuracy at rare SNPs, which are not shown separately in Figure 5. Hence, in the full Scenario B dataset, which we regard as a more realistic application of these methods, IMPUTE v2 achieves a best-guess discordance that is 15–18% smaller than BEAGLE's. In the Discussion, we propose an explanation for the change in BEAGLE's results between the full and restricted datasets. A major goal of performing imputation in Scenario B (and extensions thereof) is to simultaneously use all available reference data in an integrated modeling framework. As such, it is also important to assess the quality of imputation at SNPs in U1 (i.e., HapMap-only SNPs) in this context. To do so, we created a modified version of the full Scenario B dataset with observed Illumina genotypes in the study sample. We masked every 25th Illumina SNP in both the study sample and the diploid reference panel, then ran BEAGLE and IMPUTE v2 as before. We repeated these steps for each of the 24 other possible sets of masked SNPs (i.e., after shifting the starting index), so that every Illumina SNP was masked and imputed exactly once. Across these imputation runs, the best-guess discordance at masked SNPs in U1 was 2.87% for IMPUTE v2 and 3.60% for BEAGLE – i.e., the discordance for IMPUTE v2 was 20% smaller than the discordance for BEAGLE. Detecting minor allele copies at rare SNPs While Figure 4 confirms that a large reference panel can improve imputation accuracy at rare and common SNPs alike, it is instructive to examine where the gains at rare SNPs are made. To evaluate this question, we took the results from Figure 4C and 4D and classified the kinds of errors made by each method's best-guess imputations (i.e., at a calling threshold of 1/3). We focused primarily on the ability of each method to detect copies of the minor allele. This is clearly an important quantity, but it is obscured by gross measures of accuracy, which are inherently dominated by homozygote-common genotypes at rare SNPs. We examined two classifications of erroneous minor allele calls: false positives (homozygous common called as heterozygous) and false negatives (heterozygous called as homozygous common). The results are shown in Table 2, where the false positive and false negative rates are expressed as percentages of the total number of homozygous common and heterozygous genotypes, respectively. 10.1371/journal.pgen.1000529.t002 Table 2 False negative (FN) and false positive (FP) minor allele call rates at rare SNPs (MAF<5%) in Scenario B. Method Affymetrix 500 K data Illumina 550 K data FN calls (%) FP calls (%) FN calls (%) FP calls (%) BEAGLE 12.81 0.30 6.75 0.17 fastPHASE (K = 20) 21.12 0.28 11.07 0.15 fastPHASE (K = 30) 19.46 0.27 9.77 0.13 IMPUTE v1* 14.52 0.79 10.23 0.34 IMPUTE v2 (k = 40) 12.86 0.15 6.81 0.07 IMPUTE v2 (k = 80) 9.66 0.15 4.90 0.08 PLINK 32.79 0.53 25.63 0.40 The two columns on the left show results for Illumina-only genotypes imputed from Affymetrix 500 K data in the study sample, and the two columns on the right show results for Affymetrix-only genotypes imputed from Illumina 550 K data. The FN rates are expressed as percentages of genotypes that are truly heterozygous and the FP rates as percentages of genotypes that are truly homozygous common. * Unlike the other methods, IMPUTE v1 was not provided with the diploid reference panel. Consequently, these numbers are based on using a reference panel of 120 chromosomes to impute a study sample of 459 individuals. Several insights emerge from this table. First, IMPUTE v2 was consistently among the best methods for reducing both false negatives and false positives, suggesting that our new approach is generally more accurate than others at imputing rare SNPs. Second, while most methods were much more likely to make false negative calls than false positive calls, IMPUTE v1 was relatively more inclined toward false positives. This outcome is consistent with the use of a smaller reference panel, which will tend to overestimate the population frequencies of rare alleles. At the same time, IMPUTE v1 missed fewer true heterozygote calls than did fastPHASE or PLINK, despite using a much smaller reference panel. This tendency for fastPHASE and PLINK to mistake rare heterozygotes as homozygous for the major allele is the main factor separating these methods from IMPUTE v2 and BEAGLE in Figure 4C and 4D. Conversely, BEAGLE and IMPUTE v2 with k = 40 showed similar false negative rates, but IMPUTE v2 made half as many false positive rare allele calls. Putting these pieces together, it appears that IMPUTE v2 achieves higher accuracy than other methods at rare SNPs because it has both high sensitivity and high specificity for detecting rare alleles. These results show the strength of using a large reference panel for imputation, and of our particular approach to performing inference in that setting. Computational requirements As in Scenario A, we re-ran all programs on a single Linux server to assess their computational burdens in Scenario B. We used the same server and analysis chunks as before; the average running times and memory requirements across these four 7.5 Mb regions of chromosome 10 are shown in Table 3. (These numbers were averaged across both the masked Affymetrix and masked Illumina datasets, so eight runs contributed to each table entry.) Numbers in parentheses refer to the full Scenario B dataset, and all other numbers refer to the restricted dataset. Note that IMPUTE v1 was run on a version of the dataset that did not include the diploid reference panel. 10.1371/journal.pgen.1000529.t003 Table 3 Running times and memory requirements for various algorithms in Scenario B. Method Avg. running time (min) Avg. required RAM (MB) BEAGLE 21 (70) 2500 (3200) fastPHASE (K = 20) 530 12 fastPHASE (K = 30) 1100 20 IMPUTE v1* 5 260 IMPUTE v2 (k = 40) 409 (450) 80 (190) IMPUTE v2 (k = 80) 790 120 PLINK 1.5 8 Running times are in minutes (min) and RAM requirements are in megabytes (MB). Each entry in the table is an average across eight runs, including four runs on different 7.5 Mb regions of chromosome 10 for study samples with either Affymetrix-only or Illumina-only SNPs masked. Each analysis included a haploid reference panel of 120 chromosomes (CEU HapMap), a diploid reference panel of 1836 chromosomes, and a study sample of 459 individuals. Numbers in parentheses represent analyses that included all SNPs that are polymorphic in the HapMap CEU panel; for the rest of the analyses, only SNPs that were genotyped on either the Affymetrix 500 K or Illumina 550 K chip were included in the HapMap dataset. * Unlike the other methods, IMPUTE v1 was not provided with the diploid reference panel. Consequently, these numbers are based on using a reference panel of 120 chromosomes to impute a study sample of 459 individuals. PLINK was the fastest of these methods, followed by IMPUTE v1, BEAGLE, IMPUTE v2, and fastPHASE. PLINK also required the least RAM, followed by fastPHASE, IMPUTE v2, IMPUTE v1, and BEAGLE. While BEAGLE was quite fast, it also required more than ten times as much RAM as any other method (at least 2.5 GB per 7.5 Mb region of the genome). BEAGLE includes an option to decrease memory usage, but this would come at the cost of increased running time. We emphasize that IMPUTE v1 was among the fastest methods in this comparison only because it was assigned a much smaller problem: its reference panel contained 120 phased haplotypes, while every other method confronted a panel with 1,956 chromosomes, most of which were unphased. Using the knowledge that IMPUTE v1's computational burden grows quadratically with the number of chromosomes in the reference panel, we can project that it would have required over 1,300 minutes and 69,000 MB of RAM to run the program on a single 7.5 Mb analysis chunk with a reference panel of that size (which would also have needed to be phased ahead of time). This highlights a major advantage of our new modeling strategy: whereas IMPUTE v1 becomes computationally intractable as the reference panel grows, IMPUTE v2 remains competitive (both in computational burden and imputation accuracy) while allowing more flexibility (such as multiple, unphased, and/or incomplete reference panels). Another advantage of our approach can be seen by comparing the running times of the restricted and full datasets for BEAGLE and IMPUTE v2. The average BEAGLE run took 3.3 times longer in the full dataset than in the restricted dataset, whereas the IMPUTE v2 running time increased by factor of just 1.1. For comparison, the total number of SNPs in the dataset increased by a factor of 3.1. This contrast between the methods arises from the way they model SNPs in U1 : IMPUTE v2 models only the reference panel at such SNPs, whereas BEAGLE tries to model all individuals in the dataset. We regard the full dataset as a more realistic application of these methods, so we believe that the parenthetical running times in Table 2 offer the best comparison between BEAGLE and IMPUTE v2. Discussion In this study we introduced a new method for genotype imputation in large association studies. Our method, IMPUTE version 2, follows a flexible inference framework that uses more of the information in the data than many comparable methods, thereby improving accuracy, while remaining computationally tractable on large datasets. This approach is well-suited to the kinds of datasets that will become available in next-generation association studies: it can handle large reference panels, including ones with unphased and incomplete genotypes, and it can also integrate multiple reference panels containing different sets of SNPs. Scenario A The observation that IMPUTE v2 can achieve lower error rates than IMPUTE v1 in Scenario A validates our new approach. At the same time, the absolute improvement is small, as can be seen in Figure 3 by comparing the separation between IMPUTE v1 and v2 with the separation between IMPUTE v1 and MACH, which typically yield very similar results in our experience. We have also performed separate experiments in which IMPUTE v2 achieves much higher phasing accuracy than IMPUTE v1 at SNPs in T, but where the improvements in HapMap-based imputation of SNPs in U remain modest (data not shown). We suggest that this disconnect between phasing accuracy and imputation accuracy is caused by the inherent limitations of a small reference panel; in other words, we posit that existing models would not attain substantially lower imputation error rates with the current HapMap panel even if we knew the phase of the study genotypes perfectly. In the wake of these results, we suspect that the accuracy improvement of IMPUTE v2 over IMPUTE v1 is not practically meaningful for imputation based on the HapMap Phase II data. However, given that IMPUTE v1's computational requirements scale quadratically with the number of chromosomes in the reference panel while IMPUTE v2's requirements grow linearly, the newer version may become more computationally favorable as baseline reference panels grow in the future. For example, expanding the HapMap reference panel in this study to 800 chromosomes (which is roughly the size anticipated for each panel in the 1,000 Genomes Project) would lead to similar running times for both versions of IMPUTE, but version 2 would need only 2% of the computer memory required by version 1. At the same time, IMPUTE v2 would probably achieve higher accuracy, and its computational advantages over IMPUTE v1 would continue to grow with larger reference panels. Scenario B In our Scenario B dataset, we demonstrated that an expanded reference panel containing thousands of chromosomes can greatly improve accuracy over what is possible based on the HapMap alone, although these gains are limited to the subset of HapMap SNPs that are included on multiple genotyping chips. This finding is consistent with the conclusions of the recent BEAGLE paper [13]. IMPUTE v2 was consistently among the most accurate methods we considered. For example, IMPUTE v2 attained best-guess error rates that were 15–20% lower than those of its closest competitor (BEAGLE) in a realistic representation of Scenario B. Rare SNPs are of particular interest because of an increasing awareness that such SNPs may underlie common, complex diseases, and because imputation methods gain the most power over tagging approaches at such SNPs [6],[11]. Expanded reference panels ought to allow rare SNPs to be imputed much more accurately than they can be with the HapMap panel, and our method is able to exploit this information more effectively than competing methods. Relative to IMPUTE v1 (which had access to only the HapMap reference panel) and BEAGLE, the main improvement of IMPUTE v2 is to increase specificity by cutting down on false positive heterozygous calls; relative to fastPHASE and PLINK, the main improvement is to increase sensitivity by cutting down on false negative heterozygous calls. Modeling issues in imputation datasets Throughout this study we have touched on the fundamental modeling difficulties that arise in imputation datasets, and we have discussed various strategies that have been proposed to solve these problems. In particular, we have contrasted the BEAGLE approach of full joint modeling with the IMPUTE v2 approach, which phases the observed data jointly but imputes the missing alleles in different haplotypes independently. Based on the results seen here and elsewhere [13], we claim that BEAGLE gains very little useful information through joint modeling of entire imputation datasets. Consider these lines of evidence: In our Scenario B comparison, BEAGLE's accuracy at SNPs in U2 actually decreased when SNPs were added to U1 . This is highly counterintuitive: it is hard to explain why adding HapMap-only SNPs to a dataset, without changing any of the data in the rest of a region, should have a noticeable effect on the imputation of SNPs in an expanded reference panel, let alone a negative effect. Browning and Browning (2009) observed that BEAGLE attained better accuracy by subdividing a study sample and fitting the model separately to each subsample (along with the complete reference panel) than by simply fitting the model to the entire dataset – indeed, this subdividing strategy is now recommended as standard practice by the authors. The benefits of subdividing the sample were attributed to “model averaging”, but that is not an apt description of the process since each individual in the dataset is subjected to only a single model fit. Some model fits are probably better than others due to the stochastic nature of the algorithm, but some are also worse, so there is no reason to expect systematic improvements from this strategy if the model is working properly. Browning and Browning (2009) also observed that, for a fixed study sample of 188 individuals, BEAGLE's accuracy consistently improved relative to that of IMPUTE v1 as the size of the reference panel increased. No mechanistic rationale was provided to explain this trend. The first two points document strange behavior of the BEAGLE method: apparently, adding data – whether in the form of additional SNPs or additional individuals in the study sample – can cause BEAGLE's imputation accuracy to decrease. More specifically, it seems that increasing the proportion of missing data harms BEAGLE's inferences. This suggests an explanation for the third point above: as the reference panel grew and the study sample remained fixed, the total proportion of missing genotypes in the sample decreased, thereby generating datasets that were relatively less harmful to BEAGLE. In our view, these disparate observations point to a single underlying cause: joint modeling of untyped SNPs is generally ineffective, and it grows progressively worse as the space of missing genotypes expands. BEAGLE was competitive in our analyses, so its modeling strategy may have some merit, but it is also possible that BEAGLE's success came in spite of the joint modeling framework, not because of it. A better alternative might be to embed the same clustering model in a framework like the ones used by fastPHASE or IMPUTE v2. We suggest that further scrutiny be applied before a full joint model is used in general applications. Comparisons like ours, and others [13], are necessarily restricted to artificially small datasets, but we have shown that these “toy” datasets can mask problems that might occur in more realistic settings, which will often include larger amounts of missing data. In practice, the accuracy levels and running times achieved by BEAGLE in our study may represent best-case scenarios rather than standard results. These considerations apply to imputation datasets in general, but it is particularly interesting to examine them in the context of multiple reference panels genotyped on different sets of SNPs. BEAGLE's joint approach to such datasets is flexible, but we have seen that it can lose accuracy when certain kinds of new data are added. Conversely, IMPUTE v2's multi-panel modeling strategy responds intuitively to new sources of information like additional individuals or SNPs. This property makes it easy to predict how IMPUTE v2 will perform in larger and more complex datasets than the ones used here, whereas the same cannot necessarily be said for BEAGLE. More broadly, we believe that any imputation algorithm should strive to incorporate as much of the available reference information as possible while remaining easy to use. For example, in Scenario B it is desirable to simultaneously impute the SNPs in the expanded panel (to improve accuracy) and the SNPs represented only in the HapMap (to maintain genomic coverage). IMPUTE v2 provides an integrated framework for handling this kind of problem: it is flexible enough to handle numerous variations of Scenarios A and B, yet it remains tractable by focusing computational effort on the parts of the dataset that are most informative. Extensions The expanded reference panel we considered was constituted by controls genotyped on multiple SNP chips, but other kinds of new reference panels will also become available in the near future. For example, the HapMap Project has recently augmented its Phase II data with additional samples from both the original HapMap locations and new locations aimed at capturing more human genetic diversity. These samples have all been genotyped on multiple, largely non-overlapping SNP chips, and could be used for imputation in the same way as the controls in our Scenario B. In addition, the 1,000 Genomes project is currently pursuing whole-genome sequencing of hundreds of individuals sampled from broad geographic regions in Africa, East Asia, and Europe. One aim of the project is to generate high-quality haplotypes for these individuals, including near-complete coverage of SNPs with population MAFs of 1% or more. This resource will increase the utility of imputation approaches by expanding both the number of chromosomes in the reference set and the number of SNPs that can be imputed. Our method is well-suited to this kind of dataset: in addition to its ability to accurately impute rare SNPs, which will constitute most of the new variants in the 1,000 Genomes data, IMPUTE v2 expends relatively little computational effort on haploid imputation steps. This means that, for a given SNP chip typed in a given study sample, doubling the number of untyped variants in a phased reference panel will increase the computational burden of imputation by a factor of less than two. By contrast, other imputation methods (such as IMPUTE v1, BEAGLE, and fastPHASE) would slow down by a factor of at least two. One major use of our new method (and of imputation methods generally) will be to facilitate meta-analyses [9],[10], which combine samples from studies of similar diseases to increase the chances of detecting low-penetrance risk alleles. For this application, we might expect to repeat Scenario B for a number of different study samples genotyped on different SNP chips. Rather than re-phase the diploid reference panel for each study sample, we can save time by simply storing the posterior samples from a single run of phasing the reference panel, then read these sampled haplotypes from memory when processing each study sample. This functionality is implemented in our software. IMPUTE v2 is already fast enough to use in large association studies, but we also have plans to make it faster. We believe that the software can gain some speed simply by optimizing the code, but we also have plans to implement an analytical speed-up for the HMM forward-backward calculations [22] that may further decrease running times by a factor of five or so. Finally, while we described our imputation approach in terms of two specific scenarios involving the HapMap, it could in fact be generalized to include any number of reference panels of any type (phased/unphased, complete/incomplete) so long as their SNP sets follow a hierarchy such as the ones laid out in Figure 1 and Figure 2. We envision that IMPUTE v2 will be used in a variety of situations. For example, it may soon become standard practice to combine the HapMap Phase II and Phase III datasets to create a compound reference panel like the one in Scenario B, except with all of the reference data phased. Another plausible situation is the version of Scenario B that we described, in which a large set of controls is used to impute genotypes in cases; we discuss some concerns about association testing in this setting in Text S1 and Figure S2. IMPUTE v2 will also be applied in populations beyond the UK controls used in this study, and we expect that its performance will follow trends much like those observed for similar imputation methods [23],[24]. Our modeling strategy is flexible and fast, and it is general enough that it could be adopted by other imputation methods. We believe that this intuitive way of thinking about imputation datasets will benefit next-generation association studies, and that IMPUTE v2 will prove to be a useful tool for finding subtle signals of association. Supporting Information Figure S1 Percentage discordance between best-guess imputed and observed Illumina genotypes for various parameter settings of IMPUTE v2. These results were obtained from a 2 Mb region of chromosome 10 in the Scenario B dataset. (0.27 MB TIF) Click here for additional data file. Figure S2 Expected versus observed p-values for additive association tests between the 58 C and UKBS control groups, where the UKBS genotypes have been imputed from 58 C genotypes. (A) p-p plot for common (MAF≥5%) SNPs. (B) p-p plot for rare SNPs. The 95% concentration band is shown in grey, and the y = x line is shown in red. (0.13 MB TIF) Click here for additional data file. Table S1 Convergence statistics for various parameter settings of IMPUTE v2. For each combination of burn-in and main iterations, the number shown is the percentage of imputed genotypes for which the R convergence statistic was greater than 1.02 across 10 independent runs of the algorithm. The results are stratified into genotypes at 100 common SNPs (left) and genotypes at 24 rare SNPs (right); for rare SNPs, only genotypes that include the minor allele were used in the calculations. These results were obtained from a 2 Mb region of chromosome 10 in our Scenario B dataset, using IMPUTE v2 with k = 30 (results with k = 100 were similar). (0.03 MB PDF) Click here for additional data file. Text S1 Performance of IMPUTE v2 under various parameter settings; Convergence of IMPUTE v2 algorithm; Limits of informed conditioning approximation; Integrating genotypes from two SNP chips; Association testing of cases imputed from controls. (0.28 MB PDF) Click here for additional data file.

- Record: found
- Abstract: found
- Article: not found

Nick Patterson, Alkes L. Price, David Reich (2006)

Introduction A central challenge in analyzing any genetic dataset is to explore whether there is any evidence that the samples in the data are from a population that is structured. Are the individuals from a homogeneous population or from a population containing subgroups that are genetically distinct? Can we find evidence for substructure in the data, and quantify it? This question of detecting and quantifying structure arises in medical genetics, for instance, in case-control studies where uncorrected population structure can induce false positives [1]. It also arises in population genetics, where understanding of the structure may be important to the key scientific issues, especially uncovering the demographic history of the population under study. We focus on principal components analysis (PCA), which was first introduced to the study of genetic data almost thirty years ago by Cavalli-Sforza [2]. We will use PCA and “eigenanalysis” interchangeably. The latter term focuses attention on the fact that not just the eigenvectors (principal components) are important here, but also the eigenvalues, which underlie our statistical procedures. PCA has become a standard tool in genetics. In population genetics, we recommend a review paper [3] focusing on the use of “synthetic maps” which use PCA to study genetic geographic variation. Usually PCA been applied to data at a population level, not to individuals as we do here. Exceptions are [4,5]. In addition to single nucleotide polymorphisms (SNPs) and microsatellites, PCA has been applied to haplotype frequencies [6,7] and the distribution of ALU insertion polymorphisms [8] in order to study population structure. Most of the literature on PCA in genetics is applied, not methodological, and we know of no paper that concentrates as we do here on the statistical significance of the components. Data with hundreds or thousands of individuals and hundreds of thousands of markers are now becoming available, so that small but real effects will be detectable, and it is important to develop rigorous tests for population structure that will be practical, even on the largest datasets. This is our main aim in this paper. Using some recent results in theoretical statistics, we introduce a formal test statistic for population structure. We also discuss testing for additional structure after some structure has been found. Finally, we are able to estimate the degree of population differentiation that will be detectable for a given data size. Our methods work in a broad range of contexts, and can be modified to work with markers in linkage disequilibrium (LD). The methods are also able to find structure in admixed populations such as African Americans—that is, in which individuals inherit ancestry from multiple ancestral populations—as long as the individuals being studied have different proportional contributions from the ancestral populations. We believe that principal components methods largely fell out of favor with the introduction of the sophisticated cluster-based program STRUCTURE [9,10]. STRUCTURE and similar methods are based on an interpretable population genetics model, whereas principal components seems like a “black box” method. We will discuss how the models underlying the cluster methods, and the PCA technique we will describe, are much closer to each other than they may at first appear to be. Our implementation of PCA has three major features. 1) It runs extremely quickly on large datasets (within a few hours on datasets with hundreds of thousands of markers and thousands of samples), whereas methods such as STRUCTURE can be impractical. This makes it possible to extract the powerful information about population structure that we will show is present in large datasets. 2) Our PCA framework provides the first formal tests for the presence of population structure in genetic data. 3) The PCA method does not attempt to classify all individuals into discrete populations or linear combinations of populations, which may not always be the correct model for population history. Instead, PCA outputs each individual's coordinates along axes of variation. An algorithm could in principle be used as a post-processing step to cluster individuals based on their coordinates along these axes, but we have not implemented this. We note that STRUCTURE is a complex program and has numerous options that add power and flexibility, many of which we cannot match with a PCA approach. Perhaps the central goal of STRUCTURE is to classify individuals into discrete populations, but this is not an object of our method. We think that in the future both cluster-based methods such as STRUCTURE and our PCA methods will have a role in discovering population structure on genetic data, so that, for example, our PCA methods offer a good default for the number of clusters to use in STRUCTURE. In complex situations, such as uncovering structure in populations where all individuals are equal mixtures of ancestral populations, it may remain necessary to use statistical software that explicitly models admixture LD, such as [10–13], which allow estimation of local ancestry at arbitrary points of the genome. In this study we aim to place PCA as applied to genetic data on a solid statistical footing. We develop a technique to test whether eigenvectors from the analysis are reflecting real structure in the data or are more probably merely noise. Other papers will explore applications to medical genetics [14] and to the uncovering of demographic history. In this paper, our main purpose is to describe and to validate the method, rather than to make novel inferences based on application to real data, which we leave to future work. We show that statistically significant structure is real and interpretable, and also that our methods are not failing to recover real structure that is found by other techniques. Two important results emerge from this study. First, we show that application of PCA to genetic data is statistically appropriate, and provide a formal set of statistical tests for population structure. Second, we describe a “phase change” phenomenon about the ability to detect structure that emerges from our analysis: for a fixed dataset size, divergence between two populations (as measured, for example, by a statistic like FST ) that is below a threshold is essentially undetectable, but a little above threshold detection will be easy. Based on these results, we are able to give an estimate of how much data will be required to find population structure given a level of genetic divergence such as FST (as defined by Cavalli-Sforza, [15, p. 26, Equation 3].) The theory shows that the methods are sensitive, so that on large datasets, population structure will often be detectable. Moreover, the novel result on the phase change is not limited just to PCA, but turns out to reflect a deep property about the ability to discover structure in genetic data. For example, in the paper we present simulations that show the ability to detect structure occurs with the same dataset size when STRUCTURE and PCA are used; that is, the phase change manifests itself in the same place. The phase change effect was suggested by a recent paper in theoretical statistics [16], which demonstrated the phenomenon for a situation that is mathematically similar to ours. The theory has continued to develop and nearly all we need has now been proved, the most recent paper being [17]. We believe that the applications to genetics still pose some interesting questions for the theorists. While our methods are derived from asymptotic theory (where the datasets are very large), they also seem to work well on small datasets, and we would be interested in seeing a theoretical explanation. Results The basic technique is simple. We assume our markers are biallelic, for example, biallelic single nucleotide polymorphisms (SNPs). Regard the data as a large rectangular matrix C, with rows indexed by individuals, and columns indexed by polymorphic markers. For each marker choose a reference and variant allele. We suppose we have n such markers and m individuals. Let C(i,j) be the number of variant alleles for marker j, individual i. (Thus for autosomal data we have C(i,j) is 0,1 or 2.) For now suppose that there is no missing data. From each column we subtract the column means. So set for column j: and then the corrected entries are: Set p(j) = μ(j)/2, an estimate of the underlying allele frequency (autosomal data). Then each entry in the resulting matrix is Equation 3 is a normalization step, which is motivated by the fact that the frequency change of a SNP due to genetic drift occurs at a rate proportional to per generation. It also normalizes (at least if the data is in Hardy-Weinberg equilibrium) each data column to have the same variance. We note that Nicholson et al. use the same normalization, and motivate it similarly [18]. We verified (unpublished data) that the normalization improves results when using simulated genetic data, and that on real data known structure becomes clearer. (However all the results are just as mathematically valid even without the normalizations.) The methods also are applicable to data such as microsatellites, where there are more than two alleles at a single site. We use a device of Cavalli-Sforza [2,15], making a “marker” j out of each allele, and then setting C(i,j) to be the number of occurrences of the allele for sample i. We omit the normalization step of Equation 3 for microsatellites, merely subtracting the mean. The normalization has no clear justification for microsatellite data, and results on real data (unpublished) show that it produces worse performance in this case. An alternative, suggested by a referee, is to use the microsatellite allele length as a continuous variable, and carry out PCA directly after a suitable normalization. Now we carry out a singular value decomposition on the matrix M. (A standard reference for the numerical methods is [19]. Public domain software is readily available—we used the well-known package LAPACK, http://www.netlib.org/lapack.) We are chiefly interested here in the case that the number of samples is less than the number of markers: m 1 + 1/γ, then as m,n → ∞, the TW statistic becomes unbounded almost surely. That is, the behavior of L 1 is qualitatively different depending on whether l 1 is greater or less than 1 + 1/γ. This is a phase-change phenomenon, and we will define as the BBP threshold. This is an asymptotic result, showing that as the data size goes to infinity, the transition of the behavior, as l 1 varies, becomes arbitrarily sharp. The result, as stated above, is proved in [16] for data where the matrix entries are complex numbers, and statement (2) of the conjecture is proved in [17], which demonstrates that the behavior is qualitatively different according to whether l 1 is greater or less than 1 + 1/ γ. There seems little doubt as to the truth of statement (1) above. It has been shown (D. Paul, Asymptotic behavior of the leading sample eigenvalues for a spiked covariance model, http://anson.ucdavis.edu/~debashis/techrep/eigenlimit.pdf) that, under the assumptions of statement (1) above, the lead eigenvector of the sample covariance is asymptotically uncorrelated with the lead eigenvector of the theoretical covariance, but we believe that the question of the distribution of the leading eigenvalue is still open. Consider an example of two samples each of size m/2, diverged from each other at time τ, where unit time is 2N generations, and assume that N is the effective population size. We assume τ is small, from which it follows that We find that It follows that the BBP threshold is reached when This is interesting by itself: Define D, the data size, to be the product of the number of samples and number of SNPs genotyped. For two subpopulations of equal sample size, the phase change threshold is reached when 1/FST is equal to the square root of the data size D, independently of the number of individuals and markers, at least when both are large. At a fixed data size, the expected value of the leading eigenvalue of the data matrix (and the power to detect structure) of course is a continuous function of FST, but the BBP conjecture suggests that for large data sizes there will only be a small transition region. Above the region, detection of structure will be easy, and below it, impossible. Let us take nm = 220 (about one million genotypes), so that the BBP threshold is FST = 2−10. We let m = 2 k (k = 5…8) and set n = 220−k so that nm = 220. Now for each value of m, generate simulated data, varying FST from 2−13 to 2−7. For each simulation, we compute L 1, the TW statistic, and a p-value. We show the TW statistics in Figure 6. Figure 6 The BBP Phase Change We ran a series of simulations, varying the sample size m and number of markers n but keeping the product at mn = 220. Thus the predicted phase change threshold is FST = 2−10. We vary FS and plot the log p-value of the Tracy–Widom statistic. (We clipped −log10 p at 20.) Note that below the threshold there is no statistical significance, while above threshold, we tend to get enormous significance. The phase change is evident. Further, from [16, p. 1650ff] (also see [17, Equation 1.10]): above the BBP threshold we have that in probability as m,n → ∞. It then follows that above the BBP threshold, we can expect the TW statistic to be increasing with the number of individuals m if the data size mn is fixed. That is, increasing sample size, rather than marker number, is advantageous for detecting structure above the BBP threshold, but not below it. This effect is clearly visible in Figure 6 (note the behavior of the p-value for m = 256). We summarize: For two equal size subpopulations, there is a threshold value of FST, , below which there will be essentially no evidence of population structure. Above the threshold, the evidence accumulates very rapidly, as we increase the divergence or the data size. Above the threshold for fixed data size mn, the evidence is stronger as we increase m, as long as n ≫ m. Another implication is that these methods are sensitive. For example, given a 100,000 marker array and a sample size of 1,000, then the BBP threshold for two equal subpopulations, each of size 500, is FST = .0001. An FST value of .001 will thus be trivial to detect. To put this into context, we note that a typical value of FST between human populations in Northern and Southern Europe is about .006 [15]. Thus, we predict: most large genetic datasets with human data will show some detectable population structure. The BBP phase change is not just a phenomenon of the eigenvector-based analysis we are discussing here. We suspect that at least for biallelic unlinked markers, no methods for detecting structure will do much better than our TW-based techniques. This implies that no method will have any significant success rate if population divergence is below the BBP threshold, while above threshold, reasonable methods will succeed. To test this we made a series of simulations, each with 1,600 biallelic markers and two populations each of size 50. We varied FST and ran both our eigenanalysis and STRUCTURE. (See Methods for more detail about the simulations and analysis.) We were not successful in using STRUCTURE to produce a higher likelihood for the existence of two clusters rather than one except for the very largest FST levels. We wanted to place our methods and STRUCTURE on a “level playing field.” Our PCA methods return a leading eigenvector, while running STRUCTURE with K = 2 clusters, returns for each individual the probability of belonging to cluster 1. We used a nonparametric idea, applying a probit transform to both the output of both the PCA and of STRUCTURE, and then running an ANOVA analysis, both for PCA and STRUCTURE output. (The probit transform uses order statistics (ranks) to map the observations into points appropriate if the underlying distribution is standard normal. See, for example, [33].) This amounts to carrying out an unsupervised analysis and then checking to see if the recovered “structure” reflects the truth. Thus, we will compute three p-values: 1) a TW statistic from an unsupervised analysis; 2) an ANOVA p-value (F-statistic) after probit transform of the leading principal component; 3) an ANOVA p-value (F-statistic) after probit transform of the STRUCTURE cluster probabilities. Table 4 shows the results from a representative set of runs: we show the geometric mean of the p-value in simulations, based on a TW statistic (unsupervised) or a nonparametric ANOVA analysis, both for the eigenanalysis and for STRUCTURE. Table 4 BBP Phase Change: Eigenanalysis and STRUCTURE Here the BBP threshold is .0025. Below the threshold nothing interesting is found by the TW unsupervised statistic. Above the threshold, the TW statistic is usually highly significant, and the ANOVA analyses show that the true structure has become apparent. At the threshold we sometimes have recovered significant structure, but it will be hard (usually impossible) to tell if the structure is real or a statistical artifact. Below the threshold, the structure is too weak to be useful. In these runs, at the critical threshold, the eigenanalysis slightly outperformed STRUCTURE. We have not carefully investigated whether we could obtain better results by varying the STRUCTURE parameters. Summarizing: below the threshold, neither procedure succeeds with reasonable probability, at the threshold success is variable, and above the threshold success is nearly guaranteed. Admixture In an admixed population, the expected allele frequency of an individual is a linear mix of the frequencies in the parental populations. Unless the admixture is ancient—in which case the PCA methods will fail as everyone will have the same ancestry proportion—then the mixing weights will vary by individual. Because of the linearity, admixture does not change the axes of variation, or, more exactly, the number of “large” eigenvalues of the covariance is unchanged by adding admixed individuals, if the parental populations are already sampled. Thus, for example, if there are two founding populations, admixed individuals will have coordinates along a line joining the centers of the founding populations. We generated simulated data, by taking a trifurcation between populations (A,B,D) 100 generations ago. Population C is a recent admixture of A and B. The mixing proportion of A in an individual from C is Beta-distributed B(3.5,1.5) so that the average contribution of population A in an individual of population C is .7 (see Figure 7). Effective population sizes are 10,000 for each population. We then simulated data for 10,000 unlinked markers (more details are in the Methods section). FST between any pair of A,B,D is .005. We are attempting to mimic the data of Figure 5, and chose to run our analysis on simulated samples from populations B,C,D, not using samples from A. We expect two significant eigenvalues corresponding to the splits of populations B,C, and D. If population A is included in the analysis, we also get just two significant eigenvalues, as predicted by theory. This is what is observed (unpublished data), with, as predicted, the admixed population not adding to the number of axes of variation (the third eigenvalue is not significant). In Figure 8 we show a plot of the first two eigenvectors. Note the dispersion of population C along a line. This is diagnostic of admixture. The resemblance of Figures 5 and 8 is striking. Figure 7 Simulation of an Admixed Population We show a simple demography generating an admixed population. Populations A,B,D trifurcated 100 generations ago, while population C is a recent admixture of A and B. Admixture weights for the proportion of population A in population C are Beta-distributed with parameters (3.5,1.5). Effective population sizes are 10,000. Figure 8 A Plot of a Simulation Involving Admixture (See Main Text for Details) We plot the first two principal components. Population C is a recent admixture of two populations, B and a population not sampled. Note the large dispersion of population C along a line joining the two parental populations. Note the similarity of the simulated data to the real data of Figure 5. There remain issues to resolve here. Firstly, recent admixture generates large-scale LD which may cause difficulties in a dense dataset as the allele distributions are not independent. These effects may be hard to alleviate with our simple LD correction described below. STRUCTURE [10] allows careful modeling. Secondly, more ancient admixture, especially if the admixed population is genetically now homogeneous, may lead to a causal eigenvalue not very different from the values generated by the sampling noise. Suppose, for example, in our simulation above, we let population C mate panmictically for another 20 generations. Then we will get three clusters for A, B, C that are nearly collinear, but not exactly because of the recent 20-generation divergence, which is reflecting genetic drift unique to that population. A third issue is that our methods require that divergence is small, and that allele frequencies are divergent primarily because of drift. We attempted to apply our methods to an African-American dataset genotyped on a panel of ancestry-informative markers [34]. The Tracy–Widom theory breaks down here with dozens of “significant” axes that we do not believe have genetic meaning. Perhaps this is to be expected, as on our informative panel FST is big (.58) and the theory could be expected to perform poorly. In addition our methods are here not dealing adequately with LD caused by large admixture blocks. This is an issue for our TW techniques, but not for PCA as such. Indeed, on this dataset the correlation of our principal eigenvector with the estimated European ancestry for each individual recovered by the admixture analysis program ANCESTRYMAP [12] is a remarkable .995 (STRUCTURE produces similar results). ANCESTRYMAP has complex modeling of admixture LD, and was also provided with parental allele frequencies, but did no better than the simple PCA. (There is an issue of interpretation here: the leading eigenvector is almost perfectly correlated with ancestry, but to infer actual ancestry proportions an affine transform must be applied, translating and scaling the values. In practice, some parental allele frequencies will be needed to determine the appropriate transform. A similar issue arises with STRUCTURE if parental frequencies are unknown.) Finally, if “admixture LD” is present, so that in admixed individuals long segments of the genome originate from one founder population, simple PCA methods will not be as powerful as programs such as STRUCTURE [10], ADMIXMAP [11], and ANCESTRYMAP [12], where there is careful modeling of the admixture blocks and the transitions. The power of these methods lies in the fact that genome-wide samples may have similar proportions of inheritance from the ancestral populations, but locally they will inherit either 0, 1, or 2 alleles from each ancestral population. Methods that specifically attempt to assign local ancestries will be able to determine the specific patterns typical of each ancestral population locally. An interesting and challenging problem is to build tools that retain the power of these more complex models on admixed data and that also run rapidly on large datasets. Correcting for LD The theory above works well if the markers are independent (that is have no LD), but in practice, and especially with the large genotype arrays that are beginning to be available, this is difficult to ensure. In extreme cases uncorrected LD will seriously distort the eigenvector/eigenvalue structure, making results difficult to interpret. Suppose, for example, that there is a large “block” [35,36] in which markers are in complete LD, and we have genotyped many markers in the block. A large eigenvector of our Wishart matrix X will tend to correlate with the genotype pattern in the block (all markers producing the same pattern). This will distort the eigenvector structure and also the distribution of eigenvalues. We recommend the following if LD between markers is a concern in the data. Pick a small integer k > 0, corresponding to the number of adjacent markers one uses for adjustment (k = 1 will often suffice). In the data matrix M we will “predict” each column by running a multivariate regression on the k previous columns. We then will analyze the residuals. Concretely: we first form M, as in Equation 2. For each column j Set: Choose a to minimize and now calculate X = RR′ instead of MM′. It is first important to check that in the absence of LD the suggested correction does not seriously distort the Tracy–Widom statistic. In Figure 9A and 9B we show P–P plots, uncorrected, and with five levels (k = 1…5) of correction. The first figure is with 100 individuals and 5,000 markers, the second with 200 individuals and 50,000 markers. Then in Figure 10A and 10B we analyze a simulated dataset with severe LD. We generate blocks in perfect LD, in which the probability that a block contains L markers is 2−L. We show the corresponding plots. Note that here the uncorrected statistic is distributed quite differently than the Tracy–Widom distribution. Our suggested correction strategy seems to work well, and should be adequate in practice, especially as most large genotype arrays will attempt to avoid high levels of LD. We would recommend that before analyzing a very large dataset with dense genotyping, one should filter the data by removing a marker from every pair of markers that are in tight LD. Figure 9 LD Correction with no LD Present P–P plots of the TW statistic, when no LD is present and after varying levels (k) of our LD correction. We first show this (A) for m = 500, n = 5,000, and then (B) for m = 200, n = 50,000. In both cases the LD correction makes little difference to the fit. Figure 10 LD Correction with Strong LD (A) Shows P–P plots of the TW statistic (m = 100, n = 5,000) with large blocks of complete LD. Uncorrected, the TW statistic is hopelessly poor, but after correction the fit is again good. Here, we show 1,000 runs with the same data size parameters as in Figure 2A, m = 500, n = 5,000, varying k, the number of columns used to “correct” for LD. The fit is adequate for any nonzero value of k. (B) Shows a similar analysis with m = 200, n = 50,000. Comparison with STRUCTURE In the work above on the BBP phase change, we already showed some comparisons between STRUCTURE and our methods. A fair comparison to STRUCTURE is not easy, as the two programs have subtly different purposes and outputs. STRUCTURE attempts to describe the population structure by probabilistic assignment to classes, and we are attempting to determine the statistically significant “axes of variation,” which does not necessarily mean the same thing as assigning individuals to classes. Our impression, confirmed by Table 4, is that when our analysis finds overwhelmingly evident population structure, then STRUCTURE will as well, and when nothing at all is found, STRUCTURE will fail, too. In a problem where the effect is marginal, it may be hard to say which analysis is preferable. STRUCTURE is a sophisticated program with many features we have not attempted to match. STRUCTURE has an explicit probability model for the data, and this allows extra options and flexibility. It incorporates a range of options for ancestry and for allele frequencies, and has explicit options for modeling microsatellite distributions. On the other hand, eigenanalysis has advantages over STRUCTURE. First, it is fast and simple, and second, it provides a formal test for the number of significant axes of variation. One future possibility is to somehow incorporate recovered significant eigenvectors into STRUCTURE—in particular with regard to choosing the number of subpopulations, which is not statistically robust in the STRUCTURE framework. A sensible default for the number of clusters in STRUCTURE is one more than the number of significant eigenvalues under the TW statistic. Missing Data and Other Problems The most problematic issue when applying any method to infer population structure is that genotyping may introduce artifacts that induce apparent (but fallacious) population structure. Missing genotypes by themselves are not the most serious concern. Simply setting M(i,j) = 0 in Equation 3 if marker j is missing for individual i is reasonable if we are testing the null, that there is no structure, and the missing data is “missing at random.” Unfortunately “informative missingness” [37,38] is extremely frequent in genetic data. Probably the most common and serious issue is that with current technology, heterozygotes are more difficult to call than homozygotes. Thus, true heterozygotes are more likely to be called as missing. This is discussed in detail in [38], which is recommended as a very useful discussion of the issues, especially as they apply to medical genetics. If DNA quality (or quantity) varies among our samples, then certain individuals may have an unusual amount of missing data, and then appear as outliers in our eigenanalysis—we in fact have seen this in many runs on real data. Another issue that may produce confounding effects is if data from different populations or geographical areas is handled differently (which may be inevitable, especially in the initial processing); then, in principle, this may induce artifacts that mimic real population structural differences. Even restricting analysis to markers with no missing data, apart from an inevitable power loss, does not necessarily eliminate the problems. After all, if a subset (the missing data) is chosen in a biased way, then the complementary subset must also be biased. We have no complete solution to these issues, though there is no reason to think that our eigenvector-based methods are more sensitive to the problems than other techniques [9]. One check we do recommend is to generate a test matrix by taking the initial counts C(i,j) to be 0 if the corresponding data is present; otherwise, set C(i,j) = 1. This is equivalent to only focusing on the pattern of missing data. The eigenanalysis on this test matrix will show significant TW statistics if the missing data by itself is showing evidence of population structure. If so, the results should be regarded with some suspicion, especially if the eigenvectors show high correlation to the eigenvectors of the main analysis. We here echo [38] and recommend that the analyst should “control all aspects of source, preparation and genotyping, using the paradigms of blindness and randomization,” but, as the reference states, this will not always be possible. Another possible source of error, where the analyst must be careful, is the inclusion of groups of samples that are closely related. Such a “family” will introduce (quite correctly from an algorithmic point of view) population structure of little genetic relevance, and may confound features of the data of real scientific interest. We found that this occurred in several real datasets that we analyzed with eigenanalysis and in which related individuals were not removed. Discussion For many genetic datasets, it is important to try to understand the population structure implied by the data. STRUCTURE [9], since its introduction, has been the tool of choice, especially for small datasets. We think we have provided some evidence that PCA has advantages also, as it is fast, easily implemented, and allows accurate testing of significance of a natural null model. We can only uncover structure in the samples being analyzed. As pointed out in [39], the sampling strategy can affect the apparent structure. Rosenberg et al. [29] give a detailed discussion of the issue, and of the question of whether clines or clusters are a better description of human genetic variation. However, our “axes of variation” are likely to be relatively robust to this cline/cluster controversy. If there is a genetic cline running across a continent, and we sample two populations at the extremes, then it will appear to the analyst that the two populations form two discrete clusters. However, if the sampling strategy had been more geographically uniform, the cline would be apparent. Nevertheless, the eigenvector reflecting the cline could be expected to be very similar in both cases. Our methods are conceptually simple, and provide great power, especially on large datasets. We believe they will prove useful both in medical genetics, where population structure may cause spurious disease associations [1,40–43]; and in population genetics, where our statistical methods provide a strong indication of how many axes of variation are meaningful. A parallel paper [14] explores applications to medical genetics. Mathematical Details A moments estimator. We justify our estimator of the “effective number of markers.” Theorem 1. Let λ 1, λ 2,… λm be eigenvalues of an m × m Wishart matrix MM′, where M is m × n with entries that are Gaussian with mean 0 variance σ 2 . Define If n,σ 2 are unknown, estimates are: where With these values of n^ and , the observed values of L 1 and are equal to their expected values. Note that in this section we define our Wishart as MM′, not , as n is unknown. This scaling hardly matters in applications, as our procedures are always scale-invariant. That is, we avoid assumptions on the variance of the Gaussian entries of M. Proof: Let a = (a 1,a 2,…am ) be a random vector uniformly distributed on the unit m-sphere. is Beta (1/2,(m − 1)/2)-distributed, and it follows that Let . Then To obtain the distribution of s, unconditioned on λ, we see that we can write s as where D = diag(λ 1, λ 2,… λm ). After an orthogonal transformation where X is our Wishart matrix and b is uniform (isotropic) on the unit sphere. By properties of the Gaussian distribution, the distribution of s as given by Equation 20 is independent of b. We choose b to be (1,0,0,…). It follows that s/σ 2 is distributed as a variate so that s = 2σ 2 G where G is Γ(n/2)-distributed. Thus, Comparing Equations 17 and 21, this proves: and comparing Equations 19 and 22, we find: From Equations 23 and 24: so that a natural estimator for n is: We then obtain as an estimate for σ: If we set: then Equation 25 simplifies to: This completes the proof of Theorem 1. It would be interesting to estimate the standard error for n^ . We next show that normalizing the eigenvalues of an m × m Wishart to sum to m does not change the asymptotics of the largest eigenvalue. In our data analysis we always normalize the empirical eigenvalues in this way. Theorem 2. Consider a Wishart matrix X with eigenvalues λi, originating from an m × m matrix M whose entries are Gaussian with mean 0 and variance 1. That is, . Let λ 1 be the largest eigenvalue of X. Define Define τ by which normalizes L,L′ by the Johnstone normalization of Equation 7 with μ and σ defined as in Equations 5 and 6. Then L and L′ both tend in distribution to the Tracy–Widom distribution as m,n → ∞,n/m → γ > 1. That is, the normalization of Equation 28 does not change the asymptotic distribution of L. Proof: Let Then So Each entry of M is standard normal, and so T has mean 1 and standard deviation Let s = σ(m,n) be the scale factor of the Johnstone normalization. Then we can show (we used Maple) that as n → ∞, Write T = 1 + x so that x has mean 0 and standard deviation u. Thus, x/s → 0 in probability as m → ∞. We now show that this implies that τ − τ′ tends to 0 in probability. From the definition of μ(m,n) in Equation 5, we have μ(m,n) 4. Since as m → ∞, (L − μ(m,n))/σ(m,n) tends to TW in distribution, and σ(m,n) → 0, it follows that P(L > 10) tends to 0 as m → ∞. Similarly, P(T 0. From Equation 29: All three probabilities on the right hand side of Equation 30 can be made arbitrarily small for large enough m. By Johnstone's theorem, τ → TW in distribution, and so τ′ → TW also. The Spectrum of the Covariance Matrix We now turn to genetic (genotype count) data, and analyze the theoretical covariance matrix of the data. We concentrate on the covariance of the sample genotypes at a single biallelic marker. Note that in contrast to the results for a Wishart discussed in Theorem 2, we are now interested in a case where there is population structure, which implies dependence between the samples. Consider sampling a marker from samples belonging to K populations. Suppose the allele has frequency pi in population i. We sample diploid genotypes, obtaining counts Cj of the variant allele from sample j. We suppose sample j belongs to population i = i(j), and that the sample size for population i is M(i). We discuss the spectrum (eigenvalues) of the covariance matrix of the raw counts Cj . Note that this is for the theoretical covariance not the sample covariance. We must specify the covariance of the population frequency vector We assume that there is a hidden allele frequency P whose exact distribution will not be important to us, but is diffuse across the unit interval (0,1). Then conditional on P we assume that p has mean P(1,1,…,1) and covariance matrix P(1 – P)B where B is independent of P. This is a natural framework, used (filling in details variously) by Balding and Nichols [44], Nicholson et al. [18], and STRUCTURE [9] in the correlated allele mode. For small population divergence, we can take the diagonal entry Bii as the divergence (FST ) between P and pi . Set and assume that all τi are of order τ, which is small. Conditional on p, then the Cj are independent. Cj has mean p and variance 2p(1 − p) where p = pi (j). This assumes Hardy–Weinberg equilibrium in each of the K populations. Theorem 3. With the assumptions above, define so that Ci * has mean 0. Let V* be the covariance matrix of C* and set . Conditional on the root frequency P: 1. does not depend on P. 2. has an eigenvalue 0 with eigenvector . 3. has for each k (1 ≤ k ≤ K), M(k) − 1 eigenvalues equal to 1 − τk . (We will call these the small eigenvalues.) 4. has K − 1 eigenvectors that span a vector space F* consisting of vectors v of length M whose coordinates are constant on samples from each population, and such that the sum of the coordinates of v are 0. 5. If the matrix B (the scaled covariance of the population frequencies p) has rank r, then r − 1 of the eigenvalues of that correspond to eigenvectors in F* depend on B. (So if B has full rank, all these eigenvalues depend on B.) If we allow each sample size M(k) → ∞, then then all such eigenvalues also → ∞. (We will call the corresponding eigenvalues the large eigenvalues). Proof: Let V be the covariance matrix of the counts C. Regard V = || Vij || as a linear operator in the natural way. Write π(i) for the population index of sample i(1 ≤ i ≤ K). We can write V = Vij as D + W where D is a diagonal matrix with the diagonal element Dii = dπ (i) and Wij = qπ (i),π(j). So the covariance structure depends only on the population labels of the samples. It follows that the vector space of M long column vectors has an orthogonal decomposition into subspaces invariant under V consisting of: 1) a subspace F of vectors whose coordinates are constant within a population. F has dimension K; 2) subspaces Si (1 ≤ i ≤ K). Vectors of Si are zero on samples not belonging to population i, and have coordinate sum 0, which implies that they are orthogonal to F. It now follows that V has K eigenvectors in F, and for each k (1 ≤ i ≤ K), (M(k) − 1 eigenvectors in Sk each of which have the same eigenvalue λk . Conditional on p, V acts on Sk as 2pk (1 − pk )I where I is the identity matrix. (The factor 2 comes from the two chromosomes sampled for each individual.) Thus, Now and so the eigenvalues corresponding to eigenvectors of Sk are: V* and V act identically on Sk, the vectors of which have coordinate sum 0, so this proves assertion 3 of Theorem 3. We now consider the action of V on the K-dimensional subspace F. It is convenient to define , a quantity we will need repeatedly. Let for each k (1 ≤ k ≤ K), f [k] be the vector whose coordinates are 0 except for samples i where π(i) = k, and where for such samples: The vectors f [k] form an orthonormal basis for F. Write dk = f [k].C. Set E to be the diagonal matrix It is easy to calculate that the random variables dk have, conditional on P, covariance matrix R, where and D = diag(m(1),m(2),…m(K)). Here E corresponds to sampling noise. In the main paper we subtract the sample mean from the counts C. So define the M long vector 1 = (1,1,…,1). Then Set , this is a linear transform T where We are interested in the action of T on F. Write Then from Equation 32, regarding T as a K × K matrix (abusing notation): T = I − Q where I is the identity matrix and Set It now follows from Equation 31 that if R* is the covariance matrix of the dk, then This is enough to prove that does not depend on P (assertion 1 of Theorem 3). Next, T(1) – R* (1) = 0, which proves assertion 2 of Theorem 3. The space F* of vectors F orthogonal to 1 is invariant under V and , thus R* will have K − 1 eigenvectors of F* (assertion 4 of Theorem 3). If B has rank K (which will be true except in special cases), then TDBDT has rank K − 1 and if M(k) → ∞ for each k, then R* will have K − 1 nonzero eigenvalues which become arbitrarily large. More generally, if B has rank r, then the matrix TDBDT will have rank r − 1, and the r − 1 eigenvalues of R* that depend on B, again will become arbitrarily large as M(k) → ∞. Note that the matrix TET which arises from sampling noise is bounded. (In fact TET is a contraction and has all eigenvalues less than 1.) This completes the proof of Theorem 3. The case in which B does not have full rank occurs if there has been a genetically recent admixture between two or more populations. In this case, even if there are K clearly distinct populations, fewer than K − 1 eigenvalues will become large as the sample size increases. Definition of the TW Density For completeness, we define the TW density. Our description is taken from [22]. Let q(x) be the solution of the differential equation: with the boundary condition: and Ai(x) is the Airy function. Then the TW distribution is given by: A table of the TW right-tail area, and density, is available on request. Some Questions in Theoretical Statistics We believe this work raises some challenges to theoretical statisticians. Our results with genetic simulations would be even more convincing if there were theorems (say for the Wishart case where the data matrix has Gaussian entries) that showed: 1) that using the effective number of markers calculated by Equation 10 instead of the true number of markers does not affect the asymptotics; 2) that the BBP phase change holds for real Wishart matrices as well as for complex; 3) in Figures 2 and 3 the P–P plots show a noticeably better fit at the high end, corresponding to low p-values. Explain! Methods Datasets used. For the data used in Figure 4, we use the H952 subset of the CEPH–HGDP panel [30,31,45] where some atypical samples and pairs of close relatives have been removed. For the data used in Figure 5, we use an unpublished sample collected and genotyped by Dr. Jonathan Seidman and Dr. S. Sangwatanaroj. This consisted of 25 samples from Northern Thailand (after removing some individuals who are close relatives of people whose samples we retained) and 45 samples each from China and Japan (data drawn from the International Human Haplotype Map Project [32]). The Northern Thai samples were genotyped using an Affymetrix Xba chip. The dataset analyzed consisted of the overlap between the SNPs successfully genotyped in HapMap and the Affymetrix chip, and included 40,560 SNPs. For the data of Mark Shriver and colleagues [5], we analyzed only autosomal data where no SNP had any missing data. We removed one individual who was a duplicate, two Burunge and Mbuti samples that represented close relatives of other samples, and nine Nasioi individuals who our data suggest are part of one or two extended families. Algorithm details. In the eigenanalysis of the Shriver data, we examine no more than two markers as independent regression variables for each marker we analyze, insisting that any marker that enters the regression be within 100,000 bases of the marker being analyzed. This slightly sharpens the results. Varying these parameters made little difference. For all STRUCTURE runs, we ran with a burn-in of 10,000 iterations with 20,000 follow-on iterations, and no admixture model was used. Computations were carried out on a cluster of Intel Xeon compute nodes, each node having a 3.06-GHz clock. For our coalescent simulations, we assumed a phylogenetic tree on the populations, and at each simulated marker, ran the coalescent back in time to the root of the tree. At this point we have a set of ancestors A of the sampled chromosomes. We now assume that the marker is biallelic and that the population frequency f of the variant allele in the ancestral population is distributed uniformly on the unit interval. Sample the frequency f and then choose an allele for each ancestor of A, picking the allele for each ancestor with probability f. Now retain the marker if it is polymorphic in our samples. This process is mathematically equivalent to having a very large outgroup population diverging from the sampled populations at the phylogenetic root, with the population panmictic before any population divergence, and ascertaining by finding heterozygotes in the outgroup. If our simulated samples have n individuals, our procedure yields a sample frequency that is approximately uniform on (1,2,…,2n − 1). For the admixture analysis that created the plot of Figure 8 we had a population C that was admixed with founder populations A and B. For each individual of C, we generated a mixing value x that is Beta-distributed B(3.5,1.5). Now for each marker independently, the individual was assigned to population A with probability x or B with probability 1 − x. Supporting Information SMARTPCA, a software package for running eigenanalysis in a LINUX environment, is available at our laboratory: http://rd.plos.org/david_reich_laboratory.

PMC ID:: 3896259

DOI:: 10.1038/ng.2802

PubMed ID:: 24162737

Data availability:

ScienceOpen disciplines: Chemistry

Keywords: Genetic Loci, Polymorphism, Single Nucleotide, Middle Aged, Male, Humans, statistics & numerical data, Genome-Wide Association Study, Genetic Predisposition to Disease, Female, Cohort Studies, Case-Control Studies, genetics, epidemiology, Alzheimer Disease, Aged, 80 and over, Aged, Age of Onset