Introduction The evolution of modern human populations has been accompanied by dramatic changes in environment and lifestyle. In the last 100,000 years, behaviorally modern humans have spread from Africa to colonize most of the globe. In that time, humans have been forced to adapt to a wide range of new habitats and climates. Following the end of the last ice age, 14,000 years ago, there was a major warming event that raised global temperatures to roughly their current levels. Further dramatic changes occurred with the transition from hunter-gatherer to agricultural societies, starting about 10,000–12,000 years ago in the Fertile Crescent, and a little later elsewhere. This was also a period marked by rapid increases in human population densities. Increased population density promoted the spread of infectious diseases, as did the new proximity of farmers to animal pathogens [1,2]. Each of these kinds of changes likely resulted in powerful selective pressures for new genotypes that were better suited to the novel environments. Indeed, there are a number of recent reports of genes that show signals of very strong and recent selection in favor of new alleles: for example, in response to malaria [3–5]; at the lactase gene in response to dairy farming ; at a salt sensitivity variant in response to climate ; and in genes involved in brain development [8,9]. To date, the best examples of recent selection in humans have all been discovered in studies of candidate genes where there was a prior hypothesis of selection. Hence, very little is known about how widespread such signals are; nor is there unbiased information about what kinds of genes or biological processes are most involved in the adaptation of modern humans. It is unclear whether these genes are the same kinds of genes that were most important in the earlier evolution of the Homo lineage, as identified from comparisons with chimpanzees [10–12]. It is also not known to what extent recent selective events have been geographically restricted, as opposed to taking place in all populations. A number of recent studies have detected more signals of adaptation in non-African populations than in Africans [13–17], and some of those studies have conjectured that non-Africans might have experienced greater pressures to adapt to new environments than Africans have. In this study, we use newly available, dense, single nucleotide polymorphism (SNP) data from the International HapMap Project  to create a first-generation map of selection across the human genome. Our search is aimed at finding loci where there is strong, very recent, selection in favor of alleles that have not yet reached fixation. By doing so, we aim to provide preliminary answers to these questions about the nature and extent of recent adaptation in modern humans. The loci that we identify will start to fill in the details about the ways in which modern humans have adapted to the selective pressures in the most recent stage of our evolution. Furthermore, signals of selective sweeps in progress indicate the presence of genetic variants that must have some significant effect on human phenotypic variation. Though the actual target and the nature of the selection are usually not immediately clear, it may well be that many of these variants affect complex phenotypes of medical relevance [19,7,20–22]. For this reason, we believe that a “selection map” of ongoing sweeps in the human genome will become an important aspect of annotation for the human genome sequence. Genome-wide association studies should be sure to tag haplotypes that appear to be targets of selection, and such haplotypes will be of particular interest if they produce association signals in mapping studies. To this end, we have created files of tag-SNPs that can be used to tag the major selection signals in each HapMap population, and a web tool, Haplotter, at http://pritch.bsd.uchicago.edu/data.html, that can be used to query any HapMap SNP for evidence of selection. Results We analyzed genome-wide SNP data from Phase 1 of the International HapMap Project . These data consist of ∼800,000 polymorphic SNPs in a total of 309 unrelated individuals. For the purpose of our analyses, we grouped the data into three distinct population samples of unrelated individuals, as follows (see Materials and Methods): 89 Japanese and Han Chinese individuals from Tokyo and Beijing, respectively, henceforth denoted as East Asian (ASN), 60 individuals of northern and western European origin (CEU), and 60 Yoruba (YRI) from Ibadan, Nigeria. Except where stated, our analyses focused on the autosomes only. Our analysis was based on haplotypes estimated by the HapMap Consortium using the program Phase 2 [18,23]. We estimated genome-wide, high-resolution LD-based recombination maps separately for all 3 samples, using our implementation of the Li and Stephens algorithm . The goal in our study is to identify loci where strong selection has driven new alleles up to intermediate frequency. Such alleles might be on their way to fixation, or might become balanced polymorphisms. The classic signal of strong directional selection is that because the favored allele increases in frequency very fast, it tends to sit on an unusually long haplotype of low diversity. Meanwhile, chromosomes that do not carry the selected allele have levels of diversity and LD that are more typical of the genome as a whole. This type of signal has been used in the past to argue for selection in a number of genes in humans and in Drosophila [3,5,6,8,9,25–29]. Since our data consist of pre-ascertained SNPs, they do not directly contain information about the underlying levels of nucleotide diversity. Nonetheless, we can expect that favored alleles will generally sit within large shared haplotypes, and that these haplotypes will be in sharp contrast with the more variable haplotypes on the unselected background (Figure 1A). In order to pursue this type of signal for genome-wide SNP data, we have developed a new test statistic that we denote iHS (integrated haplotype score). The iHS was chosen after performing extensive simulations to determine the most powerful statistic from a number of new and previously published test statistics (see below and Figure S1). Our new test begins with the EHH (extended haplotype homozygosity) statistic proposed by Sabeti et al. . The EHH measures the decay of identity, as a function of distance, of haplotypes that carry a specified “core” allele at one end. For each allele, haplotype homozygosity starts at 1, and decays to 0 with increasing distance from the core site (Figure 1B). As shown in the figure, when an allele rises rapidly in frequency due to strong selection, it tends to have high levels of haplotype homozygosity extending much further than expected under a neutral model. Hence, in plots of EHH versus distance, the area under the EHH curve will usually be much greater for a selected allele than for a neutral allele. In order to capture this effect, we compute the integral of the observed decay of EHH away from a specified core allele until EHH reaches 0.05. This integrated EHH (iHH) (summed over both directions away from the core SNP) will be denoted iHH A or iHH D , depending on whether it is computed with respect to the ancestral or derived core allele. Finally, we obtain our test statistic iHS using When the rate of EHH decay is similar on the ancestral and derived alleles, iHH A /iHH D ≍ 1, and hence the unstandardized iHS is ≍ 0. Large negative values indicate unusually long haplotypes carrying the derived allele; large positive values indicate long haplotypes carrying the ancestral allele. Since in neutral models, low frequency alleles are generally younger and are associated with longer haplotypes than higher frequency alleles, we adjust the unstandardized iHS to obtain our final statistic which has mean 0 and variance 1 regardless of allele frequency at the core SNP: The expectation and standard deviation of ln(iHH A /iHH D ) are estimated from the empirical distribution at SNPs whose derived allele frequency p matches the frequency at the core SNP. The iHS is constructed to have an approximately standard normal distribution and hence the sizes of iHS signals from different SNPs are directly comparable regardless of the allele frequencies at those SNPs. Since iHS is standardized using the genome-wide empirical distributions, it provides a measure of how unusual the haplotypes around a given SNP are, relative to the genome as a whole, and it does not provide a formal significance test. In our data analysis, iHS is computed for every SNP with minor allele frequency > 5%, treating each SNP in turn as a core SNP. The iHS at each SNP measures the strength of evidence for selection acting at or near that SNP. However in simulations we have found that instead of treating each SNP separately, it is more powerful to look for windows of consecutive SNPs that contain numerous extreme iHS scores (Figure S2). This is because selective sweeps tend to produce clusters of extreme iHS scores across the sweep region, while under a neutral model, extreme iHS scores are scattered more uniformly (unpublished data). In principle, we might expect that large negative iHS scores, indicating that a derived allele has swept up in frequency, are of the most interest. However, in simulations, a sweep can also produce large positive iHS values at nearby SNPs if ancestral alleles hitchhike with the selected site. Furthermore, it is plausible that selection may sometimes switch to favor an ancestral allele that has been segregating in the population. For these reasons, we will treat both extreme positive, and extreme negative iHS scores as potentially interesting. It is now well-known that recombination rates are extremely heterogeneous across the genome, even at fine scales [30–32]. Such rate variation is a potential source of false positives when looking for regions with unusual haplotype structure as we are here. Our test is designed to control for rate variation in two ways. First we estimated high-resolution genetic maps based on LD patterns and used the estimated genetic distances when calculating iHS. By basing analysis on the fine scale genetic maps, the lengths of haplotypes that extend across large recombination coldspots are appropriately downweighted, and haplotypes that cross hotspots are upweighted. Second, since iHS is based on a ratio of haplotype homozygosities, the two alleles serve as internal controls for each other . Hence, inaccuracies in the estimated genetic map will tend to cancel out of the ratio, as will any other factors that cause the extent of haplotype homozygosity to be heterogeneous across the genome. Figure 2 plots the power of iHS and of two standard tests of selection based on summaries of the frequency spectrum. These simulations, as well as the simulations shown below, are designed to match the properties of the data as closely as possible (Materials and Methods). As shown, the iHS outperforms the frequency spectrum tests, as well as other EHH-based statistics, across a broad range of frequencies of the selective sweep (Figure 2, Figure S2). Furthermore, iHS is robust to regional variation in SNP ascertainment while tests of the frequency spectrum may not be. Nonetheless, iHS has limited ability in the HapMap data to detect low frequency sweeps and to detect sweeps that are very near fixation. The iHS statistic is constructed to provide a tool for identifying SNPs, or genomic regions, that are unusual relative to the genome as a whole, and not to provide formal significance testing relative to a theoretical model. We will show that in all populations there is an excess of extreme iHS signals relative to simulated models. However, since there is considerable uncertainty in simulated models, we prefer not to assign formal p-values to the signals that we find. Widespread Signals of Recent Selection We calculated iHS for all SNPs with minor allele frequency >5%. Figure 3 shows a summary of the extreme values on Chromosome 2, with similar plots across all autosomes and the X chromosome provided in Figures S3–S25. The plotted points are those with |iHS| > 2.5, and correspond to the most extreme 1% of iHS values. There is clear clustering of extreme values into distinct regions where many SNPs show evidence for selection. One such region, at 135–138 Mb in Europeans, contains the lactase gene (LCT), previously noted as a target of strong selection [6,33]. In principle, clustering of unusual iHS scores might occur even under a completely neutral model. However, several lines of evidence indicate that selection is indeed producing widespread signals in the data. First, simulations show that we observe more extreme values of the unstandardized iHS scores than expected under a range of neutral models (Figure 4). For each population, we performed neutral simulations that matched the observed SNP density and allele frequency spectrum, which included extensive recombination rate variation, including hotspots, and utilized a range of demographic models consistent with previous studies of demographic parameters (see Materials and Methods). The demographic models included a variety of bottleneck models for East Asians and Europeans, and models of constant size with recent growth for Yoruba [34,35]. Since previous genetic studies indicate that the Yoruba are likely to have the least complex demographic history [34–36], we focus mainly on simulation results for that population. We find that extreme values of the unstandardized iHS are more frequent in the real Yoruba data than in any of the simulated models, as expected if the largest iHS values are frequently due to selection. There is also an excess of extreme values in the Europeans, but in the East Asians some demographic models show as much variance as the real data (unpublished data). Second, in all three populations, there is greater clustering of extreme (standardized) iHS values in the real data than in neutral simulations with heterogeneous recombination rates. For example, in simulations matching aspects of the Yoruba data, only 0.1% of windows of 50 consecutive SNPs had more than 16 SNPs with |iHS| > 20. In the actual YRI data, we observed a 14-fold enrichment of such windows. Since this calculation is based on empirically standardized scores, the signal of extra clustering is distinct from the previous signal of overdispersion of the unstandardized scores. Simulations designed to match the East Asian and European datasets also indicate that there is excess clustering of extreme iHS values in the real data, though for these populations the relative enrichment is quantitatively smaller due to the extra variance seen in neutral bottleneck models (the data show 2.3-fold and 2.7-fold enrichment of the top 0.1% of windows, respectively). In summary, the visual sense of clustering of high |iHS| scores in Figure 3 does indeed exceed the level of clustering expected under neutrality, supporting a model in which distinct selective events produce large |iHS| scores across discrete regions. Third, in our data, extreme iHS scores frequently occur in regions where the frequency spectrum also indicates the action of selection (Figure 5). As shown in Figure 2, a version of Fay and Wu's H test  for ascertained SNPs (Hasc) provides a useful method for detecting sweeps where the selected site has reached high frequency. In simulations we find that high frequency selected sites tend to have both strongly negative iHS scores and strongly negative values of Fay and Wu's Hasc . However in neutral simulations, iHS and Hasc are essentially uncorrelated. In the Yoruba data, as many as one half of high frequency–derived SNPs with large iHS scores fall into the most extreme 1% of windows for Hasc , genome-wide. This correlation argues strongly that many of our extreme iHS scores are in fact the result of positive selection. In our data, an excess of significant Hasc scores is also seen for low-frequency alleles with positive iHS scores. This probably results from ancestral alleles that have hitchhiked to high frequency. Similar, but weaker, correlations are present in both the European and East Asian data (unpublished data). Fourth, there is a highly significant enrichment for extreme iHS values in genic regions (p 2 is 1.23-fold higher in genic SNPs than non-genic SNPs in Yoruba, 1.16-fold higher in Europeans, and 1.13-fold higher in East Asians. A further enrichment in extreme iHS values is found in SNPs in overlapping genes compared with SNPs in non-overlapping genes. To check that these results are not an artifact of the higher HapMap SNP density in genes, we thinned genic SNPs at random so that the proportion of genic SNPs matched the proportion of the genome containing genes. After reanalysis, the difference between genic and nongenic iHS values remained about the same as before (unpublished data). Last, various regions identified previously as likely targets of sweeps-in-progress are detected by our survey, including signals in Europeans in the lactase region [6,33], in the 17q21 inversion , and at CYP3A5 ; in the ADH cluster on Chromosome 4 in East Asians ; and in olfactory receptor clusters on Chromosomes 11p15 and 11q11 in Yoruba . However, we do not detect all previously identified selection candidates, for example failing to find signals at G6PD in Yoruba [3,5], and in two genes involved in brain development that were recently reported to be under recent positive selection [8,9]. Overview of Selection Regions To identify the strongest signals of selection, we divided the genome into non-overlapping windows of 100 kb. In this analysis windows were defined by physical location to facilitate the comparison of signals across populations. For each population, we identified the windows in the highest 1% of the empirical distribution for proportion of SNPs with |iHS| > 2. The positions of these windows on Chromosome 2 are indicated by vertical bars below each panel of Figure 3. Henceforth, we consider these windows to be candidates for containing selective sweeps. We find that 8 of 14 genomic regions listed as selection candidates by the HapMap Project are among the top 5% of our signals (Table 10 in ). A summary of some of our strongest signals is shown in Table 1, and a complete list is provided online in Protocols S1–S3). As an illustration, Figure 6 depicts the haplotype patterns, decay of EHH with distance, and plot of iHS scores for three strong candidate regions identified by our genome-wide scan. Analysis of the haplotype structure in candidate selection regions indicates that the events detected are typically very recent. We calculated the lengths of the haplotypes around the SNP with the largest negative iHS value in each region (Table S2). The average total distance between the first point to the left, and to the right of the core SNP at which EHH on the selected haplotypes drops below 0.25 is 0.52 cM in both East Asians and Europeans, and 0.32 cM in Yoruba. Hence, candidate sweep regions tend to be narrower in Yoruba than in the non-African populations, indicating that typical sweep events may be substantially younger in the non-African populations. (The size of the area affected by a strong sweep depends only weakly on the effective population size  and so the larger effective population size in Yoruba is not an explanation of the smaller average sweep size.) A fully rigorous estimation of the ages of the candidate sweeps is difficult with the current data. However, making the simplistic assumption of a star-shaped genealogy for the favored haplotypes and assuming a generation time of 25 y, suggests average ages of ≍6,600 years and ≍10,800 years in the non-African, and African populations, respectively (Materials and Methods). Simulations using SelSim (Materials and Methods) suggest that these haplotype spans are consistent with selection coefficients of 0.01–0.04, assuming a population size of 10,000. These estimated ages should not be taken to imply a burst of selection at a particular time; instead, these ages and selection coefficients might represent areas in the parameter space in which we have good power. The longest haplotypes around derived alleles at >50% frequency extend 1.39 cM in East Asians (near the Gaucher disease gene, GBA), 1.25 cM in Europeans (near NKX2–2, which is involved in insulin regulation), and 0.97 cM in Yoruba (in a gene desert on Chromosome 5p15). These long haplotypes indicate extremely strong selection on recent mutations, though it is difficult to be confident about the actual genetic target of the selective events. In summary, the selection events that we detect are generally very recent, substantially postdating the separation times of these populations, and falling mainly within the agricultural phase of human evolution. Figure 7 shows the extent of overlap of sweep regions across the three populations. Most of the sweep regions are found in only one of the three populations, consistent with the estimates indicating that these events generally postdate population separation. Nonetheless, there is a clear excess of sweeps that are shared between pairs of populations, or among all three populations. In principle, sharing of signals between populations might also be due to haplotypes that are inherited from the ancestral populations. However, this is probably a small effect since such unusually long haplotypes would be unlikely to survive the effects of recombination for >1,000 generations, separately in each population. Instead, the data suggest that most of the selective events that we detect are local to a single population, but that a significant fraction of the selective events are experienced by more than one population. This view is further supported by the relationship between iHS and allele frequency divergence measured using FST (Figures S26–S28). SNPs with high |iHS| in one population, but low |iHS| in another are likely to have high FST , indicating that the SNP has changed frequency rapidly since population separation. Among the modest number of SNPs that have extreme positive iHS or extreme negative iHS in both populations, there is not an excess of high FST , perhaps due to recent gene flow of alleles (and haplotypes) that are favored in multiple populations. Types of Genes under Selection Next, we modified our analysis to study what types of genes are most commonly involved in recent adaptation. For every gene we determined the number of SNPs with high |iHS| in a 50 SNP window centered on the gene (Materials and Methods). The genes in the upper 10% of the empirical distribution for number of significant SNPs were then considered to be candidate targets of selection. Our procedure was designed to be robust to variation in gene size and SNP density across gene categories. The PANTHER Gene Ontology database provides a classification of genes into 222 nested categories according to biological process . We tested whether any of these categories showed an enrichment of genes with signals of recent selection (Materials and Methods). In this analysis, one might be concerned that there is low power to detect enrichment, since the expected counts for many categories are low, despite considering such a large fraction (10%) of the genes as candidates for selection. Nonetheless, several categories show up as strongly significant in one or more populations (Table 2), including the related categories of chemosensory perception and olfaction; as well as gametogenesis, spermatogenesis, and fertilization. These types of processes have been identified as targets of selection in previous studies of human-chimpanzee divergence [10,11]. Overall, there is a modest enrichment for genes that show signals both of very recent selection in our study, and selection on the human lineage as a whole  (Table S3). In addition to the areas of overlap, we find enrichment in new categories not previously identified as targets of selection, including categories related to metabolism of carbohydrates, lipids, and phosphates, as well as vitamin transport. For some categories, the p-values are imprecise as there is clustering of related genes that are all significant (Table S4). We now describe in greater detail some of the categories of gene functions that show enrichment of selective signals. Except when noted, OMIM or EntrezGene provide references for the gene information given below [42,43]. The genes listed below are generally in the top 1%, and all are in the top 5% of signals in the relevant populations. However, some caution is required since the strongest signals often span both the target of selection as well as neighboring genes. Recent reports have shown that genes involved in fertility and reproduction are subject to rapid adaptive evolution in primates due to sexual competition and perhaps defense against pathogens [10,11,44]. We observe signals for selection targeting several aspects of fertility and reproduction, including the basic protein structure of sperm (RSBN1 in East Asians and Yoruba), sperm motility (SPAG4 in Europeans and East Asians; ODF2 in Europeans), sperm and egg viability (ACVR1 in Europeans, CPEB2 in Yoruba), regulation of the female immune response to sperm (TGM4), egg fertilization (the CRISP gene cluster near 6p21.3 in Europeans), and testis determination (NR0B1 in Europeans). Some of the strongest signals of recent selection appear in various types of genes related to morphology. For example, four genes involved in skin pigmentation show clear evidence of selection in Europeans (OCA2, MYO5A, DTNBP1, TYRP1). All four genes are associated with Mendelian disorders that cause lighter pigmentation or albinism, and all are in different genomic locations, indicating the action of separate selective events. One of these genes, OCA2, is associated with the third longest haplotype on a high frequency SNP anywhere in the genome for Europeans. A fifth gene, SLC24A5, has recently been shown by another group to impact skin pigmentation and to have a derived, selected allele near fixation in Europeans . Though iHS has reduced power for alleles near fixation, SNPs near this gene also show strong iHS signals in Europeans (Table S2). Various genes involved in skeletal development have also been targets of recent selection. Three related proteins involved in bone morphogenesis show signals of selection in Europeans (BMP3 and BMPR2) and in East Asians (BMP5). In addition, GDF5, a gene in which mutations cause skeletal malformations, shows strong signals of selection in both Europeans and East Asians. Other morphological features also appear to be targets of selection, including hair formation and patterning in Yoruba (the keratin cluster near 17q12; and FZD6). An important type of selective pressure that has confronted modern humans is the transition to novel food sources with the advent of agriculture and the colonization of new habitats [19,21]. As noted above, we see a strong signal of selection in the alcohol dehydrogenase (ADH) cluster in East Asians, including the third longest haplotype around a high frequency allele in East Asians. A variety of genes involved in carbohydrate metabolism have evidence for recent selection, including genes involved in metabolizing mannose (MAN2A1 in Yoruba and East Asians), sucrose (SI in East Asians), and lactose (LCT in Europeans). Processing of dietary fatty acids is another system with signals of strong selection, including uptake (SLC27A4 and PPARD in Europeans), oxidation (SLC25A20 in East Asians) and regulation (NCOA1 in Yoruba and LEPR in East Asians). The latter gene (LEPR) is the leptin receptor and plays an important role in regulating adipose tissue mass. Recent articles have proposed that genes involved in brain development and function may have been important targets of selection in recent human evolution [8,9]. While we do not find evidence for selection in the two genes reported in those studies (MCPH1 and ASPM), we do find signals in two other microcephaly genes, namely, CDK5RAP2 in Yoruba, and CENPJ in Europeans and East Asians . Though there is not an overall enrichment for neurological genes in our gene ontology analysis, several other important brain genes also have signals of selection, including the primary inhibitory neurotransmitter GABRA4, an Alzheimer's susceptibility gene PSEN1, and SYT1 in Yoruba; the serotonin transporter SLC6A4 in Europeans and East Asians; and the dystrophin binding gene SNTG1 in all populations. Several other biological processes that have not previously been proposed as targets of selection also show an enrichment for signals of selection. For example, the category of electron transport genes is significant in Europeans, due in large part to selection in CYP genes. CYP genes are mainly expressed in the liver and catalyze many reactions involved in breaking down foreign compounds, including the majority of pharmaceutical agents. Genes in this class with evidence for selection include four genes in the CYP450 gene cluster on Chromosome 1p33, as well as CYP genes in other genomic locations including CYP3A5, CYP2E1, and CYP1A2. Another category showing enrichment of selection signals is phosphate metabolism in East Asians and Europeans. Genes in the phosphatidylinositol pathway seem to be particularly overrepresented among the significant genes in this category, including INPP5E, PI4K2B, IHPK1, IHPK2, IHPK3 in East Asians and IMPA2 and SYNJ1 in Europeans. Curiously, Yoruba appears to have a greater number of signals on the X chromosome that map to genes compared with the other two populations. For example, the top 1% of 100-kb windows contain 15 genes in YRI (the largest of these windows containing only four genes), but six genes and two genes in Europeans and East Asians, respectively. Discussion In this paper we provide the first genome-wide map of incomplete selective sweeps in humans. We find widespread signals of selection in all three populations. These selective events are generally very recent, falling mainly within the Holocene era, and substantially postdating the separation of the three populations. Selective sweeps in Yoruba tend to be narrower and apparently older than in the non-African populations, perhaps explaining why previous low-resolution scans for selection have reported a deficit of selective events in African populations [13–16]. (Two of those studies also used African-American samples, in which it is possible that European admixture may dilute evidence for selection in Africans.) Though most selected regions are not shared across populations, there is still a clear excess of shared selective events. Indeed, since we have incomplete power to detect selection, it is likely that we tend to underestimate the degree of sharing across populations. Our analysis of the types of genes involved in recent selection provides a first insight into the type of biological processes that have been targets of selection in the latest stages of our evolution. Some of these functions (especially olfaction and fertilization-related genes) have also been found to show signals of sustained selection over much longer evolutionary timescales. However, other classes of selected genes (for example the skin pigmentation genes and metabolic genes) likely reflect the process of adaptation to modern conditions and new environments. Nonetheless, gaining a full view of the kinds of selective pressures that have faced modern humans, and our biological adaptations to those pressures, remains a challenging problem. For many selective signals there is uncertainty about the actual genetic target. Even when the target is clear, the nature of the adaptation is often not. Hence, interpreting the story of human adaptation promises to be an interesting research area for years to come. The tremendous shifts experienced by modern human populations in habitats, food sources, population densities, and pathogen exposures have surely led to direct selection pressures on medically relevant phenotypes. At least three of the regions that we identified as targets of selection have previously been associated with complex phenotypes, including CYP3A5 (salt-sensitive hypertension), ADH (alcoholism susceptibility), and the 17q21 inversion (recombination rates and fertility) [7,20,47]. These examples form part of an emerging pattern linking selection with complex trait phenotypes [21,22], as first proposed for diabetes in the “thrify genotype” hypothesis . Therefore, selected haplotypes identified by our scan can be thought of as “selection candidates” for involvement in complex traits. With this in mind, we have developed an online tool for evaluating iHS at any HapMap SNP, and we provide files of tag-SNPs that can be used in genome-wide association studies to tag all the haplotypes with the strongest selection signals in each population. Materials and Methods HapMap project data information All analysis is based on the HapMap Project Phase I/rel#16a datafiles (http://www.hapmap.org) . For the CEU and YRI samples, we analyzed the data from the 60 unrelated parents. Due to their close genetic similarity, and in order to have a single larger sample, we pooled the CHB (Han Chinese from Beijing) and JPT (Japanese from Tokyo) samples to form a single sample of 89 unrelated Asian individuals, denoted as the ASN sample. Haplotype phase estimation for all the data was performed by the HapMap consortium using Phase 2.0 [18,23]. The phasing procedure also imputed all missing genotypes at SNPs with 5%. To compute the iHH, what we do is equivalent to the following. We plot the decay of EHH to the left and right of the core SNP, for each allele, using all available SNPs. The x-axis is in units of genetic distance, 4Nr, estimated as described above. The EHH values at successive SNPs are joined by straight lines, and then we compute the total area under each curve, between the nearest points to the left and right of the core SNP where the EHH drops below 0.05. Since the population genetics of the X chromosome differs from that of the autosomes (e.g., due to smaller effective population size), we standardized the X chromosome separately from the autosomes, and excluded it from our overall genome-wide analyses. The SNP data include occasional large gaps between successive SNPs. If the region spanned by EHH > 0.05 reached a chromosome end or the start of a gap >200 kb, then no iHS value was reported for the core SNP. If there was a smaller gap (20–200 kb), then the genetic distance spanned by the gap was reduced by a multiplicative factor 20/g, where g is the gap size in kb. This is an ad hoc fix designed to eliminate spurious signals produced by occasional large gaps, while minimizing the loss of data. After applying the various filters, we were left with 676,718, 701,633, and 621,915 SNPs with iHS scores in the European, Yorubans, and East Asian populations respectively. Summaries of the frequency spectrum Both Tajima's D and Fay and Wu's H tests use summaries of the site frequency spectrum to identify signatures of selection [37,56]. However, these tests were designed for full-sequence data, and not for pre-ascertained SNPs. Nonetheless, we have tested the value of using these statistics to identify regions whose frequency spectra are strongly different from the bulk of the genome, suggesting the influence of selection. To compute these statistics on ascertained data, we treated the ascertained genotypes as if they were the complete sequence data, and then computed D and H in the usual way. The ascertainment-biased summary statistics are denoted Dasc and Hasc to indicate that they are not expected to follow the usual distributions. Since the ascertained data are enriched for high-frequency variants, the average Dasc is highly positive, whereas the average Hasc is highly negative. We obtained empirical distributions of both Dasc and Hasc in all 50-marker windows across the entire autosomal genome for each population. The empirical p-value for a particular window represents the proportional rank of the statistic in that window compared with the overall genome distribution (i.e., low p-values indicate a more negative D or H than the bulk of the empirical distribution). Identifying candidate signals of selection In order to identify genes or genomic regions with signals of selection, we focused on windows of consecutive SNPs. Where possible, we used a standard window size of 50 SNPs. For analyses of genes with ≤50 SNPs, we created windows of 50 SNPs centered on each gene. For the ∼5% of genes with >50 SNPs, we treated the window as containing all SNPs in the gene. To study the overlap of signals across populations, we modified our analysis to consider fixed, nonoverlapping, 100-kb windows so that the window positions would correspond exactly across populations. For both kinds of window definition, the size of signal in each region was quantified by the proportion of SNPs with |iHS| > 2. Simulations indicated that this criterion provides a powerful signal of selection. Separately for each population and analysis, we determined empirical cutoffs for the top 1%, 5%, etc., of signals genome-wide, and considered these strongest signals to indicate candidate selection regions. Since the numbers of SNPs varied considerably across the fixed 100-kb windows, we grouped windows into bins with similar numbers of SNPs, and determined empirical cutoffs separately for each bin. Regions with <10 SNPs were dropped, as were bins with <20 regions. The X-chromosome data was analysed separately. The thresholds for empirical cutoffs for the X chromosome were based on the autosomal cutoffs. 100-kb windows and genes in the X chromosome that were above these thresholds were considered to be significant. Estimating haplotype ages To obtain a crude estimate of the ages of sweeps, we assumed (1) a star phylogeny for the selected haplotypes, (2) that the two chromosomes which descended from a common ancestor and carry a selected allele are identical between the core SNP and the point of the nearest recombination event on either lineage, (3) that the two chromosomes start to differ immediately beyond the nearest recombination event. Hence, Pr[Homoz] = e −2rg where Pr[Homoz] is the probability that two chromosomes are homozygous at a recombination distance r from the selected site, given a common ancestor g generations before the present. Taking the generation time to be 25 y, the ancestor time in years becomes t = 25g. Then, for example in Africans, when Pr[Homoz] = 0.25, we observed that 2r = 0.32% (the distances in the text are the sum of r in both directions). Then t = −(25)(100)ln(.25/.32 ≍ 10,000 years. A similar calculation was performed for the other populations. Gene ontology analysis The PANTHER Gene Ontology database was downloaded from https://panther.appliedbiosystems.com in March 2005. For each PANTHER “biological-process,” category we determined both the total number of genes, and the number of genes in the top 10% of empirical signals (defined as above). One-tailed p-values for enrichment of particular categories were obtained either using the standard Chi-square approximation if the expected number of counts exceeded five, or otherwise by exact enumeration of the binomial probabilities. We used a Bonferroni correction to assess overall experiment-wide significance, though this is likely to be particularly conservative in this case since the 222 PANTHER categories are partially nested and overlapping. Haplotype plotting An online browser providing haplotype plots and iHS scores for all HapMap SNPs is at http://pritch.bsd.uchicago.edu/data.html. Supporting Information Figure S1 Power to Detect Selection Using Various Test Statistics (3 KB PDF). Click here for additional data file. Figure S2 Single versus Multipoint iHS Power Plot (2 KB PDF). Click here for additional data file. Figure S3 Plots of Chromosome 1 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (118 KB PDF). Click here for additional data file. Figure S4 Plots of Chromosome 2 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (174 KB PDF). Click here for additional data file. Figure S5 Plots of Chromosome 3 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (122 KB PDF). Click here for additional data file. Figure S6 Plots of Chromosome 4 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (120 KB PDF). Click here for additional data file. Figure S7 Plots of Chromosome 5 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (91 KB PDF). Click here for additional data file. Figure S8 Plots of Chromosome 6 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (116 KB PDF). Click here for additional data file. Figure S9 Plots of Chromosome 7 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (85 KB PDF). Click here for additional data file. Figure S10 Plots of Chromosome 8 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (129 KB PDF). Click here for additional data file. Figure S11 Plots of Chromosome 9 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (144 KB PDF). Click here for additional data file. Figure S12 Plots of Chromosome 10 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (76 KB PDF). Click here for additional data file. Figure S13 Plots of Chromosome 11 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (62 KB PDF). Click here for additional data file. Figure S14 Plots of Chromosome 12 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (92 KB PDF). Click here for additional data file. Figure S15 Plots of Chromosome 13 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (58 KB PDF). Click here for additional data file. Figure S16 Plots of Chromosome 14 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (44 KB PDF). Click here for additional data file. Figure S17 Plots of Chromosome 15 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (45 KB PDF). Click here for additional data file. Figure S18 Plots of Chromosome 16 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (31 KB PDF). Click here for additional data file. Figure S19 Plots of Chromosome 17 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (31 KB PDF). Click here for additional data file. Figure S20 Plots of Chromosome 18 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (79 KB PDF). Click here for additional data file. Figure S21 Plots of Chromosome 19 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (21 KB PDF). Click here for additional data file. Figure S22 Plots of Chromosome 20 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (27 KB PDF). Click here for additional data file. Figure S23 Plots of Chromosome 21 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (37 KB PDF). Click here for additional data file. Figure S24 Plots of Chromosome 22 SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (51 KB PDF). Click here for additional data file. Figure S25 Plots of the X Chromosome SNPs with Extreme iHS Values Illustrate Discrete Clusters of Signals (67 KB PDF). Click here for additional data file. Figure S26 Plot Comparing Fst with iHS (Europeans and East Asians) (17 KB PDF). Click here for additional data file. Figure S27 Plot Comparing Fst with iHS (East Asians and Yoruba) (17 KB PDF). Click here for additional data file. Figure S28 Plot Comparing Fst with iHS (Europeans and Yoruba) (17 KB PDF). Click here for additional data file. Figure S29 Polarized Frequency Spectra for Each HapMap Population (15 KB PDF). Click here for additional data file. Figure S30 Combined Supporting Information File (2.0 MB PDF). Click here for additional data file. Protocol S1 Summary of Strongest Regions of Selection Genome-Wide in East Asians (42 KB TXT). Click here for additional data file. Protocol S2 Summary of Strongest Regions of Selection Genome-Wide in Europeans (44 KB TXT). Click here for additional data file. Protocol S3 Summary of Strongest Regions of Selection Genome-Wide in Yoruba (46 KB TXT). Click here for additional data file. Protocol S4 Demographic Simulation Information Supplement (2.2 MB DOC). Click here for additional data file. Table S1 iHS Signal in Genic versus Non-Genic Regions (135 KB DOC). Click here for additional data file. Table S2 Genetic Spans of the Longest Haplotypes in Each Population (49 KB DOC). Click here for additional data file. Table S3 Comparison with Clark et al. 2003 Study (60 KB PDF). Click here for additional data file. Table S4 Gene Clusters That Contribute to Significant GO Categories (42 KB DOC). Click here for additional data file.