Extracting research-quality phenotypes from electronic health records to support precision medicine

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The convergence of two rapidly developing technologies - high-throughput genotyping and electronic health records (EHRs) - gives scientists an unprecedented opportunity to utilize routine healthcare data to accelerate genomic discovery. Institutions and healthcare systems have been building EHR-linked DNA biobanks to enable such a vision. However, the precise extraction of detailed disease and drug-response phenotype information hidden in EHRs is not an easy task. EHR-based studies have successfully replicated known associations, made new discoveries for diseases and drug response traits, rapidly contributed cases and controls to large meta-analyses, and demonstrated the potential of EHRs for broad-based phenome-wide association studies. In this review, we summarize the advantages and challenges of repurposing EHR data for genetic research. We also highlight recent notable studies and novel approaches to provide an overview of advanced EHR-based phenotyping.

Electronic supplementary material

The online version of this article (doi:10.1186/s13073-015-0166-y) contains supplementary material, which is available to authorized users.

Related collections

Most cited references 106

Record: found
Abstract: found
Article: not found

Measuring diagnoses: ICD code accuracy.

Kimberly O'Malley, Karon Cook, Matt D Price … (2005)

To examine potential sources of errors at each step of the described inpatient International Classification of Diseases (ICD) coding process. The use of disease codes from the ICD has expanded from classifying morbidity and mortality information for statistical purposes to diverse sets of applications in research, health care policy, and health care finance. By describing a brief history of ICD coding, detailing the process for assigning codes, identifying where errors can be introduced into the process, and reviewing methods for examining code accuracy, we help code users more systematically evaluate code accuracy for their particular applications. We summarize the inpatient ICD diagnostic coding process from patient admission to diagnostic code assignment. We examine potential sources of errors at each step and offer code users a tool for systematically evaluating code accuracy. Main error sources along the "patient trajectory" include amount and quality of information at admission, communication among patients and providers, the clinician's knowledge and experience with the illness, and the clinician's attention to detail. Main error sources along the "paper trail" include variance in the electronic and written records, coder training and experience, facility quality-control efforts, and unintentional and intentional coder errors, such as misspecification, unbundling, and upcoding. By clearly specifying the code assignment process and heightening their awareness of potential error sources, code users can better evaluate the applicability and limitations of codes for their particular situations. ICD codes can then be used in the most appropriate ways.

0 comments Cited 319 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2).

Shawn N Murphy, Griffin Weber, Michael Mendis … (2010)

Informatics for Integrating Biology and the Bedside (i2b2) is one of seven projects sponsored by the NIH Roadmap National Centers for Biomedical Computing (http://www.ncbcs.org). Its mission is to provide clinical investigators with the tools necessary to integrate medical record and clinical research data in the genomics age, a software suite to construct and integrate the modern clinical research chart. i2b2 software may be used by an enterprise's research community to find sets of interesting patients from electronic patient medical record data, while preserving patient privacy through a query tool interface. Project-specific mini-databases ("data marts") can be created from these sets to make highly detailed data available on these specific patients to the investigators on the i2b2 platform, as reviewed and restricted by the Institutional Review Board. The current version of this software has been released into the public domain and is available at the URL: http://www.i2b2.org/software.

0 comments Cited 307 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip

Chris Spencer, Zhan Su, Peter Donnelly … (2009)

Introduction The International HapMap project [1],[2] documented the strong correlations between alleles at polymorphic loci in close physical proximity along human chromosomes. As a consequence it is necessary to genotype only a subset of loci to capture much of the common variation in the genome. Combined with recent technological innovations this observation has made the concept of genome-wide association (GWA) studies a reality [3],[4]. Over the few last years these studies have been very successful in uncovering new disease genes for many different complex diseases [5]. Well over 300 such loci have already been published and many more studies are currently being planned. In the design of such studies two fundamental decisions have to be made: which loci to genotype, and in how many individuals. Both decisions have practical constraints. For example it is currently not possible to assay all known variation in the human genome at a reasonable cost and choices must be made between a set of commercially available genotyping chips. Similarly, sample sizes are often limited by the number of well characterized clinical samples. Therefore, ultimately, the researcher and funding bodies must ask how to use the financial and practical resources available in order to best further the understanding of the genetics of the disease or trait of interest. A primary consideration should be the power of the study: the probability of detecting a variant assumed to be causal. In comparing chips for GWA studies it has been common to ask what proportion of SNPs not directly genotyped are “captured” or “tagged” by the chip, i.e. are well predicted, via LD, by a SNP, or combination of SNPs, on the chip. To do so it is necessary to define the level of prediction required, or equivalently to set a threshold for the required level of correlation. Although arbitrary, this has often been set at 0.8 [6],[7],[8]. The resulting proportion of SNPs captured at this level is often referred to as the coverage of the chip. Having specified the threshold it is possible to estimate the coverage of a particular chip from HapMap data, although we note that some care is required to account for SNPs not in HapMap [8]. Here we focus instead on the power of particular chips to detect causal variants of different effect sizes, and the way in which this varies with study size and/or study cost and when using genotype imputation methods. Although coverage is straightforward to estimate, power is a complicated function of the set of SNPs on the chip, effect size, and sample size, and can only be assessed by simulation. It turns out that differences in coverage between chips are often not reflected in substantial differences in power and that the use of genotype imputation further reduces these differences. Study power is routinely used throughout science in experimental design and we argue that it should be the primary consideration in designing GWAs. This approach was used in settling several design questions in the Wellcome Trust Case Control Consortium [5]. Our results have been encapsulated in a user-friendly R package that allows the power of different chip and sample size combinations to be assessed given a total budget for the study. Knowledge of study power is also invaluable when analysing data from a study. Assessment of whether positive results at a particular significance level are “real” or due to chance requires knowledge of power [5], and the practical decision of how far down the list of potential associations one should go in replication studies should be informed by power considerations. Other comparisons of chips have been carried out but have either focussed exclusively on estimating coverage [8], have been limited in scope of which chips have been evaluated [9] or have used analytical calculations that do not properly take into account the complex LD structure of the human genome [10],[11] or failed to assess the impact of imputation correctly [11]. A recent paper [12] has used chip data to assess the performance of the chips but the small sample size (N = 359) means that these results cannot be used to assess power of new study designs of more realistic sizes. In addition, the simulations of quantitative phenotypes used the Signal to Noise Ratio (SNR) to measure effect size of the causal SNP which is non-standard and difficult to interpret. For binary traits, simulations assumed a disease prevalence of 25%, a relative risk of 3 and a sample size of only 75 cases and 75 controls. These parameter settings are not realistic for genome-wide association studies or useful when designing new studies. Results Theoretical Results Study power depends on assumptions about the underlying disease model, in addition to effect sizes and sample sizes. When the true causative SNP is not on the genotyping chip there will typically be several SNPs on the chip which are correlated with it. One or more of these could give a signal of significant association and hence allow detection of the locus. The LD structure of the human genome is sufficiently complicated that this effect cannot be captured analytically. It must be assessed via simulation studies. Nonetheless, there is one very simple situation for which analytical calculation is possible and helpful: that of the simplest disease model in which only a single SNP, correlated with the causal variant, is genotyped. For a design with the same number of cases and controls, under the disease model in which disease risk changes multiplicatively with the number of copies of the risk allele carried by an individual (this model is often referred to as the additive model because risk increases additively on the log scale), there is a known analytical relationship [13]: (1) where χ 2 is the chi-squared test statistic, the number of cases and controls, γ the effect size, p the allele frequency of the risk variant and γ 2 is the correlation between the marker and causal SNP. Although the real problem is much more complicated than this setting, Equation 1 does provide some useful intuition. Firstly, when the relative effect size is large ( ) the correlation between the marker and causal SNP may only need to be weak (r 2≪0.8) for the association to be detected (the expected test statistic is big). Equally, if the relative effect size is small ( ) then even strong or complete association (0.8 0.05. The top row shows power for case-control studies simulated in a Caucasian population based on the CEU HapMap panel. The bottom row relates to case-control studies simulated from the YRI HapMap panel. A second general feature of the power curves for Caucasian studies in Figure 2 is that aside from the Affymetrix 100 K chip (which is no longer available), there are not major differences in power across the other seven chips. For Caucasian samples the chips are typically ordered (with decreasing power): Illumina 1M, Illumina 650 k, Illumina 610 k, Affymetrix 6.0, Illumina 300 k, Affymetrix 500 k, but the absolute difference in power between the best and worst of these chips is often no more than around 10%. Put another way, for effect sizes in the range 1.3–1.5, a study with the Affymetrix 500 K chip would have the same power as one with the Illumina 1 M chip if its sample size were larger by 10–20%, with smaller increases in sample sizes giving studies with other chips the same power. Further, in Caucasian studies, power for all chips other than the Affymetrix 100 K chip is quite close to the best which could be obtained, namely by directly genotyping the causative SNP. Rare Alleles and Small Effect Sizes Equation 1 makes clear the dependence of power on the frequency of the risk allele. The results in Figure 2 are averaged over putative causative SNPs with a risk allele frequency (RAF) in the range 5–95%. Figure 3 shows that this hides quite different behaviour depending on whether the putative disease SNP is rare or common, and that the conclusions in the preceding subsection apply principally for common causative SNPs. The Figure shows a substantial difference in power for common and rare alleles with the same effect size and that power is minimal for the rare alleles when the effect size is small. These results refer to single-SNP analyses. While there are definitely more powerful analysis methods for rare alleles [14], this is not a major factor in the loss of power, and neither is the incomplete coverage of the SNPs on the commercially available chips: even using a sample size of 3000 cases and controls and genotyping the causal locus directly (black line) is unlikely to lead to a test statistic which will reach the small levels of significance thought appropriate for GWAS. 10.1371/journal.pgen.1000477.g003 Figure 3 Power for Common versus Rare alleles. Plots of power (solid lines) and coverage (dotted line) for increasing sample sizes of cases and controls (x-axis). From left to right plots are given for increasing effect sizes (relative risk per allele). Both power and coverage range from 0 to 1 and are given on the y-axis. Results are for single-marker test of association. The top two rows show the power for rare risk alleles (RAF 0.1). Rows 1 and 3 show power for case-control studies simulated in a Caucasian population based on the CEU HapMap panel. Rows 2 and 4 relate to case-control studies simulated from the YRI HapMap panel. There is an open question as to whether rarer causal alleles might have larger effect sizes than common causal alleles. If this were though plausible, then in assessing power overall for a particular chip, one could focus in Figure 3 on particular ranges of effect sizes for common causative alleles and a different range of effect sizes for rarer causative alleles. It is becoming clear that many loci harbouring common alleles affecting common diseases will have effect sizes in the range 1.1–1.2, and our simulations demonstrate that there is almost no power to detect these in studies of the size currently underway. As has already been shown empirically [19],[20] these loci can be found by meta-analyses and follow up in larger samples of GWA findings. Slightly larger relative risks do become detectable in large samples. For example the power to detect an effect of size 1.3 jumps from almost zero with 1000 cases and 1000 controls to over 50% in a study three times the size. Figure 3 also demonstrates that chip sets differ in the power they offer to detect associations at different frequencies. Most noticeably, when averaged over common alleles the Illumina 300 k chip set offers more power than Affymetrix 500 k. For rare alleles, the opposite is true with the Affymetrix 500 k chip having more power than the Illumina 610 k chip. This is most likely due to the way in which the Illumina SNP sets have been designed to specifically tag the common variation present in the HapMap panels. Power of Chips Compared to a ‘Complete Chip’ Immediately apparent is how close, for studies in Caucasian populations, the genotyping chips track the power afforded by the ideal “Complete chip” in a given study design and disease model. Figure 3 illustrates that the potential benefits of increasing SNP density on the chips or from using imputation [14] are greatest for low frequency SNPs. When focusing on common alleles, the potential benefits are greatest for the Affymetrix 100 k and 500 k chips and the Illumina 300 k chip and we show this when specifically consider imputation below (see Table 1). However, a clear consequence of these results is that for any of the chips in current use, increasing sample size is likely to have a bigger effect on power than increasing SNP density. 10.1371/journal.pgen.1000477.t001 Table 1 The table shows the power for each chip with a sample size of 2000 cases and 2000 controls and a relative risk at the causal SNP of 1.3 using a p-value threshold of 5×10−7. Chip Chip SNP Tests MultiMarker Tests IMPUTE Affy100 k 0.178 0.212 0.242 Affy500 k 0.363 0.378 0.450 Illu300 k 0.392 0.424 0.467 Illu610 k 0.439 0.455 0.488 Illu650 k 0.443 0.458 0.492 Affy6.0 0.420 0.433 0.478 Illu1M 0.457 0.461 0.493 Complete 0.499 0.499 0.499 Three different methods of analyzing the genotype data from each chip are shown: (a) testing just the SNPs on each chip, (b) using MultiMarker Tests in addition to the tests at each chip SNP, and (c) carrying out imputation using IMPUTE and testing all imputed SNPs in addition to those on each chip. The last line of the table shows the power that woud be obtained using the ‘Complete’ chip. Power versus Coverage A striking feature of Figures 2 and 3 is that substantial differences in coverage between different chips do not translate into big differences in power. Put another way, coverage is often a poor surrogate for power. As an example, the coverage in the CEU HapMap population (r 2≥0.8) provided by the Affymetrix 500 k and Illumina 610 k chips are 65% and 87% respectively, a difference of 22%. On the other hand, the difference in power e.g. for relative risk 1.5 and 1500 cases and controls, is only 7% (66% and 73% respectively). In one sense this shouldn't be surprising. Coverage is measured to a hard threshold: so if SNP has r 2 of 0.85 to its best proxy on one chip and 0.75 to its best proxy on another chip, it will be counted as “covered” by one chip but not by the other, whereas the difference in power is small. Coverage statistics also do not depend on study size or disease model. Figure 4 illustrates the differences in correlation structure for two chips. For each HapMap SNP we found it's best “tag” (the SNP on the chip with which it has the highest r 2) and generated a histogram of these maximized r 2 values. To recover coverage we simply count the proportion of SNPs for which the best tag r 2 is ≥0.8, coloured red in the bottom row of figure 4. In this sense, informally, it is useful to think of coverage as assuming that there is power one for every “tagged” SNP and no power for every other SNP. This is of course false, in ways which help to explain why coverage differences do not translate into power differences. When a SNP is common and the effect size is moderate or large, there will still be good power to detect it even if the best SNP on the chip only has r 2 = 0.5 or less. At the other extreme, for rare SNPs, unless the effect size is very large, power would be low even if the SNP had a perfect proxy on the chip. Thus even if these SNPs were well covered by one chip and completely missed by another they would not contribute to a difference in power between the chips because both chips would have power close to zero for them. The top row of Figure 4 shows the average power for SNPs in each LD bin. For the Affymetrix 500 K chip, there is a greater contribution to power from the sets of SNPs which are not well “covered”, than for the Illumina chip, and hance a smaller difference in power than in coverage. 10.1371/journal.pgen.1000477.g004 Figure 4 Histograms of the proportion of SNPs in the 22 1Mb regions (see Methods) in HapMap Phase II for which the maximum r2 with a SNP on the genotyping chip in in one of eleven bins (increasing in correlation (LD) from left to right). The same histograms are coloured in two ways. The top row shows in red the percentage of the SNPs in each bin detected (See Methods and text) when selected to be the causal SNP in our simulations (the proportion of the total volume of the bars coloured red is therefore an estimate of power). In the bottom row all r2 bins above 0.8 are coloured red (the proportion of the total volume of all the bars is therefore an estimate of coverage). Note that the use of HapMap data in choosing SNPs for the Illumina chip leads to a higher proportion of SNPs in high r2 bins. Case-Control Population For several reasons it is of interest to study the power of commercially available chips in different populations. Firstly the Illumina 100 k, 300 k and 610 k chips are aimed at capturing variation in the CEU population, whereas the Affymetrix 500 k chip is not designed with a specific population in mind. Furthermore the Illimina 650 k chip has a subset of SNPs targeted at capturing variation in the HapMap YRI (Yoruba, Africa) population. LD will not extend as far in the YRI collection [1] as in the CEU, reducing the coverage of a given set of SNPs. Figures 2 and 3 show the results of power calculation using the distribution of diversity in both the HapMap CEU and YRI populations. The results show that the increased ancestral recombination leads to a loss of power and coverage across all chips for a range of study designs. The difference between the power available from commercial genotyping chips and that achievable by exhaustively assaying all SNPs shows that increasing marker density may yield a better return than a similar approach in non-African populations. The Illumina 650 k chip, with the YRI fill-in illustrates these potential benefits, showing a marked increase in power over the 610 k. However the performance of the Illumina 300 k chip, designed using the CEU HapMap, falls below the Affymetrix 500 k when genetic diversity is modelled on the YRI HapMap panel. It is not yet clear how closely patterns of diversity and LD in other African populations mimic those in the Yoruba, and hence to what extent the power results will translate to studies in other populations. One general point is that the Illumina 650 k chip was designed specifically to capture common Yoruban variation, so one might expect power for this chip to decrease in other African populations, for which it is not specifically designed. On the other hand, the Affymetrix 500 k chip was not designed using this data, so there would be not a systematic effect changing power estimates for other African populations. As a consequence, differences in power between the Illumina 650 k chip and Affymetrix chips may well be smaller in other African populations. The Gain from Using Multi Marker Methods and Genotype Imputation Multi-marker methods, which use combinations of SNPs, have been suggested as an efficient way to increase both coverage and power [15]. Figures S6 and S7 show the results of simulations that implement the multi-marker tests. In these figures the dotted lines, which represent coverage, are higher for all chips in comparison to single marker approaches (Figures 2 and 3) consistent with previous observations. We find that multi-marker approaches also increase statistical power to detect disease loci, but that the increase is modest relative to coverage, and the broad conclusions above are not much affected. Interestingly, when comparing across genotyping platforms, we find for example that the Affymetrix 500 k chip gains more by combining SNPs than the Illumina 300 k chip. Genotype imputation methods [21],[14] are now being widely used in the analysis of genome-wide association studies [5] and meta-analysis of such studies [22],[20]. These methods can be thought of as a more sophisticated version of Multi Marker tests but are relatively much more computationally demanding. We carried out an evaluation of the boost in power that can be gained by imputation using the program IMPUTE [14]. For our simulations with a sample size of 2000 cases and 2000 controls and a relative of the causal SNP or 1.3 we ran IMPUTE on the genotype data from each of the chips under study using the CEU HapMap as the basis for imputation. We then carried out a test of association at all the imputed SNPs in addition to the SNPs on each chip. We used our program SNPTEST to carry out tests of association at imputed SNPs to properly account for the uncertainty that can occur at such SNPs[14]. The results of the simulations are shown in Table 1 and shows that the use of IMPUTE provides a noticeable boost in power over testing just the SNPs on each chip or using Multi Marker tests (as defined in [15]). This agrees with our previous results [14]. It is also very noticeable that imputation reduces the differences in power between the chips and that the use of imputation produces a level of power that is almost as high as our hypothetical ‘complete’ chip. We also note that the boost in power is more substantial than that estimated in another recent study [11]. A close look at the details of this other study shows that the only imputed SNPs used were those (a) which had real genotype data from one of the other chips, and (b) the imputed and real data at the SNP agreed with an r 2>0.8. So for example, for the Affy 500 k chip only genotypes at 427,838 imputed SNPs were used, rather than all those available from HapMap (approximately 2.5 milion SNPs), as normal practice when carrying out imputation. Using such a filter clearly creates a bias towards imputed SNPs that are almost perfect tags for SNPs on the chip so it is not surprising that this study shows such small increases in power when using imputation. Unequal Case Control Sample Sizes One option open to researchers who would like to increase power in the context of limited case series is just to increase the control collection. This strategy might include using cases for one disease as extra controls for another (assuming suitably different disease aetiologies and similar population history). We investigated the utility of such an approach by performing simulations with 1000 cases and an increasing number of control (Figure S8). Although the gains are not as strong as increasing both the case and control sample sizes (Figure 2), the ability to reject the null hypothesis of no association increase considerably with the size of the control panel. For example, adding an extra 2000 controls to a case-control study with sample size 1000–1000 increases power to detect an effect of 1.5 typically by 20%. Subject to care in their use, the growing availability of genotyped sets of controls promises to make this a possibility worth investigating for many studies. Designing a New Study The results of our simulations can be used to assess the power of a range of possible designs for a given budget and have been encapsulated in a user friendly R package for this purpose (see Software section). Table 2 shows the study size and power that can be achieved on a budget of $2,000,000 for each of the chips assuming the disease causing allele of has a relative risk of 1.5, a risk allele frequency of at least 0.05 and that a p-value threshold of 5×10−7 is used to define power. Since the different chips vary in their prices and their per sample processing costs we obtained quotes from service providers for the various chips and averaged them (see Text S1). The prices were based on quotes for 4000 chips and quotes were converted to US dollars using current exchange rates where necessary. We obtained 5 different quotes for the Affymetrix chips and 6 different quotes for the Illumina chips. 10.1371/journal.pgen.1000477.t002 Table 2 The table shows the power that can be achieved by each chip with a total budget of $2,000,000. Chip Average Price ($) Number of cases/controls Power Affy500 k 420 2381 0.767 Illu300 k 377 2653 0.821 Illu610 k 452 2212 0.818 Affy6.0 505 1980 0.772 Illu1M 750 1257 0.635 Complete - 2653 0.881 These results were calculated assuming a disease causing allele with a relative risk of 1.5, a minor allele frequency of at least 0.05, that a p-value threshold of 5×10−7 is used to define power and that the study should consist of an equal number of cases and controls. The second column shows the prices that we were able to obtain for these products at the time of submission. The last line of the table shows the power that woud be obtained using the ‘Complete’ chip using the sample size equal to that of the most powerful design. The results show that in this scenario the Illumina 300 k chip produces the most powerful design (82.1%) primarily due to its relatively cheap price compared to the other chips. Using the same sample size (2653 cases and controls) the ‘Complete’ chip has a power of 88.1%. It is also notable that the power of thie Illumina 300 k chip is nearly 17% greater than the power that can be achieved by the Illumina1 M chip (63.5%) which has approximately 3 times the SNP density. These result further illustrate the deficiencies in using coverage as a measure of chip performance as sample size is not factored into the calculation. Although these results are interesting we advise against using them directly in the design of a new study. There were noticeable variations in the quotes we obtained from the service providers and prices are likely to change through time. We encourage new studies to re-calculate power of various designs based on a set of up to date and competitive prices and to take into account the general effect that genotype imputation can have on these power estimates. Discussion Because of the complexity of human LD patterns, many questions of interest cannot be addressed analytically. We have described in detail our simulation method, HAPGEN, for generating large samples of case and control data at every HapMap SNP, which mimic the patterns of diversity and LD present in the HapMap data. The software can simulate case data under a single causal disease SNP model for specified genotypic relative risks. We have used the method here to assess the power of various commercially available genotyping chips for case-control genome-wide association studies, but note that it could be utilised to assess other design questions, in the evaluation of analytical methods, and in considering follow-on studies such as resequencing and fine-mapping. In Caucasian populations the differences in power afforded by current-generation genotyping chips are not large, and the power of these chips is close to that of an optimal chip which always directly genotyped the causal SNP. Listed in order of decreasing power for the CEU population, averaged over all potential disease SNPs with RAF ≥5%, the chips we considered were: Illumina 1M, Illumina 650 k, Illumina 610 k, Affymetrix 6.0, Illumina 300 k, Affymetrix 500 k and Affymetrix 100 k. In line with our previous work we have shown that imputation can boost the power of each chip substantially and that the resulting power will approach that which could be obtained by a hypothetical ‘complete’ chip that types all the SNPs in HapMap. One limitation of the approach we (and others [9],[10],[12],[11]) have used is that the causal SNP is assumed to be one of those SNPs in the HapMap panel and this will not always be true. Other studies [1] have shown that the majority of SNPs not in HapMap will be highly correlated with the SNPs that are in HapMap and this is especially true for the more common SNPs. This means there is a slight bias in our power results for each chip and for the use of imputation but we do not expect it to be large. A consequence of this point is that the power we estimate for the ‘complete’ chip approximates the power we might obtain if we had a chip which typed all the SNPs that exist in the human genome. A main conclusion from our analysis is that study size is a crucial determinant of the power to detect a causal variant. Increasing study size typically has a larger effect on power than increasing the number or coverage of SNPs on the chip, at least amongst chips currently available. Even for effect sizes at the larger end of those estimated to date for common human diseases (RRs of 1.3–1.5) quite large sample sizes, at least 2000 cases and 2000 controls and ideally more, are needed to give good power to detect the causal variant. When case numbers are limited, there are still non-trivial gains in power available from increasing just the number of controls. Care is needed in assessing the appropriateness of a set of controls, but as larger sets of control genotypes are made publicly available this strategy has considerable appeal, whatever the number of available cases. SNPs with smaller effect sizes are unlikely to be detected even in studies of the sizes currently undertaken, but as has been shown empirically for several diseases, these can be found by meta-analyses which combine different GWAs, or by follow-up in large samples of SNPs which look promising in the original GWA but fail to meet the low levels of significance thought appropriate for GWAS. When the causal SNP is rare (MAF 0.8 with at least one SNP on the chip. Although relatively simple to calculate (and even simpler to miscalculate), not least because it does not depend on study size, our results show that coverage can be a poor surrogate for power, and that relatively large differences between chips in coverage do not translate to large differences in power. The sets of SNPs on Illumina chips are chosen in part to maximize particular criteria, such as coverage, for certain populations, typically those in HapMap. One difficulty of analyses such as those in this paper is that these resources are also the natural ones with which to assess properties of the chips. Thus when Illumina chips “tuned” to one population (say the 610 K chip for CEU) are used in other populations, power might be systematically lower than the levels assessed here. In contrast, SNP sets of Affymetrix chips are chosen largely in a non-population specific way. While power is likely to vary in populations other than those we have considered here, there is not the same systematic effect which would lead to a decrease in power. A quantitative assessment of this phenomena will be possible when dense genotype data is available for other populations, such HapMap Phase 3. We have assumed here that accurate genotypes are available for all SNPs on each chip. In practice some SNPs on each chip will fail QC tests and not be available for analyses. As a consequence, our study will overestimate power, though this effect is unlikely to be large. We are only able to use SNPs in HapMap as potential disease SNPs. These may not be systematically representative of all potential disease SNPs. HapMap SNPs have systematically higher MAFs than do arbitrary SNPs [2], but for SNPs within a particular range of MAF, it seems unlikely that their LD properties will differ systematically, so, for example, we would expect our results for common SNPs to extend beyond those in HapMap. We have focussed on the most common GWA design, namely of a single-stage study, and the simplest disease model. The flexibility of the simulation approach allows many other practical aspects of study design to be incorporated into power calculations. These include more complex disease models, two-stage strategies (the starting point for our work was a comparison of power for one- and two-stage designs in the context of the WTCCC study [5]), genotyping errors, QC filters, misidentification of cases as controls and simple types of population structure. The HAPGEN software also provides a useful tool for the development and comparison of more sophisticated multi-marker approaches to detecting disease association (e.g. imputation [14]). We therefore believe that simulations are an essential tool in the design of association studies by allowing a focus on study power and an assessment of the affect on power of following a given study design. We hope that this method will continue to find use and can be extended to new catalogs of genetic variation such as the 1000 Genomes Project http://www.1000genomes.org/. As in other areas of science, power seems a central consideration in study design and choice of genotyping chip. But other issues may also play a role. These include coverage of particular genes, or genomic regions of interest; the utility of GWA data for directing downstream studies such as resequencing and fine mapping; data quality for particular chips; and the extent to which a chip reliably assays other forms of genetic variation such as copy number polymorphisms. Adding data to existing studies is straightforward if the same chip is used, but the success of imputation methods, in particular in meta-analyses [19],[20] means that this is not essential. In general, Affymetrix chips have more redundancy than do Illumina chips, in the sense of containing sets of SNPs which are correlated with each other. The immediate consequence of this is lower coverage and lower power for the same number of SNPs, but there can be advantages to this redundancy: loss of a particular SNP to QC filters may not be as costly; and signals of association are likely to include more SNPs, thus making them easier to distinguish from genotyping artefacts. Ultimately power can only be calculated under an alternative model. Thus on a practical level the optimal choice of assays and sample sizes will actually depend on the researcher's belief regarding the unknown distribution of effect sizes and models relating genotype and phenotype. In particular we show that one might adopt different strategies depending on the expected frequency of disease causing variant, the effect size and even the population from which cases and controls are sampled (Figure 3). In the continuing search to better understand the genetic basis of common human diseases, numerous study designs can be adopted which may involve combining data sets, imputing missing SNPs [14], distilling signals of association over multiple experimental stages, and so on. In this complex setting study power will remain a central criterion in study design, and the kinds of approaches developed here will continue to allow informed decision making by experimenters. Methods HAPGEN We adopt the model introduced by [23] (denoted LS from now on), who described a new model for linkage disequilibrium, which enjoys many of the advantages of coalescent-based methods (e.g. it directly relates LD patterns to the underlying recombination rate) while remaining computationally tractable for huge genomic regions, up to entire chromosomes. Their model relates the distribution of sampled haplotypes to the underlying recombination rate, by exploiting the identity (2) where h 1 ,…,hn denote the n sampled haplotypes, and ρ denotes the recombination parameter (which may be a vector of parameters if the recombination rate is allowed to vary along the region). This identity expresses the unknown probability distribution on the left as a product of conditional distributions on the right. LS substitute an approximation for these conditional distributions into the right hand side of (3), to obtain an approximation to the distribution of the haplotypes h given ρ (3) If h 1 ,…,hn are n sampled haplotypes typed at S bi-allelic loci (SNPs) LS modelled the distribution of the first haplotype as independent of ρ, i.e. all 2 S possible haplotypes are equally likely, so . For the conditional distribution of h k+1 given h 1 ,…,hk , LS modelled h k+1 as an imperfect mosaic of h 1 ,…,hk through the use of a Hidden Markov Model (HMM). That is, at each SNP, h k+1 is a (possibly imperfect) copy of one of h 1 ,…,hk at that position where where the transition rates between the hidden copying states are parameterized in terms of the underlying recombination rate. The transition rates are different for each of the conditional distributions in such a way so as to mimic the property that as we condition on an increasingly larger number of haplotypes we expect to see fewer novel recombinant haplotypes. A parameterisation for the mutation rate (or emission probabilities of the HMM) is used that has similar properties (see [23] for more details). The simulation of a new set of haplotypes for control and case individuals is proceeds using the following algorithm. 1. Pick a locus from the set of markers in the real dataset as the disease locus. The disease locus is chosen at random from all those loci with a minor allele frequency (MAF) within some specified range [l,u]. We use to denote the disease locus, a and A to denote the major and minor alleles at the disease locus and use p denote the sample minor allele frequency at this locus. 2. For a given disease model simulate the alleles at the disease locus of the new individual conditional upon case-control status. At the disease locus we use a general genotype model in which the frequencies of the genotypes aa, Aa and AA in control individuals are given by (1−p)2, 2p(1−p) and p2 respectively. This assumes that the control individuals are so-called population controls (as used by the WTCCC study [5]) rather than individuals who have been selected to specifically not have the disease. For case individuals the genotype frequencies are determined by specification of the two relative risks (4) where denotes the probability that an individual is a case conditional upon having genotype g. Under this model (5) where γ = (1−p)2+2αp(1−p)+βp2 . As an example, if p = 0.1, α = 2 and β = 4 the control and case genotype frequencies are (0.81, 0.18, 0.01) and (0.67, 0.30, 0.03) respectively. Assuming we have a set of k known haplotypes, the generation of a case (control) starts by simulating a genotype g using the case (control) genotype frequencies. This simulated genotype specifies the alleles on the the two haplotypes of the new individual at the disease locus. For example, if g = Aa then h k+1,d = 1 and h k+2,d = 0. 3. This step involves the simulation of two new haplotypes for the individual conditional upon the alleles simulated at the disease locus in Step 2 and conditional upon the fine-scale recombination map across the region. This involves simulating the rest of hk+1 and hk+2 . We only describe the generation of sites right flanking of the disease locus as the generation of the left flanking markers is virtually identical. Also the simulation of h k+2 follows directly from our description of how the rest of h k+1 is simulated. Let Xj be the hidden state of the HMM that denotes which haplotype h k+1 copies at site j (so that ). This state variable is initialized at the disease locus as follows (6) The value of , as with LS, is Watterson's point estimate (Watterson, 1975) (7) Simulation of the hidden state of the HMM then proceeds using the following transition rule (8) where zj is the physical distance between markers j and j+1 (assumed known); and , where is the effective (diploid) population size, and cj is the average rate of crossover per unit physical distance, per meiosis, between sites j and j+1 (so that cjzj is the genetic distance between sites j and j+1). This transition matrix captures the idea that, if sites j and j+1 are a small genetic distance apart (i.e. cjzj is small) then they are highly likely to copy the same chromosome (i.e. Xj +1 = Xj ). To mimic the effects of mutation the copying process may be imperfect: with probability k/(k+θ) the copy is exact, while with probability θ/(k+θ) a mutation will be applied to the copied haplotype. Specifically, 4. Return to step 2 to generate another individual or terminate. Illustrations of the HAPGEN method in practice and details of the testing the method against coalecent simulations are given in Text S1. Details of SNP Sets Used in the Study We used release 21 of the HapMap data for which phased haplotypes are available in NCBI b35 coordinates. The SNPs that occur on each genotyping chip were obtained from the websites of Affymetrix and Illumina respectively. Some of the SNPs in these sets do not occur in the HapMap phased haplotype data due to QC measures applied to the raw genotype data. For the Affymetrix 6.0 and Illumina 1 M chips 90.8% and 88.1% of the SNPs on these chips respectively are in this release in HapMap. This will have the effect of making our estimates of power slight underestimates of the true power. We simulated data for twenty-two one megabase regions chosen at random, one from each autosome. To ensure that the regions used to approximate genome-wide power were representative of the genome at large we their SNP density. Figure S5 plots the distribution of inter-SNP distances within the 22 analysis regions and across the whole genome for three of the genotyping chips analyzed. The close match between the distribution, both on the physical scale and in terms of genetic distance suggests that our results are insensitive to the regions we chose to simulate, and can be used to make comparisons of genotyping chips genome-wide. Calculating Coverage and Testing for Association We used data from the HapMap project Phase II to estimate coverage. Single marker coverage was defined to be the proportion of all variation (with minor allele frequency greater than 5%) in r 2 with a SNP on the genotyping chip above 0.8. Using this definition we achieved very similar estimates to previous studies which used the whole genome (we use twenty two representative megabases). Multi-marker coverage was calculated by an aggressive search of all 2-SNP and 3 SNP haplotypes within 250kb of the SNP being tagged [6]. The SNP was tagged if any of these multi-marker tags had r2 above 0.8, the rule defining the haplotype was also stored and added to the list of multi-marker tests. Single marker tests (Cochran-Armitage test) were performed at each SNP on the genotyping chip where information were simulated from the relevant HapMap panel. Multi-marker tests of association were performed in an identical fashion with the marker being formed by the multi-marker haplotypes known to tag HapMap variation. To avoid over estimation of power, multi-marker tags chosen to tag the current putative disease SNP in the simulations were excluded from the test set. Tests at imputed SNPs took account of the uncertainty in genotypes through a missing data likelihood as described in [14]. Software The HAPGEN software is freely available for academic use from the website http://www.stats.ox.ac.uk/˜marchini/software/gwas/gwas.html. In addition, the results of the power calculations for the 7 commercially available genotyping chips have been included in an R package called GWASpower available from http://www.stats.ox.ac.uk/˜marchini/#software. This package allows the user to determine the most powerful study design for a given budget. As new commercial genotyping chips become available we will update the package to include results of new chips. The package works by fitting a Generalised Linear Model to the results of the simulation study and using the model fit to predict the power for a given number of cases and controls. Supporting Information Figure S1 Top plot : Linkage disequilibrium plots across the region : D′ (top left), r2 (bottom right). Bottom plot : Fine scale recombination map ρ across the region. (0.21 MB TIF) Click here for additional data file. Figure S2 For simulated dataset A (α = 1.3, β = 1.69) the top plot shows D′ and r2 LD measures. The bottom plot shows the χ2 statistic for association across the region. The vertical blue line shows the location of the disease locus. (0.28 MB TIF) Click here for additional data file. Figure S3 For simulated dataset B (α = 1.5, β = 2.25) the top plot shows D′ and r2 LD measures. The bottom plot shows the χ2 statistic for association across the region. The vertical blue line shows the location of the disease locus. (0.29 MB TIF) Click here for additional data file. Figure S4 For simulated dataset (α = 1.7, β = 2.89) the top plot shows D′ and r2 LD measures. The bottom plot shows the χ2 statistic for association across the region. The vertical blue line shows the location of the disease locus. (0.30 MB TIF) Click here for additional data file. Figure S5 Representativeness of the 22 1Mb regions used in the simulation study. Bar plots are shown of the proportion of SNPs which fall into increasing inter-SNP distances for three of the genotyping chips used in this study. These distribution are measured on the physical scale (left column) and genetic map (right column). (1.37 MB EPS) Click here for additional data file. Figure S6 Power of Multi-marker tests. Plots of power (solid lines) and coverage (dotted line) for increasing sample sizes of cases and controls (x-axis). From left to right plots are given for increasing effect sizes (relative risk per allele). Both power and coverage range from 0 to 1 and are given on the y-axis. The results are based on simulations where the risk allele frequency of the causal allele is >0.05. The top row shows power for case-control studies simulated in a Caucasian population based on the CEU HapMap panel. The bottom row relates to case-control studies simulated from the YRI HapMap panel. (1.71 MB EPS) Click here for additional data file. Figure S7 Power of Multi-marker tests for common versus rare alleles. Plots of power (solid lines) and coverage (dotted line) for increasing sample sizes of cases and controls (x-axis). From left to right plots are given for increasing effect sizes (relative risk per allele). Both power and coverage range from 0 to 1 and are given on the y-axis. Results are for single-marker test of association. The top two rows show the power for rare risk alleles (RAF 0.1). Rows 1 and 3 show power for case-control studies simulated in a Caucasian population based on the CEU HapMap panel. Rows 2 and 4 relate to case-control studies simulated from the YRI HapMap panel. (2.90 MB EPS) Click here for additional data file. Figure S8 Plots of power (solid lines) and coverage (dotted line) for increasing sample sizes of controls (x-axis). The number of case individuals is fixed at 1000. From left to right plots are given for increasing effect sizes (relative risk per allele). Both power and coverage range from 0 to 1 and are given on the y-axis. Results are for single-marker test of association and for simulations where the minor allele frequency of the causal allele is >0.05. The top row shows power for case-control studies simulated in a Caucasian population based on the CEU HapMap panel. The bottom row relates to case-control studies simulated from the YRI HapMap panel. (1.73 MB EPS) Click here for additional data file. Text S1 Supplementary text associated with the main article. (0.06 MB PDF) Click here for additional data file.

0 comments Cited 200 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Wei-Qi Wei: wei-qi.wei@Vanderbilt.Edu

Joshua C Denny: josh.denny@vanderbilt.edu

Journal

Journal ID (nlm-ta): Genome Med

Journal ID (iso-abbrev): Genome Med

Title: Genome Medicine

Publisher: BioMed Central (London )

ISSN (Electronic): 1756-994X

Publication date (Electronic): 30 April 2015

Publication date PMC-release: 30 April 2015

Publication date Collection: 2015

Volume: 7

Issue: 1

Electronic Location Identifier: 41

Affiliations

[ ]Department of Biomedical Informatics, Vanderbilt University, Nashville, TN 37203 USA

[ ]Department of Medicine, Vanderbilt University, Nashville, TN 37203 USA

Article

Publisher ID: 166

DOI: 10.1186/s13073-015-0166-y

PMC ID: 4416392

PubMed ID: 25937834

SO-VID: da20a965-8fb5-46a9-b8a7-a9f32e1d6adb

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

History

Custom metadata

ScienceOpen disciplines: Molecular medicine

Data availability:

ScienceOpen disciplines: Molecular medicine

Comments

Comment on this article

scite_

Cited by 79

See all cited by

Most referenced authors 3,657

See all reference authors

Extracting research-quality phenotypes from electronic health records to support precision medicine

Read this article at

Abstract

Electronic supplementary material

Related collections

Network and Systems Medicine

Most cited references 106

Measuring diagnoses: ICD code accuracy.

Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2).

Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip

Author and article information

Contributors

Journal

Affiliations

Article

History

Categories

Custom metadata

Comments

Comment on this article

Similar content 169

Cited by 79

Most referenced authors 3,657