53
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Empirical Bayesian LASSO-logistic regression for multiple binary trait locus mapping

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Complex binary traits are influenced by many factors including the main effects of many quantitative trait loci (QTLs), the epistatic effects involving more than one QTLs, environmental effects and the effects of gene-environment interactions. Although a number of QTL mapping methods for binary traits have been developed, there still lacks an efficient and powerful method that can handle both main and epistatic effects of a relatively large number of possible QTLs.

          Results

          In this paper, we use a Bayesian logistic regression model as the QTL model for binary traits that includes both main and epistatic effects. Our logistic regression model employs hierarchical priors for regression coefficients similar to the ones used in the Bayesian LASSO linear model for multiple QTL mapping for continuous traits. We develop efficient empirical Bayesian algorithms to infer the logistic regression model. Our simulation study shows that our algorithms can easily handle a QTL model with a large number of main and epistatic effects on a personal computer, and outperform five other methods examined including the LASSO, HyperLasso, BhGLM, RVM and the single-QTL mapping method based on logistic regression in terms of power of detection and false positive rate. The utility of our algorithms is also demonstrated through analysis of a real data set. A software package implementing the empirical Bayesian algorithms in this paper is freely available upon request.

          Conclusions

          The EBLASSO logistic regression method can handle a large number of effects possibly including the main and epistatic QTL effects, environmental effects and the effects of gene-environment interactions. It will be a very useful tool for multiple QTLs mapping for complex binary traits.

          Related collections

          Most cited references31

          • Record: found
          • Abstract: found
          • Article: not found

          A simple regression method for mapping quantitative trait loci in line crosses using flanking markers.

          The use of flanking marker methods has proved to be a powerful tool for the mapping of quantitative trait loci (QTL) in the segregating generations derived from crosses between inbred lines. Methods to analyse these data, based on maximum-likelihood, have been developed and provide good estimates of QTL effects in some situations. Maximum-likelihood methods are, however, relatively complex and can be computationally slow. In this paper we develop methods for mapping QTL based on multiple regression which can be applied using any general statistical package. We use the example of mapping in an F(2) population and show that these regression methods produce very similar results to those obtained using maximum likelihood. The relative simplicity of the regression methods means that models with more than a single QTL can be explored and we give examples of two lined loci and of two interacting loci. Other models, for example with more than two QTL, with environmental fixed effects, with between family variance or for threshold traits, could be fitted in a similar way. The ease, speed of application and generality of regression methods for flanking marker analysis, and the good estimates they obtain, suggest that they should provide the method of choice for the analysis of QTL mapping data from inbred line crosses.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Genome-wide association analysis by lasso penalized logistic regression.

            In ordinary regression, imposition of a lasso penalty makes continuous model selection straightforward. Lasso penalized regression is particularly advantageous when the number of predictors far exceeds the number of observations. The present article evaluates the performance of lasso penalized logistic regression in case-control disease gene mapping with a large number of SNPs (single nucleotide polymorphisms) predictors. The strength of the lasso penalty can be tuned to select a predetermined number of the most relevant SNPs and other predictors. For a given value of the tuning constant, the penalized likelihood is quickly maximized by cyclic coordinate ascent. Once the most potent marginal predictors are identified, their two-way and higher order interactions can also be examined by lasso penalized logistic regression. This strategy is tested on both simulated and real data. Our findings on coeliac disease replicate the previous SNP results and shed light on possible interactions among the SNPs. The software discussed is available in Mendel 9.0 at the UCLA Human Genetics web site. Supplementary data are available at Bioinformatics online.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies

              Introduction The ideal analysis of a genome-wide association (GWA) study for a complex disease would involve analysing all the SNP genotypes simultaneously to find a set of SNPs most associated with disease risk. Such an analysis can improve performance over single-SNP tests, since a weak effect may be more apparent when other causal effects are already accounted for, but also because a false signal may be weakened by inclusion in the model of a stronger signal from a true causal association. To date, analysing all SNPs simultaneously has seemed infeasible, since current GWA platforms can type over one million SNPs, and even larger variable sets may not be far away as genome-wide re-sequencing advances. We exploit recent advances in stochastic search algorithms [1],[2] to develop a computationally efficient tool to simultaneously analyse k SNPs typed in n individuals for association with case-control status, where k » n. We formulate the problem as variable selection in a logistic regression analysis that includes a covariate for each SNP. Our aim is to find a subset of SNPs (a “model”) that best explains the case-control status subject to a specified error rate. The number of possible models is 2 k and since k is typically of the order of 106, classical methods such as forward-backward variable selection are computationally expensive and are liable to find sub-optimal modes [3]. Bayesian stochastic search methods have been used to tackle variable selection problems, typically using the “slab and spike” prior formulation [4]. Inference can be made from these models using Markov chain Monte Carlo (MCMC) [5],[6] and this methodology has been extended to the case of more variables than observations [7],[8]. Similar methods have been proposed for the analysis of SNP data [9],[10], which again utilised MCMC. However, despite design of the MCMC algorithms to minimise computational time, these methods are too slow to deal with the size of problem presented by modern SNP chips. Furthermore, these methods have dealt with the computationally easier problem of a continuous outcome. We assign continuous prior distributions with a sharp mode at zero, often referred to as “shrinkag” priors, to the regression coefficients. Our approach is Bayesian-inspired rather than fully Bayesian, since we seek only the posterior mode(s) rather than the full posterior distribution of the regression coefficients. If the signal of association at a SNP is weak or non-existent, the posterior mode for the corresponding covariate will remain at zero. By using continuous shrinkage priors the resulting posterior density is continuous and can thus be maximised using standard algorithms. Our stochastic search maximisation algorithm seeks the (small) subset of SNPs for which the posterior mode is non-zero, corresponding to a signal of association that is strong enough to overcome the prior preference for zero effect. The algorithm can be set to include only additive effects, or it can also consider dominant and recessive terms: only one of these terms is permitted to be non-zero. We consider two shrinkage prior distributions, the Laplace, or double exponential distribution (DE) and a generalisation of it, the normal exponential gamma distribution (NEG), which has a sharper peak at zero and heavier tails, Figure 1. The sharp peak of the NEG at zero favours sparse solutions which is preferable for variable selection when we believe that there are few true causal variables. Further, the heavy tails result in variables being minimally shrunk once included in the model. The NEG is characterised by a shape and a scale parameter. The smaller the shape parameter the heavier the tails of the distribution and the more peaked at zero. Conversely as the shape parameter increases the NEG approaches the DE. For both prior distributions we obtain an explicit expression for the approximate type-I error of our method, so that it can be calibrated without recourse to permutation techniques. 10.1371/journal.pgen.1000130.g001 Figure 1 Logarithms of NEG and DE densities. Fixed to have the same density at the origin. With both the NEG and DE prior for n 0.005 with a causal variant are in some sense true associations. However, due to the variable pattern of LD in the human genome, SNPs showing 0.005 0.05 with any causal SNP. Furthermore, a cluster of tightly linked false positives might be considered as essentially just one false positive, and in Table 1 false positives were only counted if they were further than a specified distance (between 0 Kb and 100 K) away from any other false positive that had already been recorded. 10.1371/journal.pgen.1000130.t001 Table 1 Main simulation study: the results shown are summed over the 500 datasets each with 6 causal variants; a causal variant is “tagged” if ≥1 selected SNP has r 2>0.05 with it. Method SNPs selected Causal SNPs tagged False positives minimum separation (Kb) 0 20 40 100 NEG 2097 1576 368 368 368 366 DE 2622 1501 297 277 276 271 ATT 6810 1554 696 536 486 441 A feature of both the NEG and DE results is the reduction in the number of false positives relative to the ATT, despite the fact that the type-I error rate was set to be the same for all analyses (Table 1). This reflects one of the principal effects of analysing SNPs simultaneously: the signal at a SNP that shows spurious association when analysed singly is often weakened by inclusion in the model of true positives, which may or may not be tightly linked with it. If several SNPs are mutually in high LD, typically at most one of them will be included in the model. Thus the NEG analysis picked 2,097 SNPs over the 500 data sets, many fewer than the 6,810 SNPs selected using the ATT (Table 1). If a causal SNP is detected by the NEG method then typically (87% of the time) it will be tagged by just one selected SNP (Figure 2). In contrast ATT often picks multiple SNPs for each causal variant and picks one SNP just 31% of the time. The higher number of false positives for ATT can be attributed in part to the way the ATT picks many more SNPs per causal variant than the NEG or DE; some of these are remote from, and in low LD with, the nearest causal variant, and fail to reach our threshold of r 2>0.05 for useful tagging. However, the fact that ATT selects many more SNPs than DE or NEG can spuriously inflate its true power, because some of these additional significant SNPs will by chance be in LD with one or more causal variants. In the case of several causal variants, the ATT may produce what appears to be a single signal. 10.1371/journal.pgen.1000130.g002 Figure 2 Main simulation study. Histograms of the number of selected SNPs tagging (at r 2>0.05) each causal SNP for (A) NEG and (B) ATT analyses. Of the 3,000 causal SNPs in the simulations, 1,402 are detected by both the NEG and ATT analyses, 54 are only detected by NEG and 32 are only detected by ATT. Although this difference in empirical power is small (0.7%), a p-value for the null hypothesis that the NEG and ATT are equally powerful is 0.011 (binomial probability of ≤32 successes, given 86 trials and success probability 0.5). Moreover, the NEG empirical power equals or exceeds that of ATT for all six combinations of MAF and allelic risk ratio (Table 2). Thus we have evidence of improved power of NEG over ATT, in addition to its lower false positive rate. 10.1371/journal.pgen.1000130.t002 Table 2 Main simulation study: numbers of causal SNPs tagged, out of the 500 for each MAF and risk ratio. Method MAF and allelic risk ratio 15% 5% 2% 1.4 1.5 1.8 2.2 2.5 3.0 NEG 252 360 209 370 146 239 DE 233 347 194 366 135 227 ATT 244 353 209 370 143 235 DE shows fewer false positives than NEG, but it tags fewer causal SNPs even though it selects 2,622 SNPs, many more than NEG (Table 1). The lighter tails of the DE distribution in comparison with the NEG result in informative variables being shrunk closer to zero. This can result in other correlated SNPs being brought into the model to explain the full effect of the over-shrunk SNP coefficients. Our preliminary analyses varying λ showed that as λ increased the false positive rate decreased but the power also decreased and in doing so approached the results obtained with the DE prior. With λ = 10 the results were very similar to the DE results. Null Simulation To validate our type-I error rate approximation, and to assess the effect on type-I error of allowing for dominant and recessive effects, we permuted the case-control status in one of the 80 K data sets to generate 1,000 data sets representing samples from the null of no genetic effects. The resulting per-SNP type-I error rates of the NEG, DE and ATT methods are shown in Table 3. False positives were only counted if they were further than 20 Kb away from any other false positive that had already been recorded, however the results were the same when the minimum separation was 100 Kb. The type-I error rate was highest for ATT although the differences are not significant. All three analyses result in noticeably fewer false positives than the nominal rate of α = 10−5. Because of LD between SNPs it is not easy to decide if this difference is significant, but it suggests that our type-I error approximation is as conservative as that of the ATT, which is based on the approximation. 10.1371/journal.pgen.1000130.t003 Table 3 Null simulation: empirical per-SNP type-I error rates from 1,000 permutations of case-control labels of 2 K individuals genotyped at 80 K SNPs. Method Error rate (per million SNPs) Additive only Additive, dominant and recessive terms NEG 6.44 12.8 DE 6.39 12.7 ATT 6.48 - In each case the nominal per-SNP type-I error rate for the additive-only model was 10−5 ( = 10 per million SNPs). When dominant and recessive effects are considered in addition to additive terms, the false positive rate approximately doubles. Thus parameter settings that control the type-I error at 2.5% for additive effects will approximately control the type-I error at 5% when dominant and recessive effects are also included. Recall that our simulations generated an approximately uniform MAF distribution of the marker SNPs, and this result may vary with different MAF distributions. Whole Genome Simulation Study We also generated a data set corresponding to a genome-wide association study consisting of 480 K SNPs, derived from 120 independent 20 Mb chromosomes using the same SNP ascertainment strategy as used in the previous simulation. We chose one 20 Mb chromosome to have ten causal variants each with MAF of 15% and allelic risk ratio of 2. This disease model is unrealistic but was chosen to permit detection of the majority of the causal variants and thus make general comments on the relative merits of the NEG and ATT in one simulation. We analysed this data set as before using the NEG and ATT but with a significance threshold of α = 5×10−7 [16]. As before, we chose the highest posterior mode from 100 permutations of the search order. Figure 3(A) shows the locations along the 20 Mb chromosome of the ten causal variants, as well as all the SNPs selected by the NEG and ATT analyses. Both methods have detected all ten causal variants, however the NEG analysis selected just 14 SNPs, whereas the ATT identified 35 significant SNPs. We see in Figure 3B that the NEG analysis has improved localisation, it ignores SNPs remote from causal variants that were significant under the ATT, in particular at 8.3 Mb, and has selected SNPs as close to the causal variant as the ATT has. In Figure 3C the improvement in localisation is less clear, the ATT has selected SNPs closer to causal variant, but at the expense of selecting many more SNPs. In particular the ATT has selected SNPs at 11.5 Mb, about 400 Kb from the causal variant. 10.1371/journal.pgen.1000130.g003 Figure 3 GWA simulation. (A) locations of the ten causal variants (vertical blue line) on the 20 Mb chromosome; also shown are the SNPs selected by NEG (red dots), and the SNPs with ATT p-value 5×10−7 (black dots) plotted against −log10 (p-value). (B) and (C) show zooms of two sub-intervals of (A). Re-Sequencing Simulation Study Fine mapping of causal variants currently uses very high density markers, obtained either directly from resequencing or from imputation following limited resequencing or from high-density SNPs in public databases [17],[16],[18]. To illustrate the utility of our method for the analysis of imputed and/or sequence data we took the simulated 20 Mb sequences of 10 K individuals used in the previous analyses, which had 192 K polymorphic sites, and sampled 10 case-control datasets using the same sample sizes and disease model as used in the main simulation. All polymorphic sites were included in our analyses. The data sets were analysed with the NEG and ATT with a per-SNP false positive rate of α = 10−5. The ATT and NEG analyses showed similar power over the ten sequence-level datasets. Both methods detected 54 of the 60 causal variants with r 2>0.3, five causal SNPs were missed entirely by both methods and one causal SNP was tagged by both methods at r 2≈0.01. However, NEG showed markedly better localisation than ATT. Figure 4 shows the distribution of the highest r 2 value for each selected SNP with a causal variant using the two methods. The NEG selected just 64 SNPs in comparison with 599 selected by the ATT, and a greater proportion of the selected SNPs were in high LD with a causal variant. Of the 60 causal variants, only nine were tagged twice by NEG, in contrast, it is evident from Figure 4 that the ATT often multiply tags causal SNPs. In no simulation did a SNP selected by NEG tag two causal SNPs. 10.1371/journal.pgen.1000130.g004 Figure 4 Re-sequencing simulation. Histograms of the maximum r 2 for each selected SNP with a causal variant for (A) NEG and (B) ATT analyses. The NEG analysis identified two false positive SNPs at the less stringent r 2 = 0.01 threshold for tagging a causal SNP. The ATT analysis generated 14 false positives at this threshold; 11 of these SNPs spanned a 230 Kb region including one of the NEG false positives, while two spanned a 103 Kb region including the other NEG false positive. The 14th ATT false positive had r 2 = 0.009 with a causal variant. Analysis of Type 2 Diabetes GWA Data Set From a genome-wide scan on 694 type 2 diabetes cases and 654 controls [19], we reanalyse here genotype data from the Human Hap300 BeadArray, but not the Human Hap100 BeadArray. After removing SNPs with Hardy-Weinberg equilibrium p-value 0, thus to control the type-I error at α we assign ξ to equal the right hand side of (7). For the NEG prior, the value of f′ (β) is given in (5). The NEG has two parameters, whereas (7) imposes only one constraint. We considered a range of values for the shape parameter λ from 0.01 to 10 and then assign γ by substituting (7) into (5) and rearranging. Both the NEG and DE behave similarly when they have been set to have the same type-I error, when their derivative at the origin is the same, and when β = 0. Solutions diverge however once SNPs are included in the model, since included SNPs are penalised less by the NEG than by the DE. This results in larger parameter estimates using the NEG, and affects how likely a variable is to be pushed out of the model once it has been included. Including Dominant and Recessive Effects So far, the genotype variable xij is the allele count, standardised to have mean zero and variance one. This corresponds to a model that is additive on the logistic scale. To implement a search for dominant or recessive effects, we simply recode this variable accordingly. For example, to seek a recessive effect, we assign xij  = −u if individual i is heterozygote or major-allele homozygote at SNPj, and xij  = v otherwise, where u and v are chosen to standardise xij . When dominant and recessive effects were included in the model they were considered in the following order: (1) additive, (2) dominant, (3) recessive; terms (2) and (3) are only considered if no preceding term is already included in the model at that SNP. Simulation Study The allelic risk ratios were multiplicative within and across loci and the disease prevalence was 12%; the multiplicative disease model is similar to, but not the same as, the logistic regression model on which our analyses are based. Two causal SNPs were chosen with each of the following approximate MAF values: 2%, 5%, and 15%. The two allelic risk ratios for each MAF were chosen so that the power to detect an association was around 25% and 75% using the ATT at a significance threshold of 10−5, see Table 2 for the effect sizes. With this disease model, the background disease risk is typically ≈6% for individuals carrying no causal alleles, and this risk can be attributed either to polygenic or environmental effects. Thus, although we explicitly simulate six causal alleles, this does not exclude multiple weaker causal alleles that are unlikely to be detected. Marker SNPs were sampled randomly from disjoint 5 Kb regions on each chromosome with probability proportional to MAF(1–MAF), resulting in an approximately uniform MAF distribution. Supporting Information Text S1 Simultaneous Analysis of all SNPs in a Genome-Wide Association Study. (0.07 MB PDF) Click here for additional data file.
                Bookmark

                Author and article information

                Journal
                BMC Genet
                BMC Genet
                BMC Genetics
                BioMed Central
                1471-2156
                2013
                15 February 2013
                : 14
                : 5
                Affiliations
                [1 ]Department of Electrical and Computer Engineering, University of Miami, Coral Gables, FL 33146, USA
                [2 ]Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA
                Article
                1471-2156-14-5
                10.1186/1471-2156-14-5
                3771412
                23410082
                f3fc97f9-a89d-46be-bd76-1ce3ebc8a4f3
                Copyright ©2013 Huang et al; licensee BioMed Central Ltd.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 1 October 2012
                : 6 February 2013
                Categories
                Methodology Article

                Genetics
                qtl mapping,binary traits,epistatic effects,bayesian shrinkage,logistic regression
                Genetics
                qtl mapping, binary traits, epistatic effects, bayesian shrinkage, logistic regression

                Comments

                Comment on this article