Both linear mixed models (LMMs) and sparse regression models are widely used in genetics applications, including, recently, polygenic modeling in genome-wide association studies. These two approaches make very different assumptions, so are expected to perform well in different situations. However, in practice, for a given dataset one typically does not know which assumptions will be more accurate. Motivated by this, we consider a hybrid of the two, which we refer to as a “Bayesian sparse linear mixed model” (BSLMM) that includes both these models as special cases. We address several key computational and statistical issues that arise when applying BSLMM, including appropriate prior specification for the hyper-parameters and a novel Markov chain Monte Carlo algorithm for posterior inference. We apply BSLMM and compare it with other methods for two polygenic modeling applications: estimating the proportion of variance in phenotypes explained (PVE) by available genotypes, and phenotype (or breeding value) prediction. For PVE estimation, we demonstrate that BSLMM combines the advantages of both standard LMMs and sparse regression modeling. For phenotype prediction it considerably outperforms either of the other two methods, as well as several other large-scale regression methods previously suggested for this problem. Software implementing our method is freely available from http://stephenslab.uchicago.edu/software.html.
The goal of polygenic modeling is to better understand the relationship between genetic variation and variation in observed characteristics, including variation in quantitative traits (e.g. cholesterol level in humans, milk production in cattle) and disease susceptibility. Improvements in polygenic modeling will help improve our understanding of this relationship and could ultimately lead to, for example, changes in clinical practice in humans or better breeding/mating strategies in agricultural programs. Polygenic models present important challenges, both at the modeling/statistical level (what modeling assumptions produce the best results) and at the computational level (how should these models be effectively fit to data). We develop novel approaches to help tackle both these challenges, and we demonstrate the gains in accuracy that result in both simulated and real data examples.