31
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      A general Bayesian treatment for MEG source reconstruction incorporating lead field uncertainty

      brief-report

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          There is uncertainty introduced when a cortical surface based model derived from an anatomical MRI is used to reconstruct neural activity with MEG data. This is a specific case of a problem with uncertainty in parameters on which M/EEG lead fields depend non-linearly. Here we present a general mathematical treatment of any such problem with a particular focus on co-registration. We use a Metropolis search followed by Bayesian Model Averaging over multiple sparse prior source inversions with different headlocation/orientation parameters. Based on MEG data alone we can locate the cortex to within 4 mm at empirically realistic signal to noise ratios. We also show that this process gives improved posterior distributions on the estimated current distributions, and can be extended to make inference on the locations of local maxima by providing confidence intervals for each source.

          Highlights

          ► We show how to account for lead field uncertainty in the MEG inverse problem. ► Co-registration errors propagate through to the final current density estimate. ► It is possible to recover the location of the head based on MEG data alone.

          Related collections

          Most cited references17

          • Record: found
          • Abstract: found
          • Article: not found

          The magnetic lead field theorem in the quasi-static approximation and its use for magnetoencephalography forward calculation in realistic volume conductors.

          The equation for the magnetic lead field for a given magnetoencephalography (MEG) channel is well known for arbitrary frequencies omega but is not directly applicable to MEG in the quasi-static approximation. In this paper we derive an equation for omega = 0 starting from the very definition of the lead field instead of using Helmholtz's reciprocity theorems. The results are (a) the transpose of the conductivity times the lead field is divergence-free, and (b) the lead field differs from the one in any other volume conductor by a gradient of a scalar function. Consequently, for a piecewise homogeneous and isotropic volume conductor, the lead field is always tangential at the outermost surface. Based on this theoretical result, we formulated a simple and fast method for the MEG forward calculation for one shell of arbitrary shape: we correct the corresponding lead field for a spherical volume conductor by a superposition of basis functions, gradients of harmonic functions constructed here from spherical harmonics, with coefficients fitted to the boundary conditions. The algorithm was tested for a prolate spheroid of realistic shape for which the analytical solution is known. For high order in the expansion, we found the solutions to be essentially exact and for reasonable accuracies much fewer multiplications are needed than in typical implementations of the boundary element methods. The generalization to more shells is straightforward.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Review on solving the inverse problem in EEG source analysis

            In this primer, we give a review of the inverse problem for EEG source localization. This is intended for the researchers new in the field to get insight in the state-of-the-art techniques used to find approximate solutions of the brain sources giving rise to a scalp potential recording. Furthermore, a review of the performance results of the different techniques is provided to compare these different inverse solutions. The authors also include the results of a Monte-Carlo analysis which they performed to compare four non parametric algorithms and hence contribute to what is presently recorded in the literature. An extensive list of references to the work of other researchers is also provided. This paper starts off with a mathematical description of the inverse problem and proceeds to discuss the two main categories of methods which were developed to solve the EEG inverse problem, mainly the non parametric and parametric methods. The main difference between the two is to whether a fixed number of dipoles is assumed a priori or not. Various techniques falling within these categories are described including minimum norm estimates and their generalizations, LORETA, sLORETA, VARETA, S-MAP, ST-MAP, Backus-Gilbert, LAURA, Shrinking LORETA FOCUSS (SLF), SSLOFO and ALF for non parametric methods and beamforming techniques, BESA, subspace techniques such as MUSIC and methods derived from it, FINES, simulated annealing and computational intelligence algorithms for parametric methods. From a review of the performance of these techniques as documented in the literature, one could conclude that in most cases the LORETA solution gives satisfactory results. In situations involving clusters of dipoles, higher resolution algorithms such as MUSIC or FINES are however preferred. Imposing reliable biophysical and psychological constraints, as done by LAURA has given superior results. The Monte-Carlo analysis performed, comparing WMN, LORETA, sLORETA and SLF, for different noise levels and different simulated source depths has shown that for single source localization, regularized sLORETA gives the best solution in terms of both localization error and ghost sources. Furthermore the computationally intensive solution given by SLF was not found to give any additional benefits under such simulated conditions.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Comparing Dynamic Causal Models using AIC, BIC and Free Energy

              W.D. Penny (2012)
              Introduction Mathematical models of scientific data can be formally compared using the Bayesian model evidence (Bernardo and Smith, 2000; Gelman et al., 1995; Mackay, 2003), an approach that is now widely used in statistics (Hoeting et al., 1999), signal processing (Penny and Roberts, 2002), machine learning (Beal and Ghahramani, 2003), and neuroimaging (Friston et al., 2008; Penny et al., 2003; Trujillo-Barreto et al., 2004). By comparing the evidence or ‘score’ of one model relative to another, a decision can be made as to which is the more veridical. This approach has now been widely adopted for the analysis of brain connectivity data, especially in the context of Dynamic Causal Modelling (DCM) (Friston et al., 2003; Penny et al., 2004). Originally (Penny et al., 2004), it was proposed to score DCMs using a combination of Akaike's Information Criterion (AIC) and the Bayesian Information Criterion (BIC) criteria. Specifically, it was proposed that (Penny et al., 2004) if both AIC and BIC provided a log Bayes factor (difference in log model evidences) of greater than three in favour of model one versus two, one could safely conclude that model one was the more veridical. More recently it has been proposed (Stephan et al., 2010), on theoretical grounds, to instead score DCMs using the Free Energy (Friston et al., 2007a). However, until now there has been no empirical comparison of the model comparison abilities of the different approaches. This motivates the work in this paper which describes a simulation study comparing AIC, BIC and the Free Energy. Differences in performance are examined in the context of General Linear Models (GLMs) and Dynamic Causal Models (DCMs). Specifically, for each class of model we define a ‘full’ and a ‘nested’ model, where the nested model is a special case of the full model with a subset of parameters set to zero. We examine how model comparison accuracy varies as a function of number of data points and signal to noise ratio for the separate cases of data being generated by full or nested models. This allows us to assess the sensitivity and specificity of the different model comparison criteria. The paper uses simulated data generated from models with known parameters but these parameters are derived from empirical neuroimaging data. We start by briefly reviewing the relevant theoretical background and then go on to present our results. Methods We consider Bayesian inference on data y using model m with parameters θ. In the analysis of brain connectivity, the data would comprise, for example, fMRI time series from multiple brain regions, the model would make specific assumptions about connectivity structure, and the parameters would correspond to connections strengths. A generic approach for statistical inference in this context is Bayesian estimation (Bishop, 2006; Gelman et al., 1995) which provides estimates of two quantities. The first is the posterior distribution over model parameters p(θ|m, y) which can be used to make inferences about model parameters θ. The second is the probability of the data given the model, otherwise known as the model evidence. This can be used for model comparison, in that ratios of model evidences (Bayes factors) allow one to choose between models (Kass and Raftery, 1995; Raftery, 1995). This paper focusses on Dynamic Causal Models and on model inference using AIC, BIC or Free Energy approximations to the model evidence. We first describe DCM, show how model parameters are estimated, describe Bayesian inference for General Linear Models and then go on to describe the different model selection criteria. In what follows N (x ; m, S) represents a multivariate Gaussian density over x with mean m and covariance S, and |S| denotes the determinant of matrix S. DCM for fMRI Dynamic Causal Modelling is a framework for fitting differential equation models of neuronal activity to brain imaging data using Bayesian inference. There is now a library of DCMs and variants differ according to their level of biological realism and the data features which they explain. The DCM approach can be applied to functional Magnetic Resonance Imaging (fMRI), Electroencephalographic (EEG), Magnetoencephalographic (MEG), and Local Field Potential (LFP) data (Daunizeau et al., 2009). The empirical work in this paper uses DCM for fMRI. Neurodynamics This paper uses DCM for fMRI with bilinear neurodynamics and an extended Balloon model (Friston, 2002) for the hemodynamics. The neurodynamics are described by the following multivariate differential equation (1) z ˙ t = A + ∑ j = 1 M u t ( j ) B j z t + C u t where t indexes continuous time and the dot notation denotes a time derivative. The ith entry in z t corresponds to neuronal activity in the ith brain region, and u t (j) is the jth experimental input. A DCM is characterised by a set of ‘intrinsic connections’, A, that specify which regions are connected and whether these connections are unidirectional or bidirectional. We also define a set of input connections, C, that specify which inputs are connected to which regions, and a set of modulatory connections, B j , that specify which intrinsic connections can be changed by which inputs. Usually, the B parameters are of greatest interest as these describe how connections between brain regions are dependent on experimental manipulations. The overall specification of input, intrinsic and modulatory connectivity comprise our assumptions about model structure. This in turn represents a scientific hypothesis about the structure of the large-scale neuronal network mediating the underlying cognitive function. These hypotheses, or models are indexed by m. The simulations in this paper use ‘DCM 8’ (available in SPM8 prior to revision 4010) with a deterministic, single-state, bilinear neurodynamical model as described above. Model predictions In DCM, neuronal activity gives rise to fMRI signals via an extended Balloon model (Buxton et al., 2004) and BOLD signal model (Stephan et al., 2007) for each region. This specifies how changes in neuronal activity give rise to changes in blood oxygenation that are measured with fMRI. The equations for these hemodynamics are provided in the Appendix A and depend on a set of hemodynamic parameters h. Overall, the DCM parameters are collectively written as the vector θ = {A, B, C, h}. Numerical integration of the neurodynamic (Eq. 1) and hemodynamic equations (Appendix A) leads to prediction of fMRI activity in each brain region. These values are concatenated to produce a single model prediction vector g(θ). Priors The priors factorise over parameter types (2) p(θ|m) = p(A|m)p(B|m)p(C|m)p(h|m) p ( θ | m ) = p ( A | m ) p ( B | m ) p ( C | m ) p ( h | m ) and each parameter prior is Gaussian. The priors used in this paper correspond to those used in ‘DCM8’. The priors over the intrinsic connections are chosen to encourage stable dynamics. This results in prior variances which depend on the number of regions in the model (Friston et al., 2003), and in this paper we model activity in three regions. For the intrinsic self-connections we have (3) p ( A ii | m ) = N  ( A ii ; − 1 , σ self 2 ) with σ self  = 0.177. The time constant associated with a self-connection is τ i  = − 1/A ii , and the time at which activity decays to half its initial value (half-life) is (1/A ii )log0.5 (Friston et al., 2003). The prior over self-connections corresponds to a prior over half-life's that can be determined by sampling from p(A ii |m) and transforming variables to τ i  = − 1/A ii . This produces a mean half life of approximately 720 ms with 90% of the distribution between 500 and 1000 ms. For those intrinsic cross connections which are not fixed at zero by model assumptions m we have (4) p ( A ik | m ) = N  ( A ik ; 1 / 64 , σ cross 2 ) where σ cross  = 0.5. Elements of the modulatory and input connectivity matrices (which are not fixed at zero by model assumptions) have shrinkage priors (5) p ( B ik j | m ) = N ( B ik j ; 0 , σ s 2 ) (6) p ( C ij | m ) = N ( C ij ; 0 , σ s 2 ) and σ s  = 2. In the above, i and k index brain regions and j indexes experimental input. The prior variance parameters σ self 2, σ cross 2 and σ s 2 along with the prior variances on hemodynamic parameters (see Appendix A) determine the overall prior covariance on model parameters, C θ (see next section). In the free energy model comparison criterion (see below) these variances contribute to the penalty paid for each parameter. Optimisation The standard algorithm used to optimise DCMs is the Variational Laplace (VL) method described in (Friston et al., 2007a). The VL algorithm can be used for Bayesian estimation of any nonlinear model of the form (7) y = g(θ) + e y = g ( θ ) + e where g(θ) is some nonlinear function, and e is zero mean additive Gaussian noise with covariance C y . This covariance depends on hyperparameters λ as shown below. The likelihood of the data is therefore (8) p(y|θ, λ, m) = N(y; g(θ, m), C y ) p ( y | θ , λ , m ) = N ( y ; g ( θ , m ) , C y ) The framework allows for Gaussian priors over model parameters (9) p(θ|m) = N(θ; μ θ , C θ ) p ( θ | m ) = N ( θ ; μ θ , C θ ) where the prior mean and covariance are assumed known. The error covariances are assumed to decompose into terms of the form (10) C y − 1 = ∑ i e x p ( λ i ) Q i where Q i are known precision basis functions. The hyperparameters that govern these error precisions are collectively written as the vector λ. These will be estimated. Additionally, the hyperparameters are constrained by the prior (11) p(λ|m) = N(λ; μ λ , C λ ) p ( λ | m ) = N ( λ ; μ λ , C λ ) The above distributions allow one to write down an expression for the joint log likelihood of the data, parameters and hyperparameters (12) p(y, θ, λ|m) = p(y|θ, λ, m)p(θ|m)p(λ|m) p ( y , θ , λ | m ) = p ( y | θ , λ , m ) p ( θ | m ) p ( λ | m ) The VL algorithm then assumes an approximate posterior density of the following factorised form (13) q ( θ , λ | y , m ) = q ( θ | y , m ) q ( λ | y , m ) q ( θ | y , m ) = N ( θ ; m θ , S θ ) q ( λ | y , m ) = N ( λ ; m λ , S λ ) The parameters of these approximate posteriors are then iteratively updated so as to minimise the Kullback–Liebler (KL)-divergence between the true and approximate posteriors. This algorithm is described fully in (Friston et al., 2007a). We emphasise here that the Variational Laplace framework assumes that the prior means and covariances (μ θ , C θ , μ λ , C λ ) are known. They are not estimated from data, as is the case for Empirical Bayes methods (Carlin and Louis, 2000). We will return to this issue in the discussion. Hyperparameters in DCM for fMRI In DCM for fMRI the precision basis functions Q i , defined in Eq. (10), are set to Q i  = I for each brain region. The quantity γ i  = exp(λ i ) therefore corresponds to the noise precision in region i. The overall error covariance matrix C y has a block structure corresponding to the assumption that observation noise is independent and identically distributed in each region. This is valid as time series data are usually pre-whitened before entering into a DCM analysis (Friston et al., 2003). The prior mean and covariance of the associated latent variables are set to (14) μ λ = 0 C λ = 1 This corresponds to the assumption that the mean prior noise precision, γ ¯ i = 1.7 . These values, along with the priors on the neurodynamic parameters, have been set so as to produce data sets with typical signal to noise ratios encountered in fMRI. Model evidence The model evidence, also known as the marginal likelihood, is not straightforward to compute, since its computation involves integrating out the dependence on model parameters (15) p(y|m) = ∫∫p(y|θ, λ, m)p(θ|m)p(λ|m)d θ d λ. p ( y | m ) = ∫ ∫ p ( y | θ , λ , m ) p ( θ | m ) p ( λ | m ) d θ d λ . The following sections describe Free Energy, AIC and BIC approximations to the (log) model evidence. Once the evidence has been computed models m 1 and m 2 can be compared using the Bayes factor (16) B 12 = ½ B 12 = p ( y | m 1 ) p ( y | m 2 ) with a value of 20 corresponding to a posterior probability of greater than 0.95 in favour of model m 1. The corresponding log Bayes factor is 3. The use of Bayes factors for model comparison is described more fully elsewhere (Kass and Raftery, 1995; Penny et al., 2004). Comparison of a large number of models is best implemented using the full posterior density, p(m|y), as described in (Penny et al., 2010). Free energy It is possible to place a lower bound on the log model evidence of the following form (Beal, 2003) (17) l o g p(y|m) = F(m) + K L[q(θ, λ|m)||p(θ, λ|y, m)] l o g p ( y | m ) = F ( m ) + K L [ q ( θ , λ | m ) | | p ( θ , λ | y , m ) ] where F(m) is known as the negative variational free energy (henceforth ‘Free Energy’) and the last term is the Kullback–Liebler distance between the true posterior density, p(θ, λ|y, m) and an approximate posterior q(θ, λ|m). Because KL is always positive (Mackay, 2003), F(m) provides a lower bound on the model evidence. The Free Energy is defined as (18) F ( m ) = ∫ ∫ q ( θ , λ | y , m ) l o g p ( y , θ , λ | m ) q ( θ , λ | y , m ) d θ d λ and can be estimated using a Laplace approximation (Friston et al., 2007a), F L (m), as derived in Appendix B. As noted in (Wipf and Nagarajan, 2009), because the Laplace approximation is not exactly equal to the Free Energy, the above lower bound property no longer holds. That is, the Laplace approximation does not lower bound the log model evidence. As we shall see, however, it nevertheless provides a very useful model comparison criterion. The Laplace approximation to the Free Energy is given in Eq. (57) and can be expressed as a sum of accuracy and complexity terms (Beal, 2003) (19) F L (m) = A c c u r a c y(m) − C o m p l e x i t y(m) F L ( m ) = A c c u r a c y ( m ) − C o m p l e x i t y ( m ) (20) A c c u r a c y ( m ) = − 1 2 e y T C y − 1 e y − 1 2 l o g | C y | − N 2 l o g 2 π (21) C o m p l e x i t y ( m ) = 1 2 e θ T C θ − 1 e θ + 1 2 l o g | C θ | − 1 2 l o g | S θ | + 1 2 e λ T C λ − 1 e λ + 1 2 l o g | C λ | − 1 2 l o g | S λ | where N is the number of data points and the ‘error terms’ are (22) e y = y − g ( m θ ) e θ = m θ − μ θ e λ = m λ − μ λ The first row of Eq. (21) is the complexity term for the parameters and the second row the complexity term for the hyperparameters. If the hyperparameters are known then the last row of Eq. (21) disappears. In this case we can write the complexity as (23) C o m p l e x i t y ( m ) = 1 2 e θ T C θ − 1 e θ + 1 2 l o g | C θ | | S θ | In the limit that the posterior equals the prior (e θ  = 0,C θ  = S θ ), the complexity term equals zero. The last term in Eq. (23), 1 2 l o g | C θ | | S θ | , is also referred to as an Occam factor (see page 349 in (Mackay, 2003)). Because the determinant of a matrix corresponds to the volume spanned by its eigenvectors, this Occam factor gets larger and the model evidence smaller as the posterior volume, |S θ |, reduces in proportion to the prior volume, |C θ |. Models for which parameters have to be specified precisely (small posterior volume) are brittle. They are not good models (complexity is high). The above considerations also apply to cases where hyperparameters are estimated. There is an additional complexity term (last line of Eq. 21) and therefore an additional Occam factor. Correlated parameters Other factors being equal, models with strong correlation in the posterior are not good models. For example, given a model with just two parameters the determinant of the posterior covariance is given by (24) | S θ | = ( 1 − r 2 ) σ θ 1 2 σ θ 2 2 where r is the posterior correlation, σ θ 1 and σ θ 2 are the posterior standard deviations of the two parameters. For the case of two parameters having a similar effect on model predictions the posterior correlation will be high, therefore implying a large complexity penalty. However, there is another factor at play. This is that neither parameter will be estimated accurately (the posterior variances will be high). This second factor can offset the higher complexity due to correlation and can lead to a situation in which additional extraneous parameters will not lead to a significant drop in free energy. One would then appeal to a further Occam's Razor principle (Mackay, 2003), namely, that in the absence of significant free energy differences one should prefer the simpler model (see Discussion). When fitting DCMs to fMRI data it is likely that some parameters will be correlated with each other. This correlation can be examined by looking at the posterior covariance matrix S θ . A good example of this is provided in Fig. 6 of Stephan et al. (2007) who describe posterior correlations among hemodynamic and connectivity parameters. Importantly, these correlations are accomodated in the Free Energy model comparison criterion (see Eq. 23 and above). This is possible because Variational Laplace does not assume that parameters are a posteriori independent among themselves, rather it is assumed that the parameters are a posteriori independent of the hyperparameters (see Eq. 13). Decompositions It is instructive to decompose approximations to the model evidence into contributions from specific sets of parameters or predictions. In the context of DCM, one can decompose the accuracy terms into contributions from different brain regions, as described previously (Penny et al., 2004). This enables insight to be gained into why one model is better than another. It may be, for example, that one model predicts activity more accurately in a particular brain region. Similarly, it is possible to decompose the complexity term into contributions from different sets of parameters. If we ignore correlation among different parameter sets then the complexity is approximately (25) C o m p l e x i t y ( m ) ≈ 1 2 ∑ j e θ j T C θ j − 1 e θ j + l o g | C θ j | | S θ j | where j indexes the jth parameter set. In the context of DCM these could index input connections (j = 1), intrinsic connections (j = 2), modulatory connections (j = 3) etc. We will see an example of this in the Results section. General Linear Models For General Linear Models (GLMs) model predictions are given by (26) g(θ) = X θ g ( θ ) = X θ where X is a design matrix and θ are now regression coefficients. The posterior distribution is analytic and given by (Bishop, 2006) (27) S θ − 1 = X T C y − 1 X + C θ − 1 m θ = S θ X T C y − 1 y + C θ − 1 μ θ These parameter values can then be plugged into Eqs. (19) to (22) to compute the Free Energy. If the hyperparameters are assumed known then the Free Energy expression in Eq. (19) is exactly equal to the log model evidence.That is, F L (m) = logp(y|m). We will revisit this case in the Results section. If the hyperparameters are estimated then the Free Energy provides a very close approximation, as confirmed by sampling methods (Friston et al., 2007a). AIC and BIC A simple approximation to the log model evidence is given by the Bayesian Information Criterion (Schwarz, 1978) (28) B I C = A c c u r a c y ( m ) − p 2 l o g N where p is the number of parameters, and N is the number of data points. The BIC is a special case of the Free Energy approximation that drops all terms that do not scale with the number of data points (see e.g. Appendix A2 in (Penny et al., 2004) for a derivation). This is equivalent to the statement that BIC is equal to the Free Energy under the infinite data limit, and when the priors over parameters are flat, and the variational posterior is exact (see section 2.3 in (Attias, 1999) and page 217 in (Bishop, 2006)). In practise, as we shall see, these three requirements are almost never met and BIC will produce model comparisons that are often very different to those from the Free Energy. An alternative model selection criterion is Akaike's Information Criterion (or ‘An Information Criterion’) (Akaike, 1973) (29) A I C = A c c u r a c y(m) − p A I C = A c c u r a c y ( m ) − p AIC is not a formal approximation to the model evidence but derives from information theoretic considerations. Specifically, AIC model selection will choose that model in the comparison set with minimal expected KL divergence to the true model (Akaike, 1973; Burnham and Anderson, 2002). There are precedents in the literature, however, for using it as a surrogate for the model evidence, in order to derive a posterior density over models (Burnham and Anderson, 2004) (Penny et al., 2004). The AIC criterion has been reported to perform poorly for small numbers of data points (Brockwell and Davis, 2009; Burnham and Anderson, 2004). This has motivated the inclusion of a correction term (30) A I C c = A I C − p ( p + 1 ) N − p − 1 known as the ‘corrected’ AIC (AICc) (Hurvich and Tsai, 1989). The AICc criterion thus penalises parameters more than does AIC. The two criteria become approximately equal for N > p 2 and identical in the limit of very large sample sizes. We note, however, that for N  , the average signal standard deviation when drawing parameters the prior p(θ). We then produced simulated data sets where the Signal to Noise ratio (33) S N R = σ e was set to a range of values by choosing an appropriate σ e . SNR defined in this manner can be related to the proportion of variance explained by the model, as shown in Appendix C. The observed fMRI data have a value of SNR = 1.3. Each simulated data set was then generated by drawing regression coefficients from their prior densities, producing model predictions g = Xθ (for both full and nested models) and adding zero mean Gaussian noise with variance σ e 2. We then fitted both full and nested models to each simulated data set and estimated Bayes factors using AIC, BIC and F L . These criteria were computed by substituting X, C y , C θ , and μ θ as defined in this section into Eq. (27) for computing the posterior mean and covariance for linear models. The prediction errors, e y , and parameter errors, e θ , were then computed from Eqs. (22) and (26). We could then compute the accuracy and complexity terms using Eqs. (20) and (21) (the complexity terms for λ were ignored as the observation noise variance was known for these simulations). Fig. 2 shows results for data drawn from the full model. The figure plots the log Bayes factors (differences in log model evidence) at various values of SNR, where each point in each curve was averaged over 1000 simulated data sets. At low SNRs, experimental effects should be impossible to detect. This is reflected in the Free Energy log Bayes factor which correctly asymptotes to a value of zero, indicating neither model is preferred. In this regime, however, BIC and to a lesser extent AIC both incorrectly favour the nested model. The error bars on the plots (not shown) are extremely tight in this regime, being ± 0.0001, ± 0.09 and ± 0.35 for SNRs of 0.0025, 0.029 and 0.055 respectively (averaged over the three criteria). This means we can be highly confident that F L is unbiased but that AIC and BIC are biassed towards the nested model. The above procedure was then repeated but this time generating data from the nested model. The results are shown in Fig. 3 (note the broader range of SNRs plotted). In the low SNR regime, model comparison should again be impossible. This is correctly reflected in the F L criterion with a log Bayes factor approaching zero, but not so in the AIC or BIC criteria. Finally, we examined the dependence of model comparison on the number of data points, N. We varied N over 20 values between 32 and 512 with 1000 replications at each value, using SNR = 0.5 (results were qualitatively similar for other SNRs). The results are shown in Fig. 4 for data generated from the full model. As expected, Bayes factors increase with the number of data points. The free energy, AIC and AICc show very similar performance with F L being slightly better at low N and AIC/AICc at high N. The BIC criterion is biassed towards the nested model. Fig. 5 shows the results for data generated from the nested model. The Bayes factors from the free energy and BIC increase with the number of data points, whereas this is not the case for AIC and AICc. We see that AIC and AICc are equivalent for large sample sizes. For small sample sizes AICc pays a larger parameter penalty. This is beneficial when the nested model is true (Fig. 5) but not when the full model is true (Fig. 4). Overall, we do not see a good reason for favouring AICc over AIC and so exclude it from subsequent model comparisons. Theory (Attias, 1999) tells us that BIC should converge to the Free Energy for large sample sizes. However, this is only the case for flat priors over parameters and if the variational posterior is correct. As we have linear models, the last requirement is met but the prior over parameters is Gaussian, rather than flat. A data set comprising 512 points is about the maximum one could hope to get from a single session of fMRI scanning (approximately 17 min with a TR of 2s). We therefore conclude that for neuroimaging applications BIC and Free Energy are likely to give different results. DCM for fMRI We now compare the model comparison criteria using DCM for fMRI. We generate data using synthetic DCMs with known parameter values. However, to ensure the data are realistic we use parameter values that were estimated from neuroimaging data. This data derive from a previously published study on the cortical dynamics of intelligible speech (Leff et al., 2008). We used data from a single representative subject. This study applied DCM for fMRI to investigate activity among three key multimodal brain regions: the left posterior and anterior superior temporal sulci (subsequently referred to as regions P and A respectively) and pars orbitalis of the inferior frontal gyrus (region F). The aim of the study was to see how connections among regions depended on whether the auditory input was intelligible speech or time-reversed speech. Full details of the experimental paradigm and imaging parameters are available in (Leff et al., 2008). The time series which were modelled in this study comprise N = 488 data points in each of three brain regions. We focus on just two of the models considered by Leff et al. (Leff et al., 2008). These are a ‘nested’ model, which has full intrinsic connectivity with auditory input, u aud , entering region P, and a modulatory connection from region P to F (this allows region F to be differentially responsive to intelligible versus time-reversed speech). We also define a ‘full’ model which is identical but has an additional modulatory connection from region P to A (b AP — see below). The two networks are shown in Fig. 6. The two models differ in only a single connection and we chose these very similar models to make model comparison as challenging as possible. Mathematically, neurodynamics evolve according to (34) z ˙ P z ˙ F z ˙ A = a PP a PF a PA a FP a FF a FA a AP a AF a AA + u int 0 0 0 b FP 0 0 b AP 0 0 z P z F z A + u aud c P 0 0 where u aud is a train of auditory input spikes, u int indicates whether the input is intelligible (Leff et al., 2008), a AF denotes the value of the intrinsic connection from region F to A, b FP and b AP are the strengths of the two modulatory connections, and c P is the strength of the input connection. For the nested DCM we have b AP  = 0. We first generated data sets from the full model over a range of SNRs as follows. To best reflect the empirical fMRI data all parameters other than the modulatory parameters were held constant. For each simulated data set the modulatory parameters were first drawn from their prior densities (see Eq. 5). Additionally, the modulatory parameters were then constrained to be positive (by taking the absolute value) so that modulatory effects would be facilitating. Synthetic fMRI data was then generated by integrating the neurodynamic and hemodynamic equations and adding observation noise to obtain the target SNR. The SNR was defined in the same way as for the linear models, but with the signal standard deviation, , averaged over the three predicted time series (one for each brain region). The observed fMRI data have a value of SNR = 0.2. We then fitted both full and nested models to each simulated data set and estimated Bayes factors using AIC, BIC and F L . Fig. 7 shows results for data drawn from the full model. The figure plots the log Bayes factors (differences in log model evidence) at various values of SNR, where each point in each curve was averaged over 50 simulated data sets. For these DCM simulations, the averaging was implemented using the median operator (rather than the mean) as the results were more variable than for the GLM case. The curves in Fig. 7 show that only the Free Energy criterion is able to correctly identify the full model. The above procedure was then repeated but this time generating data from the nested model. Again, each point in each curve is the median value over 50 simulated data sets. The results are shown in Fig. 8 (note the broader range of SNRs plotted). The results on data from the nested model are very similar to those for the GLM case (compare Figs. 3 and 8). The results for data from the full model, however, are not (compare Figs. 2 and 7), as AIC and BIC are unable to correctly identify the full model even at high SNR. In order to find out why this is the case we examined DCMs fitted to data at SNR = 2, and examined the relative contributions to the model evidence, as described in Decompositions section. For this high SNR scenario we found, slightly to our surprise, that the full DCMs were only slightly more accurate than the nested DCMs. Unsurprisingly, this increase in accuracy was realised in region A, which receives modulatory input in the full but not in the nested model (see Fig. 6). However, the main quantity driving the difference in Free Energy between full and nested DCMs was not the accuracy but rather the complexity. It turns out that the nested DCMs are able to produce a reasonable data fit by using a very large value for the intrinsic connection, a AF (from region F to A). This connection value (typically 1.5) was about 5 times bigger than the value for a full DCM (typically 0.3). This makes sense because, in the nested model, the connection from P to F is modulated by intelligibility, and by facilitating the intrinsic connection from F to A this ‘modulatory signal’ is passed on to region A. Since this modulation is of an additive nature, this therefore crudely mimics a direct modulation of the P to A connection. However, such a strong intrinsic connection from F to A is a-priori unlikely (the prior is a zero-mean Gaussian, with standard deviation σ cross  = 0.5). The nested models are therefore heavily penalised for having such unlikely parameter values (being three standard deviations away from their prior means). Only the Free Energy criterion is sensitive to such subtleties because AIC and BIC pay the same penalty for each parameter, regardless of magnitude. As mentioned above, the empirical SNR for this data is SNR = 0.2 which is very low. Fitting the full and nested DCMs to this data yielded a Free Energy difference of only 0.11 (in favour of the full DCM). This difference is negligible, and points to the difficulty of model inference for very similar models and at low SNR (as exemplified by Figs. 7 and 8). In this regime it may be a better idea to make inferences over families of models (Penny et al., 2010) and to look for consistent differences over a group of subjects (Stephan et al., 2009). Discussion We have described a simulation study which compared the relative merits of AIC, BIC and Free Energy model selection criteria. Differences in performance were examined in the context of GLMs and DCMs and we found that the Free Energy has the best model selection ability and recommended it be used for comparison of DCMs. Similar conclusions have been reached in earlier work comparing Free Energy with BIC in the context of non-Gaussian autoregressive modelling (Roberts and Penny, 2002) and Hidden Markov Modelling (Valente and Wellekens, 2004). The GLM simulation results showed that, at low SNR, AIC and BIC incorrectly selected nested models when data were generated by full models. At higher SNR, however, this bias disappeared and AIC/BIC showed increased sensitivity. We also investigated a corrected AIC criterion but this showed no benefit over the standard AIC measure. The DCM simulation results showed that only the Free Energy was able to correctly detect that data had been generated from the full model. By decomposing the Free Energy difference into contributions from different regions and parameters, we found that this ability was mainly due to penalising the nested model for having a very large, and a-priori unlikely, intrinsic connection from brain region F to A. Because AIC and BIC use the same complexity penalty for every parameter, and one that is not matched to prior expectations, they lack the sensitivity that is required, in this case, to infer that data was drawn from the full model. We emphasise that this will not always be the case, and AIC/BIC can in general be sensitive to ‘full model’ effects in DCMs. This is demonstrated, for example, in our previous work (Penny et al., 2004). However, if prior information about parameter values is available then it should be used, and can be used to good effect in the Free Energy criterion. It may also be argued that in the application in this paper AIC and BIC are implicitly using prior information in that the accuracy term is computed at the maximum posterior value. Being a posterior estimate this is naturally constrained by the prior. To avoid this one would have to implement a separate Maximum Likelihood optimisation. Given this fact, it therefore seems consistent to also use prior information when approximating the evidence. According to conventions in Bayesian statistics (Kass and Raftery, 1995), and as stated above, models can be considered clearly distinguishable once the log Bayes factor exceeds three. The simulation results for both GLMs and DCMs show smaller Bayes factors when the true model is nested rather than full. This is particularly pronounced for the (challengingly similar) DCMs examined in this paper for which the Free Energy only achieves a Log Bayes Factor of three at an SNR of 10. In such a case, modellers and imaging neuroscientists should appeal to a second Occam principle (Mackay, 2003), not the numerical one embedded in the equation for the Free Energy, but a conceptual one that when two models cannot be clearly distinguished one should prefer the simpler one. In previous work (Penny et al., 2004) we have advocated the combined use of AIC and BIC criteria for the comparison of DCMs. This was motivated by a concern about how Free Energy model inference depends on the chosen values of the prior means and variances (see earlier section on priors). Specifically, the values σ self , σ cross and σ s implicitly set the penalty paid for intrinsic, modulatory and input parameters (as governed by Eq. (21) via the overall prior covariance matrix C θ .). This therefore motivates the future application of an Empirical Bayes (Carlin and Louis, 2000) approach which would estimate these variance parameters from data. This would effectively perform a search in the continuous space of prior variances instead of the discrete space (e.g., nested versus full) examined in this paper. Such an approach can be implemented within the new framework of post-hoc model selection (Friston and Penny, 2011).
                Bookmark

                Author and article information

                Journal
                Neuroimage
                Neuroimage
                Neuroimage
                Academic Press
                1053-8119
                1095-9572
                02 April 2012
                02 April 2012
                : 60
                : 2-4
                : 1194-1204
                Affiliations
                [a ]Mechatronics School, Bl. M8-108 Facultad de Minas, Universidad Nacional de Colombia, Medellín, Colombia
                [b ]Wellcome Trust Centre for Neuroimaging, University College, London WC1N 3BG, UK
                Author notes
                [* ]Corresponding author. jodlopezhi@ 123456unal.edu.co
                Article
                YNIMG9142
                10.1016/j.neuroimage.2012.01.077
                3334829
                22289800
                176a5f61-377c-48d2-a13e-360a31896f89
                © 2012 Elsevier Inc.

                This document may be redistributed and reused, subject to certain conditions.

                History
                : 15 September 2011
                : 6 December 2011
                : 8 January 2012
                Categories
                Technical Note

                Neurosciences
                bayesian model averaging,metropolis algorithm,meg inverse problem,co-registration

                Comments

                Comment on this article