Model Fitting in the Presence of Nuisance Parameters

A basic problem in any statistical modeling of a scientific dataset is to provide the 'best' fit. Such inference is generally based on the empirical distribution function when the underlying process generating the data is not reasonably known. A computationally intensive resampling method called the bootstrap method are presented, to estimate the null distributions of various goodness of fit test statistics, when the underlying process is partially known. These results hold not only in the univariate case but also in the multivariate setting.


INTRODUCTION
A vast range of statistical problems arise in modern astronomical and space sciences research, particularly due to the flood of data produced by astronomical surveys at many wavebands.Equally important is the great increase in the complexity of the data sets: some are tabular with many dimensions, some are time series with complex temporal behaviors, and others are linked to highly nonlinear astrophysical models.While the scientific promise is tremendous, it depends critically on the ability to extract useful knowledge from the data.A basic problem in any statistical modeling of a scientific dataset is to provide the most parsimonious 'best' fit.Questions such as the following often arise in astrophysics.
• Is the underlying nature of a quasar spectrum a nonthermal power law or a combination of black bodies?• Is the topology of a clustered multivariate dataset best modeled as several distinct clusters, as a smooth distribution with voids, or as a stochastic hierarchy of embedded structures?• Are the fluctuations in the cosmic microwave background best fit by Big Bang models with dark energy or with quintessence?• Are there interesting correlations among the properties of objects in any given class (e.g. the Fundamental Plane of elliptical galaxies), and what are the optimal analytical expressions of such correlations?• How do we characterize blips embedded in larger structures?These issues arise when data are used to repudiate or support astrophysical theories but the underlying processes generating the data are not confidently known.
Standard penalized likelihood based methods such as the Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC) can be used, but do not necessarily answer the problem of the physical scientist.The AIC is optimized for the reduction of prediction error while the BIC is optimized for the maximization of the probability of correct selection.Babu and C. R. Rao have developed nonparametric resampling methods for inference, when the data comes from an unknown distribution which may or may not belong to a specified family ( [2,3]).The asymptotic null distributions of statistics based on empirical distribution function such as Kolmogorov-Smirnov, will not be fixed if nuisance parameters affecting the distribution are present.Computationally intensive bootstrap methods, to estimate these null distributions which hold not only in the univariate case but also in the multivariate setting, are discussed here.

STATISTICS BASED ON THE EMPIRICAL DISTRIBUTION FUNCTION
Nonparametric goodness of fit tests are generally based on the empirical distribution function (see Figure 1).The problem of goodness of fit tests when parameters are estimated are presented.These results hold not only in the univariate case but also in the multivariate setting.These ideas are taken a step further to develop non-parametric resampling methods for inference, when the data comes from an unknown distribution which may or may not belong to a specified family of distributions.
Many nonparametric goodness of fit tests are based on the empirical distribution function.For example, Kolmogorov-Smirnov: Cram ér-von Mises: Astronomical Data Analysis -III All of these are distribution free.The goodness of fit tests such as Kolmogorov-Smirnov can be used to generate confidence bands for the underlying probability distribution (Figure 2).This figure is generated using the public domain software R.
In the presence of nuisance parameters, the tests are generally constructed by first estimating the parameters.In such a case the asymptotic null distribution of the test statistic may depend in a complex way on the unknown parameters.Asymptotic distribution of test statistics based on the empirical distribution function, when parameters are estimated have been extensively studied in [4], [5], [6] and others.
In the multivariate case the Kolmogorov-Smirnov function is no longer distribution free.This is even the case when all the parameters are known.The following simple example to illustrate this was given in [7].Let Thus the distributions and hence critical values differ for K-S statistics for different values of a.

THE BOOTSTRAP
The bootstrap resampling scheme will help in obtaining critical values in the testing context, even when the parameters are estimated.Let {F (.; θ) : θ ∈ Θ} be a family of continuous distribution functions, where Θ is an open region in a p-dimensional Euclidean space.Let X 1 , . . ., X n be i.i.d.
random variables from a distribution function H. Statistics based on empirical measures to test H = F (.; θ) for some θ = θ 0 or if θ is partially specified are considered.The statistics such as Kolmogorov-Smirnov and Cram ér-von Mises, when θ is estimated by θn = θ n (X 1 , . . ., X n ), can be viewed as continuous functionals of the process where F n denotes the empirical distribution function of X 1 , . . ., X n .In the case of Gaussian family with θ = (µ, σ 2 ), we can use θ = ( Xn , s 2 n ), where Two bootstrap procedures are described here, when the parameters are partially or completely estimated.The procedure will also help in the computation of power under contiguous alternatives To describe the bootstrap procedures, let X * 1 , . . ., X * n be i.i.d.random variables from Fn and θ * n = θ n (X * 1 , . . ., X * n ), where Fn is an estimator of the distribution function H, based on the sample X 1 , . . ., X n .The resampling method is called nonparametric bootstrap if Fn = F n , and it is called parametric bootstrap if Fn = F (.; θn ).In the Gaussian example, θ * n = ( X * n , s * 2 n ), where Under some regularity conditions both parametric and nonparametric procedures lead to correct asymptotic levels.In fact, under very general conditions, the sample process Y n given by and the parametric bootstrap process Y P n given by In practice, one takes a bootstrap sample X * 1 , . . ., X * n from X 1 , . . ., X n and computes This procedure is repeated a large number of times.The histogram of the resulting values approximate the distribution of Hence these values can be used to obtain the critical values for testing or to get confidence bands.
In the case of nonparametric bootstrap, a bias correction is needed.Under very general regularity conditions, the process Y n and the process Y N n given by Y have the same limiting distribution.Thus the bootstrap method consistently estimates the null distributions of various goodness of fit tests.So both parametric and nonparametric procedures lead to correct asymptotic levels.The results hold also in the multivariate setting.This provides an estimate of the distance between the true distribution and the family of distributions under consideration.
Similar conclusions can be drawn for von Mises-type distance, for example,

FIGURE 1 :
FIGURE 1: Empirical Distribution Function.Also known as step function