19
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Clustering for multivariate continuous and discrete longitudinal data

      Preprint
      ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Multiple outcomes, both continuous and discrete, are routinely gathered on subjects in longitudinal studies and during routine clinical follow-up in general. To motivate our work, we consider a longitudinal study on patients with primary biliary cirrhosis (PBC) with a continuous bilirubin level, a discrete platelet count and a dichotomous indication of blood vessel malformations as examples of such longitudinal outcomes. An apparent requirement is to use all the outcome values to classify the subjects into groups (e.g., groups of subjects with a similar prognosis in a clinical setting). In recent years, numerous approaches have been suggested for classification based on longitudinal (or otherwise correlated) outcomes, targeting not only traditional areas like biostatistics, but also rapidly evolving bioinformatics and many others. However, most available approaches consider only continuous outcomes as a basis for classification, or if noncontinuous outcomes are considered, then not in combination with other outcomes of a different nature. Here, we propose a statistical method for clustering (classification) of subjects into a prespecified number of groups with a priori unknown characteristics on the basis of repeated measurements of several longitudinal outcomes of a different nature. This method relies on a multivariate extension of the classical generalized linear mixed model where a mixture distribution is additionally assumed for random effects. We base the inference on a Bayesian specification of the model and simulation-based Markov chain Monte Carlo methodology. To apply the method in practice, we have prepared ready-to-use software for use in R (http://www.R-project.org). We also discuss evaluation of uncertainty in the classification and also discuss usage of a recently proposed methodology for model comparison - the selection of a number of clusters in our case - based on the penalized posterior deviance proposed by Plummer [Biostatistics 9 (2008) 523-539].

          Related collections

          Most cited references22

          • Record: found
          • Abstract: not found
          • Article: not found

          On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion)

            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Dealing with label switching in mixture models

              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Cluster analysis of gene expression dynamics.

              This article presents a Bayesian method for model-based clustering of gene expression dynamics. The method represents gene-expression dynamics as autoregressive equations and uses an agglomerative procedure to search for the most probable set of clusters given the available data. The main contributions of this approach are the ability to take into account the dynamic nature of gene expression time series during clustering and a principled way to identify the number of distinct clusters. As the number of possible clustering models grows exponentially with the number of observed time series, we have devised a distance-based heuristic search procedure able to render the search process feasible. In this way, the method retains the important visualization capability of traditional distance-based clustering and acquires an independent, principled measure to decide when two series are different enough to belong to different clusters. The reliance of this method on an explicit statistical representation of gene expression dynamics makes it possible to use standard statistical techniques to assess the goodness of fit of the resulting model and validate the underlying assumptions. A set of gene-expression time series, collected to study the response of human fibroblasts to serum, is used to identify the properties of the method.
                Bookmark

                Author and article information

                Journal
                2013-04-16
                Article
                10.1214/12-AOAS580
                1304.4448
                f45386ec-a4fa-459c-8b64-9aca2abaa176

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                IMS-AOAS-AOAS580
                Annals of Applied Statistics 2013, Vol. 7, No. 1, 177-200
                Published in at http://dx.doi.org/10.1214/12-AOAS580 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)
                stat.AP
                vtex

                Applications
                Applications

                Comments

                Comment on this article