65
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Cox process representation and inference for stochastic reaction-diffusion processes

      Preprint
      , ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Complex behaviour in many systems arises from the stochastic interactions of spatially distributed particles or agents. Stochastic reaction-diffusion processes are widely used to model such behaviour in disciplines ranging from biology to the social sciences, yet they are notoriously difficult to simulate and calibrate to observational data. Here we use ideas from statistical physics and machine learning to provide a solution to the inverse problem of learning a stochastic reaction diffusion process to data. Our solution relies on a novel, non-trivial connection between stochastic reaction-diffusion processes and spatio-temporal Cox processes, a well-studied class of models from computational statistics. We develop an efficient and flexible algorithm which shows excellent accuracy on numeric and real data examples from systems biology and epidemiology. By using ideas from multiple disciplines, our approach provides both new and fundamental insights into spatio-temporal stochastic systems, and a practical solution to a long-standing problem in computational modelling.

          Related collections

          Most cited references25

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo

          Background Distilling regulatory networks from large genomic, proteomic and expression data sets is one of the most important mathematical problems in biology today. The development of accurate models of global regulatory networks is key to our understanding of a cell's dynamic behavior and its response to internal and external stimuli. Methods for inferring and modeling regulatory networks must strike a balance between model complexity (a model must be sufficiently complex to describe the system accurately) and the limitations of the available data (in spite of dramatic advances in our ability to measure mRNA and protein levels in cells, nearly all biologic systems are under-determined with respect to the problem of regulatory network inference). A major challenge is to distill, from large genome-wide data sets, a reduced set of factors describing the behavior of the system. The number of potential regulators, restricted here to transcription factors (TFs) and environmental factors, is often on the same order as the number of observations in current genome-wide expression data sets. Statistical methods offer the ability to enforce parsimonious selection of the most influential potential predictors of each gene's state. A further challenge in regulatory network modeling is the complexity of accounting for TF interactions and the interactions of TFs with environmental factors (for example, it is known that many transcription regulators form heterodimers, or are structurally altered by an environmental stimulus such as light, thereby altering their regulatory influence on certain genes). A third challenge and practical consideration in network inference is that biology data sets are often heterogeneous mixes of equilibrium and kinetic (time series) measurements; both types of measurements can provide important supporting evidence for a given regulatory model if they are analyzed simultaneously. Last, but not least, is the challenge resulting from the fact that data-derived network models be predictive and not just descriptive; can one predict the system-wide response in differing genetic backgrounds, or when the system is confronted with novel stimulatory factors or novel combinations of perturbations? A significant body of work has been devoted to the modeling and learning of regulatory networks [1-3]. In these studies regulatory interactions and dynamics are modeled with varying degrees of detail and model flexibility and, accordingly, such models can be separated into general classes based on the level of detail with which they model individual regulatory interactions [1,2]. At the highest level of detail lie differential equations and stochastic models, which provide detailed descriptions of regulatory systems and can be used to simulate systems dynamics, but they are computationally demanding and require accurate measurement of a large number of parameters. Hence, these simulations have primarily been carried out for small-scale systems (relative to the full, genome-wide, regulatory circuit for a given organism); often these studies model systems that have been studied in great detail for decades, such as the galactose utilization pathway in yeast and the early development of sea urchin. At the other end of the model complexity spectrum lie Boolean networks [4], which assume that genes are simply on or off, and include standard logic interactions (AND, OR, XOR, and so on). Despite this simplification of regulatory dynamics and interactions, these approaches have the advantages of simplicity, robustness (they can be learned with significantly fewer data), and ease of interpretation [5]. Recent probabilistic approaches to modeling regulatory network on the genome-wide scale use Bayesian networks to model regulatory structure, de novo, at the Boolean level [6-11]. Additive linear or generalized linear models take an intermediate approach, in terms of model complexity and robustness [12-15]. Such models describe each gene's expression level as a weighted sum of the levels of its putative predictors. Inclusion of functions that modify the linear response produced by these additive methods (sometimes referred to as squashing functions) allows some biologically relevant nonlinear processes (for example, promoter saturation) to be modeled. An advantage of linear and generalized linear models is that they draw upon well developed techniques from the field of statistical learning for choosing among several possible models and efficiently fitting the parameters of those models. Learning and/or modeling of regulatory networks can be greatly aided by reducing the dimensionality of the search space before network inference. Two ways to approach this are limiting the number of regulators under consideration and grouping genes that are co-regulated into clusters. In the former case, candidates can be prioritized based on their functional role (for example, limiting the set of potential predictors to include only TFs, and grouping together regulators that are in some way similar). In the latter case, gene expression clustering, or unsupervised learning of gene expression classes, is commonly applied. It is often incorrectly assumed that co-expressed genes correspond to co-regulated genes. However, for the purposes of learning regulatory networks it is desirable to cluster genes on the basis of co-regulation (shared transcriptional control) as opposed to simple co-expression. Furthermore, standard clustering procedures assume that co-regulated genes are co-expressed across all observed experimental conditions. Because genes are often regulated differently under different conditions, this assumption is likely to break down as the quantity and variety of data grow. Biclustering was developed to address better the full complexity of finding co-regulated genes under multifactor control by grouping genes on the basis of coherence under subsets of observed conditions [10,16-22]. We developed an integrated biclustering algorithm, named cMonkey (Reiss DJ, Baliga NS, Bonneau R, unpublished data), which groups genes and conditions into biclusters on the basis of the following: coherence in expression data across subsets of experimental conditions; co-occurrence of putative cis-acting regulatory motifs in the regulatory regions of bicluster members; and the presence of highly connected subgraphs in metabolic [23] and functional association networks [24-26]. Because cMonkey was designed with the goal of identifying putatively co-regulated gene groupings, we use it to 'pre-cluster' genes before learning regulatory influences in the present study. cMonkey identifies relevant conditions in which the genes within a given bicluster are expected to be co-regulated, and the inferred regulatory influences on the genes in each bicluster pertain to (and are fit using) only those conditions within each bicluster. In principle, the algorithm described in this work can be coupled with other biclustering and clustering algorithms. Here we describe an algorithm, the Inferelator, that infers regulatory influences for genes and/or gene clusters from mRNA and/or protein expression levels. The method uses standard regression and model shrinkage (L1 shrinkage) techniques to select parsimonious, predictive models for the expression of a gene or cluster of genes as a function of the levels of TFs, environmental influences, and interactions between these factors [27]. The procedure can simultaneously model equilibrium and time course expression levels, such that both kinetic and equilibrium expression levels may be predicted by the resulting models. Through the explicit inclusion of time and gene knockout information, the method is capable of learning causal relationships. It also includes a novel solution to the problem of encoding interactions between predictors into the regression. We discuss the results from an initial run of this method on a set of microarray observations from the halophilic archaeon Halobacterium NRC-1. Results and discussion The inferred global regulatory network for Halobacterium NRC-1 We applied our method to the Halophilic archaeon Halobacterium NRC-1. The Halobacterium genome contains 2,404 nonredundant genes, of which 124 are annotated to be known or putative TFs [28,29]. The biclustering and network inference procedure were performed on a recently generated data set containing 268 mRNA microarray measurements of this archaeon under a wide range of genetic and environmental perturbations ('Kaur A, Pan M, Meislin M, El-Geweley R, Baliga NS' and 'Whitehead K, Kish A, Pan M, Kaur A, King N, Hohmann L, Diruggiero J, Baliga NS', personal communications), [30,31]. Several TFs do not change significantly in their expression levels in the data set; of the 124 identified TFs, 100 exhibited a significant change in expression levels across the data set, and the remaining 24 TFs were excluded from the set of potential influences (see Materials and methods, below) [32]. Strongly correlated TFs (those with correlation greater than 0.85) were further grouped, yielding 72 regulators (some representing multiple correlated regulators). To these 72 potential regulators were added 10 environmental factors for a total of 82 possible predictors for the 1,934 genes with significant signal in the data set. In addition to this main data set, 24 new experiments (collected after model fitting) were used for independent error estimation subsequent to the network inference procedure. The cMonkey method (Reiss DJ, Baliga NS, Bonneau R, unpublished data) was applied to this data set (original 268 conditions) to bicluster genes and conditions, on the basis of the gene expression data, a network of functional associations, and the occurrence and detection of cis-acting regulatory motifs in bicluster upstream sequences. Biclustering resulted in 300 biclusters covering 1,775 genes. An additional 159 genes, which exhibited significant change relative to the common reference across the data set, were determined by cMonkey to have unique expression patterns and were thus not included in biclusters; these 159 genes were inferred individually. The regulatory network inference procedure was then performed on these 300 biclusters and 159 individual genes, resulting in a network containing 1,431 regulatory influences (network edges) of varying strength. Of these regulatory influences, 495 represent interactions between two TFs or between a TF and an environmental factor. We selected the null model for 21 biclusters (no influences or only weak regulatory influences found, as described in Materials and methods, below), indicating that we are stringently excluding under-determined genes and biclusters from our network model. The ratio of data points to estimated parameters is approximately 67 (one time constant plus three regulatory influences, on average, from 268 conditions). Our data set is not complete with respect to the full physiologic and environmental repertoire for Halobacterium NRC-1, and several TFs have their activity modulated by unobserved factors (for example, post-translational modifications and the binding of unobserved ligands); the regulatory relations for many genes are therefore not visible, given the current data set. Figure 1 shows the resultant network for Halobacterium NRC-1 in Cytoscape, available as a Cytoscape/Gaggle web start [33,34]. An example of the predicted regulation of a single bicluster, bicluster 76 (containing genes involved in the transport of Fe and Mn; Table 1), is shown in Figure 1b. Among the 82 possible regulators, four were selected as the most likely regulators of this bicluster. The learned function of these TFs allows prediction of the bicluster 76 gene expression levels under novel conditions, including genetic perturbations (for example, to predict the expression levels in a kaiC knockout strain, the influence of kaiC can be removed from the equation by setting its weight to zero). We discuss the predicted regulatory model for bicluster 76 further below. We evaluated the ability of the inferred network model to predict the expression state of Halobacterium NRC-1 on a genome-wide basis. For each experimental condition, we made predictions of each bicluster state, based on the levels of regulators and environmental factors, and compared predicted expression values with the corresponding measured state (using root mean square deviation [RMSD] to evaluate the difference, or error, as described under Materials and methods, below). In this way we evaluated the predictive performance of the inferred network both on experiments in the training data set and on the 24 experiments in the independent test set (which we refer to as the newly collected data set). The expression level of a bicluster is predicted from the level of TFs and environmental factors that influence it in the network, at the prior time point (for time course conditions) or the current condition (for steady state conditions). The error estimates for the 300 biclusters and 159 single genes are shown in Figures 2 and 3. For the biclusters, the mean error of 0.37 is significantly smaller than the range of ratios observed in the data (because all biclusters were normalized to have variances of about 1.0 before model fitting), indicating that the overall global expression state is well predicted. Our predictive power on the new data (Figures 2 and 3, right panels) is similar to that on the training data (the mean RMS over the training set is within 1 standard deviation of the mean RMS over the new data), indicating that our procedure is enforcing reasonable parsimony upon the models (using L1 shrinkage coupled with tenfold cross-validation [CV], as described under Materials and methods, below) and accurately estimating the degree to which we can predict the expression levels of biclusters as a function of TF and environmental factor levels. Although the majority of biclusters have new data RMS values well matched by the training set RMS values, there are also nine biclusters (biclusters 1, 37, 77, 82, 99, 137, 161, 165, and 180) with RMS values significantly higher in the new data than in the training data. We were unable to identify any features of these outlying biclusters (coherence of bicluster, bicluster size, variance in and out of sample for the biclusters, and so on) that distinguish them from other biclusters. We also investigated predictive performance for the 159 genes that were not included in biclusters by cMonkey. We found good predictive performance (over the new data as well as over the training data) for approximately half of these genes - a much lower success rate than seen for genes represented by biclusters. There are a number of possible explanations for this diminished ability to predict genes that also elude biclustering. Averaging expression levels over genes that are co-regulated within biclusters can be thought of as signal averaging, and thus single genes are more prone to both systematic and random error than bicluster expression levels. Another possible explanation is that these elusive genes are under the influence of TFs that interact with unobserved factors, such as metabolites. There are also about five conditions that we fail to predict well relative to the other 264 conditions (large RMS values in training and new data; Figures 2 and 3). Not surprisingly, these five conditions are all situated directly after large perturbations in time series, when the system is fluctuating dramatically as it re-establishes stasis. We also performed several tests to determine how well our model formulation and fitting procedure performed compared with three simplified formulations, as described in detail in Additional data file 1. Briefly, these additional tests show that our current formulation for temporal modeling is essential to the performance of this procedure (mean RMSD with no temporal modeling 0.40; significance of comparison with full model P 15 correspond to false-positive rates of less than about 5%, based on control experiments). Of the 292 conditions, 24 had not been collected at the time that the model was fit and were thus not used in the training set. These 24 conditions were used as independent verification of the predictive power of the learned network model. Availability The regulatory network and all data types used in the inference process can be visualized using the data integration and exploration tools Gaggle and Cytoscape, and can be accessed via a Cytoscape java web-start [33]. Alternate data formats are available upon request. Gaggle [55] and Cytoscape [56] are freely available on the web. Inferelator was written using the R statistical programming language [57] and is freely available upon request. Additional data files The following additional data are included with the online version of this article: A Word document describing additional tests of individual components of our model formulation (Additional data file 1). Authors' contributions R.B. conceived and initiated the project, developed and implemented the method and the resultant computer program, and wrote the manuscript. V.T. conceived and initiated the project, developed and implemented the method and the resultant computer program, and wrote the manuscript. D.J.R. assisted in development and implementation, and assisted with writing of the manuscript. P.S. assisted with visualization of the data/networks via the Gaggle. N.B. conceived and initiated the project, provided feedback on the quality of results, and initiated verification of results with further experimentation. M.F. performed experimental verification. L.H. provided guidance at project inception and assisted in writing the manuscript. Supplementary Material Additional File 1 Additional tests of individual components of our model formulation. Click here for file
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Inherent noise can facilitate coherence in collective swarm motion.

            Among the most striking aspects of the movement of many animal groups are their sudden coherent changes in direction. Recent observations of locusts and starlings have shown that this directional switching is an intrinsic property of their motion. Similar direction switches are seen in self-propelled particle and other models of group motion. Comprehending the factors that determine such switches is key to understanding the movement of these groups. Here, we adopt a coarse-grained approach to the study of directional switching in a self-propelled particle model assuming an underlying one-dimensional Fokker-Planck equation for the mean velocity of the particles. We continue with this assumption in analyzing experimental data on locusts and use a similar systematic Fokker-Planck equation coefficient estimation approach to extract the relevant information for the assumed Fokker-Planck equation underlying that experimental data. In the experiment itself the motion of groups of 5 to 100 locust nymphs was investigated in a homogeneous laboratory environment, helping us to establish the intrinsic dynamics of locust marching bands. We determine the mean time between direction switches as a function of group density for the experimental data and the self-propelled particle model. This systematic approach allows us to identify key differences between the experimental data and the model, revealing that individual locusts appear to increase the randomness of their movements in response to a loss of alignment by the group. We give a quantitative description of how locusts use noise to maintain swarm alignment. We discuss further how properties of individual animal behavior, inferred by using the Fokker-Planck equation coefficient estimation approach, can be implemented in the self-propelled particle model to replicate qualitatively the group level dynamics seen in the experimental data.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Perspective: Stochastic algorithms for chemical kinetics.

              We outline our perspective on stochastic chemical kinetics, paying particular attention to numerical simulation algorithms. We first focus on dilute, well-mixed systems, whose description using ordinary differential equations has served as the basis for traditional chemical kinetics for the past 150 years. For such systems, we review the physical and mathematical rationale for a discrete-stochastic approach, and for the approximations that need to be made in order to regain the traditional continuous-deterministic description. We next take note of some of the more promising strategies for dealing stochastically with stiff systems, rare events, and sensitivity analysis. Finally, we review some recent efforts to adapt and extend the discrete-stochastic approach to systems that are not well-mixed. In that currently developing area, we focus mainly on the strategy of subdividing the system into well-mixed subvolumes, and then simulating diffusional transfers of reactant molecules between adjacent subvolumes together with chemical reactions inside the subvolumes.
                Bookmark

                Author and article information

                Journal
                2016-01-08
                Article
                10.1038/ncomms11729
                1601.01972
                1d87e817-2829-41a6-b743-dfedd51242a9

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                25 pages, 4 figures
                cond-mat.stat-mech math.ST physics.data-an q-bio.QM stat.ML stat.TH

                Condensed matter,Quantitative & Systems biology,Mathematical & Computational physics,Machine learning,Statistics theory

                Comments

                Comment on this article