63
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Towards a Rigorous Assessment of Systems Biology Models: The DREAM3 Challenges

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Systems biology has embraced computational modeling in response to the quantitative nature and increasing scale of contemporary data sets. The onslaught of data is accelerating as molecular profiling technology evolves. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) is a community effort to catalyze discussion about the design, application, and assessment of systems biology models through annual reverse-engineering challenges.

          Methodology and Principal Findings

          We describe our assessments of the four challenges associated with the third DREAM conference which came to be known as the DREAM3 challenges: signaling cascade identification, signaling response prediction, gene expression prediction, and the DREAM3 in silico network challenge. The challenges, based on anonymized data sets, tested participants in network inference and prediction of measurements. Forty teams submitted 413 predicted networks and measurement test sets. Overall, a handful of best-performer teams were identified, while a majority of teams made predictions that were equivalent to random. Counterintuitively, combining the predictions of multiple teams (including the weaker teams) can in some cases improve predictive power beyond that of any single method.

          Conclusions

          DREAM provides valuable feedback to practitioners of systems biology modeling. Lessons learned from the predictions of the community provide much-needed context for interpreting claims of efficacy of algorithms described in the scientific literature.

          Related collections

          Most cited references23

          • Record: found
          • Abstract: found
          • Article: not found

          Transcriptional profiling shows that Gcn4p is a master regulator of gene expression during amino acid starvation in yeast.

          Starvation for amino acids induces Gcn4p, a transcriptional activator of amino acid biosynthetic genes in Saccharomyces cerevisiae. In an effort to identify all genes regulated by Gcn4p during amino acid starvation, we performed cDNA microarray analysis. Data from 21 pairs of hybridization experiments using two different strains derived from S288c revealed that more than 1,000 genes were induced, and a similar number were repressed, by a factor of 2 or more in response to histidine starvation imposed by 3-aminotriazole (3AT). Profiling of a gcn4Delta strain and a constitutively induced mutant showed that Gcn4p is required for the full induction by 3AT of at least 539 genes, termed Gcn4p targets. Genes in every amino acid biosynthetic pathway except cysteine and genes encoding amino acid precursors, vitamin biosynthetic enzymes, peroxisomal components, mitochondrial carrier proteins, and autophagy proteins were all identified as Gcn4p targets. Unexpectedly, genes involved in amino acid biosynthesis represent only a quarter of the Gcn4p target genes. Gcn4p also activates genes involved in glycogen homeostasis, and mutant analysis showed that Gcn4p suppresses glycogen levels in amino acid-starved cells. Numerous genes encoding protein kinases and transcription factors were identified as targets, suggesting that Gcn4p is a master regulator of gene expression. Interestingly, expression profiles for 3AT and the alkylating agent methyl methanesulfonate (MMS) overlapped extensively, and MMS induced GCN4 translation. Thus, the broad transcriptional response evoked by Gcn4p is produced by diverse stress conditions. Finally, profiling of a gcn4Delta mutant uncovered an alternative induction pathway operating at many Gcn4p target genes in histidine-starved cells.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo

            Background Distilling regulatory networks from large genomic, proteomic and expression data sets is one of the most important mathematical problems in biology today. The development of accurate models of global regulatory networks is key to our understanding of a cell's dynamic behavior and its response to internal and external stimuli. Methods for inferring and modeling regulatory networks must strike a balance between model complexity (a model must be sufficiently complex to describe the system accurately) and the limitations of the available data (in spite of dramatic advances in our ability to measure mRNA and protein levels in cells, nearly all biologic systems are under-determined with respect to the problem of regulatory network inference). A major challenge is to distill, from large genome-wide data sets, a reduced set of factors describing the behavior of the system. The number of potential regulators, restricted here to transcription factors (TFs) and environmental factors, is often on the same order as the number of observations in current genome-wide expression data sets. Statistical methods offer the ability to enforce parsimonious selection of the most influential potential predictors of each gene's state. A further challenge in regulatory network modeling is the complexity of accounting for TF interactions and the interactions of TFs with environmental factors (for example, it is known that many transcription regulators form heterodimers, or are structurally altered by an environmental stimulus such as light, thereby altering their regulatory influence on certain genes). A third challenge and practical consideration in network inference is that biology data sets are often heterogeneous mixes of equilibrium and kinetic (time series) measurements; both types of measurements can provide important supporting evidence for a given regulatory model if they are analyzed simultaneously. Last, but not least, is the challenge resulting from the fact that data-derived network models be predictive and not just descriptive; can one predict the system-wide response in differing genetic backgrounds, or when the system is confronted with novel stimulatory factors or novel combinations of perturbations? A significant body of work has been devoted to the modeling and learning of regulatory networks [1-3]. In these studies regulatory interactions and dynamics are modeled with varying degrees of detail and model flexibility and, accordingly, such models can be separated into general classes based on the level of detail with which they model individual regulatory interactions [1,2]. At the highest level of detail lie differential equations and stochastic models, which provide detailed descriptions of regulatory systems and can be used to simulate systems dynamics, but they are computationally demanding and require accurate measurement of a large number of parameters. Hence, these simulations have primarily been carried out for small-scale systems (relative to the full, genome-wide, regulatory circuit for a given organism); often these studies model systems that have been studied in great detail for decades, such as the galactose utilization pathway in yeast and the early development of sea urchin. At the other end of the model complexity spectrum lie Boolean networks [4], which assume that genes are simply on or off, and include standard logic interactions (AND, OR, XOR, and so on). Despite this simplification of regulatory dynamics and interactions, these approaches have the advantages of simplicity, robustness (they can be learned with significantly fewer data), and ease of interpretation [5]. Recent probabilistic approaches to modeling regulatory network on the genome-wide scale use Bayesian networks to model regulatory structure, de novo, at the Boolean level [6-11]. Additive linear or generalized linear models take an intermediate approach, in terms of model complexity and robustness [12-15]. Such models describe each gene's expression level as a weighted sum of the levels of its putative predictors. Inclusion of functions that modify the linear response produced by these additive methods (sometimes referred to as squashing functions) allows some biologically relevant nonlinear processes (for example, promoter saturation) to be modeled. An advantage of linear and generalized linear models is that they draw upon well developed techniques from the field of statistical learning for choosing among several possible models and efficiently fitting the parameters of those models. Learning and/or modeling of regulatory networks can be greatly aided by reducing the dimensionality of the search space before network inference. Two ways to approach this are limiting the number of regulators under consideration and grouping genes that are co-regulated into clusters. In the former case, candidates can be prioritized based on their functional role (for example, limiting the set of potential predictors to include only TFs, and grouping together regulators that are in some way similar). In the latter case, gene expression clustering, or unsupervised learning of gene expression classes, is commonly applied. It is often incorrectly assumed that co-expressed genes correspond to co-regulated genes. However, for the purposes of learning regulatory networks it is desirable to cluster genes on the basis of co-regulation (shared transcriptional control) as opposed to simple co-expression. Furthermore, standard clustering procedures assume that co-regulated genes are co-expressed across all observed experimental conditions. Because genes are often regulated differently under different conditions, this assumption is likely to break down as the quantity and variety of data grow. Biclustering was developed to address better the full complexity of finding co-regulated genes under multifactor control by grouping genes on the basis of coherence under subsets of observed conditions [10,16-22]. We developed an integrated biclustering algorithm, named cMonkey (Reiss DJ, Baliga NS, Bonneau R, unpublished data), which groups genes and conditions into biclusters on the basis of the following: coherence in expression data across subsets of experimental conditions; co-occurrence of putative cis-acting regulatory motifs in the regulatory regions of bicluster members; and the presence of highly connected subgraphs in metabolic [23] and functional association networks [24-26]. Because cMonkey was designed with the goal of identifying putatively co-regulated gene groupings, we use it to 'pre-cluster' genes before learning regulatory influences in the present study. cMonkey identifies relevant conditions in which the genes within a given bicluster are expected to be co-regulated, and the inferred regulatory influences on the genes in each bicluster pertain to (and are fit using) only those conditions within each bicluster. In principle, the algorithm described in this work can be coupled with other biclustering and clustering algorithms. Here we describe an algorithm, the Inferelator, that infers regulatory influences for genes and/or gene clusters from mRNA and/or protein expression levels. The method uses standard regression and model shrinkage (L1 shrinkage) techniques to select parsimonious, predictive models for the expression of a gene or cluster of genes as a function of the levels of TFs, environmental influences, and interactions between these factors [27]. The procedure can simultaneously model equilibrium and time course expression levels, such that both kinetic and equilibrium expression levels may be predicted by the resulting models. Through the explicit inclusion of time and gene knockout information, the method is capable of learning causal relationships. It also includes a novel solution to the problem of encoding interactions between predictors into the regression. We discuss the results from an initial run of this method on a set of microarray observations from the halophilic archaeon Halobacterium NRC-1. Results and discussion The inferred global regulatory network for Halobacterium NRC-1 We applied our method to the Halophilic archaeon Halobacterium NRC-1. The Halobacterium genome contains 2,404 nonredundant genes, of which 124 are annotated to be known or putative TFs [28,29]. The biclustering and network inference procedure were performed on a recently generated data set containing 268 mRNA microarray measurements of this archaeon under a wide range of genetic and environmental perturbations ('Kaur A, Pan M, Meislin M, El-Geweley R, Baliga NS' and 'Whitehead K, Kish A, Pan M, Kaur A, King N, Hohmann L, Diruggiero J, Baliga NS', personal communications), [30,31]. Several TFs do not change significantly in their expression levels in the data set; of the 124 identified TFs, 100 exhibited a significant change in expression levels across the data set, and the remaining 24 TFs were excluded from the set of potential influences (see Materials and methods, below) [32]. Strongly correlated TFs (those with correlation greater than 0.85) were further grouped, yielding 72 regulators (some representing multiple correlated regulators). To these 72 potential regulators were added 10 environmental factors for a total of 82 possible predictors for the 1,934 genes with significant signal in the data set. In addition to this main data set, 24 new experiments (collected after model fitting) were used for independent error estimation subsequent to the network inference procedure. The cMonkey method (Reiss DJ, Baliga NS, Bonneau R, unpublished data) was applied to this data set (original 268 conditions) to bicluster genes and conditions, on the basis of the gene expression data, a network of functional associations, and the occurrence and detection of cis-acting regulatory motifs in bicluster upstream sequences. Biclustering resulted in 300 biclusters covering 1,775 genes. An additional 159 genes, which exhibited significant change relative to the common reference across the data set, were determined by cMonkey to have unique expression patterns and were thus not included in biclusters; these 159 genes were inferred individually. The regulatory network inference procedure was then performed on these 300 biclusters and 159 individual genes, resulting in a network containing 1,431 regulatory influences (network edges) of varying strength. Of these regulatory influences, 495 represent interactions between two TFs or between a TF and an environmental factor. We selected the null model for 21 biclusters (no influences or only weak regulatory influences found, as described in Materials and methods, below), indicating that we are stringently excluding under-determined genes and biclusters from our network model. The ratio of data points to estimated parameters is approximately 67 (one time constant plus three regulatory influences, on average, from 268 conditions). Our data set is not complete with respect to the full physiologic and environmental repertoire for Halobacterium NRC-1, and several TFs have their activity modulated by unobserved factors (for example, post-translational modifications and the binding of unobserved ligands); the regulatory relations for many genes are therefore not visible, given the current data set. Figure 1 shows the resultant network for Halobacterium NRC-1 in Cytoscape, available as a Cytoscape/Gaggle web start [33,34]. An example of the predicted regulation of a single bicluster, bicluster 76 (containing genes involved in the transport of Fe and Mn; Table 1), is shown in Figure 1b. Among the 82 possible regulators, four were selected as the most likely regulators of this bicluster. The learned function of these TFs allows prediction of the bicluster 76 gene expression levels under novel conditions, including genetic perturbations (for example, to predict the expression levels in a kaiC knockout strain, the influence of kaiC can be removed from the equation by setting its weight to zero). We discuss the predicted regulatory model for bicluster 76 further below. We evaluated the ability of the inferred network model to predict the expression state of Halobacterium NRC-1 on a genome-wide basis. For each experimental condition, we made predictions of each bicluster state, based on the levels of regulators and environmental factors, and compared predicted expression values with the corresponding measured state (using root mean square deviation [RMSD] to evaluate the difference, or error, as described under Materials and methods, below). In this way we evaluated the predictive performance of the inferred network both on experiments in the training data set and on the 24 experiments in the independent test set (which we refer to as the newly collected data set). The expression level of a bicluster is predicted from the level of TFs and environmental factors that influence it in the network, at the prior time point (for time course conditions) or the current condition (for steady state conditions). The error estimates for the 300 biclusters and 159 single genes are shown in Figures 2 and 3. For the biclusters, the mean error of 0.37 is significantly smaller than the range of ratios observed in the data (because all biclusters were normalized to have variances of about 1.0 before model fitting), indicating that the overall global expression state is well predicted. Our predictive power on the new data (Figures 2 and 3, right panels) is similar to that on the training data (the mean RMS over the training set is within 1 standard deviation of the mean RMS over the new data), indicating that our procedure is enforcing reasonable parsimony upon the models (using L1 shrinkage coupled with tenfold cross-validation [CV], as described under Materials and methods, below) and accurately estimating the degree to which we can predict the expression levels of biclusters as a function of TF and environmental factor levels. Although the majority of biclusters have new data RMS values well matched by the training set RMS values, there are also nine biclusters (biclusters 1, 37, 77, 82, 99, 137, 161, 165, and 180) with RMS values significantly higher in the new data than in the training data. We were unable to identify any features of these outlying biclusters (coherence of bicluster, bicluster size, variance in and out of sample for the biclusters, and so on) that distinguish them from other biclusters. We also investigated predictive performance for the 159 genes that were not included in biclusters by cMonkey. We found good predictive performance (over the new data as well as over the training data) for approximately half of these genes - a much lower success rate than seen for genes represented by biclusters. There are a number of possible explanations for this diminished ability to predict genes that also elude biclustering. Averaging expression levels over genes that are co-regulated within biclusters can be thought of as signal averaging, and thus single genes are more prone to both systematic and random error than bicluster expression levels. Another possible explanation is that these elusive genes are under the influence of TFs that interact with unobserved factors, such as metabolites. There are also about five conditions that we fail to predict well relative to the other 264 conditions (large RMS values in training and new data; Figures 2 and 3). Not surprisingly, these five conditions are all situated directly after large perturbations in time series, when the system is fluctuating dramatically as it re-establishes stasis. We also performed several tests to determine how well our model formulation and fitting procedure performed compared with three simplified formulations, as described in detail in Additional data file 1. Briefly, these additional tests show that our current formulation for temporal modeling is essential to the performance of this procedure (mean RMSD with no temporal modeling 0.40; significance of comparison with full model P 15 correspond to false-positive rates of less than about 5%, based on control experiments). Of the 292 conditions, 24 had not been collected at the time that the model was fit and were thus not used in the training set. These 24 conditions were used as independent verification of the predictive power of the learned network model. Availability The regulatory network and all data types used in the inference process can be visualized using the data integration and exploration tools Gaggle and Cytoscape, and can be accessed via a Cytoscape java web-start [33]. Alternate data formats are available upon request. Gaggle [55] and Cytoscape [56] are freely available on the web. Inferelator was written using the R statistical programming language [57] and is freely available upon request. Additional data files The following additional data are included with the online version of this article: A Word document describing additional tests of individual components of our model formulation (Additional data file 1). Authors' contributions R.B. conceived and initiated the project, developed and implemented the method and the resultant computer program, and wrote the manuscript. V.T. conceived and initiated the project, developed and implemented the method and the resultant computer program, and wrote the manuscript. D.J.R. assisted in development and implementation, and assisted with writing of the manuscript. P.S. assisted with visualization of the data/networks via the Gaggle. N.B. conceived and initiated the project, provided feedback on the quality of results, and initiated verification of results with further experimentation. M.F. performed experimental verification. L.H. provided guidance at project inception and assisted in writing the manuscript. Supplementary Material Additional File 1 Additional tests of individual components of our model formulation. Click here for file
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks.

              In an effort to find gene regulatory networks and clusters of genes that affect cancer susceptibility to anticancer agents, we joined a database with baseline expression levels of 7,245 genes measured by using microarrays in 60 cancer cell lines, to a database with the amounts of 5,084 anticancer agents needed to inhibit growth of those same cell lines. Comprehensive pair-wise correlations were calculated between gene expression and measures of agent susceptibility. Associations weaker than a threshold strength were removed, leaving networks of highly correlated genes and agents called relevance networks. Hypotheses for potential single-gene determinants of anticancer agent susceptibility were constructed. The effect of random chance in the large number of calculations performed was empirically determined by repeated random permutation testing; only associations stronger than those seen in multiply permuted data were used in clustering. We discuss the advantages of this methodology over alternative approaches, such as phylogenetic-type tree clustering and self-organizing maps.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS One
                plos
                plosone
                PLoS ONE
                Public Library of Science (San Francisco, USA )
                1932-6203
                2010
                23 February 2010
                : 5
                : 2
                : e9202
                Affiliations
                [1 ]IBM T. J. Watson Research Center, Yorktown Heights, New York, United States of America
                [2 ]Laboratory of Intelligent Systems, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
                [3 ]Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, United States of America
                [4 ]Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
                [5 ]Department of Mechanical Engineering, National Technical University of Athens, Athens, Greece
                [6 ]Genome Institute of Singapore, Singapore, Singapore
                [7 ]Program in Computational Biology and Immunology, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America
                Center for Genomic Regulation, Spain
                Author notes
                Article
                09-PONE-OV-14359R1
                10.1371/journal.pone.0009202
                2826397
                20186320
                d2d55cec-9385-4adf-bac1-5bf9d3726fab
                Prill et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
                History
                : 19 November 2009
                : 19 January 2010
                Page count
                Pages: 18
                Categories
                Overview
                Computational Biology
                Biochemistry/Bioinformatics
                Biophysics/Theory and Simulation

                Uncategorized
                Uncategorized

                Comments

                Comment on this article