Precise patterns of spatial and temporal gene expression are central to metazoan complexity and act as a driving force for embryonic development. While there has been substantial progress in dissecting and predicting cis-regulatory activity, our understanding of how information from multiple enhancer elements converge to regulate a gene's expression remains elusive. This is in large part due to the number of different biological processes involved in mediating regulation as well as limited availability of experimental measurements for many of them. Here, we used a Bayesian approach to model diverse experimental regulatory data, leading to accurate predictions of both spatial and temporal aspects of gene expression. We integrated whole-embryo information on transcription factor recruitment to multiple cis-regulatory modules, insulator binding and histone modification status in the vicinity of individual gene loci, at a genome-wide scale during Drosophila development. The model uses Bayesian networks to represent the relation between transcription factor occupancy and enhancer activity in specific tissues and stages. All parameters are optimized in an Expectation Maximization procedure providing a model capable of predicting tissue- and stage-specific activity of new, previously unassayed genes. Performing the optimization with subsets of input data demonstrated that neither enhancer occupancy nor chromatin state alone can explain all gene expression patterns, but taken together allow for accurate predictions of spatio-temporal activity. Model predictions were validated using the expression patterns of more than 600 genes recently made available by the BDGP consortium, demonstrating an average 15-fold enrichment of genes expressed in the predicted tissue over a naïve model. We further validated the model by experimentally testing the expression of 20 predicted target genes of unknown expression, resulting in an accuracy of 95% for temporal predictions and 50% for spatial. While this is, to our knowledge, the first genome-wide approach to predict tissue-specific gene expression in metazoan development, our results suggest that integrative models of this type will become more prevalent in the future.
Development is a complex process in which a single cell gives rise to a multi-cellular organism comprised of diverse cell types and well-organized tissues. This transformation requires tightly coordinated expression, both spatially and temporally, of hundreds to thousands of genes specific to any given tissue. To orchestrate these patterns, gene expression is regulated at multiple steps, from TF binding to cis-regulatory modules, general transcription factor and RNA polymerase II recruitment to promoters, chromatin remodeling, and three-dimensional looping interactions. Despite this level of complexity, the regulation of gene expression is typically modeled in the context of transcription factor binding and a single enhancer's activity as this is where the majority of experimental data is available. Recent advances in the measurement of chromatin modifications and insulator binding during embryogenesis provide new datasets that can be used for modeling gene expression. Here we use a Bayesian approach to integrate all three levels of information to combine the activity of multiple regulatory elements into a single model of a gene's expression, implementing an expectation maximization strategy to overcome the problem of missing data. Importantly, while the data for histone modifications and insulator binding represents merged signals from all cells in the embryo, the model can extract cell type specific and stage-specific predictions on gene expression for hundreds of genes of unknown expression.