Introduction Understanding the complex responses of the human body to drug treatments is vitally important to address the efficacy and safety-related issues of compounds in later stages of drug development and, thus, to reduce high attrition rates in clinical trials (Kola and Landis, 2004). The fundamental challenge towards this goal lies in the selection and thorough characterization of model systems that can accurately recapitulate the drug response of human physiology for diverse drug-screening projects (Jones and Diamond, 2007; Sharma et al, 2010; Dow and Lowe, 2012). One way of obtaining unbiased, large-scale readouts from model systems is genome-wide expression profiling of the transcriptional response to various drug treatments (Feng et al, 2009; Iskar et al, 2011). This has first been systematically explored in model organisms, such as Saccharomyces cerevisiae, with the aim to elucidate drug mechanism of action (MOA) based on their transcriptional effects (Hughes et al, 2000; Ihmels et al, 2002; di Bernardo et al, 2005). Simultaneously, coexpression analysis and transcriptional modules of the yeast data allowed the inference of functional roles for genes that respond coherently to these perturbations (Hughes et al, 2000; Ihmels et al, 2002; Wu et al, 2002; Segal et al, 2003; Tanay et al, 2004). Recently, the Connectivity Map (CMap) successfully extended the concept of large-scale gene expression profiling of drug response to human cell lines (Lamb et al, 2006; Lamb, 2007). In parallel, drug-induced expression changes have been profiled at a large scale in animal models, such as rat liver (Ganter et al, 2005; Natsoulis et al, 2008). Computational advances in mining these data have improved signature comparison methods leading to novel drug–drug (Subramanian et al, 2005; Lamb et al, 2006; Iorio et al, 2009, 2010) and drug–disease (Hu and Agarwal, 2009; Sirota et al, 2011; Dudley et al, 2011; Pacini et al, 2012) connections based on their (anti)correlated transcriptional effects (Qu and Rajpal, 2012; Iorio et al, 2012). However, these mammalian transcriptional readouts still need to be utilized for uncovering the underlying gene regulatory networks and for predicting gene function and delineating pathway membership. Along those lines, we used a biclustering approach that is well-suited for revealing the modular organization of transcriptional responses to drug perturbation (Ihmels et al, 2002; Prelić et al, 2006), as it can group coregulated genes with the drugs they respond to (technically, each bicluster consists of both a gene and a drug subset). We applied it to large-scale transcriptome resources for three human cell lines and rat liver to generate, for the first time, a large compendium of mammalian drug-induced transcriptional modules. We extensively characterized these modules in terms of functional roles of the genes and the bioactivities of the drugs they contain, in order to gain insights into both, drug MOA and the perturbed cellular systems (Figure 1). Comparing drug-induced transcriptional modules generated independently for each of the three human cell types and rat liver allowed us to assess their conservation across tissue types and organisms. Although it has been noted earlier that particular transcriptional changes along developmental trajectories, in different growth conditions, stress or disease can be conserved between species (Stuart et al, 2003; van Noort et al, 2003), it remains an open question, how well the modular organization of the transcriptome is conserved across tissues and organisms (Miller et al, 2010; Zheng-Bradley et al, 2010; Dowell, 2011). In this context, our results on the conservation of drug-induced transcriptional modules contribute to a better understanding of the degree to which cell line models recapitulate the biological processes and signaling pathways taking place in the human physiological context (Jones and Diamond, 2007; Sharma et al, 2010). Results and discussion Identification of drug-induced modules in human cell lines and rat liver To identify and compare drug-induced transcriptional modules from human cell lines as well as rat liver tissue, we exploited data from the following two public resources: (i) the CMap (Lamb et al, 2006), which contains 6100 expression profiles of several human cancer cell lines treated with 1309 drug-like small molecules (hereafter simply referred to as ‘drugs') (Supplementary Figure 1) and (ii) the DrugMatrix resource, a large data set of 1743 expression profiles from liver tissue of drug-treated rats (Natsoulis et al, 2008). Raw microarray data were subjected to quality control and preprocessing procedures to improve data consistency and reduce batch effects (Iskar et al, 2010). For CMap, this resulted in a usable set of expression measurements of 8964 genes in three human cell lines (HL60, MCF7 and PC3, see Materials and Methods), each treated with the same set of 990 drugs. From the rat data set, only genes with orthologous human genes present in CMap were considered. This yielded expression profiles for 3618 genes in response to treatments with 344 distinct drugs (Supplementary Figure 1). Drug-induced transcriptional modules were detected and tested for statistical significance separately in each of the four matrices of expression data using an unsupervised biclustering approach that has previously been shown to maintain high accuracy even with noisy input data (Iterative Signature Algorithm (ISA); Bergmann et al, 2003; Ihmels et al, 2004; Prelić et al, 2006). Each resulting bicluster (hereafter referred to as ‘module') consists of a subset of genes and a subset of drugs that coherently regulate these genes. To ensure comprehensiveness and robustness of this analysis, we explored a wide range of parameter settings for the ISA workflow (Supplementary Figure 2 and Supplementary Tables 1 and 2), and in addition, applied a size threshold to exclude small, likely spurious modules as motivated by previous studies (Langfelder and Horvath, 2008, and see Materials and Methods). As a result, we identified robust, drug-induced transcriptional modules in each system individually: 25 in MCF7 cells, 28 in PC3 cells, 29 in HL60 cells and 43 in rat liver (Figure 2A and Supplementary Data set 1). On average, these modules contain 70 genes in the human cell lines and 50 genes in the rat liver, induced by an average of 29 drug treatments in both data sets. Within each data set, modules can overlap (on average, 7% of the genes and 34% of the drugs were contained in multiple modules), reflecting the effects of polypharmacological drugs that modulate multiple targets (Hopkins et al, 2006; Keiser et al, 2009), which in turn can lead to perturbation of multiple pathways by the same drug. Conservation of drug-induced modules across cell types and organisms As appropriate animal models that accurately recapitulate the human drug response are crucial in drug discovery and development, we assessed the conservation of drug responses at the transcriptome level by comparing drug-induced transcriptional modules across cell lines and organisms. This resulted in a network of module similarity calculated using a reciprocal best-hit approach, which linked modules from different cell lines and rat liver to each other if their gene members significantly overlapped between data sets (Supplementary Table 3). In addition, we assessed the drug overlap of modules linked across cell lines (see Materials and Methods). For 58 out of 82 modules (71%) from human cell lines, we identified a corresponding module in at least 1 other cell line, which yielded a total of 23 nonredundant, conserved drug-induced transcriptional modules (CODIMs; Figure 2A and Supplementary Figure 3). A conservation level of 71% between cell lines is in line with a previous study that assessed the conservation of constitutive coexpression networks across human brain regions (Oldham et al, 2008). In addition to similarity across cell lines, we found considerable (statistically significant) conservation of human modules in rat liver ranging from 3 to 5 out of 25 to 29 individual modules per cell line (15% average ratio, permutation-based P-value 0.6, see Supplementary Table 2 for robustness analysis with respect to the actual parameters used). Second, we discarded very small, likely spurious modules and retained only biclusters with a minimum of 20 genes and 5 drugs (10 for rat liver). Finally, we applied the standard ISA redundancy removal to reduce the number of very similar biclusters that are the result of many randomly initialized runs converging to highly similar result sets (Csárdi et al, 2010). Before this removal step, all transcriptional modules were sorted by gene and drug thresholds aiming to preferentially retain medium-sized modules (first prioritizing gene threshold values from 4 to 3 over ones from 5 to 3.2 and finally 2.8 to 2 for cell line data, and prioritizing values from 3 to 2 over ones from 5 to 3.2 for rat liver data; second, prioritizing drug threshold values from 3 to 2 over 4 to 3.2 in both cell line and rat liver data followed by values from 1.8 to 1 in cell line data). On the basis of this prioritization, redundant modules were filtered in sequential order following the developers' recommendations (Csárdi et al, 2010), using a correlation threshold of 0.3 to determine redundant biclusters (see Supplementary Table 1 for a parameter robustness analysis). Lastly, in addition to ISA redundancy removal, modules from the same cell line showing significant gene overlap with a module of higher priority (hypergeometric test, P-value 0.4), while excluding methodologically analogous coexpression associations (von Mering et al, 2005). Similarly, 11 988 reliable associations between 1869 rat proteins were retrieved from STRING having a combined evidence score over 0.4 and again excluding analogous coexpression associations. For each module, we calculated the proportion of functionally associated gene pairs (excluding self-relations) within the module and compared this against the distribution obtained for the proportion of functionally associated pairs in 1000 random gene sets of matched sizes. Characterization of drug-induced transcriptional modules We characterized both individual transcriptional modules along with their unified CODIMs in detail by mining information for both gene and drug members using several annotation resources. We used the Database for Annotation, Visualization and Integrated Discovery (DAVID knowledgebase; Huang et al, 2009a, 2009b) to test for enriched gene function and to identify biological themes among these. For each module, the DAVID web service provided a ranked list (enrichment score >1.3) of functionally relevant annotation clusters that represent a summary of several annotation categories, including GO (Ashburner et al, 2000), KEGG (Kanehisa et al, 2012) and BioCarta pathways (Nishimura, 2001). In this analysis, gene reference background was set to all genes for CMap, and rat liver data set provided as an input to the ISA algorithm. The analysis of gene functions was complemented with a second approach, exploiting drug annotation resources for module characterization. For this, we extracted the ATC classification code (Andersen and Hvidberg, 1981; Pahor et al, 1994) for 677 approved drugs from the set of chemicals in CMap and 4331 chemical–protein interactions for 493 CMap compounds from the STITCH database (Kuhn et al, 2011), considering only reliable database and experimental evidence with high confidence (any of the two scores >0.7). In addition, 23 458 associations between 1106 side effects and 373 drugs contained in CMap were obtained from the SIDER database (Kuhn et al, 2010). Lastly, a library of chemical fragments (>6 atoms) generated by exhaustive molecular fragmentation of 892 CMap drugs (93%) ( 1 and FDR-corrected P-value <0.01; Supplementary Table 6). Data availability Supplementary Data such as drug-induced modules and CODIMs are available from http://codim.embl.de in human and machine readable formats. The web resource also provides functionality for drug and gene search. Supplementary Material Supplementary Information Supplementary figures 1–10, Supplementary tables 1, 2, 4, 5 and 6 Supplementary Table 3 Comparison of gene and drug members of drug-induced modules across human cell lines and organisms Supplementary Data set 1 Characterization of gene and drug members of drug-induced modules Review Process File