78
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Helveticoside is a biologically active component of the seed extract of Descurainia sophia and induces reciprocal gene regulation in A549 human lung cancer cells

      research-article
      , ,
      BMC Genomics
      BioMed Central
      Connectivity map, Descurainia sophia, Helveticoside, Microarray, Reciprocal regulation

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Although the pharmacological activities of the seed extract of Descurainia sophia have been proven to be useful against cough, asthma, and edema, the biologically active components, particularly at the molecular level, remain elusive. Therefore, we aimed to identify the active component of an ethanol extract of D. sophia seeds (EEDS) by applying a systematic genomic approach.

          Results

          After treatment with EEDS, the dose-dependently expressed genes in A549 cells were used to query the Connectivity map to determine which small molecules could closely mimic EEDS in terms of whole gene expression. Gene ontology and pathway analyses were also performed to identify the functional involvement of the drug responsive genes. In addition, interaction network and enrichment map assays were implemented to measure the functional network structure of the drug-responsive genes. A Connectivity map analysis of differentially expressed genes resulted in the discovery of helveticoside as a candidate drug that induces a similar gene expression pattern to EEDS. We identified the presence of helveticoside in EEDS and determined that helveticoside was responsible for the dose-dependent gene expression induced by EEDS. Gene ontology and pathway analyses revealed that the metabolism and signaling processes in A549 cells were reciprocally regulated by helveticoside and inter-connected as functional modules. Additionally, in an ontological network analysis, diverse cancer type-related genes were found to be associated with the biological functions regulated by helveticoside.

          Conclusions

          Using bioinformatic analyses, we confirmed that helveticoside is a biologically active component of EEDS that induces reciprocal regulation of metabolism and signaling processes. Our approach may provide novel insights to the herbal research field for identifying biologically active components from extracts.

          Electronic supplementary material

          The online version of this article (doi:10.1186/s12864-015-1918-1) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references26

          • Record: found
          • Abstract: found
          • Article: not found

          DAVID: Database for Annotation, Visualization, and Integrated Discovery.

          Functional annotation of differentially expressed genes is a necessary and critical step in the analysis of microarray data. The distributed nature of biological knowledge frequently requires researchers to navigate through numerous web-accessible databases gathering information one gene at a time. A more judicious approach is to provide query-based access to an integrated database that disseminates biologically rich information across large datasets and displays graphic summaries of functional information. Database for Annotation, Visualization, and Integrated Discovery (DAVID; http://www.david.niaid.nih.gov) addresses this need via four web-based analysis modules: 1) Annotation Tool - rapidly appends descriptive data from several public databases to lists of genes; 2) GoCharts - assigns genes to Gene Ontology functional categories based on user selected classifications and term specificity level; 3) KeggCharts - assigns genes to KEGG metabolic processes and enables users to view genes in the context of biochemical pathway maps; and 4) DomainCharts - groups genes according to PFAM conserved protein domains. Analysis results and graphical displays remain dynamically linked to primary data and external data repositories, thereby furnishing in-depth as well as broad-based data coverage. The functionality provided by DAVID accelerates the analysis of genome-scale datasets by facilitating the transition from data collection to biological meaning.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            STEM: a tool for the analysis of short time series gene expression data

            Background Time series microarray experiments are widely used to study dynamical biological processes. Due to the cost of microarray experiments, and also in some cases the limited availability of biological material, about 80% of microarray time series experiments are short (3–8 time points). Previously short time series gene expression data has been mainly analyzed using more general gene expression analysis tools not designed for the unique challenges and opportunities inherent in short time series gene expression data. Results We introduce the Short Time-series Expression Miner (STEM) the first software program specifically designed for the analysis of short time series microarray gene expression data. STEM implements unique methods to cluster, compare, and visualize such data. STEM also supports efficient and statistically rigorous biological interpretations of short time series data through its integration with the Gene Ontology. Conclusion The unique algorithms STEM implements to cluster and compare short time series gene expression data combined with its visualization capabilities and integration with the Gene Ontology should make STEM useful in the analysis of data from a significant portion of all microarray studies. STEM is available for download for free to academic and non-profit users at .
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              A human functional protein interaction network and its application to cancer data analysis

              Background High-throughput functional experiments, including genetic linkage/association studies, examinations of copy number variants in somatic and germline cells, and microarray expression experiments, typically generate multiple candidate genes, ranging from a handful to several thousands. These data sets are noisy and contain false positives in addition to genes that are truly involved in the biological process under study. An unsolved challenge is how to understand the functional significance of multi-gene data sets, extract true positive candidate genes, and tease out functional relationships among these genes with confidence for use in further experimental analysis. Using biological pathways to interpret high-throughput data One way to approach the above problem is to analyze the data from the perspective of biological pathways [1,2]. A pathway is a set of biochemical events that drives a cellular process. For example, the transforming growth factor beta (TGFβ) pathway consists of a ligand receptor binding event that initiates a series of protein-protein interaction (PPI), protein degradation, protein phosphorylation, and protein-DNA binding events that transmit a regulatory signal and regulate proliferation, differentiation and migration [3]. In cancer, the TGFβ signaling network functions in complex ways to both suppress early tumor growth and promote late stage progression [4]. Some breast cancers [5-9] are thought to arise in part when components of the TGFβ pathway are deleted, thereby freeing the tissue from growth inhibition. The same type of cancer can arise via several different routes [2]. For example, tumors from two different patients might have deleted different components of the TGFβ pathway. Although the two tumors both share the loss of TGFβ growth inhibition, they may not share defects in a common gene or gene sets. However, a pathway-based analysis will resolve this confusing finding and point towards the etiology of the disease. By projecting the list of mutated, amplified or deleted genes onto biological pathways, one will find that a statistically unlikely subset of otherwise unrelated genes are closely clustered in 'reaction space'. Pathway-based analysis can thus provide important insights into the biology underlying disease etiology. One striking example of this approach is the finding of the 'exclusivity principle' in cancer: only one gene is generally mutated in one pathway in any single tumor [1]. Recently, several large-scale genome-wide screening projects have revealed common core signaling pathways in the etiology or progression of several cancer types [10-14], indicating the relevance of pathway-based analysis for the understanding of large scale disease data sets. Pathway-based analysis accomplishes at least two things: it marks the genes associated with the disease or other phenotype and separates them from innocent bystanders caught in the general instability of the malignant genome or other false positive hits [15]; and it identifies the biological pathways affected by the genes [16]. The latter outcome also places the high-throughput analysis results in an intellectual framework that can be more easily comprehended by the researcher. It connects his results to prior work from the literature, and allows him to propose hypotheses that can be tested by further experimental work. Resources for pathway analysis Pathway-based hypothesis generation has been the subject of great interest over the past few years [17]. It is the basis for several popular data analysis systems, including GOMiner [18,19], Gene Set Enrichment Analysis [20], Eu.Gene Analyzer [21], and several commercial tools (for example, Ingenuity Systems [22]). Reactome [23] is an expert-curated, highly reliable knowledgebase of human biological pathways. Pathways in Reactome are described as a series of molecular events that transform one or more input physical entities into one or more output entities in catalyzed or regulated ways by other entities. Entities include small molecules, proteins, complexes, post-translationally modified proteins, and nucleic acid sequences. Each physical entity, whether it be a small molecule, a protein or a nucleic acid, is assigned a unique accession number and associated with a stable online database. This connects curated data in Reactome with online repositories of genome-scale data such as UniProt [24] and EntrezGenes [25], and makes it possible to unambiguously associate a position on the genome with a component of a pathway. A computable data model and highly reliable data sets make Reactome an ideal platform for a pathway-based data analysis system. However, since all data in Reactome is expert-curated and peer-reviewed to ensure high quality, the usage of Reactome as a platform for high-throughput data analysis suffers from a low coverage of human proteins. As of release 29 (June 2009), Reactome contains 4,181 human proteins, roughly 20% of total SwissProt proteins. Other curated pathway databases, including KEGG [26], Panther Pathways [27], and INOH [28], offer similarly low coverage of the genome. In contrast to pathway databases, collections of pairwise relationships among proteins and genes offer much higher coverage. These include data sets of PPIs and gene co-expression derived from multiple high-throughput techniques such as yeast two-hybrid techniques, mass spectrometry pull down experiments, and DNA microarrays. These kinds of data sets are readily available from many public databases. For example, PPIs can be downloaded from BioGrid [29], the Database of Interacting Proteins [30], the Human Protein Reference Database (HPRD) [31], I2D [32], IntACT [33], and MINT [34], and expression data sets from the Stanford Microarray Database [35] and the Gene Expression Omnibus [36]. Protein or gene networks based on these pairwise relationships have been widely used in cancer and other disease data analysis with promising results [37-42]. Transforming pairwise interactions into probable functional interactions A limitation of pairwise networks is that the presence of an interaction between two genes or proteins does not necessarily indicate a biologically functional relationship; for example, two proteins may physically interact in a yeast two-hybrid experiment without this signifying that such an interaction forms a part of a biologically meaningful pathway in the living organism. In addition, some pairwise interaction data sets may have high false positive rates [43,44], which contribute noise to the system, and interfere with pathway-based analyses. For this reason, groups that make pathway-based inferences on high-throughput functional data sets inevitably draw on curated pathway projects to cleanse their data and to train their predictive models. Our goal is to achieve the best of both worlds by combining high-coverage, unreliable pairwise data sets with low-coverage, highly reliable pathways to create a pathway-informed data analysis system for high-throughput data analysis. As the first step towards achieving this goal, we have created a functional interaction (FI) network that combines curated interactions from Reactome and other pathway databases, with uncurated pairwise relationships gleaned from physical PPIs in human and model organisms, gene co-expression data, protein domain-domain interactions, protein interactions generated from text mining, and GO annotations. Our approach uses a naïve Bayes classifier (NBC) to distinguish high-likelihood FIs from non-functional pairwise relationships as well as outright false positives. In this report, we describe the procedures to construct this FI network (Figure 1), and apply this network to the study of glioblastoma multiforme (GBM) and other cancer types by expanding a human curated GBM pathway using our FIs, projecting cancer candidate genes onto the FI network to reveal the patterns of the distribution of these genes in the network, and utilizing network clustering results on cancer samples to search for common mechanisms among many samples with different sequence-altered genes. Finally, we introduce a web-based user interface that gives researchers interactive access to the derived FIs. Figure 1 Overview of procedures used to construct the functional interaction network. See text for details. BP, biological process. Results Data sources used to predict protein functional interactions We used the following six classes of data to predict protein FIs (Table 1): 1, human physical PPIs catalogued in IntAct [45], HPRD [46], and BioGrid [47]; 2, human PPIs projected from fly, worm and yeast in IntAct [45] based on Ensembl Compara [48]; 3, human gene co-expression derived from DNA microarray studies (two data sets [49,50]); 4, shared GO biological process annotations [51]; 5, protein domain-domain interactions from PFam [52]; and 6, PPIs extracted from the biomedical literature by the text-mining engine GeneWays [53]. Table 1 Data sources used to predict protein functional interactions Data source Proteins SwissProt proteins (coverage) Interactions Reference Human PPIs 10,287 10,029 (49%) 53,743 [45-47] Fly PPIs 13,383 4,088 (20%) 939,639 (26,346a) [45] Worm PPIs 5,223 1,477 (7%) 122,192 (8,161a) [45] Yeast PPIs 5,646 1,530 (8%) 1,900,980 (167,574a) [45] Domain interaction 60,569 15,218 (75%) NA [52] Lee's Gene Expression 8,250 7,647 (38%) 206,117 [49] Prieto's Gene Expression 3,024 2,901 (14%) 13,441 [50] GO BP sharing 14,197 14,197 (70%) NA [51] PPIs from GeneWays 5,252 5,252(26%) 51,048 [53] To calculate the coverage of SwissProt, we used 20,332, the total identifier number in SwissProt (UniProtKB/Swiss-Prot Release 56.9, 3 March 2009), as the denominator. The numbers of interactions from three model organisms have been mapped to human proteins based on Ensembl Compara [48] (see text for details). aNumbers of PPIs in the original species. BP, biological process. Table 1 lists these data sources, the numbers of proteins and interactions, and estimated coverage of the human genome expressed as their coverage of the SwissProt protein database. The coverage ranges from 7% (Worm PPIs) to 70% (GO biological process sharing). It is notable that the coverage of human physical PPIs from three public protein interaction databases (IntAct, HPRD, and BioGrid) is close to 50%. Many interactions from IntAct were catalogued from co-immunoprecipitation experiments combined with mass spectrometry, and contain multiple proteins in a single interaction record. An odds ratio analysis showed that human PPIs based on all interaction records are much less correlated to FIs (see below) extracted from Reactome pathways than interactions containing four or fewer interactors: 13.91 ± 0.52 versus 36.98 ± 9.17 (P-value = 2.8 × 10-5 based on t-test). Therefore, we selected interactions that contain only four or fewer interactors from the IntAct database. We also tried to use GO molecular functional annotations as one of the data sources. The odds ratio of this data set was 2.99 ± 0.02, much smaller than the GO biological process data set (11.85 ± 0.20). Our results show that this data set contributed little to the prediction. One reason for this may be that the GO molecular functional categories are usually broad and the purpose of our NBC is to predict if two proteins may be involved in the same specific reactions (see below). Construction and training of a functional interaction classifier Our goal was to create a network of protein functional relationships that reflect functionally significant molecular events in cellular pathways. The majority of PPIs in interaction databases are catalogued as physical interactions, and there is rarely direct evidence in the interaction databases that these interactions are involved in biochemical events that occur in the living cell. Other protein pairwise relationships have similar issues. To integrate pairwise relationships into a pathway context, we built a scoring system based on the NBC algorithm, a simple machine learning technique [54], to score the probability that a protein pairwise relationship reflects a functional pathway event. For our NBC, we used nine features as listed under 'Data source' in Table 1: 1, whether there is a reported PPI between the human proteins; 2, whether there is a reported PPI between the fly (Drosophila melanogaster) orthologs of the two human proteins; 3, whether there is a reported PPI between the worm (Caenorhabditis elegans) orthologs of the two human proteins; 4, whether there is a reported PPI between the yeast (Saccharomyces cerevesiae) orthologs of the two human proteins; 5, whether there is a domain-domain interaction between the human proteins; 6 and 7, whether the genes encoding the two proteins are co-expressed in expression microarrays based on two independent DNA array data sets; 8, whether the GO biological process annotations for human proteins are shared; and 9, whether there is a text-mined interaction between the human proteins. An NBC must be trained using positive and negative training data sets in order to determine the proper weighting of different combinations of features. We developed training sets from the curated information in Reactome, relying in part on an independent analysis that reported Reactome as a highly accurate data set for PPI prediction [55]. An issue in using PPIs and other pairwise relationships in a pathway context is that the data models used by pathway databases are much richer than a simple binary relationship. A pathway database describes pathways in terms of proteins, small molecules and cellular compartments that are related by biochemical reactions that have inputs, outputs, catalysts, cofactors and other regulatory molecules. To develop the training sets from Reactome pathways for NBCs, we established a relationship called 'functional interaction' using the following definition: a functional interaction is one in which two proteins are involved in the same biochemical reaction as an input, catalyst, activator, or inhibitor, or as two members of the same protein complex. It is important to note that in Reactome a 'reaction' is a general term used to describe any discrete event in a biological process, including biochemical reactions, binding interactions, macromolecule complex assembly, transport reactions, conformational changes, and post-translational modifications [23]. We treat two members of the same protein complex as functionally interacting with each other because the activity of the complex as a whole is presumably functionally dependent on the presence of all of its subunits. Based on the above definition, we extracted 74,869 FIs from Reactome, and used these FIs to create a positive training set for the NBC. After filtering out FIs that did not have at least one feature derived from the data sources in Table 1, the positive data set comprised 45,079 FIs. Creating a good negative training set is more difficult than creating a positive set due to the incompleteness of our knowledge of protein interactions [56]: just because two proteins are not known to interact does not mean that this does not in fact occur. Research groups have addressed this problem using a variety of approaches, including choosing protein pairs from different disjunct cell compartments [57], or random pairs from all proteins [58]. For our NBC training, we followed the method in Zhang et al. [58] using random pairs selected from proteins in the filtered Reactome FI set. Choosing an appropriate prior probability or ratio between the positive and negative data sets is important for NBC training. We calculated the prior probability based on the total number of proteins in the filtered FIs from Reactome pathways, which was 5.7 × 10-3. To check the effect of ratio between the sizes of the positive and negative data sets, we test the NBC performance using a ratio of either 10 or 100. NBCs trained with these two ratios yielded similar true and false positive rates, which indicated that our NBC is robust against the size of the negative data set. The performance of machine learning classifier systems can be evaluated by cross-validation, or more stringently by using an independent data set. We used FIs extracted from pathways in other human curated pathway databases as a testing data set to evaluate the performance of our trained NBC. Figure 2 shows a receiver operating characteristic curve that relates true positive rates to false positive rates across a range of thresholds using this testing data set. We chose a threshold score of 0.50, which trades off a high specificity of 99.8% against a low sensitivity of 20%. The low sensitivity may result, in part, from high false negative rates existing in some of the data sets we used for NBC, especially in PPIs [59]. Figure 2 Receiver operating characteristic curve for NBC trained with protein pairs extracted from Reactome pathways as the positive data set, and random pairs as the negative data set. This curve was created using an independent test data set generated from pathways imported from non-Reactome pathway databases. The positions for the cutoff values 0.25, 0.50 and 0.75 are marked from right to left in the inset. The area under the curve (AUC) for this receiver operating characteristic (ROC) curve is 0.93. At the threshold score (0.50), a protein pair must have multiple types of FI evidence in order to be scored as a true FI (Table S1 in Additional file 1). While most (97%) of the predicted FIs have at least one PPI feature (Figure S1 in Additional file 1), there are no predictions supported solely by human PPI data, and fewer than 3% are supported solely by PPIs in human plus other species. This greatly reduces the weight given to raw human PPI features: the 44,819 human PPIs that went in to the classifier as features resulted in fewer than 15,000 predicted FIs, representing the removal of 68% of the raw PPIs. Most (75%) of the predicted FIs are derived from GO biological process term sharing and protein domain interactions in addition to PPIs. As a check on the classifier's ability to enrich for FIs, we compared the sharing of GO cellular component annotations (which includes compartments such as 'nucleoplasm') among raw human PPIs to the sharing of these annotations among predicted FIs. Since GO cellular component annotations were not used as a feature during NBC training, we reasoned that this assessment should be independent. Among raw PPIs, 62.9% share GO cellular component terms annotated for both proteins involved in the interaction. In contrast, 96.2% of the predicted FIs share this type of GO term (P-value 70% of altered genes (Figure 7a, b) by adding the minimum number of linker genes to form a fully connected subnetwork. Figure 7 Subnetworks for GBM clusters. (a) The TCGA cluster. (b) The Parsons cluster. Shared GBM candidate genes are shown in yellow, non-shared candidate genes in aqua, and linker genes used to connect cancer genes in red. The node size is proportional to the number of samples bearing displayed altered genes. Other colors and symbols are as in Figure 2. In the TCGA data set, 164 altered genes occurred in two or more GBM samples, 98 (60%, P-value = 3.2 × 10-7) of which were in the FI network. Of these, 71 are in the GBM subnetwork (72%, P-value 70%; Table 6) co-cluster into a small corner of huge FI space. These clusters are highly enriched in classical signaling pathways as well as the cell cycle, in agreement with pathway analyses performed by the original authors of the studies. We are also able to identify extensive crosstalk among the pathways, which indicates the complexities in tumorigenesis. Furthermore, we show how the FI network can reveal overlaps - and possibly common mechanisms - between the two GBM studies. This suggests a scenario in which the two cancer whole-genome screening projects are sampling from a common core cancer pathway that can be revealed by FI network analysis. Our result that most cancer candidate genes are clustered together is similar to what was reported by Cui et al. [83] based on a much smaller signaling interaction network generated from BioCarta [84] and CellMap [61], a small subset of our imported pathway databases. The reason why most cancer genes cluster closely together is still under investigation. The connection degree contributes to such clustering. However, the degree alone still cannot interpret the clustering based on our degree-based permutation test. We suspect that the major factor that governs the clustering is from FIs among cancer genes. A subset of cancer genes may form a small graph component via these FIs in the huge FI network. Such a small graph component may be used as a core to pull other cancer genes together to form a bigger cluster. Lin et al. [85] investigated network patterns for breast and colorectal cancers using a similar but smaller data set [86], and predicted that over half of the mutated proteins (59 out of 83) in breast cancers participate in an interaction cluster, but only a very small percentage of mutated proteins in colorectal cancers form an interaction cluster, which contains 12 proteins. We used different network analysis approaches based on a larger and more reliable FI network. Our results uncovered network modules that have been mutated in the majority of cancer samples and show that most recurrently mutated genes form a network cluster that is more interconnected than would be expected by chance in both breast and colorectal cancers. The results from multiple cancer types imply that the patterns revealed in our study might be common in all cancer types. Multiple sources of evidence show that tumorigenesis in human is a multi-step process and that genomes of tumors have sequence alterations at multiple sites [1]. Pathway analysis indicates that many pathways are mutated in cancer samples [2]. A striking finding from our study is that, for all cancer types examined so far, most samples have mutations involving a small number of discrete network modules. One of these modules typically corresponds to cell cycle regulation, DNA repair, and other nucleus-based processes, while another corresponds to signal transduction events in the plasma membrane and cytoplasm. This result suggests that the transformation from normal cells to malignant cells requires functional mutations in both nuclear and cytoplasm/plasma membrane-based pathways. However, our work also suggests that different cancer types have different network modules. A detailed network module based comparison analysis is likely to reveal different specific mechanisms in different cancer types. A major motivation for this work was the desire to integrate information from multiple pathway databases in order to reduce the fragmentation of knowledge stored in these useful resources. Even with common data models such as BioPAX [87], this is not easy to accomplish due to different focus of interests among the pathway databases, and different standard operating procedures, which allow the same series of biological reactions to be described quite differently from one database to the next. By reducing the pathway databases into a series of pairwise FIs, however, we have been able to merge five of the major pathway databases into a single uniform data model, although much information about the distinct roles of each protein has been lost during the process. Much of our current and future effort will be devoted to developing methods to map the FIs back to their original pathway contexts in order to find causal and directional relationships among the proteins. Conclusions We have built a FI network that covers close to half of human gene products. This functional network, which interconnects with the curated pathways available from Reactome and other human curated pathway databases, forms the foundation for a pathway-based data analysis system for high-throughput data analysis. We have applied this system to the analysis of two genome-wide GBM data sets and data sets from other cancer types and revealed common network patterns in cancer related genes and samples, suggesting that there exists a core network in GBM tumorigenesis. Materials and methods Importing data from non-Reactome pathway databases Data from four non-Reactome human-curated pathway databases were imported into the Reactome database (28 March 2009 release). These four databases are: Panther [27], CellMap [61], NCI Pathway Interaction Database [62], and KEGG [26]. To store these imported data into the Reactome database, the original Reactome schema was extended by adding one new class, Interaction, as a subclass to Event, and a new attribute, dataSource, to the top-most class DatabaseObject. The latter is used to track the original data sources of imported instances. The data formats used for importing were: BioPAX [87] for CellMap (released June 2006) and NCI Pathway Interaction Database (released March 2009), SBML [88] with Cell-Designer additions for Panther (version 2.5, August 2008), and KGML for KEGG (released on March 8 2009). After importing, all data from Reactome and the four external databases were maintained in a database, which was also used to store PPI data (see below). We have also imported human transcription factor and target interactions from the TRED database [64]. There are two types of data in the TRED database: human curated data from the published literature and predictions based on computational methods. Only the human data were imported to ensure high quality. Panther uses protein families generated from hidden Markov models for pathway annotations [60]. We only imported human UniProt proteins that could be mapped to pathway components reliably based on evidence codes used in the mapping file. Importing protein-protein interaction datasets Human PPIs were extracted from three PPI databases: BioGrid [29] (release 2.0.50, February 2009), HPRD [31] (released August 2007) and IntAct [33] (released March 2009). Data dumps in PSI-MI version 2.5 format from these three databases were processed by an in-house Java PSI-MI parser, converted into the extended Reactome data format, and stored in the extended Reactome database. An odds ratio analysis was used to check the correlation between a PPI or other pairwise data set and FIs extracted from Reactome pathways. The control groups were generated from random pairs by using proteins from the Reactome FIs. The reported odds ratio values in the results section were based on ten permutations. PPIs in S. cerevisiae, C. elegans and D. melanogaster PPI data sets were downloaded from IntAct (released March 2009). Ensembl Compara [48] was used to map non-human proteins onto putative human orthologs. Other data sets for naïve Bayes classifier The protein domain-domain interaction data set was downloaded from pFam [52] (release 23.0, July 2008). Two microarray co-expression data sets were used in the NBC: one downloaded from [89], which was compiled from 60 data sets that contained a total of about 4,000 microarrays [49], and another generated by Prieto et al. [50]. Protein GO annotations were downloaded from [90] (gene_association.goa_human, released March 2009). A PPI data set generated from a text-mining technique was kindly provided by Rzhetsky et al. [53]. Training of naïve Bayes classifier We used a NBC to score protein-protein pairwise relationships. These pairwise relationships were extracted from nine data sources described above and used as features to train the NBC. We used a closed-world hypothesis [91] to assign values to a protein pair: if there was a pairwise relationship between the two proteins in the data set, we used true, otherwise false. To integrate protein-protein pairwise relationships into the pathway context, we extracted pairwise relationships from reactions and complexes annotated in pathways by defining a FI as an interaction in which two proteins are involved in the same reaction as input, catalyst, activator and/or inhibitor, or as components in a complex. To train the NBC, we extracted a positive data set from Reactome using this definition, and filtered out pairs that do not have any feature. We generated a negative training set using random pairs from proteins in the positive data set. Clustering of cancer genes in the functional interaction network Two human GBM data sets [12,14] were used in our cancer data analysis. For sequence-altered genes in GBM, we chose mutated genes and CNV genes for each sample. Many genes exist in CNV chromosome fragments. For our study, we chose those genes that have been labeled 'Genes with gene expression correlated with copy number' in the TCGA data set (Table S3 in [14]) or 'Candidate target' and 'Other genes' in the Parsons data set (Tables S5 and S6 in [12]). Note that CNV genes in the TCGA data set have been pre-filtered based on gene expression data sets, while CNV genes in the Parsons have not since only SAGE expression data are available, which were not used for filtering because of their lower sensitivity for under-regulated genes. To get CNV genes for each sample in the TCGA data set, we used a file, TCGA-GBM-RAE-genemap-n216-20080510-dscrt.txt, downloaded from [92], which lists CNVs for individual samples [93]. For edge-betweenness network clustering [75], we lumped all sequence-altered genes from samples together, searched FIs among these genes, and constructed a subnetwork based on these interactions. A Java graph library, Jung2 [94], was used for edge-betweenness network clustering. Hierarchically clustering of TCGA GBM samples was done using the hclust method in R [95] based on the complete linkage method with the binary distance between binary vectors generated for each GBM sample according to occurrence of altered genes in network modules identified by the edge-betweenness algorithm. The heat map from hclust was drawn using the R package heatplus [96]. To search for GBM cancer clusters, we collected sequence-altered genes occurring in two or more samples (GBM candidate genes), calculated pairwise shortest paths among genes in the FI network, hierarchically clustered them based on the average linkage method, and then selected a cluster containing more than 70% altered genes. To estimate P-value for GBM cancer clusters, we did a 1,000-fold permutation test by randomly choosing the same numbers of genes from the biggest connected graph component as the GBM candidate genes to subject to the same hierarchical procedures for the candidate genes. To construct a subnetwork spanning a set of genes, we implemented a search algorithm guided by the hierarchical clustering results based on shortest path between two clusters in order to keep the number of linking genes to a minimum. To calculate the P-value for the average shortest distance for cancer clusters, we performed a 1,000-fold permutation test by randomly selecting the same numbers of genes from the biggest connected graph component. To check if the connection degree is a confounder to clustering of cancer genes, we repeated this analysis after dynamically generating gene bins based on the sorted list of degrees in the cancer genes, and then randomly choosing genes from these bins in the same distribution as the cancer genes. For other cancer types, we used Table S3 from [10] for somatic mutated genes in breast and colorectal cancers, and Tables S3, S4, and S5 from [11] for somatic mutated and CNV genes in pancreatic cancers. All network diagrams were drawn with Cytoscape [97]. The functional enrichment analyses for pathways and GO annotations were based on binomial test. False discovery rates were calculated based on 1,000 permutations on all genes in the FI network. Enhanced experimental SkyPainter Skypainter is a web application implemented in the Reactome web site for gene or protein over-representation analysis [23]. We augmented the function of the original Skypainter by adding predicted FI data. The enhanced Skypainter was implemented using the Google web toolkit [98]. Abbreviations CNV: copy number variation; EGFR: epidermal growth factor receptor; FI: functional interaction; GBM: glioblastoma multiforme; GO: Gene Ontology; HPRD: Human Protein Reference Database; NBC: naïve Bayes classifier; PPI: protein-protein interaction; TCGA: The Cancer Genome Atlas; TGF: transforming growth factor; TNC: tenascin-C. Authors' contributions GW designed the study, constructed the FI network, did network analysis for cancer data sets, and drafted the manuscript. XF studied biological properties of the FI network. LS conceived of the study, participated in its design, and edited the manuscript. All authors have read and approved the final manuscript. Supplementary Material Additional file 1 Supplementary materials. Click here for file Additional file 2 FI database mysql dump file. Click here for file Additional file 3 Six FI files plus one read-me file explaining what these files are. Click here for file
                Bookmark

                Author and article information

                Contributors
                +82-42-868-9352 , nosookim@kiom.re.kr
                Journal
                BMC Genomics
                BMC Genomics
                BMC Genomics
                BioMed Central (London )
                1471-2164
                18 September 2015
                18 September 2015
                2015
                : 16
                : 1
                : 713
                Affiliations
                [ ]KM-Convergence Research Division, Korea Institute of Oriental Medicine, 1672 Yuseong-daero, Yuseong-gu Daejeon, 305-811 Republic of Korea
                [ ]Department of Korean Medicine Life Science and Technology, Korea University of Science and Technology, Daejeon, Republic of Korea
                Article
                1918
                10.1186/s12864-015-1918-1
                4575430
                23c0c3f8-a444-421c-afb2-2298a822bd48
                © Kim et al. 2015

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 23 April 2015
                : 11 September 2015
                Categories
                Research Article
                Custom metadata
                © The Author(s) 2015

                Genetics
                connectivity map,descurainia sophia,helveticoside,microarray,reciprocal regulation
                Genetics
                connectivity map, descurainia sophia, helveticoside, microarray, reciprocal regulation

                Comments

                Comment on this article