2
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      Early days: genomics and human responses to infection

      review-article

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          DNA microarray-based gene transcript-profiling of the responses of primates to infection has begun to yield new insights into host–pathogen interactions; this approach, however, remains plagued by challenges and complexities that have yet to be adequately addressed. The rapidly changing nature over time of acute infectious diseases in a host, and the genetic diversity of microbial pathogens present unique problems for the design and interpretation of functional-genomic studies in this field. In addition, there are the more common problems related to heterogeneity within clinical samples, the complex, non-standardized confounding variables associated with human subjects and the complexities posed by the analysis and validation of highly parallel data. Whereas various approaches have been developed to address each of these issues, there are significant limitations that remain to be overcome. The resolution of these problems should lead to a better understanding of the dialogue between the host and pathogen.

          Related collections

          Most cited references41

          • Record: found
          • Abstract: found
          • Article: not found

          GoMiner: a resource for biological interpretation of genomic and proteomic data

          Rationale Gene-expression profiling and other forms of high-throughput genomic and proteomic studies are revolutionizing biology. That much is universally agreed. But the new technologies pose new challenges. The first is the experiment itself, the second is statistical analysis of results, the third is biological interpretation. That third challenge is often the most vexing and time-consuming. In gene-expression microarray studies, for example, one generally obtains a list of dozens or hundreds of genes that differ in expression between samples and then asks: 'What does all of this mean biologically?' The work of the Gene Ontology (GO) Consortium [1] provides a way to address that question. GO organizes genes into hierarchical categories based on biological process, molecular function and subcellular localization. In the past, this GO information was queried one gene at a time. Recently, batch processing has been introduced [2], but with a flat-format output that does not communicate the richness of GO's hierarchical structure. We have developed, and present here, the program package GoMiner as a freely available computer resource that fully incorporates the hierarchical structure of the Gene Ontology to automate the functional categorization of gene lists of any length. GoMiner is downloadable free of charge from [3] or [4]. GoMiner was developed particularly for biological interpretation of microarray data; one can input a list of under- and overexpressed genes and a list of all genes on the array, and then calculate enrichment or depletion of categories with genes that have changed expression. GoMiner thus facilitates analysis and organization of the results for rapid interpretation of 'omic' [5,6] data. For concreteness, the descriptions in this article will focus on applications to microarray data, but the range of uses is obviously much broader. Overview of GoMiner GoMiner takes as input two lists of genes: the total set on the array and the subset that the user flags as interesting (for example, altered in expression level). GoMiner displays the genes within the framework of the Gene Ontology hierarchy, both as a directed acyclic graph (DAG) and as the equivalent tree structure. The latter is similar in format to the visualization in the AmiGO browser display [1]. However, each category is annotated to reflect the number of genes from the user's experiment assigned to that category plus the number assigned to its progeny categories (Figure 1a). This computation does not double-count genes that appear more than once along the traversal. The user has the option of designating each gene within the 'interesting gene' list as exhibiting under- or overexpression. If that is done, genes displayed in the tree-like view are tagged with green down-arrows or red up-arrows, respectively. The most important parameter for purposes of interpretation is the enrichment (or depletion) of a category with respect to flagged genes (relative to what would have been expected by chance alone). This parameter will be discussed more extensively and more mathematically in the section on 'Statistical considerations'. In Figure 1a, the relative enrichment is indicated by blue numbers for total flagged genes and by red and green numbers for over- and underexpressed genes, respectively. The last number (blue) for each category is a two-sided p-value from Fisher's exact test. In GoMiner, clicking on a gene of interest in the tree-structure opens a menu that can be used to submit that gene as a query to an external data resource. The number of such links is being expanded rapidly, but currently included are LocusLink [7], PubMed [8], MedMiner [9,10], GeneCards [11], the NCBI's Structure Database [12], and BioCarta and KEGG pathway maps as implemented by the NCI Cancer Genome Anatomy Project (CGAP) [13]. These external databases provide GoMiner with a rich set of resources for bioinformatic integration. For example, the links with CGAP and LocusLink provide interaction with pathway maps, chromosome visualizations, a database of single nucleotide polymorphism (SNP), and the Mammalian Gene Collection (MGC). In GoMiner, clicking on a category instead of a gene brings up a second visualization (Figure 1b), a DAG programmed as a scalable vector graphic (SVG) that can be navigated fluently. Any of its nodes can be moused-over to list the flagged genes or clicked to highlight multiple pathways connecting it to the root. Detailed quantitative and statistical results are downloadable in several tab-delimited formats that can be read directly into a text file or a spreadsheet program for further analysis. For example, the spreadsheet data can be sorted by enrichment factor or p-value to focus attention on potentially interesting categories. Development of GoMiner GoMiner is based on a variety of open-source Java classes and developer tools, plus substantial in-house custom software engineering (Figure 2). We chose Java to achieve independence of operating system so that more researchers could use the tool. A custom graphical user interface (GUI) provides the user with flexibility and an intuitive view of biological relationships (Figure 1a). A complementary command-line version of GoMiner allows high-throughput applications and fluent integration with other programs. The heart of GoMiner is its processing engine (Figure 2), which parses input gene lists and retrieves database entries for association with GO categories (also called 'terms'). The GO categories and gene associations are stored in a relational database. To enhance the speed of data manipulation, we model the information in memory using a DAG data structure. The root is the topmost node: 'Gene Ontology'. The other nodes represent gene categories, and the connections represent relationships between categories. Each category-node object contains its associated genes, functionality for counting genes, a flag for dereplication during counting, and results of statistical analyses. The gene-category associations are displayed in the form of a tree (Figure 1a) or, alternatively, in the form of a DAG (Figure 1b). We have developed GoMiner as a client-server application. The client, a Java application, communicates with a server-side database through JDBC. The client can run on platforms with Java run-time environment version 1.3 or higher. The primary client-user GUI, written using the Java Swing API, takes the form of a three-panel window in which the user can inspect GO categories and genes. The left-hand panel lists the genes, the databases from which their identities were derived, and optional up- and down-arrows to indicate under- or over-expression; the middle panel shows a tree visualization of categories in the style of the AmiGO browser [1] and, in addition, provides a visualization of the flagged genes in the particular microarray experiment. The right-hand panel shows all appearances within the GO hierarchy of any gene selected from the left or middle panel. The gene and category names are implemented as links to facilitate navigation of the data structures and access to public resources. A second type of visualization, the DAG (programmed as an SVG) shows in compact form the spanning hierarchy for all flagged genes. Optionally, it can include only nodes below a specified level if the entire DAG would be too large for easy visualization. The client application uses several open source components: the Berkeley Drosophila Genome Project (BDGP) Java Toolkit [14] for utility classes; Browser Launcher [15] for cross-platform web browser integration; Jakarta-ORO [16] for text processing; the Jena Semantic Web Toolkit [17] for manipulating RDF models; MySQL Connector/J [18] for database connectivity; and Xerces [19] for parsing XML. The back-end is a relational database server, which stores all gene ontology data. It includes an implementation in MySQL [20] of the GO Consortium database. In addition to the deployed components, we have introduced a number of open-source tools to enhance the development environment. In particular, the Concurrent Versions System (CVS) tool [21] coordinates program development at the Georgia Institute of Technology with that at the NCI, and also coordinates development within each of the groups. jUnit [22] automates unit- and system-level testing of the application. Statistical considerations The two-sided Fisher's exact test p-value for a category reflects a test of the null hypothesis that the category is neither enriched in, nor depleted of, flagged genes with respect to what would have been expected by chance alone. That is, it reflects the null hypothesis that, for each category, there is no difference between the proportion of flagged genes that fall into the category and the proportion of flagged genes that do not fall into the category. The two groups of genes are mutually exclusive, as required for Fisher's exact test. Note that the predicate of the null hypothesis does not include 'the flagged genes that fall into the union of the rest of the categories'. That predicate would not ensure mutual exclusivity. The statistical question can be framed in terms of a classical 2 × 2 contingency table (Table 1). The null hypothesis can be formulated as: H 0:p 1 - p 2 = 0, where p 1 = n f/n and p 2 = (N f - n f)/(N - n). The two-sided p-value for Fisher's exact test is the sum of probabilities of observing tables that give at least as many extreme values as the one actually observed, given that the null hypothesis is true [23-25]. The use of Fisher's exact test implies that we are conditioning on fixed marginal totals (n, N - n, N f, N - N f) under the null hypothesis. For a discussion of the implications of fixed marginal values, see for example [23-25]. Note that the 2 × 2 table does not require any information about the topology of the hierarchy or about how many genes are included in any category other than the one to which the test is being applied. We used the two-sided version of the test, which detects a significant difference in the proportions in either direction (that is, when the proportion of flagged genes in the category is either higher or lower than would be expected by random chance). Clearly, calculations analogous to the ones used here for all flagged genes can also be applied to test separately the equivalent null hypotheses for under- and overexpressed genes. Unlike the Z-statistic with the hypergeometric distribution, and tests based on it, Fisher's exact test is appropriate even for categories containing a small number of genes. Our Java implementation of the Fisher's exact test is based on Javascript by Øyvind Langsrud [26]. The following limitations of this statistical formulation should be borne in mind, and the p-values should be interpreted judiciously. Random experimental and categorization error Experimental error and any uncertainties in the classification of genes in GO are not included in the statistical model. Perhaps, given enough information (which we essentially never have) about those sources of error, they could be included in the statistical model, for example through a resampling technique. Gene representation bias The microarray gene set (or set from some other type of genomic or proteomic experiment) will generally be a biased representation of all genes. Therefore, enrichments and depletions, of necessity defined in terms of the genes studied, may be biased with respect to biological significance as well. An alternative is to replace the list of the total set of genes on the microarray with a list of the total set of genes in the genome (or a representative sample), but that approach introduces another source of bias: genes not on the microarray are counted in determining N and n but have no chance to be flagged. GO consortium database bias for human gene associations The GO Consortium [1] provides a set of flat files that indicate the association between gene names and GO categories for several species [27]. Although the flat files for human are quite comprehensive, we found a low hit rate for GO annotation of human genes using the database created by the GO Consortium's downloaded MySQL script files [28]. The hit rates were low both when the gene names were used in the format of HUGO names and when the gene names were used in the format of 'HUGO_HUMAN.' We tried the latter format because the flat files often contained '_HUMAN' appended to the human gene names. In contrast, when we used a combination of mouse (MGI) and rat (RGD) association files, there were reasonable numbers of hits. Therefore, we now routinely use mouse and rat annotations for human data. We are currently augmenting the human associations in the GO Consortium database to provide a richer annotation of human gene names. This goal will be achieved by using the MatchMiner database to integrate the information in the GO Consortium database [27] and the Swiss-Prot, TrEMBL and TrEMBLnew databases [29], and GoMiner will implement this database for human data in the near term. The MySQL script files will be freely available and should represent an improvement over what is currently available to program developers and end-users. Non-independence of gene data Gene-expression values within a category may be correlated for any of several reasons. They may represent the same gene, close family members with similar functions, genes in the same pathway or genes in alternative pathways for performing a biological function. Gene classifications in GO may be correlated for analogous reasons. How do such relationships affect the statistics? The answer is most easily seen by imagining a category containing nothing but five instances of the same gene (perhaps because five different identifiers were used and not recognized as representing the same gene). That category might appear either to be strikingly enriched (with five out of five genes flagged) or strikingly depleted (with none out of five genes flagged). But the appropriate value of n for determining statistical significance in those cases would be 1, not 5. GoMiner's companion program MatchMiner [30,31] handles this problem by identifying replicates of the same gene, even if they are represented by different identifiers. What about possible sources of correlation other than 'same-gene'? Do we want to dereplicate them as well? Generally, the answer is 'no'. Correlation of genes in the same pathway is precisely the phenomenon we are often trying to identify. We would not want a statistical test to adjust for (and, in effect, null out) the effect of such relationships. Close family members might be considered an intermediate case. The statistical model implemented in GoMiner assumes, as our state of prior knowledge, that we know when two 'genes' are identical but nothing about their relationship if they are not identical. That seems the only available course. However, for each category, GoMiner provides the gene identities and the numbers given in Table 1 – sufficient information for the knowledgeable user to decide to eliminate close family members or pathway partners if desired. The multiple comparisons problem If one has not decided before analysis which particular gene category is to be examined, a correction should be made for the multiple opportunities to obtain a p-value indicating statistically significant enrichment or depletion. For example, with 1,000 categories, we would expect approximately 1,000 × 0.05 = 50 false positives simply by chance if we set the critical value at p = 0.05. The most common way to correct for this problem is that of Bonferroni (see, for example [32]), in which the critical value is divided by the number of trials (in this case, 1,000). However, that approach assumes independence of categories and is so conservative that it becomes extremely hard to detect true positives. A number of less conservative statistical methods have also been developed, but it is beyond the scope of this paper to review them here. An approach based on resampling will be incorporated into GoMiner in the coming months. Overall, the p-values quoted should be considered as heuristic measures, useful as indicators of possible statistical significance, rather than as the results of formal inference. The p-values can be used, for example, to sort categories to identify those of the most potential interest. As another useful measure, we have calculated the relative enrichment factor, R e, defined as R e = (n f/n)/(N f/N) and shown as blue numbers in Figure 1a. The analogous quantities for overexpressed (red numbers) and underexpressed (green numbers) are also shown. Depletion is, of course, represented by an enrichment factor less than unity. Benchmarking GoMiner on a biological problem As a test, GoMiner was applied to the results of our cDNA microarray study of the molecular mechanisms by which drug resistance develops [33]. The DAG shown in Figure 1a was generated from that study, which used quadruplicate 'Oncochip' microarrays (Microarray Facility, Advanced Technology Center, NCI [34]) to compare gene expression profiles in a prostate cancer cell line (DU145) and a subline (RC0.1) selected from it for resistance to the topoisomerase 1-inhibitor 9-nitro-camptothecin. The microarray included 1,399 cancer-interesting genes. 181 of those genes differed in expression according to a threshold criterion (>1.5-fold difference). MatchMiner was used to translate IMAGE clone Ids for the 1,399 genes into HUGO names for input to GoMiner. Figure 1a shows that the category 'apoptosis regulator' was enriched 2.4-fold in genes with altered expression levels. More specifically, it was enriched 3.2-fold with underexpressed genes and 2.0-fold with overexpressed genes. Flow cytometric annexin V and TUNEL assays verified important differences in apoptotic potential between the cell lines, and analysis generated a novel hypothesis (the 'permissive apoptosis-resistance' hypothesis) for the relationship between apoptotic and cell-proliferation pathways in the development of drug resistance. Figure 1a provides more detailed information, indicating that these differences were focused in particular subcategories of apoptosis. Thus, GoMiner can help the user in at least two ways: it identifies categories enriched in, or depleted of, genes of interest; and it generates hypotheses to guide further research. Unfortunately for us, interpretive analysis of the DU145/RC0.1 study was initially done one gene at a time before development of GoMiner (and, in fact, motivated that development). Performing the GO analysis one gene at a time would have taken more than two solid hours at the computer for the 181 genes before getting to the much harder parts of the task: doing the same for the entire array (nominally > 15 hours), then collating and organizing the information for each GO category. In contrast, operating on a 266 MHz PC with 250 MB RAM, it took 90 seconds to browse for and load the files, then 30 seconds for GoMiner to process the entire array of 1,399 genes and display the flagged and unflagged genes in their hierarchical context. In another test, running 900 flagged genes and all of HUGO (15,000 genes) took 4 minutes and 40 seconds on the same computer. Overall, the processing time was essentially linear with respect to the total number of genes (time in minutes = 0.0003 × genes + 0.0656; R 2 = 0.998). Comparison of GoMiner with related programs Several other programs related to GoMiner have recently appeared. These include MAPPFinder [35,36], FatiGO [37], Onto-Express [2,38], and GoSurfer [39]. The following represents our best attempt at comparison, based on review of the available implementations and associated documentation as of January 2003. FatiGO is a web application. The current implementation is very restrictive in that the user must specify ahead of time one particular level of the GO hierarchy that is to be used for analysis of the data. The other available applications, including GoMiner, process data for the entire GO hierarchy and allow the user to select views of the results dynamically. In a trial using FatiGO's recommended search criteria with our standard test gene files, FatiGO did not find any GO categories with clusters of differentially expressed genes. Onto-Express is also implemented as a web application. Although more flexible than FatiGO, it is largely limited to a flat view of the biological world. Whereas GoMiner provides both tree and DAG views of the genes embedded within the GO hierarchy, Onto-Express does not provide any hierarchical structure (the fundamental defining feature of GO). Onto-Express lists enriched and depleted categories, but it does not provide a statistical analysis of the results to aid understanding. 'Version 2,' recently announced (at a price of $1,500 – $5,000), provides a p-value (computed by a method not specified in the announcement). GoSurfer is implemented as a Windows application. As such, it lacks the flexibility of platform-independence that Java confers upon GoMiner. GoSurfer is also rather inflexible in that the input identifiers are required to be specific Affymetrix probe sets. It is not clear whether other identifier types suggested in a figure on the web site have been implemented. In contrast, GoMiner uses HUGO gene names as input. These gene names are more convenient for human interpretation, and GoMiner's companion program MatchMiner [30,31] allows many other types of identifiers (listed at the end of this section) to be converted easily into HUGO gene names. The visual output of GoSurfer is in the form of a DAG. GoMiner uses a text-based tree as its primary visual output because the nodes of the DAG are inherently more difficult to label without creating unacceptable screen clutter. The DAG gives an intuitive feel for the overall complexity of the categorizations, but it is not particularly useful for detailed dynamic navigation or for examination of categorized genes. The tabular output of GoSurfer does not include the HUGO names, which we consider to be the most useful key to gene identity. In contrast to GoMiner, it appears that GoSurfer does not provide complete quantitative and statistical summary data. MAPPFinder is a pioneering project that integrates GO analysis and biological pathway maps. GoMiner also provides the potential for this type of integration, since each gene in the GoMiner tree classification is dynamically linked to the corresponding set of BioCarta and KEGG biological pathway maps. In addition to providing integration with biological pathway maps, GoMiner provides integration with chromosomal information via dynamic linking to LocusLink's chromosome viewer. GoMiner also provides dynamic linking to SNPs and MGC databases via LocusLink. MAPPFinder provides the fundamental tree representation of the GO hierarchy, with summary and statistical data in line with each category. However, unlike the tree implementation in GoMiner, it shows only the categories; the genes themselves are shown in an auxiliary table. In GoMiner, both the categories and the genes are seamlessly shown as integral components of the tree. MAPPFinder does not appear to include a DAG representation. In GoMiner, the DAG view provides a qualitative and quantitative picture of the often-complex, multiple parenthood of some categories. In our opinion, this type of visualization is complementary to the tree form and important to an appreciation of the complex, highly nonlinear relationships within biological systems and gene networks. This complexity is not easy for a human to infer from the tree representation. The GO consortium selected the DAG as its fundamental data structure (though not its visualization), in part because it includes the characteristics of a network that are not included in a tree. MAPPFinder is written in Microsoft's Visual Basic and is therefore restricted to running on PCs under Windows. In contrast, GoMiner is written in Java and runs on multiple operating systems. We have tested it on Windows XP, 2000, NT, and 98, as well as on Mac OS X, Solaris, Linux (Red Hat distribution), IRIX (SGI), and FreeBSD. See the GoMiner website for specific operating-system issues. We recently implemented an alternative command-line interface for GoMiner (S.N., M.S., D.W.K. and B.R.Z., unpublished work) to complement the GUI version. The command-line interface allows GoMiner to be integrated with other tools via scripts or pipes. Our website will post updated versions of the documentation and program as soon as comprehensive testing of this interface has been completed. In preliminary trials with the new interface we have routinely processed more than 2,000 datasets at a time through GoMiner. This high-throughput capability has made two further developments possible: first, randomization studies are being done to address the multiple-comparisons problem (that is, to estimate the fraction of false positives among the selected categories); second, the output data stream is being coupled with integrated downstream analysis for automated recognition of interesting results buried within a large number of exploratory experiments. The user can explore and visualize these interesting results with GoMiner's graphical user interface. The command-line interface also allows GoMiner to interact flexibly with its companion program MatchMiner. With MatchMiner as a 'preprocessor', GoMiner can take input data organized on the basis of 'omic' identifiers other than the HUGO names central to GO. MatchMiner currently resolves IMAGE clone ids, UniGene clusters, GenBank accession numbers, Affymetrix ids, chromosome locations, gene common names, and FISH clone ids, and greatly facilitates the preparation of microarray data for analysis in GoMiner. In conclusion, GoMiner will continue in development with a view to integration with other bioinformatic resources being generated by the NCI and NIH for use by the biomedical research community. GoMiner is flexible both because it is coded in Java to be platform-independent and because it can accommodate either the default GO hierarchy and gene associations or customized versions. The default is the GO Consortium's database of categories and gene associations as implemented on our server. However, the user can, if desired, edit categories and gene memberships using DAG-Edit, the BDGP Gene Ontology Editor Tool [40]. The edited database can then be accessed by GoMiner from a local server to accommodate domain- and expertise-specific applications. Another important type of flexibility is the wide range of uses. In this report, we have presented GoMiner in the context of microarray data, but the variety of applications is clearly much broader; it embraces the full range of genomic and proteomic studies.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            A module map showing conditional activity of expression modules in cancer.

            DNA microarrays are widely used to study changes in gene expression in tumors, but such studies are typically system-specific and do not address the commonalities and variations between different types of tumor. Here we present an integrated analysis of 1,975 published microarrays spanning 22 tumor types. We describe expression profiles in different tumors in terms of the behavior of modules, sets of genes that act in concert to carry out a specific function. Using a simple unified analysis, we extract modules and characterize gene-expression profiles in tumors as a combination of activated and deactivated modules. Activation of some modules is specific to particular types of tumor; for example, a growth-inhibitory module is specifically repressed in acute lymphoblastic leukemias and may underlie the deregulated proliferation in these cancers. Other modules are shared across a diverse set of clinical conditions, suggestive of common tumor progression mechanisms. For example, the bone osteoblastic module spans a variety of tumor types and includes both secreted growth factors and their receptors. Our findings suggest that there is a single mechanism for both primary tumor proliferation and metastasis to bone. Our analysis presents multiple research directions for diagnostic, prognostic and therapeutic studies.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data

              Background DNA microarray experiments simultaneously measure the expression levels of thousands of genes, generating huge amounts of data. The analysis of these data presents a tremendous challenge to biologists and new tools are needed to help gain biological insights from these experiments. Although the data are generated for individual genes, examining a dataset on a gene-by-gene basis is time consuming and difficult to carry out across an entire dataset. One way of accelerating the pace of data analysis is to approach the data from a higher level of organization. This can be done using data-driven methods, such as hierarchical clustering and self-organizing maps [1,2], which identify groups of genes with similar expression patterns. A complementary approach is to view the data at the level of known biological processes or pathways. Identifying those groups of biologically related genes that are showing a large number of gene-expression changes will create an informative description of the biology that is occurring in a particular dataset, making it possible to generate new hypotheses and identify those specific areas of biology that warrant more detailed investigation. One tool that assists in the identification of important biological processes is GenMAPP (Gene MicroArray Pathway Profiler) [3], a program for viewing and analyzing microarray data on microarray pathway profiles (MAPPs) representing biological pathways or any other functional grouping of genes. When a MAPP is linked to a gene-expression dataset, GenMAPP automatically and dynamically color codes the genes on the MAPP according to criteria supplied by the user. GenMAPP is a useful starting point for pathway-based analysis of gene-expression data, but there are several critical requirements to be met before this tool can be used to identify correlated gene-expression changes across all biology. On a practical level, pathway-based analysis of microarray data needs to be automated, so that all possible pathways can be explored. Identifying correlated gene-expression changes in an individual pathway is often interesting, but it is necessary to know if the gene-expression changes seen on a particular pathway are unique to this pathway or are occurring in many other pathways. Equally important to automation is expanding the pathway information that is digitally represented. GenMAPP currently has over 50 MAPP files depicting various biological pathways and gene families, but this is still only a small fraction of all known biology [3]. Several other pathway programs such as KEGG [4], EcoCyc/MetaCyc [5], Pathway Processor (which uses KEGG) [6] and ViMAc [7] are available for integration with microarray data analysis, but these programs focus on well-defined metabolic pathways, and like GenMAPP, would benefit from a broader base of pathway information. To address this issue, we have used information available from the Gene Ontology (GO) Consortium [8]. The GO Consortium is creating a defined vocabulary of terms describing the biological processes, cellular components and molecular functions of all genes. The GO is built in a hierarchical manner, with a parent-child relationship existing between GO terms. Curators at the public gene databases are assigning genes to GO terms to provide annotation and a biological context for individual genes. In addition to providing gene annotation, GO also provides a structure for organizing genes into biologically relevant groupings. These groupings can serve as the basis for identifying those areas of biology showing correlated gene-expression changes in a microarray experiment. While GO has been used to annotate microarray data both by hand and by some software packages [9,10,11], there has been no automated way to use it for pathway-based analysis. We have developed a tool called MAPPFinder that dynamically links gene-expression data to the GO hierarchy. For each of the 11,239 ([12]; as of May 6, 2002]) GO biological process, cellular component and molecular function terms, MAPPFinder calculates the percentage of the genes measured that meet a user-defined criterion. This is done for each specific GO node, and for the cumulative total of the number of genes meeting the criterion in a parent GO term combined with all of its children, giving a complete picture of the number of genes associated with a particular GO term. Using this percentage and a z score (see Materials and methods), the user can rank the GO terms by their relative amounts of gene-expression changes. MAPPFinder therefore generates a gene-expression profile at the level of biological processes, cellular components and molecular functions, rapidly identifying those areas of biology that warrant further study (Figure 1). MAPPFinder and GenMAPP are both available free-of-charge at [13]. Results and discussion To demonstrate the utility of MAPPFinder, we used the program to analyze the publicly available mouse microarray dataset, the FVB benchmark set for cardiac development, maturation and aging [14]. This dataset measures gene-expression levels in the hearts of 12.5-day embryos and adult mice. We have used the 12.5-day embryonic time point to identify those biological processes that show differentially expressed genes between embryonic and adult hearts. We ran the MAPPFinder analysis on this dataset using two criteria, either an increase (fold change > 1.2 and p < 0.05) or decrease (fold change < -1.2 and p < 0.05) in gene expression for the 12.5-day embryo. We chose this dataset for demonstration because of the large number of differences in gene expression observed in the 12.5-day embryo compared to the adult mouse heart tissue. MAPPFinder linked the 9,946 probe sets measured in this experiment to the 11,239 GO terms [12] in the hierarchy and calculated the percentage of genes meeting the criterion and a z score for each GO term. Table 1 gives an overall summary of the linkages made between the dataset and GO and calculations carried out by MAPPFinder. Nearly half of the 9,946 probe sets measured in the FVB benchmark dataset were connected to a GO term, representing approximately 70% of the mouse genes associated with GO terms [15] and covering a good portion of what is currently known about mouse biology. The proportion of genes in the microarray dataset that link to GO terms will increase as more GO terms and gene associations are added by the Mouse Genome Database (MGD) [16]. After MAPPFinder assigns the genes in the microarray dataset to the GO structure, it calculates for each GO term the percentage and z score (see Materials and methods) for the genes that meet the user's criterion. These two values can be used to identify GO terms with an over- (or under-) represented number of gene-expression changes. The MAPPFinder results are displayed in two forms. The first is a GO browser that graphically displays the MAPPFinder results in the structure of the GO hierarchy (Figures 2a,3a). The second is a text file listing all the GO terms measured, ranked by the z score. The number of genes meeting the criterion, the number of genes measured in the experiment, and the number of genes assigned to each GO term by MGD are given, along with the respective percentages and z score, in the text file and GO browser (Figure 2b). Table 2 shows the list of process, component and function terms with a z score greater than 2 for the significantly increased and decreased criteria at the 12.5-day embryonic time point. GO terms that had fewer than 5 or more than 100 genes changed were removed from the list because these terms were either too specific or too general for our data analysis. This filter identified the top 108 (8.0%) GO terms for the significantly increased criterion and the top 63 (4.8%) GO terms for the significantly decreased criterion. The stringency of this filter can be increased or decreased by raising or lowering the z score cutoff, or by including terms with larger or smaller numbers of genes. The filtered list was then pruned by hand for related GO terms to remove any over-represented branches of the GO hierarchy (for the complete results, see Additional data files). When both a parent and a child term were present in the list, the parent term was removed if its presence was due entirely to genes meeting the criterion for the child term. The remaining terms on the list still have a large degree of interrelatedness, but have been retained here for completeness. The MAPPFinder results present a global picture of the biological processes, cellular components and molecular functions that are increased and decreased in the 12.5-day embryo compared with the adult mouse (Table 2). Using the criterion for a significantly increased gene-expression change, MAPPFinder primarily identified GO terms involved in cell division and growth. Notable GO terms include the processes 'mitotic cell cycle' (62.9% of 70 genes, z score of 8.1), 'mRNA splicing' (90.5% of 21 genes, z score of 7.5), and 'protein biosynthesis' (50% of 104 genes, z score of 6.8). The top-ranked component and function terms reflected the same biological processes. For example, the component term 'spliceosome' shows that 17 out of 20 genes (85%, z score of 6.7) were upregulated. The upregulation of these processes is consistent with the fact that cardiomyocytes remain mitotically active throughout embryonic development [17]. Apart from processes involved in cell division and growth, the MAPPFinder results indicate that the processes 'transmembrane receptor protein serine/threonine kinase signaling pathway' and 'induction of apoptosis' are upregulated, with a z score of approximately 2. The presence of the term 'transmembrane receptor protein serine/threonine kinase signaling pathway' is due to the upregulation of genes involved in transforming growth factor-β (TGFβ) receptor signaling, which is thought to regulate the induction of apoptosis required for morphogenesis during heart development [18,19]. Genes involved in energy metabolism showed the highest levels of downregulation in the 12.5-day embryo heart versus the adult heart. In particular, the process terms 'fatty acid metabolism' (63.3% of 30 genes, z score of 5.9) and 'main pathways of carbohydrate metabolism' (51.3% of 39 genes, z score 4.8), which is the parent of the terms 'glycolysis' and 'tricarboxylic acid cycle', indicate that metabolic genes as a whole are downregulated in an embryo when compared to an adult mouse. In addition, the component term 'mitochondrion' shows that 88 out of 187 genes (47.1%, z score of 9.1) are downregulated. The downregulation of genes involved in fatty-acid metabolism is consistent with research that has shown that the developing heart, unlike the adult heart, does not derive its energy from fatty acids [20]. Overall, the MAPPFinder results provide a global perspective of the processes that are up- and down-regulated in the 12.5-day embryonic heart compared to an adult heart. The results confirmed what was expected: when compared to the adult heart, the embryonic heart is undergoing increased cell division and growth and has decreased energy metabolism. In addition, the global gene-expression profile presented by MAPPFinder allows the gene-expression changes observed for cell division and growth and energy metabolism to be put in the context of other regulatory and developmental processes such as TGFβ signaling and apoptosis. The MAPPFinder browser Viewing the MAPPFinder results as a ranked list is informative, but it does not take full advantage of the fact that GO is arranged in a hierarchy. MAPPFinder also presents the results in the context of the GO hierarchy (Figures 2a,3a) showing the entire hierarchy, color-coded by the percentage of genes changed. Users can step through the hierarchy, expanding those branches of the tree that are showing gene expression changes, moving from broad terms to more specific categories. Often the ranked list of terms will show many interrelated terms, and it is necessary to view the results in the hierarchy to identify the relationships among them. For example, the terms 'RNA metabolism', 'RNA processing', 'mRNA processing', and 'mRNA splicing' appear as upregulated in Table 2. However, the tree view (Figure 2a) clearly shows that mRNA splicing is a child term of both RNA splicing and mRNA processing, which are in turn child terms of RNA metabolism. Similarly, the terms 'main pathways of carbohydrate metabolism', 'catabolic carbohydrate metabolism', and 'glycolysis' also appear as downregulated in Table 2. The MAPPFinder browser (Figure 3a) shows that 'glycolysis' is related to 'main pathways of carboyhydrate metabolism' through the hierarchical relationship between these terms. The MAPPFinder browser also provides three search and navigation functions. First, the user can search by a keyword or an exact GO term name. Second, the user can search by a gene identifier to find which GO term(s) the gene is associated with. For example, searching for the gene alpha-myosin heavy chain using its SWISS-PROT identifier MYH6_MOUSE or its MGD identifier MGI:97255 finds the GO process terms 'striated muscle contraction', 'cytoskeleton organization and biogenesis', 'protein modification', and 'muscle development'. Third, the user can expand the GO tree automatically to show all nodes with a minimum number of genes or minimum percentage of genes meeting the criterion or with a minimum z score. The terms meeting the filter are highlighted in yellow to clearly indicate the results of the search. Once the GO terms of interest have been identified with MAPPFinder, the user will want to know exactly which genes are associated with these terms and exactly which genes are being differentially expressed. This can be accomplished using GenMAPP. Selecting a GO term in the MAPPFinder browser automatically builds a MAPP containing the genes associated with that GO term and all of its children, and opens this MAPP in GenMAPP. Figure 3b shows the MAPP generated by selecting the GO term 'glycolysis' in the MAPPFinder browser. The genes on the MAPP are color-coded with the same criteria used to calculate the MAPPFinder results, significantly increased and decreased at the 12.5-day embryo time point. Clicking on a gene on the MAPP opens a 'back page' containing annotations, gene-expression data and hyperlinks to that gene's page in the public databases. By integrating GenMAPP and MAPPFinder, it is possible to seamlessly move from a global gene-expression profile at the level of all biological processes, components and functions to a detailed description of the gene-expression levels for the specific genes involved. For example, a closer examination of the glycolysis MAPP indicates that hexokinase I is upregulated in the 12.5-day embryo and isoforms II and IV are downregulated, as compared with the adult heart. This is consistent with hexokinase I being the predominant isoform in the embryonic heart [21]. Expanding MAPPFinder beyond GO GO is a good starting point for analyzing microarray data in the context of biological pathways, but this is by no means the only way to group related genes. Instead of representing each GO process as an alphabetical list on a MAPP, it would be more useful to represent the relationships between these genes as a fully delineated pathway. As a start in this direction, GenMAPP.org [13] has created over 50 MAPPs depicting metabolic pathways, signaling pathways and gene families. MAPPFinder can incorporate any MAPP file into its analysis to augment the GO hierarchy. For the FVB benchmark developmental dataset, we have run MAPPFinder on an archive of 54 mouse MAPPs available from [13] (see Additional data files for the complete results). These results for the 12.5-day embryonic time point agree with the GO results, showing that the expression of genes involved in the metabolic pathways 'tricarboxylic acid cycle' (83.3% of 12 genes measured, z score of 5.91) and 'fatty acid degradation' (69.2% of 13 genes measured, z score 4.82) is significantly decreased. In addition, the significantly increased criterion identified genes encoding ribosomal proteins (71.1% of 45 genes, z score 6.75) and genes involved in the cell cycle (53.3% of 15 genes, z score 2.4). The archive of MAPPs provided by GenMAPP is in no way comprehensive. The growth of this archive depends on assistance from the entire biological community. Our hope is that, as MAPPFinder users see the added utility of viewing the GO biological processes as fully delineated pathways, they will use GenMAPP to organize the gene lists into more descriptive biological pathways. Figure 3c gives an example of how the genes from the GO term 'glycolysis' can be rearranged using the tools in GenMAPP to depict the full pathway showing the direction of the enzymatic cascade, metabolic intermediates and cellular compartments. GenMAPP.org is currently accepting submissions of new MAPP files. MAPPs contributed by the community will be included in the downloadable MAPP archive. MAPPFinder is a necessary complement to current analysis tools By approaching large datasets from a higher level or organization, MAPPFinder helps to ease the data analysis and shorten the time necessary to gain a biological understanding of the microarray data. MAPPFinder has greatly expanded current pathway-based tools by using the large amount of annotations available from the GO. This broad analysis will help identify biological processes that have not yet been implicated in a particular experimental condition and begin to make connections between biological processes previously thought to be unrelated. MAPPFinder is available for yeast, mouse and human data. We plan to extend the program to many of the other species that are in GO and updates will be available at [13]. Materials and methods Gene-expression data The publicly available mouse microarray dataset, the FVB benchmark set for cardiac development, maturation and aging, was obtained from the CardioGenomics Program for Genomics Applications [14]. These data compare healthy mouse hearts at different time points during development, using male and female FVB/N mice. Specifically, this dataset examines heart tissue from 12.5-day embryos, 1-day neonatal mice, 1-week mice, 4-week mice, and adult mice at 5 months and 1 year. Our analysis focused on the 12.5-day embryonic time point and the control adult mice. Three Affymetrix U74A version 1 arrays were used for each time point. For the embryonic time point, three hearts were pooled for each array because of their small size. To improve the statistical power in our analysis, the 5-month and the 1-year mice were combined into a single group of normal adult mice. Signal intensity values were obtained with Affymetrix MAS 5.0 software. Signal values less than 20 were raised to 20 and the log base 2 was taken. Log folds were determined from the average of each time point when compared with the average of the combined control group. P values were calculated with a permutation t test. The statistical analysis was done using the multest package of the R statistical programming language [22]. These data were imported into GenMAPP, and the resulting GenMAPP Expression Dataset file (.gex) was exported to MAPPFinder. MAPPFinder requires a user-defined criterion for a meaningful gene-expression change. In this case we combined a fold change with a statistical filter to determine significance. We are using a fold change of greater than 1.2 with a p value of less than 0.05 to define a significant gene-expression increase, and a fold change of less than -1.2 with a p-value of less than 0.05 to define a significant gene-expression decrease. To determine the overall number of gene-expression changes in each GO term, an additional criterion of a fold change greater than 1.2 or less than -1.2 and a p value of less than 0.05 is used (data not shown). It is important to note that while we have used gene-expression data generated from Affymetrix GeneChips, data from other microarray platforms and other techniques such as SAGE (serial analysis of gene expression) can be used equally easily. Linking the expression data to Gene Ontology MAPPFinder builds a local copy of the GO hierarchy using the three ontology files (Process, Component and Function) available from GO [12]. The directed acyclic graph (DAG) structure of GO [23] allows a node to be a child of multiple parents. This makes the navigation, visualization and computation of the MAPPFinder results more difficult than if the GO were stored in a classical tree structure. To ease the programming necessary to implement the MAPPFinder algorithm, the DAG structure was converted to a classical tree. For each node of the DAG that contained multiple parents, multiple copies were inserted into the tree representation of the GO using local identifiers to handle duplicate GO terms. This tree structure maintains the 'true path' rule enforced in the GO DAG structure. MAPPFinder handles this conversion internally, and to the user the GO hierarchy seen in the MAPPFinder browser will be identical to that seen in other GO browsers. The links between the GO terms and the genes in the expression dataset are made with the gene-association files [15]. These associations are taken from the European Bioinformatics Institute [24] for human genes, the Mouse Genome Database (MGD) [16] for mouse genes, and the Saccharomyces Genome Database (SGD) [25] for yeast genes. Currently, the genes in the input data must be identified with GenBank, SWISS-PROT or SGD identifiers. MAPPFinder uses a relational database to link the expression dataset to the gene-association files. The MAPPFinder database relates gene-expression data to the appropriate gene-identifrcation systems for each species (Figure 1). For human data, the gene-association files use SWISS-PROT identifiers, requiring a SWISS-PROT-to-GenBank relational table to link datasets using GenBank accession numbers to the GO annotations. For yeast data, the gene-association files use SGD identifiers. A SWISS-PROT-to-SGD relational table is also included for expression datasets using SWISS-PROT identifiers. For mouse data, the GO gene-association files use MGD identifiers, requiring a GenBank-to-MGD relational table, and a SWISS-PROT-to-MGD relational table. MAPPFinder takes advantage of the fact that MGD is also related to UniGene, allowing additional ESTs that are not in the MGD-GenBank relational table to be used as gene identifiers. With this intermediate step, many more GenBank identifiers can be linked to GO annotations. Currently, there is no direct relationship between SWISS-PROT and UniGene, so a similar intermediate step was not used for human data. Calculating the MAPPFinder results MAPPFinder calculates the percentage of genes measured within each GO term that meet a user-defined criterion, and this measurement is known as the 'percent changed'. MAPPFinder also calculates the percentage of the genes associated with a GO term that are measured in the experiment, and this measurement is known as the 'percent present'. Calculating the percent present is necessary to determine how well represented a GO term is in the dataset. The GO gene-association files [17] are potentially problematic, because they treat each GO term independently, removing the implicit parent-child relationship. As a result, looking at the GO terms individually is often uninformative because the number of genes associated with any one term is smaller than the actual number of genes involved in that process, component, or function. To address this issue, we calculate the nested percentage for a parent term with all its children below it in the hierarchy. By combining the child terms with their parent, the results incorporate genes associated with the entire branch of the hierarchy, providing a much more accurate representation of the number of genes involved in that process, component or function. As more specific branches of the GO are examined, the denominator of the two equations will become smaller and the user can find their desired level of specificity. One complication that arises from this method is that in some cases a gene is associated with both the parent and child terms or multiple child terms. When the percentages are calculated for the sub-tree, we ensure that each gene is only counted once, so that genes with multiple annotations are not weighted more heavily. Another complication that arises while calculating the MAPPFinder results is the issue of multiple probes of the same gene on the array. In this case, the features or duplicate genes are clustered to one unique gene. If any of the instances of this gene on the array meet the user-defined criterion, then that gene meets the user-defined criterion. The number of unique genes is also used to calculate the z score, meaning that the statistics are based only on a single occurrence of each gene in the dataset. A statistical rating of the relative gene-expression activity in each MAPP and GO term is also provided. It is a standardized difference score (z score) using the expected value and standard deviation of the number of genes meeting the criterion on a GO term under a hypergeometric distribution. The z score is useful for ranking GO terms by their relative amounts of gene expression changes. Positive z scores indicate GO terms with a greater number of genes meeting the criterion than is expected by chance. Negative z scores indicate GO terms with fewer genes meeting the criterion than expected by chance. A z score near zero indicates that the number of genes meeting the criterion approximates the expected number. Extreme positive scores suggest GO terms with the greatest confidence that the correlation between the expression changes of the genes in this grouping are not occurring by chance alone. P values are not assigned to the GO terms or MAPPs because, while such a standardized difference score could approximate a normal z score for an individual MAPP, the lack of independence between GO terms and the multiple testing occurring among them most certainly makes the normal p value for such a z score unreliable. As a result, p values are not assigned to the GO terms and MAPPs. The z score is calculated by subtracting the observed number of genes in a GO term (or MAPP) meeting the criterion from the expected number of genes, and dividing by the standard deviation of the observed number of genes. The equation used is or where N is the total number of genes measured, R is the total number of genes meeting the criterion, n is the total number of genes in this specific MAPP, and r is the number of genes meeting the criterion in this specific MAPP. Therefore, if two GO terms contain the same number of genes, the term with the greater number of genes meeting the criterion will receive a higher score. Dividing by the standard deviation adjusts for the size of the GO term, ranking a GO term (or MAPP) with a large number of genes meeting the criterion higher than a GO term (or MAPP) with the same percentage of genes changed, but fewer total genes. The MAPPFinder results are generated in the GO browser for analysis in the context of the GO hierarchy and as tab-delimited text files that can be used for sorting and filtering the data in a spreadsheet program. Additional data files The following additional data files are available: The FVBN developmental data in the form of a GenMAPP expression dataset file (.gex). It contains the microarray dataset and the criteria used to define increased and decreased gene-expression change. It can be opened for editing in GenMAPP and is the appropriate data type for use with MAPPFinder. The FVBN developmental data as a database file generated by MAPPFinder (.gdb). It contains the relationships between the genes in the dataset and the GO hierarchy. The file can be opened for viewing in Microsoft Access. This file must be present to build GenMAPP MAPPs from existing MAPPFinder results. The MAPPFinder results for the 12.5-day embryos versus the adult mice are contained in the files: 12.5-day Embryo - significantly increased - Gene Ontology results, 12.5-day Embryo - significantly increased - Local results, 12.5-day Embryo - significantly decreased - Gene Ontology results, 12.5-day Embryo - significantly decreased - Local results, 12.5-day Embryo - All Changes - Gene Ontology results, 12.5-day Embryo - All Changes - Local Results. These text files contain the MAPPFinder results for both criteria and both the GO hierarchy and the GenMAPP.org MAPPs. These files can be loaded into MAPPFinder for view in the MAPPFinder GO browser. These files are tab-delimited and can also be viewed as tables in Microsoft Excel. The 'All Changes' files contain the results for a criteria looking for either increased or decreased gene expression changes. A zip file containing all aditional data files is available. Supplementary Material Additional data file 1 The FVBN developmental data in the form of a GenMAPP expression dataset file (.gex). It contains the microarray dataset and the criteria used to define increased and decreased gene-expression change. It can be opened for editing in GenMAPP and is the appropriate data type for use with MAPPFinder. Click here for additional data file Additional data file 2 The FVBN developmental data as a database file generated by MAPPFinder (.gdb). It contains the relationships between the genes in the dataset and the GO hierarchy. The file can be opened for viewing in Microsoft Access. This file must be present to build GenMAPP MAPPs from existing MAPPFinder results. Click here for additional data file Additional data file 3 12.5-day Embryo - significantly increased - Gene Ontology results Click here for additional data file Additional data file 4 12.5-day Embryo - significantly increased - Local results Click here for additional data file Additional data file 5 12.5-day Embryo - significantly decreased - Gene Ontology results Click here for additional data file Additional data file 6 12.5-day Embryo - significantly decreased - Local results Click here for additional data file Additional data file 7 12.5-day Embryo - All Changes - Gene Ontology results Click here for additional data file Additional data file 8 12.5-day Embryo - All Changes - Local results Click here for additional data file Additional data file 9 A zip file containing all aditional data files. Click here for additional data file
                Bookmark

                Author and article information

                Contributors
                Journal
                Curr Opin Microbiol
                Curr. Opin. Microbiol
                Current Opinion in Microbiology
                Elsevier Ltd.
                1369-5274
                1879-0364
                6 May 2006
                June 2006
                6 May 2006
                : 9
                : 3
                : 312-319
                Affiliations
                [1 ]Division of Infectious Diseases and Geographic Medicine, Department of Medicine, Stanford University School of Medicine, 300 Pasteur Drive, Grant S-169, Stanford, CA 94305, USA
                [2 ]Department of Microbiology and Immunology, Stanford University School of Medicine, 279 Campus Drive, Beckman B403, Stanford, CA 94305, USA
                [3 ]Veterans Affairs Palo Alto Health Care System, Palo Alto, CA 94304, USA
                Article
                S1369-5274(06)00055-5
                10.1016/j.mib.2006.04.006
                7108404
                16679048
                de4111bc-5817-4994-b452-190538b43141
                Copyright © 2006 Elsevier Ltd. All rights reserved.

                Since January 2020 Elsevier has created a COVID-19 resource centre with free information in English and Mandarin on the novel coronavirus COVID-19. The COVID-19 resource centre is hosted on Elsevier Connect, the company's public news and information website. Elsevier hereby grants permission to make all its COVID-19-related research that is available on the COVID-19 resource centre - including this research content - immediately available in PubMed Central and other publicly funded repositories, such as the WHO COVID database with rights for unrestricted research re-use and analyses in any form or by any means with acknowledgement of the original source. These permissions are granted for free by Elsevier for as long as the COVID-19 resource centre remains active.

                History
                Categories
                Article

                Microbiology & Virology
                Microbiology & Virology

                Comments

                Comment on this article