1
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Identification of AGXT2, SHMT1, and ACO2 as important biomarkers of acute kidney injury by WGCNA

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Acute kidney injury (AKI) is a serious and frequently observed disease associated with high morbidity and mortality. Weighted gene co-expression network analysis (WGCNA) is a research method that converts the relationship between tens of thousands of genes and phenotypes into the association between several gene sets and phenotypes. We screened potential target genes related to AKI through WGCNA to provide a reference for the diagnosis and treatment of AKI. Key biomolecules of AKI were investigated based on transcriptome analysis. RNA sequencing data from 39 kidney biopsy specimens of AKI patients and 9 normal subjects were downloaded from the GEO database. By WGCNA, the top 20% of mRNAs with the largest variance in the data matrix were used to construct a gene co-expression network with a p-value < 0.01 as a screening condition, showing that the blue module was most closely associated with AKI. Thirty-two candidate biomarker genes were screened according to the threshold values of |MM|≥0.86 and |GS|≥0.4, and PPI and enrichment analyses were performed. The top three genes with the most connected nodes, alanine—glyoxylate aminotransferase 2(AGXT2), serine hydroxymethyltransferase 1(SHMT1) and aconitase 2(ACO2), were selected as the central genes based on the PPI network. A rat AKI model was constructed, and the mRNA and protein expression levels of the central genes in the model and control groups were verified by PCR and immunohistochemistry experiments. The results showed that the relative mRNA expression and protein levels of AGXT2, SHMT1 and ACO2 showed a decrease in the model group. In conclusion, we inferred that there is a close association between AGXT2, SHMT1 and ACO2 genes and the development of AKI, and the down-regulation of their expression levels may induce AKI.

          Related collections

          Most cited references45

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          WGCNA: an R package for weighted correlation network analysis

          Background Correlation networks are increasingly being used in bioinformatics applications. For example, weighted gene co-expression network analysis is a systems biology method for describing the correlation patterns among genes across microarray samples. Weighted correlation network analysis (WGCNA) can be used for finding clusters (modules) of highly correlated genes, for summarizing such clusters using the module eigengene or an intramodular hub gene, for relating modules to one another and to external sample traits (using eigengene network methodology), and for calculating module membership measures. Correlation networks facilitate network based gene screening methods that can be used to identify candidate biomarkers or therapeutic targets. These methods have been successfully applied in various biological contexts, e.g. cancer, mouse genetics, yeast genetics, and analysis of brain imaging data. While parts of the correlation network methodology have been described in separate publications, there is a need to provide a user-friendly, comprehensive, and consistent software implementation and an accompanying tutorial. Results The WGCNA R software package is a comprehensive collection of R functions for performing various aspects of weighted correlation network analysis. The package includes functions for network construction, module detection, gene selection, calculations of topological properties, data simulation, visualization, and interfacing with external software. Along with the R package we also present R software tutorials. While the methods development was motivated by gene expression data, the underlying data mining approach can be applied to a variety of different settings. Conclusion The WGCNA package provides R functions for weighted correlation network analysis, e.g. co-expression network analysis of gene expression data. The R package along with its source code and additional material are freely available at .
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Acute kidney injury.

            Acute kidney injury (formerly known as acute renal failure) is a syndrome characterised by the rapid loss of the kidney's excretory function and is typically diagnosed by the accumulation of end products of nitrogen metabolism (urea and creatinine) or decreased urine output, or both. It is the clinical manifestation of several disorders that affect the kidney acutely. Acute kidney injury is common in hospital patients and very common in critically ill patients. In these patients, it is most often secondary to extrarenal events. How such events cause acute kidney injury is controversial. No specific therapies have emerged that can attenuate acute kidney injury or expedite recovery; thus, treatment is supportive. New diagnostic techniques (eg, renal biomarkers) might help with early diagnosis. Patients are given renal replacement therapy if acute kidney injury is severe and biochemical or volume-related, or if uraemic-toxaemia-related complications are of concern. If patients survive their illness and do not have premorbid chronic kidney disease, they typically recover to dialysis independence. However, evidence suggests that patients who have had acute kidney injury are at increased risk of subsequent chronic kidney disease. Copyright © 2012 Elsevier Ltd. All rights reserved.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Geometric Interpretation of Gene Coexpression Network Analysis

              Introduction Many biological networks share topological properties. Common global properties include modular organization [1],[2], the presence of highly connected hub nodes, and approximate ‘scale free topology’ [3],[4]. Common local topological properties include the presence of recurring patterns of interconnections (‘network motifs’) in regulation networks [5]–[7]. One goal of this article is to describe existing and novel network concepts (also known as network statistics or indices [8]) that can be used to describe local and global network properties. For example, the clustering coefficient [9] is a network concept, which measures the cohesiveness of the neighborhood of a node. We are particularly interested in network concepts that are defined with regard to a ‘gene significance measure’. Gene significance measures are of great practical importance since they allow one to incorporate external gene information into the network analysis. In functional enrichment analysis, a gene significance measure could indicate pathway membership. In gene knock-out experiments, gene significance could indicate knock-out essentiality. We study gene significance measures since a microarray sample trait (e.g., case control status) gives rise to a statistical measure of gene significance. For example, the Student t-test of differential expression leads to a gene significance measure. Many traditional microarray data analysis methods focus on the relationship between the microarray sample trait and the gene expression data. For example, gene filtering methods aim to find a list of (differentially expressed) genes that are significantly associated with the microarray sample trait; another example are microarray-based prediction methods that aim to accurately predict the sample trait on the basis of the gene expression data. Gene expression profiles across microarray samples can be highly correlated and it is natural to describe their pairwise relations using network language. Genes with similar expression patterns may form complexes, pathways, or participate in regulatory and signaling circuits [10]–[12]. Gene coexpression networks have been used to describe the transcriptome in many organisms, e.g., yeast, flies, worms, plants, mice, and humans [13]–[23]. Gene coexpression network methods have also been used for typical microarray data analysis tasks such as gene filtering [19], [24]–[26] and outcome prediction [27],[28]. While the utility of network methods for analyzing microarray data has been demonstrated in numerous publications, the utility of microarray data analysis techniques for solving network theoretic problems has not yet been fully appreciated. One goal of this article is to show that simple geometric arguments can be used to derive network theoretic results if the networks are defined on the basis of a correlation matrix. Definition of Gene Coexpression Networks Although many of our network concepts will be useful for general networks, we are particularly interested in gene coexpression networks (also known as association-, influence-, relevance-, or correlation networks). Gene coexpression networks are built on the basis of a gene coexpression measure. The network nodes correspond to genes—or more precisely to gene expression profiles. The ith gene expression profile xi is a vector whose components report the gene expression values across m microarrays. We define the coexpression similarity sij between genes i and j as the absolute value of the correlation coefficient between their expression profiles: Using a thresholding procedure, this coexpression similarity is transformed into a measure of connection strength (adjacency). An unweighted network adjacency aij between gene expression profiles xi and xj can be defined by hard thresholding the coexpression similarity sij as follows (1) where τ is the “hard” threshold parameter. Thus, two genes are linked (aij  = 1) if the absolute correlation between their expression profiles exceeds the (hard) threshold τ. Hard thresholding of the correlation leads to simple network concepts (e.g., the gene connectivity equals the number of direct neighbors) but it may lead to a loss of information: if τ has been set to 0.8, there will be no link between two genes if their correlation equals 0.799. To preserve the continuous nature of the coexpression information, one could simply define a weighted adjacency matrix as the absolute value of the gene expression correlation matrix, i.e., [aij ] = [sij ]. However, since microarray data can be noisy and the number of samples is often small, we and others have found it useful to emphasize strong correlations and to punish weak correlations. It is natural to define the adjacency between two genes as a power of the absolute value of the correlation coefficient [19],[24]: (2) with β≥1. This soft thresholding approach leads to a weighted gene coexpression network. We present empirical results for weighted and unweighted networks in the main text, Text S1, Text S2, and Text S3. Social Network Analogy: Affection Network Since humans are organized into social networks, social network analogies should be intuitive to many readers. Therefore, we will refer to the following ‘affection network’ throughout this article. Assume that n individuals filled out an interest questionnaire, which was used to define a pairwise similarity score sij . For convenience, we assume that the similarity measure takes on values between 0 and 1. Our definition of the affection network is based on the following assumption: the more similar the interests between two individuals, the more affection they feel for each other. More specifically, we assume that the affection (adjacency) aij between two individuals is proportional to their similarity on a logarithmic scale, i.e., (3) This is equivalent to our soft thresholding approach aij  = sij β (Equation 2). A soft threshold β = 2 implies that the affection aij equals 0.25 if the similarity sij equals 0.5. Results Gene Significance Based on a Microarray Sample Trait Many network applications use at least one gene significance measure. Abstractly speaking, we define a gene significance measure as a function GS that assigns a nonnegative number to each gene; the higher GSi the more biologically significant is gene i. We assume that the minimum gene significance is 0. For example, if a statistical significance level (p-value) is available for each gene, the gene significance of the ith gene can be defined as minus log of the p-value, i.e., GSi  = −log(pi ). In this article, we are particularly interested in gene significance measures that are based on a microarray sample trait, e.g., a clinical outcome. The microarray sample trait T = (T 1,…,Tm ) may be quantitative (e.g., body weight) or binary (e.g., case control status). Since our goal is to provide a simple geometric interpretation of coexpression network analysis, we define the trait-based gene significance measure by raising the correlation between the ith gene expression profile xi and the clinical trait T to a power β (4) Although any power β could be used in Equation 4, we use the same power as in Equation 2 to facilitate a simple geometric interpretation. Geometric Interpretation Using a Hypersphere We find it convenient to express network quantities in terms of correlation coefficients since the correlation between two vectors can be interpreted as the cosine of the angle between them (measured in radians) if the vectors are scaled to have a mean of 0. Since the correlation is scale-invariant, i.e., cor(axi +b, cxj +d) = cor(xi ,xj ), we can assume without loss of generality that the vectors xi have a mean 0 and are of the same length. In other words, they correspond to points on a hypersphere. The network adjacency aij is a monotonically decreasing function of the angle θij between the two scaled expression profiles if 0≤θij ≤π/2. When the angle θij equals 0 or π/2, the adjacency equals 1 or 0, respectively. The network adjacency is a monotonically decreasing function of the length of the shortest path (geodesic) between the two points on the hypersphere. Soft thresholding methods (Equation 2) preserve the continuous nature of these distances. The higher the soft threshold β, the more weight is assigned to short geodesic distances compared to large distances. Since the trait-based gene significance measure GSi  = |cor(xi ,T)| β , (Equation 4) is scale-invariant, the sample trait T can also be considered a point on the hypersphere. Analogous to the network adjacency, the smaller the geodesic distance between the ith gene expression profile and the trait T, the higher the gene significance of the ith gene. In other words, the smaller the angle between the sample trait and the expression profile, the more significant is the gene. A Motivational Example As a motivational example, we study the pairwise correlations among 498 genes that had previously been found to form a sub-network related to mouse body weight. The microarray data measure the expression levels in multiple tissue samples (liver, adipose, brain, muscle) from male and female mice of an F2 intercross. Approximately 100 tissue samples are available for each gender/tissue combination. The biological significance of this subnetwork is described in [23],[26]. Here we focus on the mathematical and topological properties of the pairwise absolute correlations aij  = |cor(xi ,xj )| between the genes. For each gender and tissue type Figure 1A depicts a hierarchical cluster tree of the genes. Figure 1B shows the corresponding heat maps, which color-code the absolute pairwise correlations aij . As can be seen from the color bar underneath the heat maps, red and green in the heat map indicate high and low absolute correlation, respectively. The genes in the rows and columns of each heat map are sorted by the corresponding cluster tree. 10.1371/journal.pcbi.1000117.g001 Figure 1 This motivational example explores the pairwise absolute correlations aij  = |cor(xi ,xj )| among 498 genes in different mouse tissues. The biological significance of this network is described in [23],[26]. Each figure panel contains 8 subfigures for different genders and tissue types (liver, adipose, brain, muscle). (A) An average linkage hierarchical cluster tree of the genes. (B) The corresponding heat maps, which color-code the absolute pairwise correlations aij : red and green in the heat map indicate high and low absolute correlation, respectively. The genes in the rows and columns of each heat map are sorted by the corresponding cluster tree. (C) The relationship between gene significance GS (y-axis) and connectivity (x-axis). The gene significance of the ith gene was defined as the absolute correlation between the ith gene expression profile and mouse body weight. The hub gene significance HGS (Equation 13) is defined as the slope of the red line, which results from a regression model without an intercept term. It is visually obvious that the heat maps and the cluster trees of different gender/tissue combinations can look quite different. Network theory offers a wealth of intuitive concepts for describing the pairwise relationships among genes that are depicted in cluster trees and heat maps. To illustrate this point, we describe several such concepts in the following. By visual inspection of Figure 1B, genes appear to be more highly correlated in liver than in adipose (a lot of red versus green color in the corresponding heat maps). This property can be captured by the concept of network density (defined below). The density of the female liver network is 0.39 while it is only 0.23 for the female adipose network. Another example for the use of network concepts is to quantify the extent of cluster (module) structure. In this example, branches of a cluster tree (Figure 1A) correspond to modules in the corresponding network. The cluster structure is also reflected in the corresponding heat maps: modules correspond to large red squares along the diagonal. Network theory provides a concept for quantifying the extent of module structure in a network: the mean clustering coefficient (defined below). The female liver, male liver and female brain networks have high mean clustering coefficients (mean ClusterCoef = 0.42, 0.43, 0.41, respectively). In contrast, the female adipose, male adipose, and male brain networks have lower mean clustering coefficients (mean ClusterCoef = 0.27, 0.27, 0.25, respectively). Difference in module structure may reflect true biological differences or they may reflect noise (e.g. technical artifacts or tissue contaminations). As another example for the use of network concepts, compare the cluster tree of the female brain network with that of the male brain network. The cluster tree of the female network appears to be comprised of a single large branch, i.e., a highly connected hub gene at the tip of the branch forms the center in this network. In contrast, the cluster tree corresponding to the male brain network appears to split into multiple smaller branches, i.e., no single gene forms the center. To measure whether a highly connected hub gene forms the center in a network, one can use the concept of centralization (defined below). The female brain and male brain networks have centralization 0.34 and 0.21, respectively. These examples illustrate that graph theory contains a wealth of network concepts that can be used to describe microarray data. But we will argue that microarray data analysis techniques can also be used to derive network theoretic results. For example, network theorists have long studied the relationship between gene significance and connectivity. Several network articles have pointed out that highly connected hub nodes are central to the network architecture [17], [29]–[32] but hub genes may not always be biologically significant [33]. To define a sample trait based gene significance measure (Equation 4), we define the gene significance of gene i as the absolute correlation between the gene expression profile xi and body weight T, i.e., GSi  = |cor(xi ,T)|. Figure 1C shows the relationship between this gene significance measure and connectivity in the different gender/tissue type networks. We find a strong positive relationship between gene significance and connectivity in the female and the male mouse liver networks. The positive relationship between gene significance and connectivity suggests that both variables could be used to implicate genes related to body weight. For example, we used connectivity as a variable in a systems biologic gene screening method [26]. While most network theorists would agree that connectivity is an important variable for finding important genes in a network [17],[19], the statistical advantages of combining gene significance and connectivity are not clear. Below, we use the geometric interpretation of coexpression network analysis to argue that intramodular connectivity can be interpreted as a fuzzy measure of module membership. Thus, a systems biologic gene screening method that combines a gene significance measure with intramodular connectivity amounts to a pathway based gene screening method. Empirical evidence shows that the resulting systems biologic gene screening methods can lead to important biological insights [23]–[26]. Before combining gene significance and connectivity in a systems biologic gene screening approach, it is important to study their relationship. Toward this end, we propose a measure of hub gene significance HGS as slope of a regression line (through the origin) between gene significance and scaled connectivity. As can be seen from Figure 1C, the hub gene significance is high in liver and adipose tissues but it is low in brain and muscle tissues. Below, we use the geometric interpretation of coexpression networks to characterize coexpression networks that have high hub gene significance if the gene significance measure is based on a microarray sample trait T. Network Concepts Abstract definition of network concepts We define network concepts for (weighted) undirected networks that can be represented by a symmetric adjacency matrix A = [aij ], where 1≤i,j≤n. We assume that the pairwise adjacency (connection strength) aij takes on values in the unit interval, i.e., 0≤aij ≤1. For notational convenience, we set the diagonal elements to 1. In the Methods section, we define a network concept NCF(A,GS) by evaluating a network concept function NCF(·,·) on the adjacency matrix A and/or a corresponding gene significance measure GS. This abstract definition will be useful in defining intramodular network concepts (e.g., Equation 17) and eigengene-based analogs of network concepts (e.g., Equation 30). In the following, we describe several network concepts including the connectivity, the maximum adjacency ratio, the density, and the centralization. Connectivity and related concepts The connectivity (also known as degree) of the ith gene is defined by (5) In unweighted networks, the connectivity ki equals the number of genes that are directly linked to gene i. In weighted networks, the connectivity equals the sum of connection weights between gene i and the other genes. The maximum connectivity is defined as (6) The scaled connectivity Ki of the i-th gene is defined by (7) By definition, 0≤Ki ≤1. Note that we distinguish the scaled from the unscaled connectivity by using an upper case “K” and a lower case “k”, respectively. Social Network Interpretation of the Connectivity: For the aforementioned affection network (Equation 3), assume that the affection (adjacency) aij equals 1 if two individuals strongly like each other; it equals 0.5 if they are neutral towards each other, and it equals 0 if they strongly dislike each other. Then the scaled connectivity Ki is a measure of relative popularity: high values of Ki indicate that the ith person is well liked by many others. Potential Uses of the Connectivity: The connectivity is the most widely used concept for distinguishing the nodes of a network. As described in the motivational example and detailed below, intramodular connectivity can be used to define a systems biologic gene screening strategy that keeps track of module membership information [24]. Maximum adjacency ratio For weighted networks, we define the maximum adjacency ratio of gene i as follows (8) which is defined if ki  = Σ j≠i aij >0. One can easily verify that 0≤aij ≤1 implies 0≤MARi ≤1. Note that MARi  = 1 if all nonzero adjacencies take on their maximum value of 1, which justifies the name “maximum adjacency ratio.” By contrast, if all nonzero adjacencies take on a small (but constant) value aij  = ε, then MARi  = ε will be small. Social Network Interpretation of the Maximum Adjacency Ratio: MARi  = 1 suggests that the ith individual does not form neutral relationships; this individual either strongly likes or dislikes others. In contrast, MARi  = 0.5 suggests the ith individual forms less intense relationships with others. Potential Uses of the Maximum Adjacency Ratio: Since MARi  = 1 for all genes in an unweighted network, the maximum adjacency ratio is only useful for weighted networks. The MAR can be used to determine whether a hub gene forms moderate relationships with a lot of genes or very strong relationships with relatively few genes. To illustrate this point, we show in the following simple example that the MAR can be used to distinguish nodes that have the same connectivity. Assume a network (labeled by I) for which the adjacency between node 1 and every other node equals a 1,j (I) = 1/(n−1). Then k 1 (I) = (n−1)/(n−1) = 1 and MAR 1 (I) = 1/(n−1). For a different network (labeled by II) where a 1,2 (II) = 1 and a 1,j (II) = 0 for j≥3, the connectivity k 1 (II) still equals 1 but MAR 1 (II) = 1. In weighted coexpression networks, we find empirically that MARi is often highly correlated with the connectivity Ki (see also Equation 36). As we demonstrate in Figure 2, the MARi is sometimes (but not always) superior to Ki when it comes to identifying biologically important intramodular hub genes. As aside, we mention that a directed network analog of MARi has been used in the analysis of metabolic fluxes [34]. 10.1371/journal.pcbi.1000117.g002 Figure 2 Relationships among maximum adjacency ratio, scaled connectivity, and gene significance. (A) The relationship between MARi (y-axis) and scaled connectivity Ki using the female mouse muscle tissue network described in the motivational example. The genes are colored red or black depending on whether they are significantly (p-value 0.90), typically dozens of candidates result. Although these candidates are statistically equivalent, they may differ in terms of biological plausibility or clinical utility. Network Modules Roughly speaking, we define network modules as groups of highly interconnected genes. As detailed in Text S1, Text S2, Text S3, and in our online R tutorials, we use a hierarchical clustering procedure to identify modules (clusters) as branches of the resulting cluster tree. A common but inflexible branch cutting method uses a constant height cutoff value. Alternatively, dynamic branch cutting adaptively chooses cutting values depending on the shape of the branch [40]. Each module is assigned a unique color label (Figure 3B). Our branch cutting algorithm only assigns module colors to branches whose size exceeds a user-specified threshold parameter. In practice, it is advisable to vary the minimum module size and other branch cutting parameters to determine how the results are affected by different parameter choices. An iterative approach for choosing the parameters could be defined by optimizing the module significance. This module detection approach has led to biologically meaningful modules in several applications [1], [8], [23]–[25], [33], [39]–[43] but our theoretical results transcend this particular module detection method. Any module detection method that results in clusters of highly correlated gene expressions could be used. Intramodular Network Concepts In the following, we assume that a module detection method (e.g., a clustering procedure) has found Q modules. We denote the adjacency matrix of the genes inside the qth module by A (q). Thus, A (q) represents a subnetwork comprised of the genes in the qth module. Analogously, we define GS (q) as the gene significance measure restricted to the module genes. Denote by n (q) the number of genes inside the qth module. Throughout the manuscript, we use the superscript (q) to denote quantities associated with the qth module. But for notational convenience, we sometimes omit (q) when the context is clear. We define an intramodular network concept NCF(A (q),GS (q)) by evaluating a network concept function NCF(·,·) on the adjacency matrix A (q) and/or a corresponding gene significance measure GS (q). For example, the intramodular connectivity is defined by (17) where the j indexes the genes in the qth module. Intramodular connectivity has been found to be an important complementary gene screening variable for finding biologically important genes [24],[25],[39]. We refer to the network significance (Equation 14) of a module network simply as the module significance measure, i.e., the module significance is the average gene significance of the module genes: (18) Data Reduction Methods for Microarray Data The high dimensionality of gene expression data has inspired two broad categories of data reduction techniques. The first category, often used by network theorists, is to reduce the gene coexpression networks into modules. Each module can be represented by a centroid, e.g., an intramodular hub gene. The second category, often used by microarray data analysts, reduces the gene expression data to a small number of components that capture the essential behavior of the expression profiles [27], [44]–[51]. One of our goals is to understand how the two categories of data reduction methods relate to each other. Here we use the singular value decomposition [44],[45],[48] since this will allow us to define a simple measure of factorizability (Equation 24). Singular value decomposition For the qth module, denote by X (q) the n (q)×m matrix of n (q) gene expression profiles across m microarrays: (19) where xi denotes the gene expression vector of the ith gene. The singular value decomposition (SVD) of X (q) is given by X (q) = U (q) D (q)(V (q)) T , where U (q) is an n (q)×m matrix with orthonormal columns, V (q) is an m×m orthogonal matrix, and D (q) is an m×m diagonal matrix of the singular values {|dl (q)|}. Specifically, V (q) and D (q) are given by (20) The singular value decomposition of X (q) is closely related to the principal component analysis of the correlation matrix COR = [cor(xi (q),xj (q))] whose entries correspond to the pairwise correlations between the rows (genes) of X (q). For example, the eigenvalues of the correlation matrix COR are squares of corresponding singular values |dl (q)|. We assume that the singular values |dl (q)| are arranged in decreasing order. Adapting terminology from [44], we refer to the first column of V (q) as the Module Eigengene: (21) For brevity, we sometimes drop the superscript (q) and simply refer to E as the eigengene. The module eigengene can be used to summarize and represent the expression profiles of the module genes, see Figure 4B. The proportion of variance explained by the module eigengene E (q) is defined as (22) 10.1371/journal.pcbi.1000117.g004 Figure 4 Module eigengenes in the brain cancer gene coexpression network. (A) The pairwise scatter plots among the module eigengenes E (q) of different modules and cancer survival time T. Each dot represents a microarray sample. ME.blue denotes the module eigengene E (blue) of the blue module. Numbers below the diagonal are the absolute values of the corresponding correlations. Note that the module eigengenes of different modules can be highly correlated. The brown module eigengene has the highest absolute correlation (r = 0.20) with survival time. Frequency plots (histograms) of the variables are plotted along the diagonal. (B) Upper panel: heat map plot of the brown module gene expression profiles (rows) across the microarray samples (columns). Red corresponds to high- and green to low- expression values. Since the genes of a module are highly correlated, one observes vertical bands. (B) Lower panel: the values of the components of the module eigengene (y-axis) versus microarray sample number (x-axis). Note that vertical bands of red (green) in the upper panel correspond to high (low) values of the eigengene in the lower panel. (C) The expression profile of the module eigengene (y-axis) is highly correlated with that of the most highly connected hub gene (x-axis). A linear regression line has been added. High Level View of Gene Coexpression Networks and Eigengene Networks The module eigengenes of different modules can be highly correlated (Figure 4A). Detecting a high correlation between module eigengenes may either be of biological interest (suggesting interactions between pathways) or it may be a methodological artifact (suggesting poorly defined modules that should be merged). The correlations between two eigengenes can be used to define eigengene coexpression networks [52], e.g., a weighted eigengene coexpression network can be defined as follows (23) where E (q) and E (p) represent the eigengenes of two distinct modules. Apart from correlating the module eigengenes of different modules to each other, one can relate the module eigengenes to an external microarray sample trait T to identify trait related modules. Thus, eigengene network analysis can be viewed as a network reduction scheme that reduces a gene coexpression network involving thousands of genes to an orders of magnitude smaller metanetwork involving module representatives (one eigengene per module). Unlike traditional microarray data reduction methods that impose orthogonality (e.g., principal component analysis) or independence (e.g., independent component analysis), gene coexpression network analysis can be considered a pathway-based data reduction method that allows dependencies between the modules. When focusing on the use of module eigengenes, network analysis can be considered a variant of oblique factor analysis. Low Level View of a Single Module and Factorizable Networks While a high level view of modular gene coexpression networks can be viewed as a data reduction technique, many network analyses focus on the pairwise relationships of relatively few (hundreds) of correlated genes, i.e., genes that form a single module in a larger network. For example, the 498 genes of our motivational example were part of a body weight related module, which was found in a large gene coexpression network based on the female mouse liver samples [23]. The low-level analysis of a single network module may help identify key genes that may be used as therapeutic targets or candidate biomarkers. An important question of low level analysis is to efficiently describe the connection strengths between interacting module genes. We have provided empirical evidence that many module adjacency matrices, i.e., networks comprised of genes of a single module, are approximately factorizable [8]. In such networks, the adjacency between module genes i and j can approximately be factored into gene specific contributions, i.e., aij (q)≈CFi (q) CFj (q) with CFi (q) defined as the conformity of gene i. Thus, the adjacency matrix of an approximately factorizable network can be approximated using the rank 1 matrix [CFi (q) CFj (q)]. The conformity vector CF (q) can be estimated in several ways [8]; it is highly related to a single factor nonnegative matrix decomposition of A (q) [51] and it is highly related to the connectivity . Characterizing approximately factorizable coexpression modules An open theoretical research question is to characterize microarray data that lead to factorizable coexpression networks. Here we solve this problem for the case of modules in a gene coexpression network. Toward this end, we propose the following measure of eigengene factorizability: (24) Note that 0≤EF(X (q))≤1 and the close resemblance to the proportion of variance explained by the module eigengene (Equation 22). In the Methods section, we argue that EF(X (q))≈1 implies that the correlation matrix factors as follows Further, we derive the following Observation 1 If the eigengene factorizability EF(X(q)) is close to 1, the adjacencies of the weighted coexpression module network A(q) = |cor(X(q))|β and the trait-based gene significance measure GSi (q) = |cor(xi (q),T)|β can be factored as follows (25) where (26) is referred to as the eigengene conformity of the ith gene, and (27) is referred to as the qth module eigengene significance with respect to T, also denoted as EigengeneSignif(q). As described in Table 1, the eigengene significance and the eigengene conformity are the eigengene-based counterparts of the centroid significance (Equation 15) and centroid conformity (Equation 16), respectively. 10.1371/journal.pcbi.1000117.t001 Table 1 Dictionary for translating between general network terms and their eigengene-based counterparts. Term General network Gene expression Adjacency matrix A (q) = [aij ] Microarray data X (q) Decomposition Factor analysis of A Singular value decomposition of X = UDVT Centroid Intramodular hub gene Module eigengene E Conformity(i) CFi defined as 1st factor of A ae ,i  = |cor(xi ,E)| β Approximately factorizable means aij ≈CFiCFj xi ≈u 1(i)|d 1|E Factorizability measure CentroidSignif(i) GSi,centroid ae ,t  = |cor(E,T)| β CentroidConformity(i) ai.centroid,i ae ,i  = |cor(E,xi )| β Weighted gene coexpression network and its eigengene-based approximation if EF(X (q))≈1 Coexpression network Eigengene-based counterpart Network A = |cor(X)| β AE  = aeae T Gene significance(i) GSi  = |cor(xi ,T)| β GSE ,i  = ae,iae,t Connectivity(i) ki  = Σj≠iaij kE,i  = ae ,i Σjae ,j Network concepts based on a network concept function NCF(·,·) if EF(X (q))≈1 and max j (ae ,j )≈1 Intramodular Eigengene-based Concept NCF(A,GS) NCF(AE,GSE) Scaled connectivity(i) KE,i ≈ae ,i Density Centralization Heterogeneity Clustering Coefficient(i) Max. adjacency ratio(i) Hub gene significance ae ,t Module significance Here we omit the reference to the qth module. The eigengene-based approximations on the right hand side of Equation 25 motivate us to define the eigengene-based adjacency matrix AE (q) and gene significance measure GSE (q) as follows: (28) (29) For our coexpression modules, we find empirically that the eigengene factorizability is close to 1 (see Table 2, Text S1, Text S2, and Text S3). 10.1371/journal.pcbi.1000117.t002 Table 2 Values of network concepts in weighted gene coexpression module networks (brain cancer data). Module Blue Brown Green Grey Red Turquoise Yellow Module size (n (q)) 606 185 136 1313 105 1112 143 Eigengene factorizability (EF(X (q))) 0.97 0.99 0.99 0.66 0.98 0.98 0.99 VarExplained(E(q)) 0.59 0.66 0.70 0.28 0.68 0.57 0.71 Max. conformity max(ae ,i ) 0.97 0.97 0.98 0.91 0.95 0.98 0.98 Density 0.58 0.65 0.69 0.29 0.67 0.55 0.70 DensityE 0.58 0.65 0.70 0.23 0.68 0.55 0.71 Centralization 0.16 0.13 0.12 0.15 0.11 0.17 0.12 CentralizationE 0.16 0.13 0.12 0.21 0.11 0.18 0.12 Heterogeneity 0.14 0.10 0.11 0.17 0.091 0.17 0.11 HeterogeneityE 0.14 0.10 0.11 0.44 0.091 0.17 0.11 Mean(ClusterCoef) 0.60 0.66 0.71 0.32 0.68 0.59 0.72 ClusterCoefE 0.60 0.66 0.71 0.33 0.68 0.59 0.72 ModuleSignif 0.088 0.12 0.21 0.11 0.16 0.14 0.065 ModuleSignifE 0.018 0.093 0.21 0.008 0.16 0.13 0.039 HubGeneSignif 0.11 0.14 0.25 0.15 0.18 0.19 0.074 HubGeneSignifE 0.023 0.11 0.25 0.014 0.18 0.17 0.045 EigengeneSignif = ae,i (q) 0.024 0.12 0.25 0.016 0.19 0.18 0.046 Here we report the results for soft thresholding with β = 1 (Equation 2). The results for higher powers β and for unweighted networks can be found in Text S1. Grey colors genes outside the 6 properly defined modules. The table shows that network concepts in the proper modules are close to their eigengene based analogs. Abstractly speaking, Observation 1 allows us to characterize coexpression networks for which the adjacency aij can be approximated by a product of the centroid conformities (Equation 16): aij ≈CentroidConformityi CentroidConformityj . Geometric interpretation of factorizability In the Methods section, we argue that EF(X (q))≈1 if the module gene expressions xi (q) are approximately orthogonal to the right singular vectors vl (q) for l≥2, i.e., if on average the gene expression profiles point in the direction of the module eigengene v 1 (q) = E (q). A rough geometric intuition of aij (q)≈ae,j (q) ae,j (q) (Equation 25) is presented in Figure 5A. The angle between the module eigengene E (q) and the ith gene expression profile is denoted by θi . The angle between gene expression profiles i and j is denoted by θij . In the Methods section, we show that θij ≈|θi ±θj | and sin(θi ) sin(θj )≈0 imply approximate factorizability of the correlation matrix. 10.1371/journal.pcbi.1000117.g005 Figure 5 Using vectors to illustrate results in gene coexpression network analysis. (A) A geometric interpretation of factorizability if the gene expression profiles and the module eigengene lie in a Euclidean plane. Then the angle θ 12 between gene expressions profiles 1 and 2 can be expressed in terms of angles with the module eigengene, i.e., θ 12 = θ 1−θ 2. Similarly, θ 23 = θ 2+θ 3. Under the assumptions stated in the text, we find θij ≈|θi ±θj |. Using a trigonometric formula (Equation 51) this implies that the correlation matrix is approximately factorizable. (B) Illustrating why intramodular hub genes cannot be “intermediate” genes between two distinct coexpression modules. The large angle between module eigengenes E1 and E2 reflects that the corresponding modules are distinct. Since intermediate gene 1 does not have a small angle with either eigengene, it is not an intramodular hub gene. By contrast, intramodular hub gene 2 has a small angle with eigengene E1 but is not close to module eigengene E2. (C,D) Illustrating that the hub gene significance of a module depends on the relationship between the module eigengene and the underlying microarray sample trait (Equation 34). For sample traits T2 and T1 the hub gene significance (and corresponding eigengene significance cor(E,T)) are high and low, respectively. The geometry of (C) implies relationships between the connectivity k of a gene (determined by its angle with the eigengene E) and gene significance measure GS1 (its angle with trait T1) and GS2 (its angle with trait T2). As shown in (D), the gene significance measure GS2 increases with k since the small angle between E and T2 implies that genes with high k (small angle with E) also have a small angle with T2. In contrast, high connectivity k implies a large angle with T1 and thus GS1 decreases as a function of k. Eigengene-Based Analogs of Network Concepts Here we define eigengene-based network concepts as a step towards a geometric interpretation of network concepts. Analogous to the case of intramodular network concepts, we define eigengene-based network concepts by evaluating the network concept function NCF(AE (q),GSE (q)) on the eigengene-based adjacency matrix AE (q) (Equation 28) and the eigengene-based gene significance measure GSE (q) (Equation 29). One can easily derive the following formulas for eigengene-based network concepts: (30) where . Under the assumptions of Observation 1, we find that A (q)≈AE (q) and GSi ≈GSE,i . For a continuous network concept function NCF(·,·) this implies NCF(A (q),GS)≈NCF(AE (q),GSE ). We summarize this observation as follows Observation 2 If A(q) = |cor(X(q))|β and the eigengene factorizability EF(X(q)) is close to 1, the network concepts can be approximated by their eigengene-based analogs. This observation is illustrated in Figure 6. 10.1371/journal.pcbi.1000117.g006 Figure 6 Illustrating Observation 2 regarding the relationship between a network concept (y-axis) and its eigengene-based analog (x-axis) in the brain cancer data. Each point corresponds to a module. (A–F) Corresponding to a weighted network constructed with a soft threshold (Equation 2) of β = 1. (G–L) Analogous plots for β = 6. (A,G) Centralization (y-axis) versus eigengene-based CentralizationE (x-axis). The following are analogous plots for (B,H): heterogeneity; (C,I) clustering coefficient; (D,J) module significance; and (E,K) hub gene significance. (F,L) Illustrating Equation 13 regarding the relationship between eigengene significance and hub gene significance. The blue line is the regression line through the points representing proper modules (i.e., the grey, nonmodule genes are left out). While the red reference line (slope 1, intercept 0) does not always fit well, we observe high squared correlations R 2 between network concepts and their analogs. Since the grey point corresponds to the genes outside properly defined modules, we did not include it in calculations. Using the eigengene-based heterogeneity to study the effect of soft thresholding It can be advantageous to replace network concepts by their eigengene-based analogs when studying theoretical properties. To illustrate this point, we briefly describe the effect of soft thresholding aij  = sij β (Equation 2) on the network heterogeneity. Using extensive simulation studies reported on our webpage, we found that for the vast majority of networks, the heterogeneity increases with the soft threshold β. Thus, for most coexpression networks, increasing β makes it easier to discern highly connected genes from less connected genes. However, one can construct networks for which increasing β leads to a lower heterogeneity. The situation is much simpler for the eigengene-based heterogeneity HeterogeneityE (q) (Equation 30). In the Methods section, we prove that the eigengene-based heterogeneity is a monotonically increasing function of the soft threshold β. Thus, the heterogeneity will be an increasing function of β if it can be approximated by its eigengene based analog (Observation 2). Relationships among eigengene-based network concepts A major theoretical advantage of eigengene-based network concepts is that they reveal simple relationships amongst each other. For example, it is straightforward to derive (31) To arrive at particular simple relationships among network concepts, we make use of the following terminology. We denote the maximum eigengene conformity as ae ,max (q) = max j (ae,j (q)), where ae,j (q) = |cor(xj (q),E (q))| β (Equation 26). In most modules, we find genes that have very high correlations (r≈0.99) with the module eigengene. For a low power β, this implies that the maximum eigengene conformity is approximately equal to 1: (32) We refer to Equation 32 as the maximum conformity assumption. With the results in the Methods section, one can show that the maximum conformity assumption implies the following Observation 3 If A(q) = |cor(X(q))|β, EF(X(q))≈1 and the maximum conformity assumption applies, intramodular network concepts satisfy the following relationships (33) (34) (35) (36) (37) (38) (39) (40) where mean(ClusterCoef(q)) denotes the mean clustering coefficient, ClusterCoefmax (q) = maxj(ClusterCoefj (q)) and MARmax (q) = maxj(MARj (q)). In practice, we find that the maximum conformity assumption holds well for low values of β. Below, we study the robustness of our results with respect to higher powers and alternative network construction methods. Geometric interpretation of network concepts Observations 2 and 3 allow us to provide a geometric interpretation of intramodular network concepts. The relationship between the scaled intramodular connectivity Ki (q) and its eigengene based analog ae,i (q) = |cor(xi (q),E (q))| β (Equation 33) facilitates a geometric interpretation of the intramodular connectivity: the smaller the angle θi between the ith gene expression profile and the module eigengene, the larger is |cos(θi )| β  = ae,i (q), i.e., the larger is the scaled intramodular connectivity. Since the module eigengene summarizes the overall behavior of the module, ae,i (q) measures how well gene i conforms to the overall module. Thus, a tongue-in-cheek social network interpretation of Equation 33 is that group-conforming behavior leads to high popularity. We provide two geometric interpretations of the density. The first makes use of the relationship aij (q) = |cos(θij )| β where θij denotes the angle between gene expression profiles i and j. By definition (Equation 9), the smaller the pairwise angles θij between the gene expression profiles, the higher is the module density. Equation 39 provides another interpretation: the smaller the angles θi between the module gene expression profiles and the module eigengene, the higher is the density. Thus, the density can be interpreted as a measure of average closeness between the gene expression profiles and the module eigengene. By definition, coexpression module networks have a relatively high density (see Table 2, Text S1, Text S2, and Text S3). The eigengene-based heterogeneity equals the coefficient of variation of the aE (q), i.e., it is a measure of variability of the angles θi between the gene expression profiles and the module eigengene. The heterogeneity equals 0 if the angles θi are all equal. The ith gene has high eigengene-based significance GSE,i (q) (Equation 29) if the eigengene has a small angle with the sample trait and θi is small. Similarly, the geometric interpretation of the hub gene significance (Equation 13) is straightforward: the smaller the angle between the module eigengene and the sample trait, the higher is the hub gene significance (Equation 34). We provide two geometric interpretations of the module significance (Equation 14). The first interpretation is based on the definition of the module significance as average gene significance; a module has high module significance if on average the angles between the module expression profiles and the sample trait tend to be small. The second interpretation of the module significance is based on Equation 37: a module has high significance if the module density is high and the angle between the module eigengene and the sample trait is small. What Can Microarray Data Analysts Learn from the Geometric Interpretation? Here we illustrate how the geometric interpretation of gene coexpression networks can be used to derive results, which may be interesting to microarray data analysts. Summarizing the expression profiles of a module Multiple approaches are conceivable for summarizing the expression profiles of the genes inside a single module. One approach (popular with statisticians) applies a singular value decomposition to the expression data and summarizes the module with the module eigengene. Another approach (popular with network theorists) is to construct a module network and to use the most highly connected hub gene as centroid. Since Equation 33 implies that hub genes are highly correlated with the module eigengene, we find that the two seemingly different approaches will lead to very similar results in practice (Figure 4C). Intramodular connectivity is a measure of module membership Since module construction is computationally intensive, one often restricts the module detection analysis to a subset of the original genes on the microarray, e.g., the most varying and/or the most connected genes. To counter this loss of information, generalizing the intramodular connectivity to extramodular genes, i.e. genes outside the module, is an important problem. Our solution is motivated by the relationship between the intramodular connectivity and its eigengene based analog (Equation 33). Specifically, the qth module eigengene gives rise to an eigengene-based scaled intramodular connectivity measure (41) Under the assumptions of Observation 3, Equation 33 implies that Ki (q)≈|K cor,i (q)| β for the subset of genes that are in the qth module. The larger K cor,i (q), the more similar is gene i is to the summary profile of the qth module. Thus, K cor,i (q) can be used to measure module membership. A theoretical advantage of K cor,i (q) over Ki (q) is that its definition can be easily extended to expression profiles xi outside the qth module. Another advantage of K cor,i (q) is that a simple correlation test p-value can be used to assess the statistical significance of the correlation between xi and E (q). Fuzzy module annotation of genes Module detection usually involves certain parameter choices. For some genes, it may be difficult to decide whether they belong to a particular module or whether they belong to more than one module. Instead of reporting a binary indicator of module membership, it can be advantageous to report a fuzzy measure of module membership, which takes on values in the unit interval [0,1]. A natural choice for a fuzzy measure of module membership is the eigengene-based scaled intramodular connectivity measure K cor,i (q) (Equation 41). The fuzzy module membership measures K cor,i (q) specify how close gene i is to modules q = 1,…,Q. It is straightforward to use these measures for finding genes that are close to two modules, i.e., intermediate genes. In Figure 7, we show the pairwise relationships among different K cor,i (q) measures where the genes are colored by their original module assignment. Note that many of the nonmodule (grey) genes lie intermediate between the proper module genes. 10.1371/journal.pcbi.1000117.g007 Figure 7 Fuzzy module annotation of genes in the brain cancer network. A natural choice for a fuzzy measure of module membership is the generalized scaled connectivity measure K cor,i (q) = |cor(xi ,E (q))| (Equation 41). (A) Scatterplot of the brown module membership measure (y-axis) versus that of the blue module (x-axis). Note that grey dots corresponding to genes outside of properly defined modules can be intermediate between module genes. (B) The corresponding plot for blue versus turquoise module membership. (C) Brown versus turquoise module membership. (D) The relationship between gene significance based on survival time (y-axis) and brown module membership (x-axis). What Can Network Theorists Learn from the Geometric Interpretation? In the following, we provide several examples that illustrate potential uses of the geometric interpretation. Statistical significance of network concepts While fundamental network concepts are defined as functions of the network adjacency matrix, their eigengene-based analogs are often simple monotonic functions of correlation coefficients. This insight can be used to attach significance levels (p-values) to several eigengene-based network concepts. For example, the eigengene-based hub gene significance is a monotonic function of the correlation between the eigengene and the sample trait (Equation 34). Thus, one can use a correlation test p-value [53] or a regression-based p-value for assessing the statistical significance between E (q) and the sample trait T. Analogously, one can attach a significance level to the fuzzy module membership measures K cor,i (q) (Equation 41). Since the gene coexpression network concepts are based on correlations between quantitative variables, one can use permutation test procedures to attach significance levels to network concepts. By randomly permuting the gene expression values of each gene, it is possible to noise up the correlation structure inherent in the original data. We find that the resulting permuted data lead to networks with low density and low mean clustering coefficients (reflecting the lack of large modules). Relationship between centralization and density The relationship between centralization and density (Equation 40) is surprisingly simple for coexpression networks but it does not hold in general networks. For a general network, one can only derive an upper bound for the centralization in terms of the density [35]. As a caveat, we mention that our empirical studies (described below) show that Equation 40 is not very robust with regard to deviations from our theoretical assumptions. Intramodular hub genes cannot be intermediate genes in coexpression networks The geometric interpretation of gene coexpression network analysis can be used to argue that a gene that lies “intermediate” between two distinct modules cannot be a highly connected intramodular hub gene in either module (see Figure 5B). More precisely, we refer to gene i as hub gene in module 1 if its scaled connectivity Ki (1) is very high (say larger than 0.9). Further, we refer to two modules as distinct if their respective eigengenes have a low correlation, say |cor(E (1),E (2))| 0.5 and Ki (2)>0.5. Equation 33 allows us to translate statements about the scaled intramodular connectivity into statements about the angles between genes and module eigengenes. A gene is an intermediate gene if it has a moderately small angle with both module eigengenes. If the eigengenes are distinct (i.e., the angle between them is large), the intermediate gene cannot have a very small angle with either module eigengene, i.e., it cannot be an intramodular hub gene in either module. A geometric interpretation of this example can be found in Figure 5B. As an important caveat, we mention that intermediate network genes may well be highly connected “hub” genes if the factorizability property does not hold such as in the entire network comprised of multiple distinct modules. Characterizing module networks where hub genes are significant For a trait-based gene significance measure, the striking relationship between module significance and hub gene significance (Equation 37) suggests a positive relationship between connectivity and gene significance (high hub gene significance) in modules that are enriched with significant genes (high module significance). Further, Equation 34 shows that the hub gene significance of a module network is determined by the angle between the module eigengene and the sample trait. This allows us to describe situations when a module has high hub gene significance, i.e., when there is a strong positive relationship between gene significance and intramodular connectivity. In the example provided in Figure 5C and 5D, the angle between E and T2 is small which implies that the hub gene significance with regard to GS2i  = |cor(xi ,T2)| is high. By contrast, the angle between E and T1 is large, which implies that the hub gene significance with regard to GS1i  = |cor(xi ,T1)| is low. Dictionary for Translating between Network Concepts and Their Eigengene-Based Analogs To facilitate the communication between microarray data analysts and network theorists, we provide a short dictionary for translating between microarray data analysis and network theory terminology. More specifically, for a subset (module) of genes that have high expression factorizability, Table 1 describes the correspondence between general network terms and their eigengene-based counterparts. While our theoretical derivations assume a weighted gene coexpression network, our robustness studies show empirically that many of the findings apply to unweighted networks as well. The summary of empirical robustness studies is described below. In general, eigengene-based concepts are no substitute for network concepts. It is natural to use network concepts when describing the pairwise relationships between genes and to use eigengene-based network concepts when relating the gene expression profiles to a module eigengene. Since eigengene-based network concepts tend to be relatively simple, they often simplify theoretical derivations. Further, many of them allow one to calculate a statistical significance level (p-value) using a correlation or regression based test statistic. Real Data Applications To illustrate the theoretical results we report 4 different microarray data applications. The underlying data sets and R software code can be found on our webpage http://www.genetics.ucla.edu/labs/horvath/ModuleConformity/GeometricInterpretation/. Brain cancer network application Here we describe a weighted gene coexpression network that was constructed on the basis of 55 microarray samples of glioblastoma (brain cancer) patients. A detailed description of the data, modules, and biological implications can be found in [24]. We defined 6 modules as branches of an average linkage hierarchical cluster tree (Figure 3B). Module membership in the 6 “proper” modules is color-coded by turquoise, blue, brown, yellow, green and red. Grey denotes the color of genes that were not grouped into any of the 6 proper modules. To allow for a comparison, we also report results for the “improper” module comprised of grey genes. We used the patient survival time as microarray sample trait T. We defined a gene significance measure as the absolute value of the correlation between T and the gene expression profiles (Equation 4). The module significance was defined as average gene significance (Equation 14). Figure 3C shows that the brown module had the highest module significance. This module was previously found to be enriched with genes that are prognostic of patient survival [24]. By relating the gene significance measure GSi to the scaled connectivity Ki , we arrive at a hub gene significance measure (Equation 13). As illustrated in Figure 3D and 3E, the hub gene significance is defined as the slope of a regression model without intercept term. The brown module had the highest hub gene significance, see Table 2. We defined the module eigengene significance (Equation 27) as the absolute value of the correlation between the module eigengene and patient survival time. The brown module eigengene also had the highest eigengene significance: ae,t brown = |cor(E brown,T)| = 0.202. An advantage of the eigengene-based hub gene significance (the eigengene significance) is that it allows one to compute a corresponding p-value. Using a correlation test, we find that the value of the eigengene significance ae,t brown is statistically insignificant (p = 0.30) in this dataset. However, when we combined these data with an additional data set, we found that the brown module eigengene is significantly related to survival time [24]. We visualize the gene expression profiles of module genes with a heat map plot (Figure 4B) where rows correspond to the genes, the columns to the samples, and the gene expression profiles have been standardized to a mean of 0 and a variance of 1. The heat map colors high and low expression values by red and green, respectively. For a given module, the heat map exhibits characteristic vertical bands that reflect the high correlation among module gene expression profiles. For the 6 proper modules of our brain cancer application, the proportion of variance explained by the first eigengene ranges from 0.59 to 0.71 (Table 2). For the improper grey module genes (defined as genes outside of all proper modules) the proportion of variance explained by the first eigengene is only 0.28. Similarly, when all network genes are used to define an improper module, the proportion of variance explained by the first eigengene is only 0.32. As expected by module construction, we find that the gene expression data of proper modules have high eigengene factorizabilities EF(X)≥0.97 (Table 2). By contrast, the factorizability of the grey genes (i.e., the genes outside of proper modules) is relatively low (EF(X) = 0.66). For each module, Table 2 reports network properties including network size, density, centralization, heterogeneity, mean clustering coefficient, module significance, hub gene significance, and eigengene significance. For the proper (nongrey) modules, we find that the numerical values of the intramodular network concepts and their eigengene-based analogs support our theoretical derivations. Our empirical results illustrate Observation 2 regarding the relationship between intramodular network concepts and their eigengene-based analogs. Figure 6A–E depict the relationships among centralization, heterogeneity, clustering coefficient, module significance, hub gene significance and their respective eigengene-based analogs when a soft threshold of β = 1 is used for the weighted network construction (Equation 2). The analogous results for β = 6 are depicted in Figure 6G–K. Figure 6F and 6L depicts the relationship between hub gene significance (Equation 13) and module eigengene significance (Equation 27) for β = 1 and β = 6, respectively. For completeness, we also report the results for the grey, nonmodule genes in the figures. But since our theoretical results assume proper modules, we exclude the grey genes from the calculation of the squared correlation coefficient R 2. The summary of a robustness analysis with regard to different soft thresholds β and hard thresholds τ is reported in Table 3 and Text S1. Overall, we find very high squared correlations (R 2>0.85), which confirm our theoretical results. Only the R 2 values for the relationship between clustering coefficient and its eigengene-based analog is decreased if β>3. 10.1371/journal.pcbi.1000117.t003 Table 3 Robustness analysis of the brain cancer gene coexpression network results. Weighted networks Unweighted networks Squared correlation R 2 across modules Soft threshold β Hard threshold τ Relation 1 2 3 4 5 6 0.7 0.5 Centralization≈CentralizationE 1.0 1.0 0.97 0.90 0.87 0.88 0.07 0.93 Heterogeneity≈HeterogeneityE 1.0 1.0 0.99 0.98 0.97 0.96 0.89 0.87 ClusterCoefi ≈ClusterCoefE 0.99 0.96 0.88 0.74 0.58 0.45 0.04 0.32 ModuleSignif≈ModuleSignifE 0.98 0.91 0.87 0.85 0.85 0.86 0.98 0.98 HubGeneSignif≈HubGeneSignifE 0.96 0.91 0.89 0.90 0.92 0.94 0.93 0.87 EigengeneSignif≈HubGeneSignif 0.96 0.89 0.87 0.88 0.90 0.92 0.93 0.87 ClusterCoef = (1+Heterogeneity 2)2×Density 0.99 0.96 0.89 0.76 0.61 0.49 0.006 0.32 1.0 0.99 0.99 0.98 0.97 0.95 0.85 0.99 0.90 0.68 0.058 0.016 0.16 0.35 0.20 1.0 0.94 0.94 0.94 0.94 0.93 0.92 0.95 0.98 Ki ≈ae ,i (median R 2) 1.0 1.0 1.0 1.0 1.0 0.99 0.95 0.83 The table reports how the relationships among network concepts change as function of different soft threshold parameters β (Equation 2) or hard thresholds (Equation 1) used in the network construction. For each relationship and each network construction method, the table entry reports the squared correlation R 2 across the proper modules. For within module comparisons the table reports median R 2 values. Additional details can be found in Text S1. Figure 8 illustrate the implications of Observation 3 regarding the relationships among network concepts in the cancer coexpression module networks. Figure 8A shows that the scaled connectivity Ki (q) is highly correlated (R 2>0.99) with ae,i (q), which illustrates Equation 33. This relationship is highly robust with regard to high soft thresholds β as can be seen from Table 3. 10.1371/journal.pcbi.1000117.g008 Figure 8 Using the brain cancer data to illustrate Observation 3 regarding the relationships among network concepts. (A) Illustrating Equation 33 regarding the relationship between scaled intramodular connectivity Ki (q) (y-axis) and eigengene conformity ae ,i (x-axis). Each dot corresponds to a gene colored by its module membership. We find a high squared correlation R 2 even for the grey genes outside properly defined modules. (B) Illustrating Equation 31 regarding the relationship between the clustering coefficient and (1+Heterogeneity 2)2×Density. Again each dot represents a gene. The clustering coefficients of grey genes vary more than those of genes in proper modules. The short horizontal lines correspond to the mean clustering coefficient of each module. (C) Illustrating (Equation 37); here each dot corresponds to a module. Since the grey dot corresponds to genes outside of properly defined modules, we have excluded it from the calculation of the squared correlation R 2. (D) Illustrating (Equation 40); (E) Illustrating (Equation 38). A reference line (red) with intercept 0 and slope 1 has been added to each plot. The blue line is the regression line through the points representing proper modules (i.e., the grey, non-module genes are left out). A robustness analysis with regard to different network construction methods, e.g., β>1, can be found in Text S1. Figure 8B illustrates the relationship between the clustering coefficient (the mean corresponds to the short horizontal line) and (1+Heterogeneity 2)2×Density (Equation 31). This relationship is diminished for soft thresholds β>3 as can be seen from Table 3. Figure 8C illustrates the relation (Equation 37), which is highly robust with regard to different choices of β (Table 3). Figure 8D illustrates (Equation 40). This relationship is not robust with regard to β: the R2 value is only 0.058 for β = 3. Figure 8E illustrates (Equation 38), which is highly robust with regard to β (Table 3). Although our theoretical results were derived using relatively restrictive assumptions, we find that most results are robust in the weighted networks, see Figure 9, Table 3, and Text S1. However, in unweighted networks, several relationships have lower R 2 values and show a strong dependence on the hard threshold τ (Table 3). 10.1371/journal.pcbi.1000117.g009 Figure 9 Using three different data (brain cancer, mouse liver, and yeast cell cycle) and three different network construction methods to illustrate Equation 37 regarding the relationship between module significance (y-axis) and (x-axis). Points correspond to modules. The square of the correlation coefficient R 2 was computed without the grey, improper module. (A,D,G) Corresponding to the brain cancer gene coexpression networks. (B,E,H) Corresponding to mouse liver networks. (C,F,I) Corresponding to yeast networks. (A–C) Corresponding to a weighted network (Equation 2) constructed with soft thresholds β = 1. (D–F) Corresponding to β = 6. (G–I) Corresponding to an unweighted network (Equation 1) that results from thresholding the correlation matrix at τ = 0.5. Overall, we find that the reported relationship is quite robust with respect to our theoretical assumptions (e.g., factorizability). The blue line is the regression line through the points representing proper modules (i.e., the grey, nonmodule genes are left out). A reference line with slope 1 and intercept 0 is shown in red. Additional details can be found in Text S1, Text S2, and Text S3. Motivational example: Mouse tissues of an F2 intercross The mouse tissues came from an F2 intercross between two mouse strains C3H/HeJ and C57BL/6J. The data were already described above and in Figure 1. The 498 genes were part of a body weight related module in liver tissue (the Blue module described in reference [23]). Table 4 presents network concepts and their eigengene-based analogs in the different tissue networks. As predicted by Observation 2, we find a close relationship between the two types of network concepts if the eigengene factorizability of the corresponding network is close to 1. This example also illustrates that our results apply to coexpression networks comprised of relatively few genes (here 498 genes). 10.1371/journal.pcbi.1000117.t004 Table 4 Values of network concepts in the different mouse gender/tissue networks reported in Figure 1. Female liver Female adipose Female brain Female muscle Network concept Network Eigengene Network Eigengene Network Eigengene Network Eigengene Factorizability 0.92 0.91 0.72 0.46 0.89 0.82 0.79 0.68 Density 0.39 0.39 0.23 0.14 0.32 0.27 0.24 0.19 Centralization 0.19 0.19 0.11 0.19 0.34 0.23 0.17 0.22 Heterogeneity 0.18 0.19 0.22 0.59 0.36 0.54 0.32 0.57 Mean cluster coef 0.42 0.42 0.27 0.26 0.41 0.46 0.30 0.33 Male liver Male adipose Male brain Male muscle Network concept Network Eigengene Network Eigengene Network Eigengene Network Eigengene Factorizability 0.93 0.92 0.76 0.59 0.73 0.46 0.76 0.48 Density 0.37 0.36 0.23 0.16 0.21 0.13 0.25 0.16 Centralization 0.19 0.21 0.15 0.21 0.21 0.20 0.12 0.21 Heterogeneity 0.28 0.32 0.28 0.65 0.27 0.67 0.30 0.64 Mean cluster coef 0.43 0.44 0.27 0.31 0.25 0.26 0.31 0.31 For each network, the table reports the network factorizability F(A), the eigengene factorizability EF(X), network concepts, and their eigengene-based analogs. Here we use a soft threshold β = 1 (Equation 2). Mouse gene coexpression network application Here we focus on the female mouse liver tissues of the above-mentioned F2 mouse cross. Specifically, 135 female mice were used to construct a weighted network comprised of 3,400 highly connected genes. The biological significance and gene ontology enrichment analysis of the 12 modules in this large network is described in [23]. In Text S2, Table 5, and Figure 9, we focus on the relationships among the network concepts. We find that many of our theoretical results hold approximately even if the expression factorizability is low. Table 5 shows how the relationship (R 2 values) between network concepts and their eigengene-based analogs depend on the soft threshold β. Overall, we find that our theoretical results are highly robust in weighted networks. The relationship between the clustering coefficient and its eigengene-based analog is diminished (down to 0.44) for β>3. The relationship between heterogeneity and its eigengene-based analog is diminished (down to 0.54 when β 0.95) and if the maximum conformity assumption (Equation 32) is satisfied. However, even if the maximum conformity assumption does not hold, one can still find simple relationships among the network concepts (Equation 49). The geometric interpretation of gene coexpression networks facilitates the derivation of several results that should be interesting to network theorists. For example, we argue that highly connected intramodular hub genes cannot be intermediate between two distinct coexpression modules (Figure 5B). The geometric interpretation is particularly useful when studying gene significance and module significance measures that are based on a microarray sample trait (Equation 4). To study the relationship between connectivity and gene significance, we propose a novel measure of hub gene significance (Equation 13). We find that the hub gene significance of a module network is determined by the angle between the module eigengene and the microarray sample trait (Equation 34). Our geometric interpretation of coexpression networks allows us to describe situations when a module has low hub gene significance (Figure 5C and 5D). Our theoretical derivations for relating module significance to hub gene significance (Equation 37) assumes a gene significance measure based on a sample trait. Although this important assumption is violated for the gene significance measure (knock-out essentiality) in the yeast network, it is striking that the relationship between hub gene significance and module significance can still be observed in this application (Figure 9). We provide a robustness analysis that shows that many of our theoretical results apply even if our underlying assumptions are not satisfied (Figures 6 and 9, Tables 3, 5, and 6, Text S1, Text S2, and Text S3). We find that the correspondence between network concepts and their eigengene-based analogs is often better in weighted networks than in unweighted networks. Further, we find that the results in weighted networks tend to be more robust than those in unweighted networks with regard to changing the network construction thresholds β and τ, respectively. Thus, weighted coexpression networks are preferable over unweighted networks when a geometric interpretation of network concepts is desirable. The correspondence between coexpression module networks and the singular value decomposition (Table 1) can break down when a high soft threshold is used for constructing a weighted network or when dealing with an unweighted network. Thus, eigengene-based concepts do not replace network concepts when describing interaction patterns among genes. While this article has a theoretical bent, we illustrate the results on three different microarray data sets (human, mouse, and yeast) that are described in our online R software tutorials, in Text S1, Text S2, and Text S3. Our theoretical results also apply to networks comprised of genes that are highly correlated with a sample trait. The key assumption underlying our results is high eigengene factorizability EF(X (q)). To illustrate this point, Text S4 describes a brain cancer network comprised of the 500 genes with highest absolute correlation with brain cancer survival time. Our results illustrate that the geometric interpretation of gene coexpression networks has important theoretical and practical implications that may guide the development and application of network methods. Materials and Methods Network Concept Functions and Fundamental Network Concepts Analogous to [8], we define a network concept function to be function of a square matrix M = [Mij ] (1≤i,j≤n) and/or a corresponding vector G = (G 1,…,Gn ). For example, M could be the adjacency matrix (with diagonal set to 0) and G could be a corresponding gene significance measure. We make use of the following network concept functions: (42) where the components of matrix BM in the denominator of the clustering coefficient function are given by bij  = 1 if i≠j and bii  = Ind(mii >0). Here the indicator function Ind(·) takes on the value 1 if the condition is satisfied and 0 otherwise. According to our convention, the diagonal elements of the adjacency matrix are set to 1. Therefore, the diagonal elements of A–I (where I denotes the identity matrix) equal 0. Now we are ready to define the (fundamental) network concepts that are studied in this article. Definition of Fundamental Network Concepts: The fundamental network concepts of a network A are defined by evaluating the network functions ( Equation 42 ) on A–I and the gene significance measure GS, i.e., For example, the connectivity is given by (43) We define an intramodular network concept NCF(A (q)−I,GS (q)) by evaluating the network concept function on the restricted adjacency matrix A (q) and the restricted gene significance measure GS (q). We will now define eigengene-based network concepts. Using the eigengene-based adjacency matrix AE (q) = ae (q)(ae (q)) T (Equation 28) and the eigengene-based gene significance measure GSE,i (q) = ae,i (q) ae,t (q) (Equation 29), we define an eigengene-based network concept as NCF(AE (q),GSE (q)). As example, consider the eigengene-based connectivity given by (44) Deriving Observation 1: Expression Data with High Eigengene Factorizability Lead to Approximately Factorizable Networks Here we derive Observation 1, which characterizes approximately factorizable gene coexpression module networks. To simplify the presentation, we omit the superscripts (q) in the following, e.g., we will write EF(X) instead of EF(X (q)). We will argue that if the eigengene factorizability EF(X) is close to 1, the adjacencies of the weighted coexpression module network A = |cor(X)| β and the trait-based gene significance measure GSi  = |cor(xi ,T)| β can be factored as follows (45) where (46) (47) Since our gene coexpression networks are defined with respect to the correlation matrix [cor(xi ,xj )], which is scale-invariant, we can assume that the gene expression profiles have been scaled as follows: where m is the number of microarray samples. Then one can derive the following relationships Note that u 1,i |d 1|2 u 1,j /m = cor(xi ,E)cor(xj ,E). Using the fact that U is an orthogonal matrix, it is straightforward to show that This equation motivates us to propose the following measure of eigengene factorizability: (48) Note that 0≤EF(E)≤1. By definition EF(E)≈1 implies that By raising both sides of this equation to a power β, we find The last step highlights an important theoretical advantage of the soft thresholding method: it preserves the approximate factorizability of the underlying correlation matrix. An alternative, possibly more direct way of motivating the observation is based on the insight that the squared singular values |dl|2 correspond to the eigenvalues of the correlation matrix COR = [cor(xi ,xj )]. For high values of EF(E), the correlation matrix can be factored as follows where u 1 denotes an eigenvector of length 1. Relationships among Network Concepts When the Maximum Conformity Assumption Does Not Hold Here we describe relationships among eigengene-based network concepts if the maximum conformity assumption does not hold (i.e., ae ,max (q) 1 then the following inequality holds: (53) To prove the Proposition, we will make use of the following Lemma: Let {ui, i = 1,…,n} and {vi, i = 1,…,n} be groups of nonnegative numbers, and θ be a number 0≤θ<1. Then the following inequality holds: (54) The Lemma can be proved with Hölder's inequality, which is given by (55) We use the Lemma with θ 1 = β/(2β−1), ui  = ai , and vi  = ai 2β to derive Further, we use the Lemma with θ 2 = (2β−2)/(2β−1), ui  = ai , and vi  = ai 2β to derive By squaring the first inequality and multiplying it with the second inequality, we arrive at since 2θ 1+θ 2 = 2 and 3−(2θ 1+θ 2) = 1. The last inequality completes the proof since it is equivalent to the inequality in Equation 53. Supporting Information Text S1 Robustness Analysis of the Brain Cancer Gene Coexpression Network. This supporting text provides a detailed analysis of the brain cancer gene coexpression network. The robustness analysis illustrates how the results change with regard to different network construction methods. (3.83 MB PDF) Click here for additional data file. Text S2 Robustness Analysis of the Mouse Gene Coexpression Network. This supporting text provides a detailed analysis of the mouse tissue gene coexpression network. The robustness analysis illustrates how the results change with regard to different network construction methods. (3.76 MB PDF) Click here for additional data file. Text S3 Robustness Analysis of the Yeast Gene Coexpression Network. This supporting text provides a detailed analysis of the yeast cell cycle gene coexpression network. The robustness analysis illustrates how the results change with regard to different network construction methods. (2.62 MB PDF) Click here for additional data file. Text S4 Brain Cancer Network Comprised of 500 Prognostic Genes. Here we analyze a brain cancer network comprised of the 500 genes with highest absolute correlation with brain cancer survival time. The results illustrate that our theoretical results also apply to small networks comprised of sample trait related genes. The robustness analysis illustrates how the results change with regard to different network construction methods. (0.38 MB PDF) Click here for additional data file.
                Bookmark

                Author and article information

                Contributors
                Role: Writing – original draftRole: Writing – review & editing
                Role: Writing – original draftRole: Writing – review & editing
                Role: ConceptualizationRole: Data curation
                Role: Formal analysisRole: Funding acquisition
                Role: InvestigationRole: Methodology
                Role: Project administrationRole: Resources
                Role: SoftwareRole: Supervision
                Role: ValidationRole: Visualization
                Role: Writing – review & editing
                Role: Editor
                Journal
                PLoS One
                PLoS One
                plos
                PLOS ONE
                Public Library of Science (San Francisco, CA USA )
                1932-6203
                3 February 2023
                2023
                : 18
                : 2
                : e0281439
                Affiliations
                [001] The First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi, China
                University of Louisville, UNITED STATES
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                Author information
                https://orcid.org/0000-0002-5463-8048
                Article
                PONE-D-22-24410
                10.1371/journal.pone.0281439
                9897545
                36735737
                e493c2af-9c53-4443-9869-64ea5786259d
                © 2023 Wei et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 7 September 2022
                : 23 January 2023
                Page count
                Figures: 5, Tables: 4, Pages: 11
                Funding
                The author(s) received no specific funding for this work.
                Categories
                Research Article
                Biology and Life Sciences
                Anatomy
                Renal System
                Kidneys
                Medicine and Health Sciences
                Anatomy
                Renal System
                Kidneys
                Biology and Life Sciences
                Biochemistry
                Biomarkers
                Biology and Life Sciences
                Molecular Biology
                Molecular Biology Techniques
                Molecular Biology Assays and Analysis Techniques
                Amino Acid Analysis
                Research and Analysis Methods
                Molecular Biology Techniques
                Molecular Biology Assays and Analysis Techniques
                Amino Acid Analysis
                Biology and Life Sciences
                Genetics
                Gene Identification and Analysis
                Genetic Screens
                Biology and Life Sciences
                Genetics
                Gene Expression
                Biology and Life Sciences
                Genetics
                Gene Identification and Analysis
                Genetic Networks
                Computer and Information Sciences
                Network Analysis
                Genetic Networks
                Biology and Life Sciences
                Biochemistry
                Enzymology
                Enzyme Chemistry
                Enzyme Metabolism
                Biology and Life Sciences
                Biochemistry
                Metabolism
                Amino Acid Metabolism
                Custom metadata
                The datasets generated and analyzed during the current study are available in the NCBI Gene Expression Omnibus repository ( https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE139061) with accession number GSE139061.

                Uncategorized
                Uncategorized

                Comments

                Comment on this article