41
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Predictive analytics of environmental adaptability in multi-omic network models

      research-article
      a , 1 , 1
      Scientific Reports
      Nature Publishing Group

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Bacterial phenotypic traits and lifestyles in response to diverse environmental conditions depend on changes in the internal molecular environment. However, predicting bacterial adaptability is still difficult outside of laboratory controlled conditions. Many molecular levels can contribute to the adaptation to a changing environment: pathway structure, codon usage, metabolism. To measure adaptability to changing environmental conditions and over time, we develop a multi-omic model of Escherichia coli that accounts for metabolism, gene expression and codon usage at both transcription and translation levels. After the integration of multiple omics into the model, we propose a multiobjective optimization algorithm to find the allowable and optimal metabolic phenotypes through concurrent maximization or minimization of multiple metabolic markers. In the condition space, we propose Pareto hypervolume and spectral analysis as estimators of short term multi-omic (transcriptomic and metabolic) evolution, thus enabling comparative analysis of metabolic conditions. We therefore compare, evaluate and cluster different experimental conditions, models and bacterial strains according to their metabolic response in a multidimensional objective space, rather than in the original space of microarray data. We finally validate our methods on a phenomics dataset of growth conditions. Our framework, named METRADE, is freely available as a MATLAB toolbox.

          Related collections

          Most cited references46

          • Record: found
          • Abstract: found
          • Article: not found

          Quantitative prediction of cellular metabolism with constraint-based models: the COBRA Toolbox v2.0.

          Over the past decade, a growing community of researchers has emerged around the use of constraint-based reconstruction and analysis (COBRA) methods to simulate, analyze and predict a variety of metabolic phenotypes using genome-scale models. The COBRA Toolbox, a MATLAB package for implementing COBRA methods, was presented earlier. Here we present a substantial update of this in silico toolbox. Version 2.0 of the COBRA Toolbox expands the scope of computations by including in silico analysis methods developed since its original release. New functions include (i) network gap filling, (ii) (13)C analysis, (iii) metabolic engineering, (iv) omics-guided analysis and (v) visualization. As with the first version, the COBRA Toolbox reads and writes systems biology markup language-formatted models. In version 2.0, we improved performance, usability and the level of documentation. A suite of test scripts can now be used to learn the core functionality of the toolbox and validate results. This toolbox lowers the barrier of entry to use powerful COBRA methods.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            A comprehensive genome-scale reconstruction of Escherichia coli metabolism—2011

            Introduction A common denominator for systems biology studies of a target organism is a high-quality genome-scale metabolic network reconstruction. A network reconstruction represents a biochemically, genetically, and genomically structured knowledgebase that contains detailed information about an organism in a structured format (Thiele and Palsson, 2010). Metabolic network reconstructions contain information such as exact stoichiometry of metabolic reactions, chemical formulas and charges of metabolites, and the associations between genes, proteins, and reactions. These reconstructions form a basis for the formulation of mechanistic, and thus computable, genome-scale genotype–phenotype relationships (Palsson, 2009). The most detailed and complete metabolic reconstruction of any organism to date is for the common laboratory strain Escherichia coli K-12 MG1655. The first genome-scale reconstruction of E. coli was iJE660 (Edwards and Palsson, 2000). This network was constructed through extensive searches of literature and databases to ensure correct stoichiometry and cofactor usage, and was the most extensive metabolic network reconstruction in existence at that time. An updated version of this reconstruction, iJR904 (Reed et al, 2003), had an expanded scope, including pathways for the consumption of alternate carbon sources and more specific quinone usage in the electron transport system. Hundreds of new genes and reactions were added, gene–protein–reaction associations (GPRs) were included for the first time to connect reactions with genes, and all reactions were elementally and charged balanced through the inclusion of protons. In the next update, iAF1260 (Feist et al, 2007), the scope of the network was expanded again, now including many reactions for the synthesis of cell wall components, and all metabolites were assigned to the cytoplasm, periplasm, or extracellular space. The thermodynamic properties of each reaction were calculated, and this was used to set lower bounds on predicted irreversible reactions. iAF1260 contained 2077 reactions, 1039 metabolites, and 1260 genes. A core model version of iAF1260, useful for testing and debugging new constraint-based algorithms and for educational use, has also been published (Orth et al, 2010a). Here, we present an updated version of the E. coli metabolic network reconstruction. This new version, titled iJO1366, includes many newly characterized genes and reactions. Since the iAF1260 model was a very complete representation of the known metabolism of E. coli, only minor expansions in the scope of the network were made. Still, new discoveries since 2007 have made this model update necessary. Several genes were added based on the results of an experimental screen of E. coli knockout strains in four different media conditions. The gaps in the iAF1260 network were identified and characterized, and new reactions and genes were added to reduce the total number of gaps. The iJO1366 reconstruction can serve as a basis for metabolic network reconstructions of other E. coli strains and closely related organisms. We assembled preliminary reconstructions for many other E. coli strains, and analysis of their completeness provides insights into the metabolic networks of these organisms. iJO1366 is the most complete E. coli metabolic reconstruction to date, and like its predecessors, it will likely aid in many new discoveries (Feist and Palsson, 2008). Results and discussion Process for updating the reconstruction and its content The updated network reconstruction of E. coli K-12 MG1655 began with the iAF1260b network (Feist et al, 2010), a slightly updated version of the iAF1260 network. In order to identify incorrect model predictions in order to improve the E. coli reconstruction, we experimentally determined conditional essentiality for most of the genes in the iAF1260b model. By comparing model predicted growth phenotypes to the measurements, errors in the reconstruction were found and several updates were made (Supplementary Table 1). For a discussion of the updates made to iAF1260b based on this screen, see Experimental phenotypic screens in the Supplementary Information. Next, literature and database searches were used to add newly characterized genes and reactions since 2007. The EcoCyc (Keseler et al, 2009) and KEGG (Kanehisa et al, 2010) databases were used extensively for this purpose. Results from the experimental screen described above also led to several model updates. After this first round of updates, the reconstruction contained 1274 genes. The network gaps (Orth and Palsson, 2010) in this version of the reconstruction were then investigated using a modified version of the GapFind algorithm (Satish Kumar et al, 2007). All orphan reactions (reactions without known associated genes) in the reconstruction were also identified from the model GPRs. Gaps were manually sorted into scope and knowledge gaps. Scope gaps are metabolites that are blocked in a model due to the limited scope of the network reconstruction, but have actual known producing and consuming reactions. Knowledge gaps exist because our knowledge of any metabolic network is incomplete. Targeted literature and database searches were performed for each knowledge gap to try to identify any known metabolic reactions missing from the reconstruction. We continued to add newly published metabolic information to the reconstruction during this gap-filling process, and the reconstruction was ultimately updated to iJO1366. All manual curation followed an established protocol (Thiele and Palsson, 2010). iJO1366 represents a significant expansion of the E. coli reconstruction, as it contains 1366 genes, 2251 metabolic reactions, and 1136 unique metabolites. A comparison of the content of iJO1366 and its predecessor, iAF1260, is presented in Table I. Like iAF1260, iJO1366 contains a wide range of metabolic functions (Figure 1). The complete lists of reactions and metabolites in iJO1366 can be found in Supplementary Tables 2 and 3, with a list of all references used in Supplementary Table 4. iJO1366 accounts for three cellular compartments: the cytoplasm, periplasm, and extracellular space. In total, 107 new genes were added to the reconstruction, while one gene, prpE (b0335), was removed. Most new genes added have been characterized since iAF1260 was published in 2007 (Figure 1D). The fact that some references predate previous versions of the E. coli reconstruction does not necessarily mean that they were previously missed. Rather, as genes and reactions are often added on a pathway basis, complete functional pathways are typically fully elucidated over time from multiple sources. The new genes mostly add new pathways and systems to the network, but a significant number of them fill gaps and orphan reactions in existing systems (Figure 1E). A complete list of all new and removed genes, reactions, and metabolites can be found in Supplementary Table 5. The ‘core' and ‘wild-type' biomass reactions of iAF1260 have also been updated in iJO1366. These are reactions that drain biomass precursor compounds in experimentally determined ratios to simulate growth (Feist and Palsson, 2010). For the complete core and wild-type biomass reactions see Updating the biomass composition and growth requirements in the Supplementary Information and Supplementary Table 6. The knowledge index (number of abstracts in Medline) of the 1366 genes in the network was computed, and indicates that iJO1366 contains most of the best-characterized genes in E. coli (Knowledge index of iJO1366 genes in the Supplementary Information and Supplementary Table 7). iJO1366 was also compared with an automatically generated E. coli reconstruction from the Model SEED (Henry et al, 2010) (Comparison of iJO1366 to the Model SEED E. coli reconstruction in the Supplementary Information and Supplementary Table 8) and to the protein localization database EchoLocation (Horler et al, 2009) (Comparison of iJO1366 to the EchoLocation database in the Supplementary Information and Supplementary Table 9). A significant number of the gaps in the iAF1260 network were filled during the update to iJO1366, and several blocked pathways were unblocked. Several different types of gaps in metabolic networks are possible. Root no-production gaps are metabolites with consuming reactions but no producing reactions. Root no-consumption gaps are metabolites with producing reactions but no consuming reactions. Downstream gaps are metabolites with producing and consuming reactions but which are unable to be produced at steady state because they are downstream of a root no-production gap. Similarly, upstream gaps are upstream of root no-consumption gaps. The final reconstruction contains 48 root no-production gaps, 63 root no-consumption gaps, 52 downstream gaps, and 69 upstream gaps. In total, 11.5% of the metabolites in iJO1366 are blocked under all conditions due to gaps (Gaps and orphan reactions in the iJO1366 reconstruction in the Supplementary Information and Supplementary Table 10). The orphan reactions in models such as iJO1366 can also help to identify the functions of metabolic genes. In the original iJR904 study (Reed et al, 2003), gene homology was used to predict the likely E. coli genes that encode the enzymes for 56 orphan reactions. Since then, 14 of these predictions have been independently confirmed to be correct (Supplementary Table 11), and these genes are now included in iJO1366. Prediction of metabolic phenotypes Flux balance analysis (FBA) (Orth et al, 2010b) can be used with a constraint-based model to predict metabolic flux distributions, growth rates, substrate uptake rates, and product secretion rates. The iAF1260 model and its predecessors were already very accurate at making phenotypic predictions such as growth rates and central metabolic flux distributions, so improved predictive capabilities in these areas were not expected with iJO1366. Instead, the value of the updated model is in its ability to predict phenotypes under a wider range of conditions than its predecessors. To demonstrate the utility of the iJO1366 model in making these phenotypic predictions, we generated two large-scale sets of model phenotype predictions. First, the growth phenotypes of E. coli on all possible carbon, nitrogen, phosphorus, and sulfur sources were predicted. The numbers of growth-supporting substrates are summarized in Table II, and the full results of this screen given in Supplementary Table 12 and discussed in Prediction of all growth-supporting carbon, nitrogen, phosphorus, and sulfur sources in the Supplementary Information. We also performed a screen of model predicted growth phenotypes for all possible single gene knockout strains. Growth phenotypes were predicted on both glucose and glycerol minimal media, and the results were compared with experimental data sets (Table III; Supplementary Table 13). Not unexpectedly, iJO1366 is slightly less accurate at predicting overall gene essentiality than iAF1260. This difference is due to the fact that the 107 new genes added to this model version are from less well-studied systems and pathways than the existing genes in iAF1260, as discussed more thoroughly in Prediction of gene essentiality in the Supplementary Information. Mapping iJO1366 to closely related strains Although iJO1366 is a model of E. coli K-12 MG1655, gene homology mapping can be used to create models of other E. coli and Shigella strains. To date, the metabolic reconstruction of E. coli W (Archer et al, 2011) is the only published reconstruction for E. coli strains other than K-12. The iJO1366 reconstruction should prove useful for the study of other recently sequenced E. coli strains. A previous analysis of multiple E. coli genomes showed that there is a moderate level of variability with respect to metabolic gene content within the species (Vieira et al, 2011). This analysis went one step beyond genetic conservation, also investigating the conservation of network topology, but it stopped short of computing the capacity to carry flux through specific growth-supporting pathways. While it is known that equivalent function is not guaranteed by gene homology, it is still one of the most commonly used and effective methods of genome annotation (Frazer et al, 2003). In addition, the combination of sequence homology with constraint-based analysis of metabolic networks can be used to determine the most likely metabolic gene content of an organism by generating models that match known biology. Using this approach, we predicted with FBA whether metabolic models based on iJO1366 for 38 E. coli and Shigella strains are capable of producing biomass on glucose minimal media at various conservation thresholds (Figure 2A). At each percent identity (PID) cutoff, the set of genes that are absent from iJO1366 was determined using Smith–Waterman alignments. The set of missing genes was then used to impose constraints on metabolic reactions utilizing the iJO1366 GPRs. As expected, the results show that all strains were capable of producing biomass with no conservation requirement, but many strains lost this ability as the requirement was made more stringent. At a PID of 40%, only four strains were incapable of producing biomass. This PID was used for further analysis, and was justified through network-based analysis of auxotrophies (Figure 2B and Mapping iJO1366 to closely related strains in the Supplementary Information). Hundreds of unannotated genes in various E. coli and Shigella strains were identified through comparisons to iJO1366 genes, and are listed in Supplementary Table 14. By analyzing the genetic content of 38 E. coli and Shigella genomes, 1006 metabolic genes in iJO1366 were found to be common to all genomes. The average genome in this analysis contains ∼97% of the genes in iJO1366 (Figure 2C). The mapping procedure described here led to the creation of 38 strain-specific models. It is important to note that these models are not yet on the same level as typical ‘draft' metabolic models, as new metabolic functions (functions beyond those occurring in E. coli K-12 MG1655) have not yet been considered. Because only the 1366 metabolic genes of iJO1366 were considered for this mapping, the common set of genes (74%) appears larger than in the previous study by Vieira et al, which considered a set of 1545 reactions. This difference is also partly due to the fact that Vieira et al added orphan reactions to their E. coli networks and compared metabolic content on a reaction basis, rather than on a gene basis. Conclusions Although the metabolism of E. coli has been the subject of active research for decades, the new content added to the iJO1366 reconstruction is a result of the new discoveries that continue to be made. In fact, most of the newly characterized genes in iJO1366 were actually characterized in 2010 (Figure 1D). Although this is a small sample size, it appears that the pace of new discoveries is not slowing as more of the metabolic network is uncovered. There are still numerous uncharacterized E. coli K-12 MG1655 genes, many of which are predicted to have metabolic functions (Riley et al, 2006). As more discoveries are made in the future, additional updates to the E. coli network reconstruction will need to be made. Future increases in the scope of the E. coli metabolic network reconstruction will likely include the integration of this network with reconstructions of other cellular systems such as transcription and translation (Thiele et al, 2009) or transcriptional regulation (Covert et al, 2004). iJO1366 is the most advanced and comprehensive metabolic reconstruction of any microorganism to date, and can thus continue to serve as a basis for the metabolic reconstruction of other bacteria. Based on the success of its predecessors, we expect that iJO1366 will be an important tool in microbial systems biology for years to come. Materials and methods Experimental phenotypic screens The iAF1260 E. coli K-12 MG1655 metabolic reconstruction was used to make computational predictions. The parent strain of the Keio Collection, BW25113, is derived from K-12 MG1655 and is missing several genes that are present in K-12 MG1655: araBAD, rhaBAD, and lacZ. Therefore, flux through the associated reactions without isozymes (ARAI, RBK_L1, RMPA, LYXI, RMI, RMK, and LACZ) was constrained by setting the upper and lower flux bounds of these reactions to zero. For aerobic growth, oxygen uptake was allowed by setting the lower bound of the oxygen exchange reaction to −18.5 mmol gDW−1 h−1. Anaerobic growth was modeled by setting the lower bound of this reaction to zero. For growth on glucose, the lower bound of the glucose exchange reaction was set to −8 mmol gDW−1 h−1. For growth on L-lactate and succinate, the lower bounds were set to −16 mmol gDW−1 h−1. These are arbitrary bounds based on typical substrate uptake rates. After setting the bounds for each condition, the predicted effect of the deletion of each gene in iAF1260 for each condition was computed using the singleGeneDeletion COBRA Toolbox function (Becker et al, 2007), which uses GPRs to constrain the appropriate reactions to carry zero flux and then predicts maximum growth using FBA. A gene was considered essential for the simulated condition if deletion of the gene reduced optimal growth rate to <5% of the wild-type strain as computed by FBA. When a strict threshold of zero growth was used to determine essentiality instead, only one additional essential gene was predicted: lldD (b3605) on lactate minimal medium. Knockout strains were taken from the Keio Collection (Baba et al, 2006; Yamamoto et al, 2009) (supplied by Open Biosystems), a genome-scale collection of E. coli K-12 gene knockouts. Stocks of knockout mutants were streaked onto LB agar with kanamycin (50 μg ml–1) from odd-numbered Keio Collection plates only. Two single colonies of the Keio knockout strain were inoculated into 96-well plates containing 200 μl of LB media with kanamycin (50 μg ml–1) and incubated overnight at 37 °C without shaking. Plates were then centrifuged and pelleted cells were washed twice with 200 μl of 1 × M9 salts per well. Disposable replicator pins were used to transfer cells from the preculture plate to four new plates, two containing glucose M9 minimal medium, one containing lactate M9 minimal medium, and one containing succinate M9 minimal medium. There was 200 μl of minimal medium per well, and the minimal media also contained 50 μg ml–1 kanamycin. The 96-well plates were covered with Aeraseal breathable film (Sigma) to minimize cross-well contamination. For aerobic conditions, the plates were incubated at 37 °C without shaking in a sterile cabinet. For anaerobic conditions, the plates were incubated at 37 °C in an anaerobic chamber ([O2] <50 p.p.m.). After 48 h, absorbance at 600 nm of each well was determined using a VERSAmax microplate reader (Molecular Devices, Sunnyvale, CA). A well was considered to have no growth or slow growth if its absorbance was less than a cutoff value specific to each of the four conditions. The cutoff value was determined by visual inspection of a histogram of absorbance values for all wells measured from a given condition. ‘Normal' growers are supposed to result in measurements lying in the roughly Gaussian-shaped distribution that makes up most of the data, while slow- and non-growers are supposed to result in measurements falling outside the upper 0.95 area of the Gaussian distribution. The cutoffs were OD600=0.26 for glucose aerobic plates, OD600=0.21 for glucose anaerobic plates, OD600=0.21 for lactate aerobic plates, and OD600=0.20 for succinate aerobic plates. If both colonies of a gene knockout were determined to have normal growth, then the gene was considered a true positive (TP) if the model had predicted growth for the knockout, or a tentative false negative (TFN) if the model indicated the knockout should not have grown or previous experiments in glucose MOPS minimal medium had indicated the gene was essential (Baba et al, 2006). Similarly, if both colonies of a gene knockout were determined to have slow/no growth, then the gene was considered a true negative (TN) if the model predicted the same outcome, or a tentative false positive (TFP) otherwise. A gene was considered inconclusive (INC) if for any condition only one colony showed slow/no growth. A second round of screening was performed for TFN, TFP, and INC gene knockouts. TFP and INC gene knockouts were rescreened with two colonies each; TFN gene knockouts were rescreened with four colonies and their pellets were washed four times before transfer to minimal media instead of twice to ensure that growth was not due to contamination from trace amounts of rich media from the precultures. For TFP and INC gene knockouts, a gene was considered essential for a condition if at least one colony showed slow/no growth in the condition in both the first and second round screens. If an essential gene was predicted by the model to be non-essential in the experimental condition, then the gene was concluded to be a genuine false positive (FP) for that condition. For TFN gene knockouts, a gene was considered a genuine false negative (FN) if two or more colonies demonstrated normal growth in the secondary screen. Metabolic network reconstruction procedure The iJO1366 reconstruction was assembled by updating the iAF1260b E. coli metabolic reconstruction (Feist et al, 2010), an updated version of the iAF1260 reconstruction (Feist et al, 2007). iAF1260b contains six additional reactions that were not in iAF1260 (ALAt2rpp, ASPt2rpp, CITt3pp, DHORDfum, GLYt2rpp, and MALt3pp). A 96-step procedure for metabolic network reconstruction was recently published (Thiele and Palsson, 2010), and the appropriate steps were followed when adding new genes, reactions, and metabolites to form iJO1366. The reconstruction was assembled using the SimPheny (Genomatica Inc., San Diego, CA) software platform. All new metabolites were checked against public databases (KEGG, PubChem) for correct structure and charge at a pH of 7.2. New reactions were mass and charge balanced and reversibility was assigned based on experimental studies, thermodynamic information, or the heuristic rules in the standard reconstruction protocol (Thiele and Palsson, 2010). Reactions were associated with genes and functional proteins to form GPRs. The iJO1366 model was exported from SimPheny as an SBML file and the COBRA Toolbox (Becker et al, 2007), a Matlab (The MathWorks Inc., Natick, MA) Toolbox, was used for additional model testing. The Tomlab (Tomlab Optimization Inc., Seattle, WA) CPLEX linear programming solver was used for all optimization procedures. The GapFind MILP algorithm (Satish Kumar et al, 2007) was encoded in the COBRA Toolbox and used to identify all blocked metabolites in the iAF1260 and iJO1366 models. This algorithm was modified from the published version. Specifically, an option was included to change the mass balance constraint to , allowing for metabolites without consuming reactions to be identified as gaps. For each GapFind run, the lower bounds of all exchange reactions were set to −1000 mmol gDW−1 h−1 and the upper bounds of all model reactions were set to 109 mmol gDW−1 h−1. The GapFind algorithm was then run twice, once with each mass balance constraint option. Root no-production and no-consumption metabolites were identified from the model stoichiometric matrix (S) by searching for rows containing only negative or positive coefficients, respectively. Downstream no-production and upstream no-consumption gaps were identified by removing the root gaps from the GapFind outputs. The root gaps of each downstream gap were identified through targeted computational experiments in which metabolite source reactions were added to the network to restore connectivity. Orphan reactions were identified as all reactions without associated GPRs. The core and wild-type biomass reactions were modified from the iAF1260 biomass reactions. Biotin was added with a coefficient based on a published biotin concentration of 250 molecules per cell (Delli-Bovi et al, 2010), while the related cofactor lipoate was added with a similar number of molecules per cell assumed. Iron–sulfur clusters were added with coefficients based on the predictions that 5% of all E. coli proteins contain these clusters (Fontecave, 2006), and that the majority (90%) of these clusters are of the [4Fe–4S] type. The coefficient of Fe2+ was decreased to account for Fe used in iron–sulfur clusters. Molybdenum cofactors were added with coefficients based on the measurement that the inorganic ion content of E. coli is 0.80% Mo (Cvetkovic et al, 2010), and the fact that the majority of this Mo is in the bis-molybdopterin guanine dinucleotide form (85%). The coefficients of Cu, Mn, Zn, Ni, and Co were also adjusted based on recent in vivo measurements (Cvetkovic et al, 2010). All other biomass components remain the same as in iAF1260. Growth-associated and non-growth-associated maintenances values were recalculated based on recent E. coli K-12 MG1655 chemostat data for growth on glucose minimal media (Taymaz-Nikerel et al, 2010). The slope and intercept of this experimental data were identified by linear regression. For these calculations, the electron transport system NADH dehydrogenase reactions NADH16pp (nuo) and NADH5 (ndh) were constrained to carry identical fluxes by replacing these reactions with an equivalent ‘flux split' reaction, in order to constrain the model to a realistic P/O ratio of 1.375 (Noguchi et al, 2004). The NGAM of 3.15 mmol ATP gDW−1 h−1 was identified by FBA as the maximum amount of ATP produced at a glucose uptake rate of 0.17 mmol gDW−1 h−1, the intercept of the experimental data. The GAM of 53.95 mmol ATP gDW−1 was identified by FBA as the value that would give the correct experimentally determined slope of 10.83 mmol glucose gDW−1 h−1/(μ) h−1 when using the core biomass reaction. The iJO1366 model is available in SBML format at BioModels (accession: MODEL1108160000) and as Supplementary Model 1. Comparison of iJO1366 to the Model SEED E. coli reconstruction The E. coli K-12 MG1655 model Seed83333.1 V20.21 was downloaded from the Model SEED database (Henry et al, 2010) in SBML format. The set of 1139 genes in this model was compared by gene ID (b-number) to the 1366 genes in iJO1366 to identify common genes and the unique genes in each model. The genes in the Model SEED model that were not in iJO1366 were then investigated one at a time. EcoCyc was used to identify gene functions, and several genes with verified metabolic functions were then added to iJO1366. Comparison of iJO1366 to the EchoLocation database Predicted and experimentally determined protein location data were obtained for all E. coli K-12 genes from the EchoLocation database (Horler et al, 2009). The cellular locations of the proteins associated with the 1366 model genes from this database were then compared, one at a time, to the compartments of the metabolites in the reactions associated with each gene in iJO1366. For each location in the EchoLocation database, a Boolean rule was written to determine if the location is consistent with the associated model metabolites. For example, ‘Cytoplasmic' proteins are consistent with genes only associated with cytoplasmic metabolites, and are inconsistent with genes associated with any periplasmic or extracellular metabolites. See Supplementary Table 9 for the full list of Boolean rules. The genes whose locations were inconsistent with model metabolites were then investigated individually, except for ‘Cytoplasmic' and ‘Periplasmic' genes with both cytoplasmic and periplasmic metabolites in the model, because there were too many of these inconsistencies to investigate manually. Literature and database information was used to determine whether the EchoLocation database, the iJO1366 reconstruction, or both were correct. Constraint-based modeling The iJO1366 model, constructed in SimPheny, was exported as an SBML file and used to perform simulations and constraint-based analyses using the COBRA Toolbox and Tomlab CPLEX linear programming solver. The constraint-based model consists of an S matrix with 1805 rows and 2583 columns, where 1805 is the number of distinct metabolites (in all three compartments) and 2583 is the number of reactions including exchange and biomass reactions. Each of the reactions has an upper and lower bound on the flux it can carry. Reversible reactions have an upper bound of 1000 mmol gDW−1 h−1 and a lower bound of −1000 mmol gDW−1 h−1, making them practically unconstrained, while irreversible reactions have a lower bound of zero. By default, the core biomass reaction is set as the objective to be maximized. Certain reactions are by default constrained to carry zero flux to avoid unrealistic behaviors. These reactions are CAT, DHPTDNR, DHPTDNRN, FHL (formate hydrogen lyase), SPODM, SPODMpp, SUCASPtpp, SUCFUMtpp, SUCMALtpp, and SUCTARTtpp. CAT, SPODM, and SPODMpp are hydrogen peroxide producing and consuming reactions that can carry flux in unrealistic energy generating loops. DHPTDNR and DHPTDNRN form a closed loop that can carry an arbitrarily high flux. The succinate antiporters SUCASPtpp, SUCFUMtpp, SUCMALtpp, and SUCTARTtpp can form unrealistic flux loops with other transporters for aspartate, fumarate, malate, and tartrate. The genes encoding FHL are known to be active under anaerobic conditions, but this reaction is constrained to zero to avoid unrealistic aerobic hydrogen production. The NGAM constraint is imposed by a lower bound of 3.15 mmol gDW−1 h−1 on the reaction ATPM. The exchange reactions that allow for extracellular metabolites to pass in and out of the system are defined such that a positive flux indicates flow out. All exchange reactions have a lower bound of zero except for glucose (−10 mmol gDW−1 h−1), the vitamin B12 precursor cob(I)alamin (−0.01 mmol gDW−1 h−1), and oxygen and all inorganic ions required by the biomass reaction (−1000 mmol gDW−1 h−1). The default lower bound on glucose uptake is based on typical glucose uptake rates. Because only a very small amount of B12 is required for growth, the lower bound on cob(I)alamin uptake is arbitrary and never actually constraining in practice. The iJO1366 computational model also includes drain reactions for six cytoplasmic metabolites without known consuming reactions that must be drained from the system to allow simulation of steady-state cell growth. These metabolites are p-cresol, 5′-deoxyribose, aminoacetaldehyde, S-adenosyl-4-methylthio-2-oxobutanoate, (2R,4S)-2-methyl-2,3,3,4-tetrahydroxytetrahydrofuran, and oxamate. Prediction of all growth-supporting carbon, nitrogen, phosphorus, and sulfur sources The possible growth-supporting carbon, nitrogen, phosphorus, and sulfur sources of E. coli were identified using FBA. First, all exchange reactions for extracellular metabolites containing the four elements were identified from the metabolite formulas. Every extracellular compound containing carbon was considered a potential carbon source, for example. Next, to determine possible growth-supporting carbon sources, the lower bound of the glucose exchange reaction was constrained to zero. Then the lower bound of each carbon exchange reaction was set, one at a time, to −10 mmol gDW−1 h−1 (a typical uptake rate for growth-supporting substrates), and growth was maximized by FBA using the core biomass reaction. The target substrate was considered growth supporting if the predicted growth rate was above zero. While identifying carbon sources, the default nitrogen, phosphorus, and sulfur sources were ammonium (nh4), inorganic phosphate (pi), and inorganic sulfate (so4), respectively. Prediction of growth-supporting sources of these other three elements was performed in the same manner as growth on carbon, with glucose as the default carbon source. Prediction of gene essentiality To simulate the effects of gene knockouts, the iJO1366 model with its default constraints and core biomass reaction objective was modified to match the genotype of E. coli BW25113 (see Experimental phenotypic screens). For growth on glucose, the lower bound of the glucose exchange reaction was set to −10 mmol gDW−1 h−1. For growth on glycerol, the lower bound of the glucose exchange reaction was set to zero while the lower bound of the glycerol exchange reaction was set to −10 mmol gDW−1 h−1. All 1366 genes in the model were knocked out one a time and growth was simulated by FBA using the singleGeneDeletion COBRA Toolbox function. Gene knockout strains with a growth rate above zero were considered non-essential. Experimental gene essentiality data for growth on glucose (Baba et al, 2006) and glycerol (Joyce et al, 2006) was then obtained and adjusted based on an updated analysis of the Keio Collection strains (Yamamoto et al, 2009). The newly identified essential genes were added to the lists of essential genes under both conditions, while the genes whose essentiality was identified as uncertain were not changed from their original designations. Mapping iJO1366 to closely related strains The protein sequences of all available E. coli and Shigella strains (38 strains total) were downloaded from KEGG (http://www.genome.jp/kegg/download), and SSEARCH35 of the FASTA suite (Pearson and Lipman, 1988), an open-source implementation of the Smith–Waterman algorithm, was used to determine a PID conservation for each iJO1366 gene. The flags used in SSEARCH35 were ‘–m9 –E 1 –q –H'. Next, the deleteModelGenes COBRA Toolbox function was used to delete all genes in a strain that failed to be conserved at a particular protein sequence identity threshold. The entire cutoff range was scanned (from 0 to 100% identity in increments of 1%) and FBA was performed with an objective function of maximum growth rate using the core biomass reaction, to produce Figure 2A. Subsequently, the situation occurring at 40% identity was investigated to determine why four of the strains were unable to produce biomass at this conservation threshold. A demand reaction for each biomass component was added (excluding the components involved in the ATP hydrolysis reaction) and flux was separately maximized through each of these reactions to determine the biomass components that the strain was unable to produce. To determine the subset of genes responsible for a particular loss of function, the effects of single knockouts were computed and literature was consulted. The preliminary network-level analysis of conservation revealed that many ORFs are not properly automatically annotated in non-model organism genomes. For this reason, this analysis was supplemented with a nucleotide-level search for conservation of each model gene in each of the 38 strains. The analysis described above was repeated, this time with the additional conservation qualification of a genomic hit displaying at least 70% identity and aligned length to the model gene. These cutoffs were determined by manual inspection of a scatter plot of all nucleotide-level results. There was a clear region of conservation in the upper right quadrant defined by these cutoffs. Genes found to be conserved only at the nucleotide level are considered to have 100% protein sequence identity (see Supplementary Table 14 for a complete list of such cases). Supplementary Material Supplementary Information Supplementary Text and Figures Supplementary Tables Supplementary Tables 1–14 Supplementary Model 1 iJO1366 model in SBML format Review Process File
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Context-Specific Metabolic Networks Are Consistent with Experiments

              Introduction Commonly, genome-scale metabolic networks are reconstructed to contain all known metabolic genes and reactions in a particular organism [1]. These reconstructions are thus a superset of the metabolic reactions that are functioning in the organism at any one time. The processes that determine which enzymes are active in a cell are often overlooked in constraint-based studies. Of particular interest are the transcriptional regulatory processes in cells which choose a subset of possible enzymes for activity at any given time. Knowledge of transcriptional regulation of metabolism comes from different sources. At a low level, from the bottom up, some of the regulatory proteins that control the transcription of sets of metabolic genes are known [2]. At a higher level, from the top down, gene expression data provides a picture of what genes are being transcribed at a particular time, and hence which enzymes are probably active in the cell [3]. Both of these types of knowledge can be used to refine metabolic networks under given conditions. There are three ways to study how regulation tailors gene-expression under a specific condition. First, if a transcriptional regulatory network network (TRN) is available, then the transcription state of the cells can be computed for a given input [4],[5]. However, genome-scale TRNs are not available. Even for Escherichia coli, it has been estimated from dual perturbation experiments that only about one-fourth or one-third of its TRN is currently known [6]. ChIP-chip data and other approaches may soon enable more comprehensive reconstructions. Second, in the absence of a TRN, optimization procedures based on the assumption that the organism picks out the best set of reactions to meet a physiological objective have been used [7]. However, there are multiple solutions to such problems [8] and no real way to determine which internal reactions are used in the absence of flux data. In addition, the statement of an objective introduces a ‘user-bias’ and such objective may not actually be relevant to the true physiological state. The third approach to study regulation relies on the available of expression profiling data. If such data is available for the conditions being examined we can directly examine the expression of the ORFs accounted for in a genome-scale reconstruction. Metabolic network reconstructions can be combined with gene expression data from different states to identify regulatory principles in organisms [9]. Pathway-based analysis methods can be used to predict the usage of entire pathways of reactions based on the expression state of multiple genes [10]. Gene expression data has previously been applied with yeast to predict which reactions may be inactive on a gene-by-gene basis [11]. More recently, gene expression data has been interpreted in terms of elementary modes [12], moving towards a more functional view of analysis. The results of these methods are dependent on the quality of the expression data that is used as input. Expression data is known to be noisy, and the variety of methods for converting the fluorescence intensity of thousands of spots on a chip to semi-quantitative readings of mRNA molecule counts do not produce equivalent results [13],[14]. Importantly, due to the noise, it is impossible to define a comprehensive set of present mRNA transcripts without a large number of false-positives. Practically speaking, one can say either (1) these few mRNAs are almost certainly present in the cell, or (2) some of these many mRNAs are maybe present in the cell. Pathway-based methods, for example [10], attempt to avoid the noise problem by assuming that all mRNAs assigned to a particular pathway should be present or absent together. This is dependent on biased, human-imposed pathway definitions, and reactions that function within a pathway can also function outside of that pathway, limiting the use of such assumptions. Here we use gene expression data in combination with objective functions to create functional models despite potentially noisy data. We describe the use of genome-scale transcriptomic data to constrain reactions in both bacteria and human cells, enabling context-specific metabolic networks to be reconstructed and compared. We quantitatively define the consistency of gene expression data with assumed functional states of a cell, demonstrating agreement with physiological data. Context-specific metabolic networks will be virtually essential to accurately model human metabolism due to the variety of cell types and their corresponding metabolic processes. Results/Discussion GIMME Algorithm The approach to the construction of context-specific metabolic networks is termed Gene Inactivity Moderated by Metabolism and Expression (GIMME) and is illustrated in Figure 1. As inputs, the algorithm requires: (1) a set of gene expression data, (2) the genome-scale reconstruction, and (3) one or more Required Metabolic Functionalities (RMF) that the cell is assumed to achieve. Preliminary tests (not shown) suggest that proteomic data can be substituted for expression profiling data. Given these three inputs the algorithm produces a list of reactions in the network that are predicted to be active and an inconsistency score (IS) that quantitatively classifies the disagreement between the gene expression data and the assumed objective function. This inconsistency score is converted to a normalized consistency score (NCS), allowing for relative comparisons of how well each gene expression data set agrees with a particular metabolic function. 10.1371/journal.pcbi.1000082.g001 Figure 1 A flow chart schematic representation of the GIMME algorithm. The GIMME algorithm takes three inputs: gene expression (or any other data type) mapped to reactions, a metabolic reconstruction, and one or more RMFs. A metabolic reconstruction is mapped through a data set, removing reactions that are not available and creating a reduced model. Reactions are reinserted into the reduced model as needed to achieve RMFs (such as growth and/or ATP production), resulting in a functional, context-specific model that features minimal disagreement with the data. The consistency score quantifies the disagreement with data, showing the minimal sum of fluxes weighted with reaction data deviations from data. Simply speaking, reactions that correspond to mRNA transcript levels below a specified threshold are tentatively declared inactive. If the cell cannot achieve the desired functionality without at least one of these reactions, linear optimization is used to find the most consistent set of reactions to reactivate. Inconsistency scores are calculated based on the product of distance from threshold and necessary flux for each reaction required to be reactivated, as illustrated in Figure 2. A smaller inconsistency score indicates that the data is more consistent with the RMF. The GIMME algorithm produces the network with the minimal inconsistency score through the following two-step procedure: Find the maximum possible flux through each RMF (allowing usage of all reactions). Constrain the RMF's to operate at or above some minimum level (generally a percentage of the maximum found in [A]) and identify the set of available reactions that best fit a quantitative data set. 10.1371/journal.pcbi.1000082.g002 Figure 2 The computation of inconsistency scores. Inconsistency scores for each reaction are computed by multiplying the deviation from a threshold by the required flux through a reaction. In the example here, the green reactions have data above the threshold, set to 12 (this is a parameter; see text). The red reactions have data below the threshold (11.4 and 8.2). The calculation of the inconsistency score corresponding to each reaction is shown numerically as flux multiplied by the deviation from the cutoff. They each increase the inconsistency score, implying that the data are less consistent with the objective of growing on lactate. Greater required fluxes and greater deviation from the threshold both increase the inconsistency scores. The total inconsistency score is the sum of all individual reaction scores. Part A is achieved through flux balance analysis (FBA) [15]. Part B involves the solution of the following linear programming problem: In the formulation, xi is the normalized gene expression data mapped to each reaction. xcutoff is a cutoff value set by the user above which a reaction is definitely present; there is no contribution to the inconsistency score from reactions that are above this threshold. S is the stoichiometric matrix with reactions as columns, metabolites as rows, and stoichiometric coefficients as elements. v is the flux vector, quantitatively describing the flow through each reaction. ai and bi are the lower and upper bounds, respectively, for each reaction and define the minimum and maximum allowable flux. These bounds are set according to the maximal value of the RMF(s) found in step (1), in general by setting the lower bound corresponding to each RMF to some fraction of its maximal value. The great majority of the lower and upper bounds do not correspond to an RMF and are set to the same value as in a standard FBA problem, usually to arbitrarily high and low values, but sometimes to finite values as for input constraints (glucose uptake, for example) and irreversible reactions. The above optimization problem would generally be difficult to solve due to the presence of an absolute value operator, but in this case, a trivial simplification converts the above problem to a standard LP problem. Each reaction defined as possibly reversible (containing a negative lower bound) is converted to two irreversible reactions, thus restricting all fluxes to be positive, and removing the need for the absolute value. In general, some reactions will not have available data. The algorithm takes a conservative approach and designates these reactions as active; hence the term “gene inactivation” is part of the method name. The algorithm treats these reactions as if they had data that surpassed the cutoff; this is a conservative approach to avoid any penalty for absent data. The lack of data does have implications for the interpretation of results. It is entirely possible that given better data, these reactions would be determined to be absent, perhaps necessitating the activation of other reactions. Clearly, with limited data, the results must be considered with caution. In general, this is far more of a concern for human metabolic networks than for E. coli. Context-Specific Networks for E. coli We have used the GIMME algorithm to produce context-specific metabolic networks for E. coli for several different conditions and to compare inconsistency scores for different strains of the bacterium. We show that the inconsistency scores agree with experimental data in nearly all cases. Gene expression data from different conditions of E. coli growth are the input data, and the independent validation data is phenotypic data describing the relative growth and product secretion. Strains adapted to novel substrates Adaptive evolution has been used in the laboratory to improve the growth rate of E. coli on culture media to which it is unaccustomed [16]. Metabolic models of E. coli predict better growth than it initially achieves in the laboratory when it is given a substrate other than glucose to use as a carbon source for growth. However, after selective pressure is applied to maximize growth, the actual growth has been shown to improve to match the prediction, typically after a month or two of serial passage of cells [16]. Using the GIMME algorithm, we constructed context-specific metabolic networks for three types of strains described in [17]: (1) wild-type strains, (2) strains evolved to growth on glycerol, and (3) strains evolved to growth on lactate. The gene expression data used to construct the models consists of CEL files containing the data described in [17], normalized using GCRMA [14]. The data was mapped from genes to reactions using the gene-protein-reaction associations from the reconstruction [18]. The threshold (xcutoff) was set at 12, meaning that reactions assigned a normalized value greater than 12 are assumed to be present; similar results were noted at other thresholds. The RMF was growth on a given carbon source, and the context-specific metabolic networks were forced to grow no less than 90% of optimal growth. Because the evolved strains nearly always grow better than wild-type strains on a variety of carbon sources [19], metabolic networks for optimal growth on nine carbon sources were constructed. The results are shown in Figure 3 (glycerol evolved strains) and Figure 4 (lactate evolved strains). For these figures, the inconsistency scores were used to calculate normalized consistency scores (see Materials and Methods for details); a higher normalized consistency score indicates that the gene expression profile is more consistent with the objective. The figures show that the gene expression state of evolved strains are always more consistent with growth on the nine substrates, paralleling the phenotypic findings from [19] in nearly all cases. These findings demonstrate that the evolved strains have gene expression states that are more consistent (than wild type strains) with usage of the optimal networks for growth on a variety of carbon sources. 10.1371/journal.pcbi.1000082.g003 Figure 3 Glycerol-evolved strain normalized consistency scores. Normalized consistency scores are computed directly from the inconsistency scores, as described in the text. A higher normalized consistency score indicates that the gene expression data is relatively more consistent with the RMF. Thus, here the gene expression data from the glycerol-evolved strains are more consistent with highly efficient growth on each of the carbon sources tested. The p values, determined by permutation testing, are less than 0.01 in all cases here. 10.1371/journal.pcbi.1000082.g004 Figure 4 Lactate-evolved strain consistency scores. This figure demonstrates the same result as Figure 3, but with strains evolved on lactate. The normalized consistency scores for growth on each of the tested carbon sources are higher for evolved strains, indicating that the gene expression data from the evolved strains are more consistent with efficient growth on each carbon source. Metabolic engineering strain Metabolic engineering seeks to optimize bacterial strains to produce a valuable product from a less expensive set of molecules. Rational design of strains for metabolic engineering is possible with genome-scale metabolic models [20]. Adaptive evolution of knock-out strains of E. coli can be used to optimize such strains [21]. We determined the inconsistency scores with GIMME for replicates of a Δpta ΔadhE strain that is designed to produce lactate as a byproduct of anaerobic growth on glucose, as described in [21], as well as wild-type strains. The objective used was growth and lactate production was fixed at a rate consistent with experimental data from [21]. As shown in Figure 5, the designed strain has gene expression data that is more consistent with growth-coupled lactate production, exactly as experimental data indicates. The gene deletions and subsequent evolution have led to a global metabolic gene expression state that is more consistent with growth-coupled lactate production than the wild-type strain. 10.1371/journal.pcbi.1000082.g005 Figure 5 Metabolic engineering strain consistency score. The normalized consistency score for an E. coli strain designed to produce lactate indicate that the Δpta ΔadhE strain has a metabolic gene expression state consistent with the simultaneous production of lactate and growth when compared with the wild-type. This higher normalized consistency score indicates that the gene expression data from the double deletion strain is more consistent with the metabolic engineering objective than the wild-type strain, in accordance with experimental measurements. We used this data set to verify the robustness of the algorithm to two different factors. First, we tested the effect of altering the cutoff by recalculating the results for cutoffs ranging from eight to 14, in increments of 0.1. We found that the consistency scores were significantly different for all cutoffs. For some cutoffs, the p value was not as good as for others, but p<0.01 for all choices of cutoff within the range tested. Second, we verified the robustness of the algorithm with a jackknife test. We randomly removed 5% of the expression values mapped to reactions 100 times and recomputed the context-specific networks and consistency scores. We found that for all repetitions the same conclusion was reached, although in some cases the p values were not quite as low as when using all of the data. In all cases, the conclusion was reached with p<0.02, which demonstrates slightly lower performance with all of the data available. This suggests that the algorithm should be expected to have greater statistical power when as many reactions as possible are assigned data. Terminal electron acceptor effect on network The growth of E. coli varies depending on the availability of terminal electron acceptors, usually oxygen or nitrate. Gene expression data from a total of 21 different strain/electron acceptor conditions was analyzed to construct the most consistent models for growth with oxygen, without oxygen, and with nitrate. The expectation is that the strain data taken from a given condition (for example, aerobic) should be more consistent with growth on that condition (again, aerobic) than strain data from a different condition. Pairwise comparisons were made between all consistency scores, and the results are shown in Figures 6, 7, and 8, for aerobic, anaerobic, and anaerobic nitrate conditions, respectively. A green box indicates that the strain/condition indicated on the y axis is more consistent with growth than the strain on the x-axis; a red box indicates the opposite. The intensity of the green or red color indicates the difference in inconsistency scores, after log2 transformation for visualization scaling. Black boxes indicate that no statistically significant (p<0.05) conclusion could be reached from the data. In all cases where statistically significant conclusions were possible, gene expression in strains grown with oxygen is more consistent with aerobic growth than gene expression from strains grown without (Figure 6). In 99% of cases that are statistically significant, gene expression in strains grown anaerobically is more consistent with anaerobic growth than the data from strains grown with oxygen (Figure 7). The trend holds 90% of the time for anaerobic growth with nitrate (Figure 8). As would be expected, different subsets of reactions are active for each condition. Taken with the previous results, this provides strong support for the inconsistency scores that emerge from the algorithm and provides a positive control. 10.1371/journal.pcbi.1000082.g006 Figure 6 Pairwise comparisons of consistency for aerobic conditions. A graphical representation of the log2 transform of the difference between inconsistency scores. A green box indicates that the sample on the y-axis is more consistent with aerobic growth than the sample on the x-axis. Red boxes indicate the opposite. Differences that do not meet p<0.05 are left blank. The shade of red or green quantifies the log2 of the difference in inconsistency scores. The position of green and red blocks here indicates that in all statistically significant cases, strains grown with oxygen have gene expression more consistent with efficient aerobic growth than strains grown without oxygen. 10.1371/journal.pcbi.1000082.g007 Figure 7 Pairwise comparisons of consistency for anaerobic conditions. A graphical representation of the log2 transform of the difference between inconsistency scores. A green box indicates that the sample on the y-axis is more consistent with anaerobic growth than the sample on the x-axis. Red boxes indicate the opposite. Differences that do not meet p<0.05 are left blank. The shade of red or green quantifies the log2 of the difference in inconsistency scores. The position of green and red blocks shows that in nearly all cases that are statistically significant, gene expression data for strains grown without oxygen is more consistent with efficient anaerobic growth than strains grown with oxygen. 10.1371/journal.pcbi.1000082.g008 Figure 8 Pairwise comparisons of consistency for nitrate conditions. A graphical representation of the log2 transform of the difference between inconsistency scores. A green box indicates that the sample on the y-axis is more consistent with nitrate growth than the sample on the x-axis. Red boxes indicate the opposite. Differences that do not meet p<0.05 are left blank. The shade of red or green quantifies the log2 of the difference in inconsistency scores. The position of green and red blocks indicates that in most cases, gene expression from strains grown with nitrate as the terminal electron acceptor is more consistent with efficient growth under this condition than strains grown under other conditions. Context-Specific Networks for Human Cells The wide variety of human cell types in the body do not share a simple objective such as cellular growth, but rather have a multiplicity of functions necessary for multi-cellular life. Accordingly, understanding the metabolism of any particular cell type requires a model that contains only the reactions present in that cell type, without potentially thousands of extraneous reactions. Human Recon 1 [22] will contain many reactions that are inactive in particular cell types. Accurate models require their removal, and the GIMME algorithm provides a framework for this process. Herein we describe the first functional genome-scale metabolic models for particular human cells, in this case, skeletal muscle cells in different conditions. Data sources We used three publicly available sets of gene expression data for skeletal muscle cells, as depicted in Table 1. 10.1371/journal.pcbi.1000082.t001 Table 1 Datasets used to create context-specific skeletal muscle models. Abbreviation Description Reference GEO Accession Number GB 3 patients before and 1 year after gastric bypass surgery (vastus lateralis). [27] GDS2089 GI 6 subjects before glucose/insulin infusion via clamp and 2 hours after beginning (vastus lateralis) [28] GSE7146 FO 24 subjects divided into 3 groups of eight: morbidly obese (MO), not obese (NO), and obese (O) (rectus abdominus). [27] GDS268 These three datasets were originally gathered for purposes completely distinct from creating context-specific metabolic networks, just as the E. coli datasets described earlier were. Nevertheless, they can be interpreted in the context of a genome-scale metabolic network towards this end. All three datasets were collected using Affymetrix (Santa Clara, CA) gene expression arrays. The GB dataset used U133+ 2.0 arrays, while the GI and FO datasets used U133A arrays. While the arrays are similar, the U133+ 2.0 array is able to provide reliable trancriptomic data for 179 reactions beyond what the U133A array can provide. The coverage of these arrays in terms of model reactions is shown in Figure 9. Each probeset that corresponded to a metabolic gene was mapped to that gene, provided that the annotation information for that particular array type indicated that the probeset sequence was unique to either that gene or a closely related gene. Probesets with sequences that correspond to multiple, unrelated genes were ignored. The values associated with the expression of genes were mapped to reactions through the gene-protein-reaction associations, as described earlier and in Materials and Methods. 10.1371/journal.pcbi.1000082.g009 Figure 9 The mapping of Affymetrix gene chip data to reactions. Reactions in the white area have no usable gene chip data on either platform. Reactions in grey have usable data only on the 133+ 2.0 platform. Reactions in black have usable data for both the 133+ 2.0 and the 133A platform. Importantly, 5% (179) of the reactions are only represented on the 133+ 2.0 chip, potentially increasing scores across chips. The average difference score is 340, so a difference of 179 reactions is greater than a 50% impact. Model creation and comparison Each of the 42 (6+12+24) gene expression datasets was used with the GIMME algorithm to create a model that produces ATP at no less than half the optimal efficiency and matches the data as closely as possible. These models were compared on a pairwise basis by finding the number of reactions that are different in the two models under comparison. On average, two models differ by 340 reactions, which is approximately 10% of the reactions in the global model. The pairwise distances are shown graphically in Figure 10. Darker squares represent pairs of networks that are more similar than lighter squares. Two trends are immediately apparent. First, the metabolic networks that are derived from each dataset are more similar to others derived from that same dataset, as shown by the three large dark squares that surround the diagonal. Secondly, it appears that the GI and FO models are more similar to each other than to the GB models. Initially, we suspected that the gene chip might bias this result, so we recomputed the distance between each pair of models, ignoring the 179 reactions that are not present on the U133A array. This result is graphically depicted in Figure 11, showing that the FO models are similar to both the GB and GI models, but the GI models are not similar to the GB models. Comparing models generated with different gene expression platforms must be done with caution. The bottom line is that there are 179 metabolic reactions that the GB models could elect not to use based on data, but the GI and FO models cannot because no data is present; the GIMME procedure will only turn off a reaction in the presence of some data mapped to that reaction. Better coverage of metabolic reactions on gene expression arrays will lead to fewer extraneous reactions in resulting models. 10.1371/journal.pcbi.1000082.g010 Figure 10 A comparison of skeletal muscle models. This heat map displays the level of difference in each pair of models. Darker squares represent models that are more similar to each other than lighter squares. A black square (as on the diagonal) indicates identical models, and a white square indicates the most different pair of models. The three darker blocks that surround the main diagonal are the comparisons of samples within each dataset to each other. These darker blocks show that the models within each dataset tend to be more similar to each other than to models from other datasets. The models from a particular expression array type also appear to be more similar to each other than to models from a different array types, but the data available do not allow us to show that this is actually true, as is shown in Figure 11. 10.1371/journal.pcbi.1000082.g011 Figure 11 A comparison of skeletal muscle models, using only reactions that have data for both gene chips used. This figure is the same as Figure 10, but the distances that are graphically represented are computed using only reactions that have data on both types of gene chips. The 5% of reactions that are represented on the U133+ 2.0 chip but not the U133A chip are not used for comparison. Here we no longer see any bias based on chip type, but rather we see that the FO datasets appear to be similar to both GI and GB data sets, instead of just the GI data sets. The chip type does not appear to affect the distances as much as the different experiments do. The chip type does appear to affect the reactions that the algorithm defines as active versus inactive, as seen in the differences between Figures 10 and 11. Two significant results In spite of the difficulties of comparing models derived from different sources, two statistically significant differences emerge from the analysis. First, a given patient is more similar to himself before and after either gastric bypass or glucose/insulin infusion than he is to other patients. We took the similarity scores for the GB and GI patients and created two separate groups: (A) all matched patients before and after and (B) all unmatched patients from the same dataset. Permutation testing demonstrated that group A has a smaller mean distance than group B (p<0.01). Secondly, we looked at consistency scores, asking if any group was more consistent with high ATP production than any other. Only one statistically significant (p<0.01) result emerged, that the after-GI patients are more consistent with high ATP production than the before-GI patients. Again, this result is exactly as expected; muscle cells that have been given a substantial dose of glucose and insulin in the bloodstream should be more consistent with high ATP production. Conclusions The work reported herein details the first available method to both produce a guaranteed functional metabolic model specific to a set of gene expression data and quantify the agreement between gene expression data and one or more metabolic objectives. We have demonstrated the functionality of this GIMME method with gene expression data from E. coli and human skeletal muscle cells. We have shown that (1) the computed consistency between gene expression data for different conditions and RMF agrees with physiological data, (2) the most consistent networks depend on the metabolic objective and media conditions, and (3) the most consistent networks for human skeletal muscle cells contain significantly fewer reactions than the global human model. Initially, we expected that the results for human models would be more interesting than those for any other organism reconstructed to date, principally because we expected that human cells would show the most variability across conditions. However, the lack of available data for a substantial number of human metabolic reactions confuses attempts at comparison. We showed that reducing the number of reactions considered by 5% can change the apparent differences between different datasets. In addition, the lack of replicates in human gene expression data sets and the difficulty in obtaining high quality biological controls complicates matters and reduces the statistical power of comparisons. We have higher confidence in the results presented for E. coli because nearly all of the gene-associated reactions have data available, replicates are available, and controls are present. We also found that a substantial number of reactions in E. coli do vary in activity when different input conditions are provided. In the end, we conclude that a tool originally conceived to plug a key gap in the analysis of human cellular metabolism actually provides more immediate use in the analysis of microbial metabolism. With metabolic reconstructions growing in size and becoming available for more and more organisms, tools to filter global reaction lists into context-specific reaction lists will be highly useful. Meaningful analysis of the human metabolic network will require procedures such as GIMME in order to accurately predict phenotypes. Materials and Methods The metabolic networks for E. coli and human cells were imported into Matlab with the COBRA Toolbox [23]. The gene expression data for E. coli has been obtained by our lab and is previously published as cited above. The gene expression data for human skeletal muscle cells was downloaded from NCBI GEO [24] as CEL files. Gene Expression Data Processing The gene expression data was obtained as CEL files and processed using Bioconductor [25]. The data for E. coli was processed using GCRMA as implemented within Bioconductor [14]. The data for human skeletal muscle was processed using the affy package [26] and the mas5calls function. The p values were subtracted from 1 and the resulting value used as a quantitative measure of likelihood that the gene was available. The default parameters were used. For all datasets, the expression level of each reaction was determined by mapping any available data from genes associated with that reaction. If data was not available for any gene associated with a reaction, it was given a score of −1. If data was available for one or more genes, a single score was computed by evaluating the boolean GPR associations; OR's would evaluate to the greater of the two values, AND's to the lesser. The end result was a score for each reaction from each set of data, either −1 or non-negative, with greater numbers implying greater certainty that reaction is present. This is the data that was input into the GIMME algorithm to compute the consistency scores and context-specific metabolic networks. GIMME Implementation The GIMME algorithm is implemented in Matlab, using functions in the COBRA Toolbox. In general, any robust linear programming solver should work; we used Tomlab (Tomlab Optimization, Pullman, WA). Conversion to Normalized Consistency Scores The output from the GIMME algorithm is an inconsistency score, and a higher score means that the gene expression data is less consistent with the model achieving the desired objective. For visualization purposes only, these scores are converted into normalized consistency scores, with a higher score indicating greater consistency between the data and the modal achieving the objective. For a given set of scores, each inconsistency score is subtracted from 1.02 * (maximum inconsistency score) to produce a set of consistency scores. Each consistency score is divided by the maximum consistency score to produce a set of normalized consistency scores. The 1.02 factor assures that the smallest consistency score is slightly greater than zero and easy to visualize on a graph. Statistical Significance Testing Permutation testing with 10,000 randomizations was used to determine the statistical significance of all results with regard to consistency scores. This testing was implemented in Matlab. Visualization Heat-map type representations were produced in Matlab. Other graphs were produced in Excel (Microsoft, Redmond, WA).
                Bookmark

                Author and article information

                Journal
                Sci Rep
                Sci Rep
                Scientific Reports
                Nature Publishing Group
                2045-2322
                20 October 2015
                2015
                : 5
                : 15147
                Affiliations
                [1 ]Computer Laboratory - University of Cambridge , UK
                Author notes
                Article
                srep15147
                10.1038/srep15147
                4611489
                26482106
                39740971-bdd1-4434-9432-9c7d1e524228
                Copyright © 2015, Macmillan Publishers Limited

                This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

                History
                : 20 March 2015
                : 14 September 2015
                Categories
                Article

                Uncategorized
                Uncategorized

                Comments

                Comment on this article