Introduction Despite advances in screening, diagnosis, and treatment, colorectal cancer (CRC) is the third most common cancer and the fourth-leading cause of cancer death worldwide . Pathological staging is the only prognostic classification used in clinical practice to select patients for adjuvant chemotherapy . However, pathological staging fails to predict recurrence accurately in many patients undergoing curative surgery for localized CRC. In fact, 10%–20% of patients with stage II CRC, and 30%–40% of those with stage III CRC, develop recurrence. Among the molecular markers that have been extensively investigated for colon cancer (CC) characterization and prognosis, microsatellite instability (MSI), caused by defective function of the DNA mismatch repair (MMR) system, is the only marker that was reproducibly found to be a significant prognostic factor in early CRC in both a meta-analysis and a prospective trial ,. Many studies have exploited microarray technology to investigate gene expression profiles (GEPs) in CRC in recent years, but no established signature has been found that is useful for clinical practice, especially for predicting prognosis –. GEP studies on CRC have been only poorly reproducible, possibly because CRC is composed of distinct molecular entities that may develop through multiple pathways on the basis of different molecular features –. As a consequence, there may be several prognostic signatures for CRC, each corresponding to a different entity. Indeed, GEP studies that include unsupervised hierarchical clustering, and integrated genetic/epigenetic analysis—including the more recent classification based on high-throughput methylome data —have identified at least three distinct molecular subtypes of CC ,–. Therefore, CC should no longer be considered as a homogeneous entity. However, the molecular classification of CC currently used, which is based on a few common DNA markers (MSI, CpG island methylator phenotype [CIMP], chromosomal instability [CIN], and BRAF and KRAS mutations) –, needs to be refined, and a standard and reproducible molecular classification is still not available. In this study, we exploited a large, multicenter, and extensively characterized series of CC samples to establish a robust molecular classification based on genome-wide mRNA expression analysis. Then we assessed the associations between molecular subtypes and clinicopathological factors, common DNA alterations, and prognosis. To confirm the robustness of the subtypes obtained, we further validated our molecular classification in a large independent set. Methods Ethics Committee Approval The use of the tumor collection was approved by the following ethics committees and institutional boards: lle de France II (2008-135; AFSSAP 2008-A01058-47), Marseille (PHRC2005, COS-IPC of 27 September 2007), Strasbourg (Comité Consultatif de Protection des Personnes dans la Recherche Biomedicale d'Alsace, 2004-63 and CPP-EST4 [DC-2009-1016 and AC-2008-438]), the Human Research Ethics Committee of Saint-Antoine Hospital (INCa; TUM0203—project 2010-1-RT-02), the Toulouse Hospital board (CRB–Cancer Toulouse, DC-2008-463, AC-2008-820, CPP2), and Nice (PHRC1997, CHUNice-948). The informed consent of the patients was recorded as required by a French law in force until 2007. Since the last inclusion in this study was 2007, the standard hospital blanket consent was considered sufficient. Patients The French national Cartes d'Identité des Tumeurs (CIT) program involves a multicenter cohort of 750 patients with stage I to IV CC who underwent surgery between 1987 and 2007 in seven centers. Fresh-frozen primary tumor tissue samples were retrospectively collected at the Institut Gustave Roussy (Villejuif), the Hôpital Saint Antoine (Paris), the Hôpital Européen Georges Pompidou (Paris), the Hôpital de Hautepierre (Strasbourg), the Hôpital Purpan (Toulouse), and the Institut Paoli-Calmettes (Marseille), and prospectively collected at the Centre Antoine Lacassagne (Nice). Patients who received preoperative chemotherapy and/or radiation therapy and those with primary rectal cancer were excluded from this study. Clinical and pathologic data were extracted from the medical records and centrally reviewed for the purpose of this study. Patients were staged according to the American Joint Committee on Cancer tumor node metastasis (TNM) staging system  and monitored for relapse (distant and/or locoregional recurrence; median follow-up of 51.5 mo). Patient and tumor characteristics are summarized in Table 1 and detailed in Table S1. 10.1371/journal.pmed.1001453.t001 Table 1 Patient and tumor characteristics of the different sets. Characteristics CIT Cohort Patients (n = 750) CIT Discovery Dataset (n = 443) Validation Datasets CIT Cohort p-Value All Cohorts p-Value CIT (n = 123) Public (n = 906) Mean age (sd, range), years 67 (14, 19–97) 67 (14, 22–97) 68 (12, 42–90) 68 (13, 23–95) 0.21 0.25 Sex (male/female) (percent) 429/321 (57/43) 237/206 (53/47) 73/50 (59/41) 347/330 (51/49) 0.24 0.24 TNM stage (percent) I 52 (7) 27 (6) 10 (8) 48 (11) 0.058 A (p.V600E) mutation was assessed by allelic discrimination using TaqMan probes and the same protocol as that for KRAS mutations. TP53 mutations (exons 4–9) were assessed as previously described . MSI was analyzed using a panel of five different microsatellite loci from the Bethesda reference panel . MSI-high tumors were further classified as deficient MMR (dMMR), and both MSI-low and MSS tumors as proficient MMR (pMMR). CIMP status was determined using a panel of five markers (CACNA1G, IGF2, NEUROG1, RUNX3, and SOCS1) as previously described . Experimental procedures are detailed in Text S1. Common DNA alterations are summarized in Table 1 and detailed in Table S1. Gene Expression Analysis The GEP of 566 primary CC samples were determined on Affymetrix U133 Plus 2.0 chips. For 19 patients, adjacent non-tumor tissue (normal tissue [NT]) was also available and was tested. The methods used for RNA purification, quality control, fluorescent probe production, hybridization, and raw data processing were as previously described . Each dataset was normalized independently in batches using the robust multi-array average method implemented in the R package affy . For the CIT dataset, residual technical batch effects were corrected using the ComBat method implemented in the SVA R package . Data are available via the NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/; accession number GSE39582). Array-Based Comparative Genomic Hybridization Analysis A total of 464 of the 750 primary CC samples from the CIT cohort could be analyzed for array-based comparative genomic hybridization (CGH) on a BAC array containing 4,434 bacterial artificial chromosome clones with a median resolution of 0.6 Mb. DNA labeling, hybridization, and data processing were as previously described . CIN was defined from CGH profiles: samples with at least 20% gain or loss of whole chromosomes or fractions of chromosomes were scored as CIN+ (see Text S1 for details). Unsupervised Subtype Discovery Based on Gene Expression Analysis Unsupervised classification of the discovery set was performed using hierarchical clustering (Ward linkage and 1 − Pearson correlation coefficient distance used) on the most variant class of probe sets (n = 1,459). To obtain a robust classification, we used a consensus unsupervised approach  implemented in the R package ConsensusClusterPlus. The consensus clusters were obtained from 1,000 resampling iterations of the hierarchical clustering, by randomly selecting a fraction of the samples and of the most variant probe sets (90%). The optimal number of clusters was selected according to the approach criteria detailed in Text S1. Validation Set Subtype Assignment Validation datasets were independently assigned to GEP subtypes according to a standard distance-to-centroid approach . A centroid-based predictor was built by a 10-fold cross-validation approach, resulting in the selection of the five top up-regulated and five top down-regulated genes specific to each subtype, yielding 57 genes (three genes were shared by two subtypes). The approach was implemented in the R package citccmst, and is detailed in Text S1. Molecular Subtype Characterization The Chi-squared test and logistic regression were used to study associations between anatomoclinical features, common DNA alterations, and subtypes. Each molecular subtype was further characterized according to (i) GEP of NT counterparts from our dataset; (ii) previously published supervised signatures based on intestinal stem cell phenotype ,, BRAF mutation , and serrated CRC phenotype , as described in Text S1; (iii) cancer-relevant signaling pathways retrieved from the Kyoto Encyclopedia of Genes and Genomes (see Text S1); and (iv) CGH alteration frequencies. Recurrence Risk Group Assignment according to Other Molecular Predictors The ColoPrint and Oncotype DX prognostic classifiers , were adapted and applied to our overall datasets as described in Text S1. Survival Analysis Survival analysis was intentionally restricted to the subgroup of patients with stage II–III tumors because reliable prognostic biomarkers are most needed for these patients. Indeed, most stage I patients will not derive benefit from adjuvant chemotherapy because of their excellent prognosis after curative surgery, and most stage IV patients, already metastatic, will die from their disease and therefore should be analyzed independently for progression-free survival. RFS was defined as the time from surgery to the first recurrence and was censored at 5 y. Survival was analyzed according to the Kaplan-Meier method, and differences between survival distributions were assessed with the log-rank test. Univariate and multivariate models were computed using Cox proportional-hazards regression (R package survival) (see Text S1 for details). Results Unsupervised Analysis of Gene Expression Profiles Revealed Six Subtypes of Colon Cancer Consensus unsupervised analysis of the GEP data from the 443 samples of the discovery set revealed six clusters of samples based on the most variant probe sets (n = 1,459): C1 (n = 95, 21%), C2 (n = 83, 19%), C3 (n = 56, 13%), C4 (n = 46, 10%), C5 (n = 118, 27%), and C6 (n = 45, 10%) (Figure 1; Table S2). The consensus matrix showed that C2, C3, C4, and C6 appeared as well-individualized clusters, whereas there was more classification overlap between C1 and C5 (Figure 1A). Based on cluster expression centroid classification and the gene expression heatmap (Figure 1B and 1C), cluster C4 appeared to be the most distinct. The other clusters subdivided into C2 and C3 on one side of the cluster expression centroid classification (Figure 1B), and C6, C5, and C1 on the other. The GEPs of C1 and C5 showed overlap but displayed slightly distinct gene deregulations. This was confirmed in the supervised selection of the cluster-discriminant probe sets shown in the gene expression heatmap in Figure S2 and detailed in Table S3. 10.1371/journal.pmed.1001453.g001 Figure 1 Unsupervised gene expression analysis of the discovery set of 443 colon cancers. (A) Consensus matrix heatmap defining six clusters of samples for which consensus values range from 0 (in white, samples never clustered together) to 1 (dark blue, samples always clustered together). (B) Distance between clusters according to the hierarchical clustering of the 1,459 probe sets based on the centroids of each cluster. (C) GEP heatmap of the 1,459 probe sets ordered by subtype, with annotations associated with each subtype. Clinical and Molecular Relevance of Colon Cancer Subtypes Associations with anatomoclinical and DNA alterations data are shown in Figures 1C and S3A and in Table S4. Tumors classified as C1, C5, and C6 were more frequently CIN+, CIMP−, TP53-mutant, and distal (p 0.5, yielding 1,108 discriminant probe sets. (PDF) Click here for additional data file. Figure S3 Associations between molecular subtypes and anatomoclinical characteristics, DNA alterations, and supervised signature annotations in the discovery and validation sets. Associations were assessed in (A) the Affymetrix discovery set, (B) the Affymetrix validation set, and (C) the TCGA, non-Affymetrix, validation set. For each subtype and variable, the proportion of each modality is represented (dark grey: “positive/true/yes” proportion; white: “negative/false/no” proportion; grey: “data not available” proportion), and the percent of the main feature (dark grey) within each subtype is indicated. The Chi-squared test p-values are indicated in red. (PDF) Click here for additional data file. Figure S4 Subtype genomic alteration profiles along the genome. CC molecular subtypes present different copy-number-change profiles. The profiles were established using genome-wide array-based CGH available for 356 samples. (A) Frequencies of gains (frequency>0) and losses (frequency 20%/CIN≤20%; F/M, female/male; WT/M, wild-type/mutant. (XLS) Click here for additional data file. Table S2 List of the 1,459 most variant probe sets used to perform unsupervised analysis. The GeneCluster column corresponds to the gene cluster letters in Figure 1; logFC_CjvsCx corresponds to the gene expression log2 fold changes of subtype Cj versus the other subtypes; adjpv.CjvsCx corresponds to the adjusted p-values of the moderated t-test comparing Cj versus the other subtypes. (XLS) Click here for additional data file. Table S3 List of the 1,108 subtype-discriminant probe sets. For each subtype, discriminant probe sets were selected from the discovery set using a moderated t-test, comparing a given subtype to the other subtypes (adjusted p 0.5), yielding 1,108 discriminant probe sets. The GeneCluster column corresponds to the gene cluster letters in Figure S2; logFC_CjvsCx corresponds to the gene expression log2 fold changes of subtype Cj versus the other subtypes; adjpv.CjvsCx corresponds to the adjusted p-values of the moderated t-test comparing Cj versus the other subtypes. (XLS) Click here for additional data file. Table S4 Associations of anatomoclinical characteristics, DNA alterations, and supervised signature annotations with the six subtypes based on logistic regression. Associations were assessed by logistic regression using a multinomial logit model (function mlogit, R package mlogit). (XLS) Click here for additional data file. Table S5 List of the 57 genes used to assign subtypes. The 57 genes selected to build the subtype predictor, given with each subtype's centroid values. (XLS) Click here for additional data file. Table S6 Univariate and multivariate Cox analyses including the classification and clinical annotations. Associations of the classification and clinical annotations with RFS were assessed by Cox proportional-hazards regression analyses on (A) the discovery set, (B) the validation set, and (C) both sets. Univariate Cox analyses were performed on each variable independently. The best multivariate model was determined by using a backward–forward selection approach to restrict the multivariate model to the most informative variables for the subset of samples for which all the variables were available. Value indicates the modality of the annotation associated with the HR. H.R., Cox HR. (XLS) Click here for additional data file. Table S7 Univariate and multivariate Cox analyses including the classification and other molecular annotations. Associations with RFS of the six-subtype classification—including BRAF, KRAS, and TP53 mutations, MMR status, and CIMP status—were assessed by Cox proportional-hazards regression analyses on the discovery set. Univariate Cox analyses were performed on each variable independently. The best multivariate model was determined by using a backward–forward approach to restrict the multivariate model to the most informative variables for the subset of samples for which all the variables were available. The TP53 mutation variable was excluded from the multivariate analysis, as only 201 samples were characterized and as it was not significantly associated to outcome. Value indicates the modality of the annotation associated with the HR. H.R., Cox HR. (XLS) Click here for additional data file. Text S1 Supplementary methods. (PDF) Click here for additional data file.