Introduction Breast cancer is the most common non-skin cancer among American women. The American Cancer Society's estimates indicate approximately 1.3 million new cases of invasive breast cancer were diagnosed globally in 2007; and nearly 500,000 women died from the disease [1]. Currently, there are over 2.5 million breast cancer survivors in the US, and an estimated $8.1 billion dollars is spent each year on treatment of breast cancer [2]. The principal prognostic indicator currently in clinical use for breast cancer is the tumor-node-metastasis (TNM) stage [3], [4]. Morphological attributes of malignant tumors that influence disease prognosis are the size of the primary tumor (T), presence and extent of regional lymph node involvement (N) and presence of distant metastases (M). Molecular attributes of tumors are also considered in clinical decision-making; loss of hormone receptor expression [5] and increased expression of ERBB2 [6] have each been associated with poor prognosis. Although numerous recent studies have demonstrated that alterations of DNA methylation in breast cancers are common and may be important etiologic and prognostic markers [7]-[14], large gaps in our knowledge remain. There is a notable lack of studies examining tumor DNA methylation in relation to breast cancer risk factors such as diet or reproductive factors in conjunction with other important tumor markers. Patient exposures such as alcohol and folate intake have potentially strong mechanistic links to epigenetic dysregulation [15]. In addition, recent work in-vitro and in animal models suggest that long term exposure to estrogen may lead to epigenetic effects and altered profiles of DNA methylation [16], [17]. To explore associations of tumor methylation with important tumor and patient characteristics, we analyzed tumors from breast cancer patients in the Kaiser Permanente Division of Research Pathways Study using a large scale methylation array. Results Unsupervised clustering and locus-by-locus analysis Table 1 shows the patient demographic, hormonal, dietary and tumor characteristics for the 162 women overall (and stratified by menopausal status in Table S1). Results of unsupervised hierarchical clustering of the 750 most variable CpG loci indicate the epigenetic heterogeneity of these tumors (Figure 1). 10.1371/journal.pgen.1001043.g001 Figure 1 Unsupervised clustering heatmap of CpG methylation in breast carcinomas. Unsupervised hierarchical clustering heat map based on Manhattan distance and average linkage of the 750 autosomal CpG loci with the highest variance. Samples are in rows (n = 162), and CpG loci are in columns. Blue indicates methylated and yellow indicates unmethylated. 10.1371/journal.pgen.1001043.t001 Table 1 Patient demographic, hormonal, dietary, and tumor characteristics. All cases Covariate n = 162 missing Age Range (median) 30–91 1 Median 59 Mean (sd) 59.2 (11.6) Race Caucasian, n (%) 117 (72.7) 1 African American 13 (8.1) Hispanic 10 (6.2) Asian 10 (6.2) Other 11 (6.8) Alcohol (g/day) Range (median) 0–83.2 (1.8) 8 Mean (sd) 9.1 (15.4) Dietary folate (ug/day) Range (median) 43.5–1610 (427) 7 Mean (sd) 459 (213) Body mass index Range (median) 18.5–56.1 (27.4) 2 Mean (sd) 29.0 (6.7) Parity Nulliparous, n (%) 30 (18.8) 2 1–2 children 77 (48.1) 3–4 children 45 (28.1) 5+ children 8 (5.0) Histology Ductal, n (%) 94 (59.1) 3 Lobular 56 (35.2) Adenocarcinoma 9 (5.7) Estrogen receptor Positive, n (%) 141 (87.6) 1 Negative 20 (12.4) Progesterone receptor Positive, n (%) 119 (73.4) 1 Negative 42 (26.6) ERBB2 Negative, n (%) 134 (85.9) 3 Positive 22 (14.1) Triple negative No, n (%) 145 (91.2) 3 Yes 14 (8.8) AJCC stage I, n (%) 95 (59.0) 1 II 47 (29.2) III 15 (9.3) IV 4 (2.5) Tumor Grade Well differentiated, n (%) 48 (30.0) Moderately differentiated 79 (49.4) 2 Poorly differentiated 32 (20.0) Undifferentiated 1 (0.6) Lymph node status Positive, n (%) 48 (30.8) 6 Negative 108 (69.2) Tumor Size (mm) Range (median) 0–135 (14.0) 0 Mean (sd) 17.4 (15.0) In array-wide locus-by-locus analysis the strongest associations of methylation of individual loci (Q-values 0.05 (n = 8, 0.5%), were eliminated from analysis. All CpG loci on the X chromosome were excluded from analysis. The final dataset contained 1413 CpG loci associated with 773 genes. Unsupervised clustering Subsequent analyses were carried out using the R software [41]. For exploratory and visualization purposes, hierarchical clustering was performed using R function hclust with Manhattan metric and average linkage. To discern and describe the relationships between CpG methylation and patient and tumor covariates a modified model-based form of unsupervised clustering known as recursively partitioned mixture modeling (RPMM) was used as described in [18] and as used in [42]. Permutation tests (running 10,000 permutations) were used to test for association with methylation class by generating a distribution of the test statistic for the null distribution for comparison to the observed distribution. For continuous variables, the permutation test was run with the Kruskal-Wallis test statistic. For categorical variables we used the standard chi-square statistic for testing association between two categorical variables. Locus-by-locus analysis Associations between covariates and methylation at individual CpG loci were tested with a generalized linear model. The β-distribution of average β values was accounted for with a quasi-binomial logit link with an estimated scale parameter constraining the mean between 0 and 1, in a manner similar to that described by Hsuing et al. [43]. Array-wide scanning for CpG loci associations with sample type or covariate used false discovery rate estimation and Q-values computed by the qvalue package in R [44]. Multinomial logistic regression Multinomial logistic regression was used to model methylation class while controlling for potential confounders. Referent class selection does not affect the underlying interpretation of the model and as class three was neither the largest, nor the smallest class, and was relatively hypomethylated it was chosen as the referent class. Because of the potentially large number of methylation classes, logistic regression coefficients were regularized using a ridge (L2) penalty, with coefficients for a common (non-intercept) covariate across outcome levels shrunk toward zero [34] the tuning parameter was selected by minimizing Bayesian information criterion. Sequenom EpiTYPER methylation and RT–PCR Spearman correlation coefficients and test P-values are reported for correlation between array and Sequenom methylation values. Tests for association between methylation and mRNA expression used relative mRNA expression versus array methylation average β stratified into two groups around 0.5 with the Kruskal-Wallis test statistic. Supporting Information Figure S1 GSTM2 expression is significantly reduced in tumors with GSTM2 methylation. Relative mRNA expression values for GSTM2 are plotted versus array methylation values for two CpGs significantly associated with tumor grade stratified at 0.5. Each box top and bottom edge represents the third and first quartile expression values respectively; box center line represents the median relative expression value. (A) A CpG 153 bases 3′ of the GSTM2 transcription start site has significantly reduced mRNA expression when methylated (P<0.001). (B) A promoter-based CpG 109 bases 5′ of the GSTM2 transcription start site has significantly reduced mRNA expression when methylated (P<0.03). (0.05 MB DOC) Click here for additional data file. Figure S2 Recursively partitioned mixture model of CpG methylation in breast tumors from post menopausal patients. The figure depicts the results of RPMM. Columns represent CpG sites and rows represent methylation classes. The height of each row is proportional to the number of observations residing in the class (total n = 117). Blue indicates methylated and yellow indicates unmethylated. The color of the columns within each class represents the average methylation of the CpG for that class. (0.06 MB DOC) Click here for additional data file. Table S1 Patient demographic, hormonal, dietary, and tumor characteristics stratified by menopausal status. (0.04 MB XLS) Click here for additional data file. Table S2 Summary of results from locus-by-locus CpG methylation versus covariates. (0.03 MB XLS) Click here for additional data file. Table S3 Tumor size, grade, hormone receptor, and ERBB2 status are highly associated with CpG methylation in breast carcinoma (n = 162). (0.12 MB XLS) Click here for additional data file. Table S4 Multinomial logistic regression of methylation class membership (Class 3 is referent) by patient demographic and tumor characteristics. (0.04 MB XLS) Click here for additional data file. Table S5 DNA methylation profiles of post-menopausal patients' breast tumors are significantly associated with estrogen receptor status. (0.03 MB XLS) Click here for additional data file.