Introduction Cancers carry somatic mutations. A small proportion are “drivers” that confer clonal advantage, are causally implicated in oncogenesis, and have been positively selected during the evolution of the cancer (Stratton, 2011; Stratton et al., 2009). Driver mutations occur in the subset of genes known as cancer genes. Through systematic sequencing of cancer genomes, considerable advances have recently been made in the identification of cancer genes, providing insights into mechanisms of neoplastic transformation and targets for therapeutic intervention (Stratton, 2011; Stratton et al., 2009). We have relatively limited understanding, however, of the DNA damage and repair processes that have been operative during the lifetime of the patient and that are responsible for the somatic mutations that underlie the development of all cancers in the first place. Historically, analysis of mutation patterns to investigate underlying DNA damage and repair processes in human cancers has predominantly been restricted to reporter cancer genes, notably TP53. These studies have revealed that mutation patterns can be related to carcinogen exposures and DNA repair processes. For example, G>T/C>A transversions predominate in smoking-associated lung cancer, a pattern compatible with DNA damage induced by tobacco carcinogens such as benzo[a]pyrene diolepoxide (Pfeifer et al., 2002). These mutations are enriched at CpG dinucleotides and exhibit a transcriptional strand bias reflecting past activity of transcription-coupled nucleotide excision repair (TCR) on bulky adducts of guanine caused by tobacco carcinogens (Hainaut and Pfeifer, 2001). Similarly, in UV-light-associated skin cancers, C>T and CC>TT transitions are common. These occur at dipyrimidines, reflecting the formation of pyrimidine dimers following exposure of DNA to UV light (Pfeifer et al., 2005) and also show transcriptional strand bias due to the action of TCR on pyrimidine dimers. Further examples of exogenous exposures leading to distinctive mutational patterns include G>T transversions in aflatoxin B1-associated hepatocellular carcinomas (Macé et al., 1997) and A>T transversions in urothelial tumors from patients exposed to aristolochic acid (Nedelko et al., 2009). Although these studies have been highly informative, they have limitations. Because they depend upon driver mutations, effects of selection have been superimposed upon mutational patterns generated by DNA damage and repair processes. Moreover, only a single mutation from each cancer sample is usually incorporated into each data set. Thus, they have been well placed to report strong exposures and dominant repair processes operative across most cases of a particular tumor type. However, where there is heterogeneity of mutational process in a cancer class, a composite of different processes will be reported. The completion of the human genome sequence and the advent of second-generation sequencing technology have overcome historic limitations of scale, permitting sequencing of whole cancer genomes (Berger et al., 2011; Chapman et al., 2011; Ding et al., 2010; Lee et al., 2010; Ley et al., 2008; Mardis et al., 2009; Shah et al., 2009; Tao et al., 2011), and generation of comprehensive catalogs of somatic mutation (Pleasance et al., 2010a; Pleasance et al., 2010b). Most somatic mutations in cancers are thought to be “passenger” events that do not contribute to cancer development. These bystanders bear the imprints of the DNA damage and repair processes operative during the development of the cancer, unmodified by selection. The several hundreds to tens of thousands of somatic mutations in each cancer, therefore, potentially allow much greater resolution of mutational patterns and insights into underlying mutational processes. Recent analyses of comprehensive mutational catalogs from a malignant melanoma and a lung cancer illustrate the power of this approach (Pleasance et al., 2010a; Pleasance et al., 2010b). They revealed the characteristic mutational patterns of UV light and tobacco carcinogens respectively (see above) and provided strong evidence for the past activity of TCR. In addition, G>T mutations in the lung cancer showed a preference for CpG dinucleotides outside CpG islands, suggesting a role for methylated cytosine because CpG islands are usually unmethylated. Conversely, G>C mutations at CpG dinucleotides were more prevalent within CpG islands suggesting that the mutagen(s) underlying these mutations preferentially acted on unmethylated DNA (Pleasance et al., 2010b). In the melanoma, at least one additional mutational process characterized by G>T changes and independent of UV light exposure was shown to have been operative. In both cancers, mutations were more common in poorly expressed than in highly expressed genes, both on the transcribed and untranscribed strands. Indeed, mutations were also found to be more prevalent at the 3′ ends of genes than at the 5′ ends (Pleasance et al., 2010a). The mechanisms underlying these expression-related phenomena are unknown. Compared to melanoma and lung cancer, the mutational processes underlying other cancer types are poorly understood. Therefore, in this study, we document essentially the full repertoire of somatic mutations of 21 breast cancers to investigate the mutational mechanisms shaping these cancer genomes. Results Sequencing of Breast Cancers We sequenced the complete genomes of 21 primary breast cancers and matched normal DNAs from the same individuals. Cancers were typed for estrogen receptors (ER), progesterone receptors (PR), and HER2, and included nine cases with germline mutations in the breast cancer predisposition genes BRCA1 (five) and BRCA2 (four) (Table S1A, available online). Cancer and normal DNAs were sequenced to > 30-fold coverage and analyzed to identify somatic base substitutions, insertions, and deletions (indels); rearrangements; and copy number changes. PD4120a was sequenced to ∼188-fold depth to investigate temporal and clonal evolution (Nik-Zainal et al., 2012). Using orthogonal sequencing technologies we estimated the specificity of substitution-calling to be ∼92.1% (Table S1A). All substitutions were therefore included in the analyses. For indels and rearrangements only confirmed variants were included (Table S1B). From 17 of the 21 cases mRNA expression data were also obtained. The Catalogue of Somatic Mutations from 21 Breast Cancer Genomes A total of 183,916 somatically acquired base substitutions were identified (see Table S1B for hyperlinks). In protein coding regions, there were 1,372 missense, 117 nonsense, 2 stop-lost, 37 essential splice-site, and 521 silent mutations. Of the 2,869 indels identified, 2,233 were deletions, 544 insertions and 92 complex. There were 21 coding indels, of which 15 were predicted to result in a translational frameshift and six were in-frame. In addition, 1,192 structural variants (rearrangements), 16 homozygous deletions, and 14 regions of increased copy number (amplifications) were identified (Table S1C). Likely driver substitutions and indels in cancer genes were found in TP53, GATA3, PIK3CA, MAP2K4, SMAD4, MLL2, MLL3, and NCOR1 (Table S1C). Amplification was observed over cancer genes previously implicated in breast cancer development including ERBB2, CCND1, MYC, MDM2, ZNF217, and ZNF703 and a homozygous deletion involving MAP2K4 was identified. All tumors derived from BRCA1 or BRCA2 germline mutation carriers showed loss of wild-type haplotypes at 17q21 or 13q12, respectively, as expected of recessive cancer genes (Table S1B). Extracting Mutation Signatures from Catalogues of Somatic Mutation The set of somatic mutations in a cancer genome is the aggregate outcome of one or more mutational processes. Each process leaves a mutation signature on the cancer genome defined by the mechanisms of DNA damage and repair that constitute it. The final catalog of mutations is determined by the strength and duration of exposure to each mutational process. We set out to extract the mutation signatures characterizing the mutational processes operative in the 21 breast cancers studied. There was substantial variation between the cancers in the numbers and relative contributions of each of the six classes of base substitution (C>A, C>G, C>T, T>A, T>C, and T>G) (Figure 1A). To provide greater depth of insight into the operative mutational processes, we incorporated the sequence context in which mutations occurred, by considering the bases immediately 5′ and 3′ to each mutated base. Because there are six classes of base substitution and 16 possible sequence contexts for each mutated base there are 96 possible mutated trinucleotides. We have represented the fraction of mutations at each of the 96 mutated trinucleotides as a heat map for each cancer and normalized it according to the prevalence of each trinucleotide in the genome. The display therefore highlights mutational signatures generated by processes that favor particular classes of mutation and/or particular sequence contexts in which mutations occur (Figure 1C). Visual inspection of the 21 heatmaps provided evidence for multiple independent mutational signatures and indicated that, in most cancers, more than one process had been operative. For example, overrepresentation compared to chance of C>T substitutions at XpCpG triplets (C is the mutated base and X is any base) was observed in all cancers, albeit to different extents. The elevated C>T mutation rate at XpCpG trinucleotides is a well-recognized mutational mechanism probably due to deamination to thymine of methylated cytosines, which are usually at XpCpGs (Waters and Swann, 2000). The role of methylated cytosine is supported in our data by the higher frequency of C>T transitions at XpCpG triplets outside CpG islands (where most XpCpGs are methylated) than inside (where most are unmethylated) (OR 9.95; 95% CI 7.17–13.8; p T mutation rate (for example, see PD3905a). Although the normalized heatmap representation emphasizes the ubiquitous elevation of the C>T mutation rate at XpCpG trinucleotides, the absolute number of these mutations is relatively modest because of the general depletion of XpCpGs from the human genome due to the activity of the same, or a similar, mutational process in the germline over evolutionary time. There was also an overrepresentation of C>T, C>G and C>A mutations at TpCpX trinucleotides in many breast cancers and very pronounced in some (notably, PD4199a and P4120a). In addition to the high proportion of T immediately 5′ to the mutated cytosine, the base immediately 3′ to the mutated C also appears to influence this mutational signature with greater overrepresentation of TpCpA, TpCpT and TpCpG than of TpCpC. We have previously reported this signature in breast and other cancers (Greenman et al., 2007; Stephens et al., 2005, 2012). Application of Mathematical Approaches to Extract Mutation Signatures Although major features of some mutational signatures can be discerned by visual inspection, a formal mathematical approach is required to extract subtler elements and weaker signatures and to assess the contribution of each mutational process to the mutation set in each cancer. We employed a nonnegative matrix factorization (NMF) and model selection approach (Berry et al., 2007) to extract mutational signatures from the 21 cases. NMF extracts interpretable features from complex multidimensional data (Berry et al., 2007; Lee and Seung, 1999). For example, application to images of faces yields familiar components such as eyes, nose, and mouth (Lee and Seung, 1999). Our desire to extract biologically meaningful mutational signatures, as well as the intrinsic nonnegativity of the mutation spectrum data, renders NMF an appropriate choice for factorizing the data from the 21 cases. Evaluation of NMF decompositions (Berry et al., 2007) (Extended Experimental Procedures and Figures S1A–S1C) suggested that a best estimate of five biologically distinct mutational signatures were present in the 21 cancers (named A–E, Figure 2A). Each signature was characterized by a different profile of the 96 potential trinucleotide mutations and contributed to a different extent to each of the 21 cancers. Different combinations of the five signatures account for the variation in the 21 mutational catalogs (Figure 1D). Signature A was characterized by C>T mutations at XpCpG trinucleotides but included other mutation classes making smaller contributions (Figure 2A). Signature B was composed predominantly of C>T, C>G and C>A mutations at TpCpX trinucleotides. The dominant features of these two mutational signatures were previously noted by scrutiny of the heatmaps (see above). However, for each process NMF provided greater insight into the relative contributions of all mutation classes. NMF provided evidence for three additional mutational processes. Signature C and Signature D both exhibited a relatively even distribution of mutations across the 96 trinucleotides. However, there were subtle differences. Signature C was moderately enriched for C>T, C>G and, to a lesser extent, C>A mutations at XpCpG trinucleotides, whereas Signature D was not (Figures 2A and S1D). In hindsight, an enrichment of C>G and C>A mutations at XpCpG trinucleotides can be discerned in some cancers in the heatmap (Figure 1C). Moreover, the strength of this enrichment does not appear to be well correlated with enrichment of C>T mutations at XpCpG trinucleotides, suggesting that they are due to different processes, providing the rationale for NMF to separate Signature C from Signature A (compare, for example, PD4006a and PD3945a in Figure 1C). In Signature E the dominant feature was C>G mutations at TpCpX trinucleotides. Signature E is therefore similar to Signature B, but lacks the C>T mutations at TpCpX trinucleotides characteristic of Signature B. NMF will extract mutational patterns due to sequencing artifacts. We noted a signature characterized by T>G mutations at GpTpX trinucleotides. This was not, however, reproduced in somatic mutations verified by using another sequencing technology and transpired to be due to aberrant sequence phasing at Ts following runs of Gs in the genome. NMF can estimate the contribution of each mutational signature to the mutational catalog of each breast cancer. The results indicate that multiple mutational processes contribute to most cancers, although in some cases one process has been dominant (Figure 1D). No correlation was found between the presence of a particular somatically mutated gene and any of these processes. We then performed unsupervised hierarchical clustering by using the relative contributions of each of the five signatures to each of the 21 mutational catalogs as features. Interestingly, all nine breast cancers with BRCA1 or BRCA2 mutations clustered together (Figure 2B). This was predominantly due to a relatively substantial contribution by Signature D and a relative deficiency of Signature A in these cancers. Mutational Processes Generating Double-Nucleotide Substitutions By performing Monte Carlo simulations we showed that the number of double-nucleotide substitutions (for example, CpC>ApT) in each of the 21 cancer genomes was 75–11,000 fold higher than expected if single-nucleotide substitutions had been randomly distributed (p A substitutions as components of double-nucleotide substitutions (Figure 1B) with the consequent emergence of CpC>ApA as the most common class of double-nucleotide substitution (Table S1E). Double-nucleotide substitutions were distributed throughout the cancer genomes without obvious evidence of further clustering. Regional Hypermutation, Kataegis, Is Common in Breast Cancer We investigated the possibility of regional clustering of substitution mutations by constructing “rainfall plots” in which the intermutation distance, the distance between each somatic substitution, and the substitution immediately before it has been plotted for each mutation (Figures 3A–3C and S2). Mutation clustering was commonly observed in the 21 breast cancers. The mutational patterns within these clusters are outlined below by using two cases as examples. The largest regional cluster of mutations was found in a breast cancer with a germline mutation in BRCA1, PD4107a, which showed a markedly elevated mutation prevalence over a 14 Mb region on chromosome 6 (Figure 3A). The mutations within this cluster exhibited several remarkable features. First, there was evidence of further clustering within the 14 Mb region (the “macrocluster”), with heavily mutated stretches of genome of a few hundred base pairs (“microclusters”) sometimes separated by tens of kilobases without mutations (Figure 4A). Second, substitutions within the region were characterized by a distinctive mutational spectrum and sequence context (Figure 3D). Most were C>T transitions at TpCpX trinucleotides. Third, examination of individual sequence reads, which derive from individual DNA molecules, showed that most mutations within microclusters occurred in cis with respect to each other, i.e., were on the same parental chromosome (Figure 4A). Fourth, mutations were generally of the same type for long genomic distances and then could switch to a different class. For example, in PD4107a mutations were almost exclusively C>T (on the plus chromosomal strand) for several megabases and then switched to G>A (Figure 4B). This suggests that groups of mutations may be generated on just one of the two DNA strands, perhaps simultaneously or in a processive manner over a short time span. Fifth, the cluster of mutations on chromosome 6 colocalizes with a cluster of somatic genomic rearrangements (Figure 4A). Within the hypermutated 14 Mb region, there were 17 genomic rearrangements but only seven in the remaining 157 Mb of chromosome 6. Most of these rearrangements were between different locations within the chromosome 6 14 Mb region. Despite the clear positional correlation between rearrangements and hypermutation, at higher resolution mutation microclusters were usually separated from the nearest rearrangement by many kilobases. Finally, in PD4107a a much smaller mutation cluster with similar mutational characteristics and associated with genomic rearrangements was observed on chromosome 12. An ER-positive breast cancer, PD4103a, also exhibited localized hypermutation, but the pattern of mutation clustering differed slightly from that in PD4107a (Figure 3B). There were multiple mutation clusters involving chromosomes 3, 4, 8, 10, 11, 12, 20, and 21 each of which spanned shorter distances than the major cluster in PD4107a. The clustered substitutions in PD4103a included C>T transitions at TpCpX dinucleotides, similar to PD4107a, but in addition a greater proportion of C>G mutations at TpCpX trinucleotides. In other respects, notably the mutations being in cis and showing a processive pattern, there were many similarities (Figures 5A–5B). Mutation clusters in PD4103a were also closely associated with somatic genomic rearrangements that were all linked together by a web of interchromosomal rearrangements (Figure 5C). We have termed the presence of regional mutation clusters kataegis (from Greek for shower or thunderstorm). Rainfall plots revealed varying extents of kataegis in 13/21 breast cancers (PD4199a, PD4192a, PD4198a, PD4248a, PD4109a, PD4116a, PD3904a, PD3945a, PD4005a, PD4006a, PD4103a, PD4120a, and PD4107a, see Figure S2), encompassing all subclasses of the disease. In each case, the features were similar to those outlined for PD4107a and PD4103a. In some instances, kataegis was associated with rearrangements that had features of chromothripsis (Stephens et al., 2011) (Figure 4D), but it also colocalized with other rearrangement architectures. Previously published mutation catalogs from a malignant melanoma and small cell lung cancer did not show kataegis (Pleasance et al., 2010a, 2010b) (data not shown). The pattern of C>T and C>G mutations at TpCpX trinucleotides in kataegis is similar to that of mutational Signature B, and to a lesser extent, Signature E (Figure 2A). Yet, in many cancers with kataegis, Signatures B and E make only a small contribution to the genome-wide mutation catalog. Conversely, Signature B dominates PD4199a (Figures 1A, 1C, and 1D) despite relatively limited kataegis (Figure S2). Therefore, intriguingly, a globally distributed and a localized form of the mutational processes underlying these signatures may exist, and the two forms may operate independently of each other. The Relationship between Substitution Mutations and Transcription We next examined the relationship between transcription and prevalence of somatic substitutions. First, we searched for differences in the prevalence of mutations on the transcribed and untranscribed strands (transcriptional strand bias) of protein coding genes. A moderate degree of strand bias was detectable for C>A/G>T transitions across the 21 breast cancer genomes (p = 1.75 × 10−15) and is present in almost all cases (Figure 6A). This bias was characterized by fewer G>T mutations on transcribed than untranscribed strands. A strand bias was also observed for T>G/A>C mutations (p = 1.5 × 10−4) with fewer T>G mutations on transcribed than untranscribed strands. No evidence of a transcriptional strand bias was observed for C>G/G>C, C>T/G>A, T>A/A>T or T>C/A>G mutations. The best-recognized cause of transcriptional strand bias is TCR that removes nucleotides with bulky adducts from the transcribed strands of genes. Assuming that TCR is responsible for the observed strand biases, the presence of fewer G>T mutations on transcribed than untranscribed strands would suggest that bulky adduct damage to guanine may be the cause of the observed mutations. Similarly, the presence of fewer T>G mutations on transcribed compared to untranscribed strands would suggest that there may have been bulky adduct damage to thymine. The nature of these ubiquitous mutagenic exposures in breast cancer, which may conceivably be of exogenous or endogenous origins, is unknown. However, the hypothesis that TCR is involved is currently unsubstantiated and it may ultimately transpire that other DNA repair, or indeed damage, processes differentially affect the transcribed and untranscribed strands of genes. We next examined the relationship between levels of gene expression and prevalence of somatic mutation. An inverse correlation between substitution prevalence and gene expression was observed for C>A/G>T (p = 2.47 × 10−9), C>T/G>A (p = 7.5 × 10−3), T>A/A>T (p = 1.09 × 10−6), and T>C/A>G (p = 1.83 × 10−4) mutations for both transcribed and untranscribed strands (Figures 6 and S3A). No correlation was observed for C>G/G>C or T>G/A>C mutations. The results indicate that mutational processes characterized by both transcriptional strand bias and expression-related mutation prevalence are operative in breast cancer, similar to our previous observations in melanoma and lung cancer. However, T>G/A>C mutations exhibited a transcriptional strand bias but not expression-related mutation prevalence. Conversely, C>T/G>A, T>A/A>T, and T>C/A>G mutations showed expression-related mutational prevalence but no transcriptional strand bias (Figure 6B and Figure S3A) suggesting that these two features are independent. Finally, we examined the relationship between distance from the transcriptional start site (TSS) and mutation prevalence in protein coding genes. There was evidence of increasing mutation prevalence at increasing distance from the transcription start site (Figure 6C), suggesting that the suppressive influences of transcription upon mutagenesis described above wane as proximity to the TSS decreases. This effect appears to be particularly pronounced in the first 1 kb from the TSS (Figure 6D). The result confirms the observation previously made on UV light induced C>T mutations in a melanoma. Microhomology-Mediated Deletions in BRCA1 and BRCA2 Mutant Cancers Of the 2,869 validated somatic indels from the 21 breast cancers, single-base pair indels were the most common in each case (Figure 7A). There was substantial variation in number and pattern of indel, however, with more and larger indels observed in BRCA1 and BRCA2 mutant cancers. The sequences flanking each indel were interrogated for the presence of short tandem repeats or short stretches of identical sequence at the breakpoints (termed overlapping microhomology) (Figure 7B). Repeat-mediated indels were small (1–5 bp), present in all breast cancers, and comprised both deletions and insertions. Microhomology-mediated indels were larger (up to 50bp), mainly deletions and considerably more common in cases with BRCA1 or BRCA2 mutations (p = 2.2 × 10−16). Overlapping microhomology is often considered a signature of nonhomologous end-joining (NHEJ) DNA double-strand break (DSB) repair. The segments of microhomology probably mediate alignment of the two DNA ends that are joined. Because BRCA1 and BRCA2 are involved in homologous recombination (HR)-based DSB repair, the elevated frequency of microhomology-mediated indels in BRCA1 or BRCA2 mutant cancers presumably reflects usage of alternative methods of DSB repair in these cancers (Figure 7C). Discussion Catalogues of somatic mutation from 21 breast cancers have yielded several insights into underlying mutational mechanisms. Five independent single-nucleotide substitution processes appear to have been operating, generating the observed variation in mutation numbers and patterns between cancers. The processes appear to have been acting in combination, either contemporaneously or during different phases of evolution of the cancer clone (Nik-Zainal et al., 2012). Additional subtle processes may exist, and sharper definition of currently characterized processes may follow refinements of NMF and inclusion of other mutational features in the models. Signature A is likely mediated by deamination of 5-methyl-cytosine at XpCpG trinucleotides leading to C>T transitions. However, the mechanisms underlying the remainder are currently unknown. Signature B, characterized by C>T, C>G, and C>A substitutions at TpCpX trinucleotides, is responsible for the overwhelming majority of mutations in certain cancer samples and is present in this dominant form in approximately 10% of ER-positive breast cancers (Stephens et al., 2012). The mutational patterns in Signatures C, D, and E have not, to our knowledge, been previously described. A remarkable process generating regional hypermutation, termed kataegis, is frequently operative in breast cancer. Regional clusters of mutations in cancer have occasionally been observed in experimental models, although not at the mutation density observed here (Wang et al., 2007). Mutations within regions of kataegis bear similarities to those in Signature B, notably the preponderance of C>T and C>G substitutions at TpCpX trinucleotides. Furthermore, they are closely associated with regions of rearrangement and occur on the same chromosome and chromosomal strand over long genomic distances, suggesting that they occur simultaneously or in a processive manner over a short time span (Chen et al., 2011). On the basis of similarities to mutational patterns observed in other biological contexts or in experimental systems, we propose that the AID/APOBEC family of proteins may be implicated in kataegis and/or in the mutational process underlying Signature B. Although APOBEC1, the founding member of the AID/APOBEC family, was first identified as an RNA-editing enzyme (Teng et al., 1993) several members of the AID/APOBEC family (including APOBEC1 itself) can deaminate cytosine to uracil within DNA, acting as DNA mutators (Harris et al., 2002). AID functions in antibody diversification, deaminating cytosine residues within the immunoglobulin loci in B lymphocytes thereby triggering somatic hypermutation and class-switch recombination (reviewed in Longerich et al., 2006). There are seven APOBEC3 proteins in humans, with the prototype (APOBEC3G) as well as several other APOBEC3s acting on lentiviral replication intermediates constituting an innate pathway of antiretroviral defense (Hultquist et al., 2011; Sheehy et al., 2002). Although off-target deamination by AID is likely responsible for the mutations and translocations seen in many B cell tumors (reviewed in Nussenzweig and Nussenzweig, 2010), AID is unlikely to account for the mutational signatures described here because it exhibits a strong preference for deaminating C residues flanked by a 5′-purine (Pham et al., 2003). In contrast, the Cs targeted in Signature B and kataegis are preceded by a 5′-T. However, APOBEC1 (when acting on DNA) and the APOBEC3 enzymes (apart from APOBEC3G) favor C residues flanked by a 5′-T (Harris et al., 2002; Hultquist et al., 2011). Furthermore, transgenic overexpression of APOBEC1 is associated with cancer development (Yamanaka et al., 1995) and enforced overexpression of APOBEC3A causes genomic damage and mutation (Landry et al., 2011; Stenglein et al., 2010; Suspène et al., 2011). Thus APOBEC1 and some APOBEC3s are attractive candidates for the mechanisms underlying kataegis and/or mutation Signature B. Thus far, we have not observed a clear correlation between overexpression of a member of the AID/APOBEC family and kataegis or Signature B, although some key samples lack expression data. Signature E also exhibits mutations at TpCpX trinucleotides, but is characterized by a much lower fraction of C>T mutations than Signature B. It is possible that both result from cytidine to uracil deamination by an APOBEC family member, but that the different signatures are sequelae of different repair mechanisms following the deamination step. C>T transitions may simply result from DNA replication across uracil. However, if uracil is excised by uracil-DNA glycosylase (UNG) as part of base excision repair (BER), an abasic site is generated (Wilson and Bohr, 2007). The partiality for C>G transversions in Signature E may reflect preferential insertion of cytosine opposite such an UNG-mediated abasic site. The propensity to introduce cytosine opposite an abasic site is characteristic of REV1 translesion polymerase, which is known to function in BER (Jansen et al., 2006; Ross and Sale, 2006). Thus, Signature B may be caused by a combination of replication and BER, whereas Signature E may be the imprint of the almost exclusive activity of BER on uracil. Further studies are required to explore whether and how AID/APOBEC family members contribute to mutagenesis in cancer. If they are implicated in kataegis, current understanding of the mode of action of AID in immunoglubulin gene somatic hypermutation and class switch recombination would suggest that their primary effect is through deamination of cytidine to uracil, with substitutions and rearrangements both consequent upon this initiating event. If so, an important remaining question is how the activity of the enzymes is targeted to the regions of kataegis. Furthermore, if the same enzymes from the AID/APOBEC family are also involved in mutation Signatures B and possibly E, it remains to be understood how their activities can be unleashed upon the whole genome without apparent relation to the presence of rearrangements, as opposed to being regionally targeted in the vicinity of rearrangements in kataegis. Other mechanisms and enzymatic activities may, however, be responsible for kataegis. If so, the question of which constitutes the primary set of lesions, the rearrangements, or the substitutions observed in kataegis, remains to be addressed. If a stochastic event in a cell nucleus results in a DNA DSB and repair of this break is associated with accumulation of substitutions in the vicinity of the consequent rearrangement, this could provide an explanation for the regional targeting of kataegis. Indeed, such mechanisms have been reported in yeast (Deem et al., 2011; Hicks et al., 2010). In all the breast cancers, double-nucleotide substitutions were much more common than expected by chance adjacency of single-nucleotide substitutions, suggesting the existence of one or more biological processes responsible for their presence. Currently, the best-characterized double-nucleotide substitutions in human cancer are the CpC>TpT mutations found in skin tumors, which are generally attributed to the pyrimidine dimers generated by UV light exposure. In principle, the dinucleotide mutations observed in breast cancer could also be due to exposures with a propensity to damage adjacent DNA bases. However, other mechanisms are also plausible, for example error prone polymerases that have a higher risk of misincorporation at a base adjacent to one that is damaged. BRCA1 and BRCA2 are implicated in HR-based DNA repair processes. The distinctive profile of small deletions with rearrangement breakpoints showing overlapping microhomology in BRCA1 and BRCA2 mutant cancers is therefore compatible with these processes being defective and of NHEJ or other error-prone mechanisms of DSB repair acting in their place. Interestingly, the combinations of base substitution signatures in BRCA1 and BRCA2 mutant cases are also similar. These similarities in mutational signatures contrast strikingly with differences in histology and gene expression profiles between BRCA1 and BRCA2 mutant cancers (Hedenfalk et al., 2001; Palacios et al., 2008; Perou et al., 2000; Sørlie et al., 2001). Mutational patterns, which are probably more closely related to the underlying biological defect, therefore appear to report similarities in disease pathogenesis between BRCA1 and BRCA2 mutant cancers better than cellular phenotype. BRCA1 and BRCA2 cancers are particularly responsive to some DNA damaging agents and inhibitors of DNA repair, notably PARP inhibitors (Fong et al., 2009). Because some breast cancers without mutations in BRCA1 and BRCA2 are reported to respond to these treatments (Forster et al., 2011) it will be interesting to explore whether mutational patterns characteristic of BRCA1 and BRCA2 null cancers are better predictors of response to these therapies than the presence of mutations in the two genes. These 21 genomes have yielded further evidence of complex relationships between mutagenesis and transcription. A transcriptional strand bias was found for C>A/G>T mutations in most of the cancers. If TCR is responsible, DNA damage by bulky adducts may be implicated in breast cancer pathogenesis. In principle, these could result from exogenous exposures. Indeed, many carcinogens cause adducts on guanine. Alternatively, an exposure could be endogenous in origin, for example reactive oxygen species (Hori et al., 2011) or intermediates of oxidative estrogen metabolism (Spencer et al., 2012). Both can cause damage to guanine and, although preferentially repaired by BER, some lesions can be substrates for TCR (Hanawalt and Spivak, 2008). If TCR is not involved, then the data suggest that other uncharacterized forms of transcription coupled DNA damage or repair exist. The relationship between gene expression levels and mutation prevalence, previously reported in a malignant melanoma and a small cell lung cancer (Pleasance et al., 2010a; Pleasance et al., 2010b), has been extended here to primary breast cancers. The relationship is again inverse in nature, with more somatic substitutions in poorly expressed genes. The phenomenon could, in principle, be due to an increased sensitivity to DNA damage and/or less efficient repair in poorly expressed genes. The fact that it applies to the untranscribed strands of genes and does not correlate with the presence of transcriptional strand bias suggests that these have different underlying mechanisms. One possibility is that the genome-wide form of NER is recruited more effectively to highly transcribed genes. This study has started to untangle and characterize the mutational processes that contribute to breast cancer. The data are derived from only 21 genomes and similar analyses of thousands of cancers by the International Cancer Genome Consortium (Hudson et al., 2010) will likely yield evidence of further mutational processes and better definition of those already known. Nevertheless, the analyses have provided a level of characterization of mutational processes in cancer that was previously impossible and illustrate the power of whole cancer genome sequences, yielding essentially complete catalogs of somatic mutations, to further understanding of mechanisms of DNA damage and repair. Experimental Procedures Samples and Massively Parallel Sequencing DNA was extracted from 21 breast cancers and normal tissue from the same individuals. Short insert 500 bp library construction, flowcell preparation and cluster generation were according to the Illumina no-PCR library protocol (Kozarewa et al., 2009). 108 base or 100 base paired-end sequencing were performed on Illumina GAIIx or Hiseq 2000 genome analyzers respectively, as described in the Illumina Genome Analyzer operating manual. Short insert 2∗108 bp or 2∗100 bp paired-end reads were aligned to the reference human genome (NCBI37) by using BWA (Li and Durbin, 2009). Genome sequence data have been deposited at the European Genome-Phenome Archive (http://www.ebi.ac.uk/ega/ at the EBI) with accession number EGAD00001000138. SNP6 array data have been deposited with ArrayExpress Archive (EBI, accession number E-MTAB-1087). Mutation-Calling An in-house bespoke algorithm, CaVEMan was used for calling somatic substitutions. Insertions and deletions in the tumor and normal genomes were called by using a modified Pindel version 0.2.0 on the NCBI37 genome build (Ye et al., 2009). Postprocessing filters were developed to improve the specificity of mutation-calling. Structural variants were called from the short insert data by using MAQ alignments as previously described (Campbell et al., 2008; Stephens et al., 2009). Structural variants in association with copy number segments were sought to improve sensitivity of detection. Tumor DNA samples were analyzed by Affymetrix SNP6 microarrays (Bignell et al., 2010). Copy number and allelic ratio profiles were statistically processed by using the ASCAT algorithm, version 2.0 (Van Loo et al., 2010). Validation of substitutions and indels was performed by Roche 454 pyrosequencing or capillary sequencing. Structural variants were confirmed by custom-designed PCR across the rearrangement breakpoint or by local sequence assembly. All confirmations were performed in both tumor and normal. Gene expression data were derived from the Illumina Human HT12 Expression BeadChip array, and processed as previously described (Pleasance et al., 2010a). Somatic mutation data are available via hyperlinks in Table S1B and are also available via COSMIC at http://www.sanger.ac.uk/genetics/CGP/cosmic/ and have been annotated to Ensembl v58. Statistical Analysis Mutational processes were extracted by using nonnegative matrix factorization. Monte Carlo simulations were performed to assess how randomly distributed mutations differed to the primary cancer genomes. A Kolmogorov-Smirnov test was used to compare the distribution of indels mediated by repeats or microhomology. Extended Experimental Procedures Samples and Massively Parallel Sequencing DNA was extracted from 21 breast cancers as well as matched normal tissue derived from the same individuals. Samples had previously been subjected to pathology review and only samples assessed as being composed of > 70% tumor cells, were accepted for these analyses. Short insert 500bp library construction, flowcell preparation and cluster generation was performed according to the Illumina no-PCR library protocol (Kozarewa et al., 2009). 108 base paired-end sequencing or 100 base paired-end sequencing was performed on Illumina GAIIx genome analysers or Illumina Hiseq 2000 analysers respectively, as described in the Illumina Genome Analyzer operating manual. Short insert 2∗108 bp or 2∗100 bp paired-end reads were aligned to the reference human genome (NCBI37) by using BWA (Li and Durbin, 2009). An average of 30-fold sequence coverage was required for both tumor and normal genomes. One breast cancer, PD4120a, was sequenced to ∼188-fold coverage (Table S1A). Mutation-Calling: Substitutions An in-house bespoke algorithm, CaVEMan (Cancer Variants Through Expectation Maximization) was used for calling somatic substitutions. In brief, CaVEMan is a somatic base substitution caller that utilizes an expectation maximization (EM) algorithm and is designed for calling variants in new sequencing technology reads. Given the reference base, copy number status and fraction of aberrant tumor cells present in each cancer sample, CaVEMan generates a probability score for potential genotypes at each genomic position. A putative ‘somatic’ genotype probability of 95% and above was applied as a cut off. A high specificity was essential for the nature of downstream analyses applied in this study. As such, further postprocessing filters of potential ‘somatic’ genotypes were designed to eliminate false positive calls arising from: genomic features that generate mapping errors e.g., regions of excessively high coverage due to collapsed repeat sequences in the reference genome (http://genome.ucsc.edu/) systematic sequencing artifacts e.g., motifs known to cause errors of phasing during the sequencing reaction or sequencing artifacts arising in at least 5% of at least 2 samples from a panel of normal samples germline insertions/deletions. 1%–5% of putative somatic substitution variants from each cancer genome were sampled for validation in order to make an assessment of the specificity of these cancer genomes. Mutation-Calling: Insertions/Deletions Insertions and deletions in the tumor and normal genomes were called by using a modified Pindel version 0.2.0 on the NCBI37 genome build (Ye et al., 2009). Indels were required to be present in 5 reads or more in the tumor and not present in the matched normal sample. Variants were also screened against a panel of normal samples and were excluded if present in at least 5% of reads in at least 2 samples from this panel. All indels reported in this study have been validated. Mutation-Calling: Copy Number Tumor DNA samples were analyzed by Affymetrix SNP6 microarrays (Bignell et al., 2010). Copy number and allelic ratio profiles were statistically processed by using the ASCAT algorithm, version 2.0 (Van Loo et al., 2010). Mutation-Calling: Structural Variation Structural variants were called from discordantly mapping paired-end reads from short insert data by using MAQ alignments as previously described (Campbell et al., 2008; Stephens et al., 2009). In order to improve sensitivity of detection, additional candidate structural variants were sought from within the proximity of copy number changes in the following way. All nontelomeric and noncentromeric segmentation breakpoints were obtained from SNP6 data processed via ASCAT (Van Loo et al., 2010). Candidate rearrangements close to copy number segmentation breakpoints were considered if rearrangements for which both breakpoints correspond to copy number breakpoints were identified or if one copy number breakpoint could only match one rearrangement. Rearrangements closer to a copy number breakpoint were preferred over rearrangements further away. When both rearrangement breakpoints matched a copy number breakpoint, the sum of the distances between the rearrangement and the copy number breakpoints was below 400kb, and when for only one rearrangement breakpoint a corresponding copy number breakpoint was found, the distance between both breakpoints was below 20kb. Expression Analysis Gene expression data were derived from the Illumina Human HT12 Expression BeadChip array, run in duplicate, normalized and processed as previously described (Pleasance et al., 2010). Validation of Substitutions and Insertions/Deletions Validation of putative somatic substitutions was performed via Roche pyrosequencing in 20 tumor-normal pairs and capillary sequencing in 1 tumor-normal pair (PD3890). All coding substitution variants and a random assortment of noncoding variants were selected for validation to make up to ∼400 PCR products per sample. In addition to the variants validated to determine specificity, validation was also targeted to several hundred substitutions involved in regions of hypermutation and dinucleotides (Table S1B). Primers were designed to generate 275–425 bp fragments suitable for Roche 454 pyrosequencing. The specificity of the calling of substitution variants from the Illumina sequence reads, was determined from the proportion of calls confirmed as somatic when sequenced on this orthogonal platform (Table S1A). For pyrosequencing data, an average coverage of ∼657X was achieved for each validated variant. A minimum of 25 reads of mapping quality of at least 20 and base quality of 25 and above were required to report each variant. To be considered as somatic: variants were required to be present in at least 5% of the reads in the tumor and not in the normal, or if present at a low mutation burden of A, C > G, C > T, T > A, T > C, T > G) at each 5′ and 3′ base context from 21 (M) breast cancer cases. Thus, matrix A has a size of 96 × 21. We decompose this data set into two matrices - W with size 96 × k and H with size k × 21 where k is the number of signatures that we are trying to model and identify. We perform NMF and use a model selection approach for k = 2 ⋅⋅⋅ 20. An optimal decomposition and value of k was chosen based on the cophenetic correlation coefficient (a measure of how faithfully clustering approaches preserve pairwise distances and therefore dendrogram structures) and the average reconstruction error (Brunet et al., 2004). NMF was performed by using a modified version of the publicly-available implementation (Brunet et al., 2004) of the multiplicative updated algorithm (Lee and Seung, 1999) and was repeated 1,000 times for each value of k. The cophenetic correlation coefficient indicated reproducibility and stability for k values between 2 and 6 (Figure S2A). The cophenetic correlation fell sharply for k > 6 (less than 0.95, Figure S2A) indicating a lack of robustness when a decomposition exceeded 6 signatures for this data set. Given a value of k, each sample was reconstructed and compared to the observed data (Figure S2B). Error in reconstruction for each value of k was plotted (Figure S2C), and a dramatic reduction in the slope of the reconstruction error revealed that the model stabilized at five mutational signatures. As such, we selected to decompose the pooled mutation data set into five stable mutational signatures. A typical comparison between the reconstructed and observed mutation profile is given in Figure S2B. The concordance indicated that five signatures were sufficient to describe the general behavior of mutation profiles of the 21 breast cancer samples. In theory, whereas NMF is able to highlight true mutational signatures that underlie mutational profiles in cancers, it should be noted that it is also able to identify systematic sequencing artifacts as a mutational signature. Monte Carlo Simulation of Double Substitutions In order to assess the degree of enrichment and statistical significance of the occurrence of double-nucleotide substitutions, Monte Carlo simulations were performed for each cancer genome. The mutation prevalence of each mutation type (C > A, C > G, C > T, T > A, T > C, T > G) was obtained for each chromosome of each cancer genome. For each genome, 1000 simulations were then performed by generating mutations in silico, at the observed mutation rates. For each simulation, the total number of in silico double substitution was identified and this number was compared to the observed number of double substitutions in the cancer genome. None of the simulations yielded a greater number of double-nucleotide substitutions than were observed in the cancer genomes, hence p T transitions occurring within CpG islands and outside CpG islands (http://genome.ucsc.edu/) were determined. The samples were considered in aggregate apart from PD4120a that was excluded because of the nature of the excessive mutation burden contributed by this sample. The odds ratio of the rate of mutation outside to inside CpG islands was calculated.