1
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      COSINE: A web server for clonal and subclonal structure inference and evolution in cancer genomics

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Related collections

          Most cited references11

          • Record: found
          • Abstract: found
          • Article: found

          Clonal Heterogeneity and Tumor Evolution: Past, Present, and the Future

          Intratumor heterogeneity, which fosters tumor evolution, is a key challenge in cancer medicine. Here, we review data and technologies that have revealed intra-tumor heterogeneity across cancer types and the dynamics, constraints, and contingencies inherent to tumor evolution. We emphasize the importance of macro-evolutionary leaps, often involving large-scale chromosomal alterations, in driving tumor evolution and metastasis and consider the role of the tumor microenvironment in engendering heterogeneity and drug resistance. We suggest that bold approaches to drug development, harnessing the adaptive properties of the immune-microenvironment while limiting those of the tumor, combined with advances in clinical trial-design, will improve patient outcome.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            The evolutionary history of 2,658 cancers

            Cancer develops through a process of somatic evolution 1,2 . Sequencing data from a single biopsy represent a snapshot of this process that can reveal the timing of specific genomic aberrations and the changing influence of mutational processes 3 . Here, by whole-genome sequencing analysis of 2,658 cancers as part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA) 4 , we reconstruct the life history and evolution of mutational processes and driver mutation sequences of 38 types of cancer. Early oncogenesis is characterized by mutations in a constrained set of driver genes, and specific copy number gains, such as trisomy 7 in glioblastoma and isochromosome 17q in medulloblastoma. The mutational spectrum changes significantly throughout tumour evolution in 40% of samples. A nearly fourfold diversification of driver genes and increased genomic instability are features of later stages. Copy number alterations often occur in mitotic crises, and lead to simultaneous gains of chromosomal segments. Timing analyses suggest that driver mutations often precede diagnosis by many years, if not decades. Together, these results determine the evolutionary trajectories of cancer, and highlight opportunities for early cancer detection.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              The Life History of 21 Breast Cancers

              Introduction Age-incidence curves of most common epithelial cancers show rapidly increasing rates after the 4th–5th decades of life. Classic mathematical models of tumor development developed by Armitage and Doll (Armitage and Doll, 1954; Hornsby et al., 2007) suggested that 5–8 rate-limiting events are required to generate such incidence patterns. Since these studies were performed in the 1950s, we have learnt much about the biological and genetic basis of cancer. In particular, evolution toward cancer often occurs on a phenotypic spectrum of increasingly disordered premalignant stages, as “hallmark” cellular processes are cumulatively co-opted or ablated in the cancer cells (Hanahan and Weinberg, 2011). Somatic mutation is the fundamental mechanism by which cancer cells suborn these pathways (Stratton et al., 2009), notwithstanding the contributions of epigenetic changes, cues from the local microenvironment and germline genetic variation. Whole-cancer genomes sequenced to date carry thousands to tens of thousands of somatic mutations (Chapman et al., 2011; Ding et al., 2010; Ley et al., 2010; Pleasance et al., 2010a, 2010b; Puente et al., 2011; Shah et al., 2009), the vast majority of which probably have no biological relevance. The accumulation of mutations in cancerous and precancerous cells over time is increasingly recognized as a complex, dynamic process. Carcinogenic exposures and DNA repair defects can lead to sustained elevations in mutation rate; telomere attrition and chromothripsis can drive massive genomic rearrangement in catastrophic bursts (Bignell et al., 2007; Campbell et al., 2010; O'Hagan et al., 2002; Stephens et al., 2011). In the classic view of cancer development, those somatic mutations conferring a selective advantage on the cell drive successive waves of clonal expansion, with the fittest clone coming to dominate the cellular compartment. Increasingly, however, cancers are recognized to be mixtures of competing subclones, based on analyses of cancers sampled within a patient at different times (Ding et al., 2012; Shah et al., 2009), from different sites (Campbell et al., 2010; Ding et al., 2010; Yachida et al., 2010), at hypermutable genomic loci (Campbell et al., 2008), or through single-cell isolation (Anderson et al., 2011; Navin et al., 2011; Notta et al., 2011). Although these studies imply the existence of genetic heterogeneity within a tumor, fundamental questions remain about the dynamics of Darwinian evolution in cancer, the biological relevance of subclonal genetic variation and the relationship between mutational processes and clonal expansion. Here, we use newly developed bioinformatic algorithms (Greenman et al., 2012) to reconstruct the genomic history of 21 breast cancers. Borrowing the concept of a “most-recent common ancestor” from population genetics, we can divide somatic mutations into those acquired before the last complete selective sweep (and thus shared by all cancer cells within the sample) and those subclonal variants that occurred after the emergence of the common ancestor. We study the early genomic evolution of the eventual cancer clone, quantify the extent and dynamics of subclonal variation within the cancer sample sequenced and explore changes in mutation signatures over time. These findings have important implications for our understanding of how breast cancers develop over the decades between breast organogenesis and diagnosis in the adult. Results Inference of Cancer Genome Evolution We sequenced 20 primary breast cancer samples to an average 30–40 coverage across each base in the genome. The sample series includes four cases each of estrogen-receptor (ER)-positive, HER2-positive, and BRCA2-positive breast cancer; three cases of triple negative; and five cases of BRCA1-positive breast cancer. In addition, we sequenced to 188-fold depth one other ER-positive tumor with a distinctive mutator phenotype, consisting of C>A, C>G and C>T mutations specifically in a TpC context. As described in the companion paper to this (Nik-Zainal et al., 2012, this issue of Cell), we identified a high-confidence, validated set of base substitutions, insertions, and deletions (indels); genomic rearrangements; and copy number changes in the 21 cancers. To develop the reasoning that underpins this paper, we start with the tumor sequenced to 188-fold depth, PD4120a. At the chromosomal scale, the cancer genome is hypodiploid, with relatively few copy number changes (Figure 1A). To exploit the considerable sequencing depth available for this tumor, we modified the parameters of our somatic substitution algorithm in order to identify subclonal mutations; those found in only a fraction of tumor cells. In total, we identified 70,690 somatic substitutions genome-wide, including many in which fewer than 5% of the reads across the base reported the variant allele. That these are bona fide mutations is evidenced by the dominance of the C>∗ mutations in a TpC context at all levels of subclonality (Figures S1A and S1B, available online) and a high rate of verification in a subset by targeted PCR and pyrosequencing. The mutations fall into well-circumscribed clusters when displayed by the fraction of reads reporting the variant (Figure 1B and Figure S1C). The major cluster of points occurs at a read depth around 210, with about 35% of reads reporting each variant. Because an estimated 70% of cells in the sample derive from the cancer clone, this cluster represents those point mutations found in all tumor cells on one copy of a diploid chromosome. By comparison, mutations from regions of copy number 1 show a lower overall read depth (because the copy number is lower) but higher variant allele fraction (because there are no reads from the deleted chromosome). The genome has one triploid chromosomal region, 1q, which is most likely to have arisen as a single gain of one whole chromosome arm, although we cannot formally exclude duplication of both alleles with subsequent loss of one. Mutations occurring on the relevant chromosomal arm before duplication would be present on two of three copies (with an expected variant allele fraction of ∼55%), whereas mutations occurring after the duplication would be present on only one copy. In fact, we find only seven mutations on 1q predicted to have occurred before the chromosome arm was duplicated, not one of which has the signature of C>∗ mutations in a TpC context (Figure S1C). In contrast, 1,250 mutations are found on 1q at single copy number, of which 1,130 have this mutation signature, with the spectrum of early (ploidy 2) and late (ploidy 1) mutations being significantly different (p  50% to completely separate subsets. Therefore, the two variants must be collinear on the phylogenetic tree. If this reasoning applies for any two such mutations at > 50%, then it applies to all such mutations en bloc. Furthermore, if one such mutation is found in a strictly greater fraction of cancer cells than another such mutation, then it must have occurred earlier than the other. Applying these deductions to PD4120a, it follows that the mutations found in cluster C are all on the same branch of the phylogenetic tree, together with the subclonal deletion of chromosome 13. Furthermore, because the deletion is found in a larger proportion of cells, it must have occurred earlier during the cancer's evolution than cluster C mutations. We can directly test the veracity of this reasoning. Because we reason that the deletion of 13 occurred before subclonal mutations in cluster C, we predict that those subclonal mutations could only involve the retained copy of chromosome 13. Many somatic mutations will be sufficiently close to heterozygous germline SNPs that individual sequencing read pairs will span both, thus allowing the mutation to be “phased” with the SNP (Figure 2B). Of the 2,171 mutations on chromosome 13, we were able to phase 756 (35%) with a nearby heterozygous SNP, thus unambiguously determining whether the mutation occurred on the parental copy of chromosome 13 that was subclonally deleted or on the retained copy (Figure 2C). We find a cluster of mutations on the retained copy of chromosome 13 at a variant allele frequency of ∼48% (green points, Figure 2C)—this represents fully clonal mutations (cluster D). We also see a cluster of mutations from the deleted copy of chromosome 13 at ∼15% (brown points), denoting ancestral mutations subsequently deleted in 68% of tumor cells. A third distinct cluster of mutations is evident at ∼25% of reads, the equivalent of cluster C. All of these, as predicted, are phased with the retained copy of chromosome 13. This approach is also informative for the other subclonally deleted chromosomes. For mutations on the retained copy of chromosomes 6, 7, 8, and 11, we find clusters A–D as for nonsubclonal diploid chromosomes (Figure 3A and data not shown). For mutations on the parental copies of chromosomes 6 and 7 that are subclonally deleted, cluster B is completely lost, whereas the others remain unchanged. This demonstrates that virtually all mutations in cluster B are on a separate phylogenetic branch from mutations in cluster C. Furthermore, the subclonal deletion of chromosome 6 and 7 must be collinear with the mutations in cluster B. The same patterns and reasoning apply to chromosome 8, 11, 12, 14, 15, 18 and 21. For chromosome 2, we find that cluster B is abolished on both parental copies of the chromosome. This confirms the observation from the logR values that both copies of chromosome 2 are subclonally deleted (Figure S2B) and moreover places these deletions on the same branch as the mutations in cluster B. In summary, these data indicate that the subclonal deletion of chromosome 13 and the mutations in cluster C are on the same branch of the phylogenetic tree, with del13 occurring first. On a separate branch of the tree from this dominant subclone, we find all the mutations in cluster B, together with subsequent subclonal deletion of chromosomes 2, 6, 7, 8, 9, 11, 12, 14, 15, 18, and 21. Phasing Pairs of Subclonal Somatic Mutations We can also attempt to phase any two somatic mutations that are sufficiently close together to be spanned by single read pairs. We recognize two informative scenarios. The two mutations could arise in completely independent subclones, in which case reads could report either variant alone but never the two together (Figure S4A). Alternatively, one mutation could occur as subclonal evolution in a cell that already contains the other mutation, in which case we would see reads that report the earlier variant only as well as reads that report both variants together (Figure S4B). It is only valid to identify mutually exclusive mutation pairs in chromosomes that are haploid in the tumor, and we do indeed find 17 such pairs (Figure 3B and Figure S4C). Genome-wide, we also identify 76 examples of sub-subclonal evolution occurring on the same allele as a pre-existent subclonal mutation (Figure 3C, and Figure S4D). Strikingly, there are no examples of sub-subclonal evolution at 9%–12% variant allele fraction (cluster B) occurring in conjunction with a mutation at > 16% allele fraction (cluster C), confirming that mutations in cluster B fall on a separate phylogenetic branch from those in cluster C. These data also indicate that cluster A, the set of mutations at a variant allele fraction ∼5%, is likely to contain several discrete subclones. Some of these variants are clearly subclonal to cluster C and others subclonal to cluster B, as shown in Figure 3C. However, most are not derived from cluster B, because the peak for cluster A is largely unchanged for the parental copy of chromosome 7 that is subclonally deleted (Figure 3A). Mutations in cluster A frequently fell in mutually exclusive subclones (Figure 3B) and hence on different branches of the phylogenetic tree. It is therefore probable that some or even most of the cluster A mutations represent a third branch from the most-recent common ancestor. In support of this is a pair of cluster A mutations, both of which phase with the subclonally deleted copy of chromosome 13, and hence cannot be placed on the del13 branch, but are mutually exclusive with one another (Figure S4E). A Phylogenetic Tree for PD4120a With these observations, we can integrate the subclonal chromosome-scale losses with the subclonal point mutations to reconstruct how the tumor has evolved (Figure 3D). In one branch of the phylogenetic tree, there has been loss of chromosome 13 and subsequent acquisition of cluster C mutations. The other branch of the phylogenetic tree contains the cluster B mutations and sub-subclonal losses of multiple chromosomes, including both parental copies of chromosome 2. Because homozygous deletion of chromosome 2 in a diploid cell is frankly implausible, the most likely model is that a sub-subclone of the cluster B subclone has become tetraploid, presumably through an endoreduplication event. It has subsequently lost one of the four copies of chromosomes 6, 8, 9, 11, 12, 14, 15, 18, and 21. In addition, both copies of the same parental chromosome 7 have been lost, and one of each parental copy of chromosome 2 has been lost. From this model, we estimate that the subclone with mutations in cluster C represent 65% of tumor cells, cluster B represents 18% of tumor cells, and the tetraploid subclone represents 14% of tumor cells. Mutations in cluster A account for 14% of tumor cells. If many of these do fall on a third branch of the phylogenetic tree, the three branches would neatly account for all descendants of the most-recent common ancestor because the 14% of tumor cells in cluster A, the 68% with deletion 13, and the 18% of tumor cells in cluster B together add up to 100%. This phylogenetic tree explains all data observed for PD4120a. Timing Chromosomal Evolution in 20 Breast Cancer Genomes Chromosomal instability, the gains and losses of whole chromosomes or chromosome arms, is a well-recognized feature of breast cancer cells probably caused by missegregation of chromosomes during cell division (Burrell et al., 2010). As outlined above, for genomic regions that have increased in copy number, we can estimate the timing of the duplication event by comparing the proportion of mutations at ploidy 1 and ploidy 2 (Greenman et al., 2012). Among the 20 breast genomes sequenced to 30- to 40-fold depth, 16 had informative genomes for timing chromosomal gains (Figure 4). Broadly, the data suggest that the onset of large-scale chromosomal gains did not begin across these genomes until after at least 15%–20% of point mutation time had elapsed but thereafter continued steadily in many tumors. The implication is that chromosomal instability is not usually the earliest source of mutation in breast cancer evolution, but is a common and on-going process in later stages. A related phenomenon is that of whole-genome duplication, caused by a single event of cytokinesis failure, endoreduplication, or fusion of two diploid cells. Ten of the tumors studied here show evidence for such an event, inferred from the homogeneity of the distribution of early and late mutations across the genome (Figure 4 and Figure S5A). In general, such endoreduplication was a late event in this series, occurring after more than 50% of point mutation time had elapsed and often following many preceding single chromosome losses and gains. Five informative tumors had genomic amplifications of known cancer genes, three involving ERBB2 and one each involving MYC and CCND1. In four of the five cases, no mutations were present on all copies of the amplified segment, even allowing for the one patient where both parental copies of the locus contributed to the amplification (Figures S5B and S5C). Therefore, the first rearrangement driving the genomic amplification presumably occurred early in the evolution of these cancers. Interestingly, however, all amplifications showed multiple mutations at several discrete stages of ploidy intermediate between all copies and only one copy of the amplified region (Figures S5B and S5C). These mutations must have accumulated after the amplification had begun, because they do not involve all copies, but before the amplification was complete, because their ploidy is more than 1. Coupled with the fact that such mutations were observed at several discrete levels of ploidy, these data suggest that the genomic rearrangements driving the amplifications in these patients were acquired over a relatively protracted period of molecular time. This pattern is different to mechanisms of genomic amplification such as breakage-fusion-bridge or double minute chromosomes, where amplification can occur rapidly and even exponentially over a few cell cycles (Bignell et al., 2007; Campbell et al., 2010; Stephens et al., 2011). Changing Spectrum of Mutations over Time In addition to timing when chromosomal gains occur, we can compare the spectrum of point mutations acquired early, before the copy number gain (ploidy 2), and late, after the gain (ploidy 1) (Pleasance et al., 2010a). Fourteen of the genomes had sufficient numbers of early and late mutations to enable statistical comparison of the mutation spectrum over time (Figure 5A). Of these, 11 had statistically significant differences in spectrum between mutations acquired early and late, with many patients showing a strikingly different profile. The most consistent pattern is that C>T transitions constitute a higher proportion of early mutations than of late mutations, with 10/14 genomes showing a statistically significant decrease in C>T ratios after chromosome gains. C>T transitions are frequently caused by spontaneous deamination of methylated cytosine to thymine. However, we find that, in general, the decreased proportion of C>T mutations over time applies equally to other contexts as to those at CpG dinucleotides (Figure S6). In the companion paper to this one (Nik-Zainal et al., 2012), we show that many breast cancer genomes have distinctive mutation processes, from which a nonnegative matrix factorization algorithm identified five separate signatures. By classifying whether mutations were early clonal (ploidy > 1), late clonal (ploidy = 1) or subclonal (ploidy  T mutations at CpG dinucleotides, termed “signature A,” contributes a large proportion of the early mutations in these cancers, and relatively few late in the evolution of the tumors. In contrast, “signature E,” denoting C>G mutations at TpCpA, TpCpC and TpCpT trinucleotides, is a late onset mutational signature, contributing a large fraction of subclonal mutations in many patients. A similar pattern is seen for “signature B” comprising C>G or C>T mutations in a TpC context. Although not shown in these figures, dinucleotide substitutions were also significantly over-represented among late mutations than early events (odds ratio, 1.9; p  T transitions, both at CpG dinucleotides and in other sequence contexts, play a significant role in the early acquisition of mutations, accounting for up to 40% of mutations acquired before chromosomal gains. To some extent, the profile of base changes seen among many of the early breast cancer variants is a default mutation spectrum, closely mirroring that seen in tumor types such as blood, pancreatic, and brain cancers (Greenman et al., 2007; Jones et al., 2008; Papaemmanuil et al., 2011; Puente et al., 2011) and indeed in germline nucleotide substitutions (Hwang and Green, 2004). The lower proportional contribution of C>T transitions among late mutations is most likely caused by an increase in the rate of other mutation types because tumor-specific signatures account for much of the variation between early and late mutations. Intriguingly, we find that there are several mutation signatures at play in many of these patients, contributing varying proportions of mutations and with onset at different times during cancer evolution. The implication therefore is that in most breast cancers, the mutation rate increases in more advanced stages of tumor development, driven by distinctive, cancer-specific mutational processes. These processes continue past the emergence of the most-recent common ancestor, driving subclonal diversification within the tumor. Timing Kataegis, Localized Clusters of Mutations In the companion paper to this one (Nik-Zainal et al., 2012), we describe localized clusters of C>T and C>G mutations occurring in a TpC context closely associated with genomic rearrangements, which we termed kataegis. The presumption here is that an individual cluster of mutations occurs in a single event because of the close association with rearrangements and the fact that there is a strong strand bias within a cluster. Although the mutations within each cluster might occur simultaneously, however, the relative timing of different clusters of kataegis is not clear. In PD4103a, there are many clusters of kataegis mutations genome-wide. Interestingly, within the amplicon involving regions of chromosomes 10, 11, and 12, we find that these clusters occur at several different levels of ploidy (Figure 5C). For example, on chromosome 12, there are several such events found at variant allele fraction of 0.8 or higher in association with rearrangements that demarcate large copy number changes. These must have occurred early in the genesis of the amplicon and then themselves been amplified by subsequent rearrangements. Interestingly, there is also a cluster at an allele fraction of 0.4 and several at allele fractions  1 are aggregated as “early,” and all ploidy 1 mutations are aggregated as “late.” We compare the proportions of each by using standard chi-square tests for independence. Note that this introduces two simplifications. First, not all chromosomal gains occurred at the same time in these tumors, meaning that aggregation across different regions somewhat dilutes the differences observed. Second, in regions of triploidy as 2+1 parental copies, early mutations on the nonamplified parental copy will remain at ploidy 1 and thus be classified as “late.” However, the effects of both of these simplifications would be to obscure any true differences between early and late mutation signatures, rather than make any artifactually appear. In practice, the effects of the simplifications are rather small, and outweighed by the improvements gained in ease of interpretation. We also applied the nonnegative matrix factorization, as described in the companion paper, to these timed mutations, including a category for subclonal mutations, where the variant allele fraction was  ∗ signature in a TpC context at all levels of allele fraction (Figure S1A). Phasing Somatic Mutations with Heterozygous Germline SNPs To determine the fraction of tumor cells carrying a given mutation, we use the following formula: f = min ( 1 , r R ρ η T + ( 1 − ρ ) η N ρ ) , where ρ is the fraction of cells in the sample that are tumor cells (derived here from the ASCAT package [Van Loo et al., 2010]); r is the number of reads reporting the variant allele out of R total reads across the base in question; and ηT and ηN are the copy number of the genome at that base in the tumor and normal genomes respectively. For chromosomes showing subclonal copy number variation, we identified and phased mutations linked to nearby heterozygous germline SNPs. First, we identified heterozygous germline SNPs that were less than the maximum insert size away from a given mutation on the relevant chromosome (700bp in this sample). We then screened the sequencing file for read pairs where sequence covered both the SNP and the mutation, allowing us to determine which of the heterozygous SNP alleles was linked to the mutation. We determined whether that particular allele was on the deleted or retained copy of the chromosome by evaluating the number of reads from the tumor sample reporting either allele of the heterozygous SNP. If the observed fraction of reads reporting the germline allele linked to the mutation was significantly less than the expected ratio for the baseline state (ie 0.5 in a near-diploid chromosome) by a binomial test, we classified the mutation as being on the subclonally deleted chromosome. Where significantly greater than the expected ratio, we classified it as being on the retained copy. Where the observed fraction was not significantly different from the expected ratio, we identified the allele as part of the wider imputed haplotype (as discussed below) and determined whether the mutation was on the subclonally deleted or retained parental copy accordingly. Phasing Two Nearby Subclonal Somatic Mutations To identify pairs of subclonal mutations that were either mutually exclusive or showed subclonal evolution, we compiled a list of mutations present in < 100% of tumor cells and identified pairs of such mutations within an insert size of one another. In order to characterize a pair of subclonal mutations as showing subclonal evolution, we needed to find at least one example of a read pair reporting both variant alleles simultaneously AND one read pair reporting one variant allele but the wild-type allele for the other locus. In addition, we required that no reads were seen reporting the reverse situation (wild-type for the former locus and variant for the latter locus). To classify a mutation as “mutually exclusive,” we could only examine regions for which the copy number in the tumor is 1. This is because for regions of higher copy number, mutations could be in the same subclone but on different parental copies of the chromosome. Within a region of copy number 1, the requirement for a mutually exclusive pair of mutations was that at least one read pair reported the variant allele at the 5′ locus and the wild-type allele at the 3′ locus AND one read pair reporting the wild-type allele at the 5′ locus and the variant allele at the 3′ locus. In addition, we required that no read pairs reported both variant alleles simultaneously. Modeling Subclonal Structure Using Bayesian Dirichlet Processes We model somatic mutations within the tumor as deriving from an unknown number of subclones, each of which is present at an unknown fraction of tumor cells and contributes an unknown proportion of all somatic mutations, with all the unknown parameters to be jointly estimated. We initially considered conventional mixture models, but found the requirement to specify the number of clusters too restrictive. For this reason, we used a Bayesian Dirichlet process (Dunson, 2010) to model the data. Essentially, for a set of observed somatic mutations, we know the total read-depth across the base and the number of those reads reporting the variant allele for each mutation. We also know the expected fraction of reads that would report a mutation if present at one copy in 100% of tumor cells given the copy number at that locus and the normal cell contamination (through the equation shown above). Then, y i ∼ Bin(N i , ζ i π i ), with π i ∼ D P(α P 0), y i ∼ Bin ( N i , ζ i π i ) , with π i ∼ D P ( α P 0 ) , where yi is the number of reads reporting the ith mutation from Ni reads and ζi is the expected fraction of reads that would report a mutation present in 100% of tumor cells at that locus. Here, πi ∈(0, 1), the fraction of tumor cells carrying the ith mutation, is modeled as coming from a Dirichlet process. We use the stick-breaking representation of the Dirichlet process: P = ∑ h = 1 ∞ ω h δ π h , with π h ∼ P 0 , where ωh is the weight of the hth mutation cluster (that is, effectively the proportion of all somatic mutations specific to that cluster) and δπ is a point mass at π. To capture the stick-breaking formulation, we let κ h = V h ∏ l < h ( 1 − V l ) , with V h ∼ Beta ( 1 , α ) . The complicating factor here is that for subclones representing a lower fraction of tumor cells (lower values of πh ), we have a lower sensitivity for identifying mutations in the original genome sequencing. We therefore correct the κh for the sensitivity Sπ (probability of calling a mutation found in a fraction, θ, of tumor cells): ω h = κ h S π h ∑ i κ i S π i . We assume Sπ is known and calculable for all values of π∈ (0, 1). For the genomes described here, we use a three-parameter logistic curve where parameters are estimated from bootstrapped resamples by nonlinear least-squares (see below). For the prior distributions, we let P 0 ∼ U(0, 1) and α∼Γ(0.01,0.01). We set a practical upper limit on h of 30. We find little difference in the model output if we vary these priors, and in particular that on α, even to the extent of fixing it at 1 (data not shown). We use Gibbs sampling to estimate the posterior distribution of the parameters of interest, implemented in OpenBUGS v3.2.1. The code for this implementation is available with this paper (Supplemental Data). The Markov chain was run for 20,000 iterations, of which the first 13,000 were discarded. The chains converged rapidly to a stable limiting distribution, as assessed by using standard procedures. To generate the marginal distribution of subclonal mutations (such as those in Figure 1C and Figure 6B), both κh and πh were monitored. The median and 95% posterior intervals of the density were estimated from πh , each weighted by the associated value of κh , by using a Gaussian kernel (in R v2.13). We first tested the Bayesian Dirichlet model on simulated data. To ensure the simulations replicated real data, the simulations used parameter estimates and sequence coverage matched to the values generated for PD4109a. The sensitivity for detecting mutations at different levels of subclonality in this sample is shown in Figure S6A—the parameter estimates for the logistic curve were then used to generate the “observed” mutations in the simulation. For each mutation, the read depth used for simulating the 232 mutations for the Dirichlet process was then based on the same read depth as for the 232 mutations confirmed by 454 sequencing in PD4109a. In the first simulation (Figures S1D and S1E), we set the “true” underlying make-up of the cancer to have 20% of the somatic mutations in all tumor cells and 80% mutations derived from three subclones (at 20%, 30%, and 60% of tumor cells; brown bars). Then, allowing for variable sensitivity for detecting mutations at different fractions of tumor cells, we simulated a set of “observed” mutations from the cancer (blue bars). From the simulated set of observed mutations, we run the Dirichlet process model, which returns a distribution (green curves) very close to the “true” underlying distribution. In a second simulation, we assumed a completely different true underlying distribution: 40 subclones, each representing 2.5% of mutations and evenly spread through (0,1). Again, the Dirichlet process accurately modeled the “true” distribution (Figure S1D). Estimating Sensitivity of Sequencing for Finding Subclonal Mutations To model the sensitivity of the original genome sequencing for detecting mutations at different fractions of tumor cells, we undertook bootstrapping. We sampled 10,000 random wild-type positions in the genome for each level of subclonality, θh . At each of these random positions, we introduced a “subclonal mutation” in the following way. First, from the copy number of the locus, the normal cell contamination and the current level of subclonality θh , we calculated ζiθh , the fraction of reads expected to report the variant allele (by using the first formula in Extended Experimental Procedures). We choose the variant allele from (A, C, G, T) \ (Reference allele). Then, for each of the reads from the tumor genome across this base, we draw from the U(0,1) distribution. If the draw is greater than ζiθh , we leave the base call unchanged. If the draw is less than ζiθh , and the base call = Reference allele, we change the base call to the variant allele. If the draw is less than ζiθh and the base call is not the reference allele or the variant allele (i.e., there was a sequencing error), we leave it unchanged. If the draw is less than ζiθh and the base call was the variant allele, we change it to one of the other three (nonvariant allele) bases. The reads from the normal genome we leave unchanged. Thus, we generate a set of simulated mutations at each level of subclonality that exactly match the sequence coverage, sequencing error profiles, normal cell contamination and ploidy variation across the genome as seen in the original sequencing data. On the simulated sets of mutations, we run our substitution caller, Caveman, and count the number of mutations called at each level of subclonality. From these bootstrapping estimates of sensitivity, we fit a three parameter logistic curve by using nonlinear least-squares estimation: S θ = A 1 + exp ( ( χ − log θ ) / σ ) . Identifying Regions of Subclonal Copy Number Variation Major and minor allele frequencies are estimated as the proportion of reads corresponding to each allele and therefore have binomial distributions. The observed distributions of the minor and major allele frequencies are distinct when the underlying allele frequencies are very different and when the read depth is high. In such regions, accurate estimates of the B-Allele Frequency (BAF) may be obtained by assuming that the allele frequencies above and below 0.5 arise from the two different haplotypes. However, as the BAF approaches 0.5 (as occurs in the case of subclonal aberrations) the two distributions increasingly overlap, resulting in wider confidence intervals in the estimated BAF. In order to separate the two distributions, we phased the observed genotypes by using Impute2 (Howie et al., 2009). Impute2 uses a set of polymorphic sites for which a reference panel of known genotypes is available. In this work we used the interim release from phase 1 of the 1000 Genomes Project, released in June 2011, as a reference panel (1000 Genomes Project Consortium, 2010). In the first step, the observed genotype of the matched normal samples at each of the polymorphic sites was identified. Impute2 was then run across these genotypes for the whole genome, in 5Mb segments, yielding phased haplotypes. Using the phased haplotypes, minor and major allele frequencies may be assigned to the two haplotypes. Within haplotype blocks of length ∼300 kb the resulting haplotype frequencies lie in two separate bands for regions with subclonal or fully clonal copy number aberrations (CNAs). However, these blocks are separated by recombination hotspots which lead to “switching” of the blocks and hence the Battenberg pattern illustrated in Figure 2A (chromosome 7). For chromosomes or regions containing no CNA, the two distributions lie on top of each other, as illustrated in Figure 2A (chromosome 3). In regions with a separation of haplotype frequency bands, the “switch-points” between consecutive haplotype blocks are clearly visible and can be straightforwardly detected by segmentation methods. We use segmentation by the Piecewise Constant Fitting (PCF) algorithm (with a breakpoint penalty parameter of 3) (Baumbusch et al., 2008) to determine “switch-points” of haplotype blocks. This approach allows haplotype phasing across large chromosomal distances (up to whole chromosomes, see Figure 2A) in regions of copy number imbalance. Building on these long-range phased haplotype frequencies, we can discern both clonal and subclonal copy-number imbalances. The haplotype frequencies of clonal copy number changes must obey: h f = ½ h f = 1 − ρ + ρ n B 2 ( 1 − ρ ) + ρ ( n A + n B ) where ρ is the fraction of tumor cells within the samples, and nA and nB are the (integer) allele-specific copy numbers of that locus. We obtain initial estimates of ρ (and the tumor ploidy ψ) from the ASCAT package (Van Loo et al., 2010) applied to Affymetrix SNP 6.0 array data, and further fine-tune these two-digit estimates to three-four digit estimates by applying the above equation to a large clonal reference region with constant allele-specific copy number. To obtain allele-specific copy number estimates, we combine LogR data lR (derived from local NGS read depth) with the haplotype frequency equation above, and build upon the model for allele-specific copy number derivation described previously (Van Loo et al., 2010): n A = ρ − 1 + ( 1 − h f ) ψ 2 l R ρ n B = ½. n B = ρ − 1 + h f ψ 2 l R ρ . As our haplotype frequency data show very little systematic bias (Figure 2A), and a significant part of LogR data bias is removed by GC wave correction, it is reasonable to assume that the error margin on these copy number estimates is significantly less than ± 1 for most genomic segments (with perhaps the exception of some highly amplified regions). Hence, under a model of clonality of a genomic segment under study, the haplotype frequencies of the germline heterozygous SNPs within that segment should be distributed around one of the four following values: (1 − ρ + ρ⌊n B ⌋) / (2(1 − ρ) + ρ(⌊n A ⌋ + ⌊n B ⌋)) ( 1 − ρ + ρ ⌊ n B ⌋ ) / ( 2 ( 1 − ρ ) + ρ ( ⌊ n A ⌋ + ⌊ n B ⌋ ) ) , (1 − ρ + ρ⌊n B ⌋) / (2(1 − ρ) + ρ(⌊n A ⌋ + ⌊n B ⌋)) ( 1 − ρ + ρ ⌊ n B ⌋ ) / ( 2 ( 1 − ρ ) + ρ ( ⌊ n A ⌋ + ⌊ n B ⌋ ) ) , (1 − ρ + ρ⌈n B ⌉) / (2(1 − ρ) + ρ(⌊n A ⌋ + ⌈n B ⌉)) ( 1 − ρ + ρ ⌈ n B ⌉ ) / ( 2 ( 1 − ρ ) + ρ ( ⌊ n A ⌋ + ⌈ n B ⌉ ) ) or (1 − ρ + ρ⌈n B ⌉) / (2(1 − ρ) + ρ(⌈n A ⌉ + ⌈n B ⌉)) ( 1 − ρ + ρ ⌈ n B ⌉ ) / ( 2 ( 1 − ρ ) + ρ ( ⌈ n A ⌉ + ⌈ n B ⌉ ) ) . This framework provides the means to determine statistically if a copy number aberration is clonal or subclonal by a simple t test. We employ a two-sided t test with a = 0.05 and require a minimum deviation of 0.01 of segmented haplotype frequencies from their theoretical clonal states to call a segment subclonal. For subclonal copy number changes, we aim to call copy number states in each of the subclones, and to quantify the cell percentages of each of these subclones. We base our approach on the assumption that any genomic aberration occurred only once during tumor evolution. Under this assumption, the haplotype frequency and coverage depth/LogR data of this genomic segment are shaped by three populations of cells: a fraction 1-ρ of admixed normal cells, a fraction ρτ of tumor cells with the subclonal aberration and a fraction ρ(1-τ) of tumor cells without the subclonal aberration (with t ∈ [0,1]). Note that these populations of tumor cells may each consist of one or multiple subclones. To estimate τ, the fraction of tumor cells with the specific subclonal copy number change, we further assume that the subclonal copy number change resulted in a gain or a loss of exactly one copy of one allele. Given an error margin smaller than ± 1 on the allele-specific copy number estimates nA and nB , this assumption restrains the allele-specific copy number states in both subclones to one of four combinations: (i) ⌊n A ⌋ ⌊ n A ⌋  + ⌊n B ⌋ ⌊ n B ⌋ in one subclonal cell population and ⌈n A ⌉ ⌈ n A ⌉  + ⌊n B ⌋ ⌊ n B ⌋ in the other subclonal cell population, (ii)  ⌊n A ⌋ ⌊ n A ⌋  + ⌊n B ⌋ ⌊ n B ⌋ and ⌊n A ⌋ ⌊ n A ⌋  + ⌈n B ⌉ ⌈ n B ⌉ , (iii) ⌈n A ⌉ ⌈ n A ⌉  + ⌊n B ⌋ ⌊ n B ⌋ and ⌈n A ⌉ ⌈ n A ⌉  + ⌈n B ⌉ ⌈ n B ⌉ or (iv) ⌊n A ⌋ ⌊ n A ⌋  + ⌈n B ⌉ ⌈ n B ⌉ and ⌈n A ⌉ ⌈ n A ⌉  + ⌈n B ⌉ ⌈ n B ⌉ . For any specific haplotype frequency hf of a genomic segment (that as detailed above shows very little bias), only two of these combinations are possible (i.e., a line with constant hf drawn in allele-specific copy number space will intersect the square defined by vertices (⌊n A ⌋, ⌊n B ⌋) ( ⌊ n A ⌋ , ⌊ n B ⌋ ) , (⌈n A ⌉, ⌊n B ⌋) ( ⌈ n A ⌉ , ⌊ n B ⌋ ) , (⌊n A ⌋, ⌈n B ⌉) ( ⌊ n A ⌋ , ⌈ n B ⌉ ) and (⌈n A ⌉, ⌈n B ⌉) ( ⌈ n A ⌉ , ⌈ n B ⌉ ) exactly twice in the case of a subclonal aberration). It can be shown that one of these combinations is always a mix of states with total copy number ⌊n A ⌋ ⌊ n A ⌋  + ⌊n B ⌋ ⌊ n B ⌋ and ⌊n A ⌋ ⌊ n A ⌋  + ⌊n B ⌋ ⌊ n B ⌋  + 1, whereas the other is always a mix of states with total copy number ⌊n A ⌋ ⌊ n A ⌋  + ⌊n B ⌋ ⌊ n B ⌋  + 1 and ⌊n A ⌋ ⌊ n A ⌋  + ⌊n B ⌋ ⌊ n B ⌋  + 2 (i.e., the line with constant hf in allele-specific copy number space has a positive slope). Therefore, the total copy number nA + nB, or the value of lR (as n A + n B = (2ρ − 2 + ψ2 l R ) / ρ n A + n B = ( 2 ρ − 2 + ψ 2 l R ) / ρ ) can be used to select the final combination of states. Given that combination of allele-specific copy number states (nA,1 , nB,1 ) in a fraction of tumor cells τ and (nA,2 , nB,2 ) in a fraction of tumor cells 1-τ, the haplotype fraction hf of the segment is given by: h f = 1 − ρ + ρ τ n B , 1 + ρ ( 1 − τ ) n B , 2 2 − 2 ρ + ρ τ ( n A , 1 + n B , 1 ) + ρ ( 1 − τ ) ( n A , 2 + n B , 2 ) Hence, τ can be calculated as: τ = ½ τ = 1 − ρ + ρ n B , 2 − 2 h f ( 1 − ρ ) − h f ρ ( n A , 2 + n B , 2 ) h f ρ ( n A , 1 + n B , 1 ) − h f ρ ( n A , 2 + n B , 2 ) − ρ n B , 1 + ρ n B , 2 Finally, standard deviations and 95% confidence intervals can be calculated for this value by using a bootstrapping approach (for each segment, allele frequencies are resampled 1,000 times, with replacement).
                Bookmark

                Author and article information

                Contributors
                Journal
                Zool Res
                Zool Res
                ZR
                Zoological Research
                Science Press (16 Donghuangchenggen Beijie, Beijing 100717, China )
                2095-8137
                18 January 2022
                : 43
                : 1
                : 75-77
                Affiliations
                [1 ] School of Computer Science and Technology, Xidian University, Xi'an, Shaanxi 710071, China
                [2 ] iFlora Bioinformatics Center, Germplasm Bank of Wild Species, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, Yunan 650201, China
                [3 ] Yuxi Rongjian Information Technology Co., Ltd., Yuxi, Yunan 653100, China
                [4 ] Center for Molecular Medicine Cologne (CMMC), University of Cologne, Cologne 50931, Germany
                [5 ] Pediatric Research Institute, Ministry of Education Key Laboratory of Child Development and Disorders, National Clinical Research Center for Child Health and Disorders, China International Science and Technology Cooperation Base of Child Development and Critical Disorders, Chongqing Key Laboratory of Translational Medical Research in Cognitive Development and Learning and Memory Disorders, Children’s Hospital of Chongqing Medical University, Chongqing 400014, China
                Author notes
                Article
                zr-43-1-75
                10.24272/j.issn.2095-8137.2021.250
                8743247
                34845880
                1ed3a83a-70c3-405c-8d39-1d4a63a422e5
                Editorial Office of Zoological Research, Kunming Institute of Zoology, Chinese Academy of Sciences

                This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/4.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 25 September 2021
                : 25 November 2021
                Funding
                This work was supported by the CAS Pioneer Hundred Talents Program and National Natural Science Foundation of China (32070683) to Y.P.C.; the Science and Technology Planning Project of XI'AN (GXYD6.2) and National Natural Science Foundation of China (61771369) to X.G.Y.
                Categories
                Letter to the Editor

                Comments

                Comment on this article