13
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found

      Spatial heterogeneity in medulloblastoma

      research-article
      1 , 2 , 1 , 2 , 1 , 2 , 3 , 4 , 5 , 1 , 2 , 6 , 1 , 2 , 7 , 1 , 2 , 1 , 2 , 7 , 1 , 2 , 1 , 2 , 8 , 9 , 10 , 11 , 11 , 12 , 12 , 12 , 2 , 6 , 7 , 3 , 4 , 5 , 2 , 6 , 1 , 2 , 1 , 2 , 14 , 1 , 2 , 7 , 15 , 16 , 16 , 1 , 2 , 7 , 1 , 2 , 7 , 1 , 2 , 7 , 17 , 18 , 19 , 20 , 21 , 11 , 11 , 11 , 11 , 11 , 11 , 11 , 11 , 11 , 11 , 11 , 11 , 11 , 11 , 11 , 11 , 11 , 2 , 22 , 2 , 22 , 2 , 6 , 23 , 24 , 25 , 26 , 15 , 27 , 16 , 16 , 27 , 28 , 29 , 30 , 31 , 2 , 10 , 2 , 6 , 2 , 6 , 2 , 7 , 10 , 19 , 32 , 33 , 11 , 11 , 11 , 21 , 11 , 34 , 35 , 1 , 2 , 36 , 6 , 21 , 11 , 34 , 1 , 2 , 7
      Nature genetics
      Spatial heterogeneity, medulloblastoma, glioblastoma multiforme, renal cell carcinoma, whole exome sequencing, gene expression profiling

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Introductory paragraph Spatial heterogeneity of transcriptional and genetic markers between physically isolated biopsies of a single patient’s tumor poses major barriers to the identification of biomarkers, and the development of targeted therapies effective against the entire tumor. We analyzed spatial heterogeneity data from 35 patients with multi-regional biopsies using a combination of transcriptomic and genomic profiles. Medulloblastomas, but not malignant gliomas, demonstrate spatially homogeneous transcriptomes, allowing accurate subgrouping of tumors from a single biopsy. Conversely, somatic mutations that impact genes suitable for targeted therapeutics demonstrate high levels of spatial heterogeneity in medulloblastoma, malignant glioma, and renal cell carcinoma. Actionable targets found in a single medulloblastoma biopsy are seldom clonal across the entire tumor, questioning the efficacy of monotherapy against a single target. Clinical trials of targeted therapies for medulloblastoma should first assure the spatially ubiquitous nature of the target mutation. Letter Many cancer types show considerable intertumoral heterogeneity between patients 1–3 . Molecular biomarkers are intended to (i) tailor treatment intensities 4,5 , (ii) define oncogenic drivers for targeted therapies 5–7 , and (iii) identify diagnostic mutations (i.e. SMARCB1 in ATRT) 8 . Currently, clinical diagnoses are based on single biopsies due to the assumption of spatial homogeneity across tumors, however, spatial heterogeneity could lead to erroneous tumor classification, or selection of therapies against targets only present in a locally-restricted subset of the tumor. These implications were recently highlighted in late-stage renal cell carcinoma (RCC) 9,10 with highly divergent mutational profiles affecting MTOR and TP53, as well as demonstrating good and poor prognostic gene signatures in multi-region biopsies derived from the same tumor 10,11 . To determine the degree and clinical importance of spatial heterogeneity in medulloblastoma, we performed multi-regional biopsies and compared gene expression profiles, DNA copy number aberrations, and somatic mutations. Our cohort includes 9 primary medulloblastomas, 16 high-grade gliomas (HGG) (10 with gene expression only 12 ), and 10 kidney cancers 10 , each with 4–11 spatially distinct biopsies (median=6). An overview of the data types available for each patient is presented in Supplementary Table 1a and Supplementary Figure 1. Glioblastoma 13 and medulloblastoma 14 each comprise four distinct molecular subgroups that are discerned through analysis of transcriptional data. Unsupervised hierarchical clustering (HCL) of expression data demonstrates that medulloblastoma biopsies form tight clusters apart from single samples 15–20 (8/8; Fig. 1a, Supplementary Fig. 2a–b), whereas in HGGs (3/3) and RCCs (8/9), multi-region biopsies from single individuals cluster apart when combined with single samples (Supplementary Fig. 2c–f,). Overall, based on the standard deviation of expression, inter-tumoral differences were greater than intra-tumoral heterogeneity in each tumor type (Fig. 1b). Using Predictive Analysis of Microarrays (PAM), subtype prediction revealed that 21% (13/63) of glioblastoma multi-region samples diverged from the most commonly observed subtype for each patient, compared to only 2% (1/52) of medulloblastoma biopsies (p=0.003; Fig. 1c–e; Supplementary Figs. 3–6). Considering only biopsies with subgroup predictions of 100% confidence, all MB tumors had concordant subgroup calls between multiple biopsies (9/9) compared to only 55% of glioblastomas (6/11; p=0.038; Fig. 1e). We conclude that medulloblastoma can be robustly and reliably sub-grouped from only a single biopsy, but glioblastoma cannot. We identified somatic copy number aberrations (CNA) using a custom pipeline based on the TITAN algorithm 21 , which is robust to high levels of normal contamination (see Methods). Regions of CNA were identified in all three tumor entities (Fig. 2a, Supplementary Fig. 7–8, Supplementary Table 1b–c), and unsupervised HCL of clonal segments showed tight clustering of individual biopsies in the cohort across all tumor entities (Fig. 2b, Supplementary Fig. 9). CNA-derived measurements of spatial heterogeneity reveal variance between individuals within each tumor type (Fig. 2c). Somatic single nucleotide variants and indels recapitulate a similar pattern of spatial heterogeneity across tumors (Fig. 2d; Supplementary Table 1d). Overall, based on mutation and copy number data, none of the three tumor types were solely comprised of only homogeneous or heterogeneous tumors, rather, each have a repertoire of tumors that reside along a continuum of genetic heterogeneity. This genomic complexity results from a process of clonal evolution whereby successive acquisition of mutations and copy number events generates genetically related subpopulations of cells or lineages within each tumor. We integrated CNA and mutational data using the EXPANDS algorithm 22 , in order to infer the cellular lineage composition in each biopsy. EXPANDS detects multiple genetically distinct co-existing subpopulations of cells and allows phylogenetic reconstruction of their evolutionary relationships. A cartoon describing the spatial distribution of genetically distinct subpopulations throughout a tumor (Fig. 3a) illustrates the clonal intermixing detected in many samples of the cohort (Fig. 3b–d; Supplementary Fig. 10; Supplementary Table 1e–f). Many tumor biopsies have a major clone (genotype present in >70% of tumor cells) that is also detected in a minority of cells in other biopsies from the same tumor (i.e. are subclonal), or which is absent in other biopsies (e.g. biopsies 3,5,6 in RCC7 are genetically similar to some cells in biopsy 4 (4a), but not all cells, since 4b clusters separately; Fig. 3c). In some tumors, individual biopsies contain two or more cell lineages that independently accumulate distinct repertoires of mutations not found elsewhere in the tumor (e.g. HGG2 biopsies 1,5; Fig. 3c). The presence of multiple genetically distinct cellular lineages within single biopsies has previously been linked to poor prognosis and treatment response across a variety of cancer types 23 . This surprising but common pattern of major genetic clones in one biopsy that are subclonal or absent in spatially distinct locations in the tumor prompted us to investigate observable mutation clonality across biopsies, since clonality is a key requirement of clinically-actionable therapeutic targets 24 . We classified mutations into clonal and subclonal populations (Supplementary Fig. 11; Supplementary Table 1g), and determined whether the subset of damaging clonal mutations changes status between spatially-separated tumor biopsies. In nearly all tumors we find a predominance of clonal mutations that are subclonal or completely absent in additional biopsies (Fig. 4a; Supplementary Fig. 12; validation set of 7 mutations with 96% validation rate across biopsies; Supplementary Fig. 13 and Supplementary Table 1h). This observation remains true when considering only driver events 25,26,27,28 (Fig 4b; Supplementary Table 1i–k). We predict that monotherapies against a single target identified in a single biopsy are unlikely to show dramatic clinical effects as the targets are not ubiquitous, leaving untargeted clones in unsampled portions of the tumor free to survive and repopulate the tumor. With the goal of improved patient treatment, the clinically relevant question is whether the observed level of genomic spatial heterogeneity affects actionable or driver alterations. As a proof of concept, we focused on a set of genes with known roles in cancer initiation/progression 29 , or which have defined drug interactions 30 . These genes are therefore enriched in relevant or actionable targets in a manner unbiased towards either of the cancer types included (Supplementary Table 1l–m). Considering the spectrum of SNVs, indels, and CNAs affecting these genes (Supplementary Fig. 14–15), we found a remarkable variety of patterns across tumors, including: cases with only a small set of shared alterations across biopsies but with many events present in single biopsies (e.g. HGG4 MET amplification); homogeneous tumors with many shared actionable events (e.g. HGG3); cases without ubiquitous actionable targets, which may require multi-agent targeted therapeutics (e.g. MB6); tumors lacking vulnerability to any of the considered actionable targets in a subset of biopsies (e.g. MB7); and tumors with alterations that may predict resistance (e.g. RCC7 TP53 compound loss and somatic mutation). Considering the full set of identified actionable mutations per tumor across all biopsies, we calculate that in each entity, an average of at least 5 biopsies are required to have an 80% chance of identifying at least 80% of these alterations. Lowering these measures to 50% would require sampling of at least 2 biopsies, with highly heterogeneous tumors needing as many as 4 (Fig. 5a). This is likely an underestimate, as the detection of actionable mutations does not plateau in most patients (Supplementary Fig. 16). Upfront profiling of numerous tumor regions in order to reveal the full repertoire of actionable targets is neither practical nor likely, given the amount of sequencing required, thus we focused on maximizing the information derived from a minimal set of biopsies. Specifically, we wanted to determine how well we could predict the frequency of individual mutations across a tumor, given increasing number of biopsies, noting that prediction accuracy for mutations identified in a single fraction would be high only in very homogeneous tumors. We empirically determined the frequency of each alteration when considering all possible pairs of an increasing number of biopsies, and compared this observed quantity to the known frequency of the alteration in all biopsies; the difference between these values being the inference error of mutation frequency resulting from insufficient biopsies in genetically heterogeneous tumors (Supplementary Fig. 17). Taking a 10% error rate as an acceptable threshold, we calculate for each tumor the number of observed mutation frequencies that fall within this range (i.e. accuracy). As expected, accuracy improves with increasing number of biopsies, and also reveals that brain tumors fall into two patterns. The first comprises more homogeneous tumors, which have fairly high prediction accuracy even with a low number of biopsies, while the second comprises more heterogenous tumors that require multiple biopsies to ensure accurate determination of mutation frequency (Fig. 5b). In our cohort of medulloblastomas and glioblastomas, considering just two biopsies per tumor enables distinction of these high vs low genetic heterogeneity tumors, with high specificity especially for those tumors that are highly heterogeneous (Fig. 5c; Supplementary Fig. 18). While spatial heterogeneity is clearly a barrier to highly effective therapeutics against the entire primary tumor, the extent of heterogeneity between a primary and recurrent medulloblastoma 31 is many fold greater (Fig. 6a, Supplementary Fig. 19). This vast discordance at relapse is therefore unlikely to be secondary solely to inadequate spatial sampling of the therapeutically-naïve primary tumor. In gliomas 32 the recurrent disease more closely resembles the primary and only in rare cases diverges to the extent seen in medulloblastoma, possibly due to less complete resection success in this more diffuse and infiltrating tumor. Medulloblastoma is known to recur from very rare populations of cells 31 , thus, therapeutic approaches that can eradicate such cellular lineages despite their low prevalence in the primary tumor are drastically needed. Immunotherapy relies on the destruction of cancer cells based on the presence of tumor-specific cell-surface antigens as opposed to cell autonomous somatic mutations. We examined expression of the antigens/genes for which chimeric antigen receptor T-cells or monoclonal antibodies already exist 33–43 , and demonstrate a remarkable consistency of expression across multi-regional biopsies, which contrasts highly to the inhomogeneity of somatic mutations across fractions in the same set of tumors. This was the case in all medulloblastomas examined, including those with high levels of genetic heterogeneity and in which targeted therapy would be problematic (Fig. 6b; Supplementary Fig. 20) 33–43 . Homogeneity of the transcriptome versus heterogeneity of somatic mutations in our MB cohort suggests that targeted immunotherapeutic approaches could overcome the hurdle of spatial genetic heterogeneity. The vast majority of brain tumor patients have their tumor classified from a single tumor biopsy, which may be adequate for medulloblastoma, but not for glioblastoma patients. The extent of spatial heterogeneity of somatic mutations observed in our cohort suggests that clinical trials of molecularly targeted therapy should first assess the ubiquitous distribution of the target. The lack of clonal actionable driver mutations ubiquitously present across all regions of a given brain tumor suggest that monotherapies targeting a single gene from a single biopsy are unlikely to have dramatic effects to improve the lives of brain tumor patients. Materials and Methods Patients and samples Multi-region tumor biopsies and clinical data were gathered for 35 tumors (9 primary medulloblastomas, 16 high-grade gliomas (HGG) (10 with gene expression only 12 ), and 10 kidney cancers 10 ); peripheral blood samples were included as germline (GL) controls for all cases with exome sequencing. All multi-region biopsies for unpublished cases were obtained in situ during tumor resection, by mimicking the previous sample preparation conditions of published cases to the best of our knowledge. Medulloblastoma tumors are similar in size to glioblastomas, with an average diameter of 8–12cm; biopsies were taken as far apart as possible by the surgeon. Renal cancers, due to their localization in the abdomen may be larger in size. Detailed information on multi-region tumor samples is provided in Supplementary Table 1a and Supplementary Figure 1. All patient material and clinical information were obtained after receiving informed consent and was approved by the institutional review boards of the contributing institutions. DNA and RNA extractions were performed as previously described 16 . RNA quality was assessed on a 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA). Only high-quality RNA (RNA integrity number ≥7) was included for further studies. Gene expression profiling Expression profiling was conducted on eight MB and three HGG multi-region biopsies totaling 72 biopsies with a median number of 6 multi-region biopsies per primary (range 4–9). Affymetrix HU133 Plus 2.0 microarrays were used for HGG samples and Affymetrix Gene 1.1 ST array (Affymetrix, Santa Clara, CA) was used for MB samples to ensure that these multi-region biopsies could be compared to published data sets 15–17,20 . Microarrays were processed according to the manufacturer’s guidelines. Raw data was normalized using a transcript-level robust multi-array average (RMA) algorithm 44 , and subsequently clustered using unsupervised HCL (Pearson’s Dissimilarity – average linkage) in Partek Genomics Suite. The molecular classification of the multi-region biopsy samples was performed using a class prediction algorithm, Prediction Analysis for Microarrays (PAM) 45 , as implemented in the pamr package (v 1.51). Markers for GBM subtypes were obtained from the Verhaak classifier 13 . We note that classification was performed for the GBM samples only, thus excluding HGG1. Subgroup-specific markers for MB were identified based on one-way analysis of variance (ANOVA) with multiple hypothesis correction by the Bonferroni method in previously published datasets with known subgroup affiliation 46 . Based on the misclassification error values in core GBM 13 and MB 15–17 training datasets (Supplementary Fig. 6), threshold values of 1.75 and 1 were chosen for multi-region samples from published 12 and unpublished GBM and MB patients, respectively. The published GBM dataset 12 was quantile normalized using Partek Genomics Suite. Predicted subtypes or subgroups with confidence probabilities higher than established thresholds 46 were considered bona fide subgroup assignments. Samples with less than 500 ng of remaining RNA were analyzed using NanoString as previously described 46 . MB3 was exclusively analyzed using NanoString since only limited amounts of RNA were available for all multi-region biopsies. NanoString counts were normalized to the three housekeeping genes (GAPDH, ACTB and LDHA). Dot plots and PCA based on normalized NanoString calls were prepared using the R-statistical environment (v2.15.1). Pearson correlation was used to determine the correlation of marker gene expression for each biopsy per patient (intra-tumor comparison) and between each biopsy and all others samples from different patients of the same subgroups (inter-tumor comparison). Wilcoxon rank sum test was used to infer intra- and inter-tumor marker gene expression differences in a subgroup-specific fashion. A previously published dataset of nine multi-region RCC samples 9 profiled using the Affymetrix Human Gene 1.0 ST array was included in the analysis as well as two RCC datasets 18,19 with 53 and 29 single RCC samples, respectively. The RCC expression datasets have been processed together in R (v3.1.1) with the oligo package (rma normalization) and the combat package has been used for batch effect correction. Unsupervised HCL (Pearson’s Dissimilarity – average linkage) has been performed using the Partek Genomics Suite. The Gene Expression Omnibus accession numbers for the previously-unplublished gene expression data are GSE62802 (HGG samples) and GSE62803 (MB samples). Whole-Exome Sequencing DNA libraries (MB1-5) from multi-region samples were exome captured using Agilent SureSelect V5+UTR probes, followed by 8 cycles of PCR and then paired-end 75 base reads were sequenced over 2 lanes on an Illumina HiSeq 2000 instrument per pool of 6 libraries. Reads were aligned to the human reference genome hg19a using the Burrows-Wheeler Aligner (BWA) (version 0.5.7) 47 . Two lanes were merged with duplicates marked using Picard Tools (version 1.71). Additional samples (MB6/7, HGG1-5) were subjected to paired-end library construction using Illumina’s Nextera Rapid Capture Exome (RCE) kit. Captured exome DNA sequences were then sequenced on Illumina HiSeq 2000 (rapid-run mode) for 100-bp paired-end reads. We used FASTX toolkit to remove adaptor sequences and to trim low quality reads. Quality trimmed reads were then aligned to the human reference genome (hg19) using BWA (version 0.5.9) 47 . Genome Analysis Toolkit (GATK) 48 was used for indel realignment. Duplicate reads were then marked using Picard to be able to exclude them further in our analysis. The Toronto datasets are available under accession numbers EGAD00001000723 and EGAS00001001014. Somatic SNV detection and filtering SNVs were called exome-wide using samtools mpileup (v0.1.7), and indels were called using VarScan. Stringent filtering was performed requiring no reads in the germline sample supporting a SNV to ensure conservative selection of somatic events. Variants with sufficient coverage (≥10) were further annotated using Annovar 49 (table_annovar.pl; RefSeq gene annotations, amino acid change annotation, SIFT, PolyPhen, LRT, and MutationTaster scores, PhyloP and GERP++ conservation scores, dbSNP identifiers, 1000 Genomes Project allele frequencies, NHLBI-ESP 6500 exome project allele frequencies). Mutation Validation A subset of somatic mutations were validated using PCR amplification from all tumor biopsies, matched germline, and a healthy control sample. Regions of interest were amplified from genomic DNA using primers flanking each SNV (Supplementary Table 1h,n) using Q5 High-Fidelity DNA polymerase (NEB). PCR specificity was determined by agarose gel electrophoresis followed gel extraction of specific bands using Gel Extraction/PCR Clean-up kit (Qiagen) following manufacturer instructions. Purified amplicons were sequenced using Sanger sequencing, and traces reviewed manually for the expected presence or absence of the mutated base. Droplet digital PCR For the validation and quantification of the frequency of the PIK3CA SNV detected in MB3, we used droplet digital PCR (ddPCR), since Sanger traces were of poor quality in the region of interest. Genomic DNA from 6 spatially distinct biopsies from MB3 as well as matched germline and a healthy donor control were used in the assay. The PIK3CA mutation (chr 3:178936091 G>A) was validated by using the PrimePCR ddPCR mutation assay kit, PIK3CA p.E545K, human (Biorad, dHsaCP2000075 (mutant, FAM) and dHsaCP2000076 (Wt, VIC)), accordingly to manufacturer instructions. Fluorescence measurement using QX100 ddPCR droplet reader (Biorad) was used to detect the presence of mutant and wildtype alleles. QuantaSoft Analysis software (Biorad) was used in the quantification. Copy number analysis TITAN 21 estimates the cellular prevalence of tumor cell populations (lineages) based on a user-defined number of clonal clusters, and user-defined ploidy estimation. Thus, 20 runs of TITAN were performed for each exome, with cluster numbers 1 to 10 (representing one clonal lineage through to 10 co-existing clonal lineages with distinct genotypes), and ploidy set to either 2 or 4. Copy number segments from the 20 parameter combinations were analyzed and merged into larger segments if they were on the same chromosome arm, were <10Mb apart, and had the same state (loss or gain). Merged results from each of the 20 parameter combinations for each biopsy were compared in order to select the optimal parameter combination as the highest scoring considering the following criteria: maximize the largest contig size maximize the median contig size minimize the number of contigs minimize the number of clonal clusters The parameter combination with the largest x value was selected as optimal, where: x = ( ( L ∗ M ) ∗ ( M 2 / 10 9 ) ∗ ( 1 / T ) ∗ ( 1 / ( C + 1 ) 2 ) ) / M / 10 9 L = largest contig size (Gb) M = median contig size (Gb) T = total number of contigs C = clonal clusters We next assessed the prevalence of copy number segments (loss or gain) identified in the best parameter combination of a unique biopsy (i.e. target segments), using either all segments or clonal segments only (logratio ≥ |0.2|). A target segment was considered as found in another biopsy from the same tumor if any of the 20 parameter combinations contained a segment with the same state (loss or gain), and whose span had a minimum reciprocal overlap of at least 70% with the target segment. Concordance of driver regions of loss and gain in the RCC tumor cohort was performed for our calls and the published data 9 . With our computational approach, we achieved a 97% concordance when comparing to the manual curation performed previously 9 , indicating that this method is specific and sensitive despite a high level of normal cell contamination in these tumors. Conversely, compared to our results for the subset of copy number gains and losses identified in Gerlinger et al., 2014, the manual curation shows an 89% concordance to the TITAN pipeline, indicating that our approach is more sensitive, and that the homogeneity of certain copy number driver events may be greater than previously estimated (Supplementary Table 1c). Finally, our approach is applicable genome-wide and across tumor types in a highly parallel fashion. SNV Classification using mclust Variant allele frequencies (VAF) of somatic SNVs were classified into distinct clusters using the R package mclust 50 , which uses finite mixture estimation via iterative Expectation Maximization steps (EM) and the Bayesian Information Criterion (BIC). Each cluster is manually categorized as “homozygous”, “clonal”, or “subclonal”, depending on the cluster VAF and the uncertainty separating it from the next cluster, and taking into account the biopsy tumor cell content value reported by TITAN. Multiple subclonal populations are numbered sequentially, starting with the most highly prevalent population. Clonal and subclonal mutation details per biopsy are summarized in Supplementary Table 1d,g. Phylogenetic reconstruction from combined SNV and CNV data We combined copy number and LOH information derived from TITAN (including the clonal and subclonal events identified in the best parameter combination run for each biopsy), as well as somatic mutations and SNPs in areas of LOH, to infer tumor phylogenies using EXPANDS 22 . EXPANDS v1.7.2 was run with the runExPANdS function. All parameters were set to default with the exception of maxScore, which was lowered to 1.5 in order to reduce the false positive rate of subpopulation detection. Only subpopulations with a minimum size (cellular frequency) of 0.1 were considered. Mutations that could not be assigned to a high confidence subpopulation were discarded, so that no ambiguous assignments were made. In addition, ambiguous subpopulations (i.e. groups of mutations and copy number events) were dropped from the analysis. Mutations are assigned to all nested subpopulations (i.e. if a mutation is found in a subpopulation of cells at a high frequency of 0.8, it will also be assigned to “daughter” subpopulations, for instance of frequency 0.5), to report the assignment of every mutation to all detected subpopulations in all biopsies of the tumor (assuming that the mutation could be assigned unambiguously as mentioned above; Supplementary Table 1f). Phylogenetic relationships between the subpopulations inferred by the EXPANDS algorithm in all biopsies per patient were generated using both SNV and copy number segments. The Manhattan distance metric was used to calculate pairwise distances between all pairs of biopsies based on this data, and a complete linkage hierarchical clustering was performed to generate phylogenies. Germline-rooted trees were generated using the as.phylo R function from the ape package. Error inference of actionable genetic alterations In order to perform an analysis of genetic heterogeneity affecting actionable and putative driver genes in a way that is unbiased to either of the tumor types, we opted to use general lists of known cancer drivers and drugable genes. Sets of genes known to be drivers in GBM, MB, and RCC tumors come from studies of different cohort sizes, with sometimes unknown subgroup affiliation, and thus are not equally comprehensive. To overcome this, we used a list of genes of interest that includes putative driver genes found in the Cancer Gene Census database 29 (n=572) and actionable genes from the Drug-Gene Interaction Database 30 (n=426 genes) (Supplementary Table 1l–m). Oncoprint plots (R package ComplexHeatmap v1.6.0) were built for the combination list of these genes for all tumors, using (a) clonal mutations and indels and (b) clonal mutations and indels plus high-level copy number aberrations (>4 copies gained; homozygous loss). Manual review of results revealed that absence of clonal somatic mutations in subsets of biopsies is not explained by concordant copy number loss. Since not all biopsies had copy number data, further analyses were performed using results from strategy (a) in order to maximize the number of usable biopsies per tumor. Driver event lists The MB CNA driver events in Supplementary Table 1i,j and Figure 4b are mainly taken from Shih et al. 27 plus a subset of the mostly highly recurrent genes listed in Northcott et al. 25 . The HGG chromosome arm and recurrent driver genes events were retrieved from Sturm et al. 26 (Table 1 and 2 of Reference 26). RCC chromosome arm and gene-level driver events were retrieved from the ccRCC TCGA paper 28 (Sup Fig 22 threshold FDR q-value< 10−15 and Table S4, q-value threshold 0.05). Cancer cell fraction values presented in Supplementary Figure 10b for driver mutations were calculated as previously described 51 : CCF = VAF ∗ ( 1 / purity ) ( CN ∗ purity + 2 ( 1 -purity ) ) where CCF=cancer cell fraction, VAF=variant allele frequency, CN=copy number at the mutation, purity=tumor purity as calculated by EXPANDS. Accuracy of mutation frequency detection We calculate the inferred error of the prevalence of each mutation across biopsies using a subsampling approach. In each tumor, given a subset of biopsies from 1 to n (where n = total biopsies per tumor), the frequency of each identified mutation in the biopsies sampled is calculated as fo. This value is subtracted from the “ground truth” expected frequency for that mutation across all n biopsies (fe). When the observed and expected values are identical, then the inferred error (fe – fo) is 0. In the majority of tumors, there is a predominance of genes with mutations in single biopsies, leading to negative values of error for many genes as the frequency of the mutation is often overestimated (Supplemental Fig. 17). In contrast, genes that are present in all but one or two biopsies will often have an error value over 0, as their frequency can be underestimated. The likelihood of being within +/−0.1 of 0 (i.e. close to perfect accuracy given the data from all biopsies) is calculated as the proportion of genes at each sampling of 1:n biopsies where the error rate was within those bounds. For instance, we sample all possible combinations of a certain number of biopsies from the total number of biopsies, and in each case calculate the inferred error of each detected mutation’s prevalence. The proportion of the total set of error values < |0.1| is the likelihood of a correct interpretation of mutation frequency given that number of biopsies (Fig. 5b). Genetic heterogeneity estimation from two biopsies To address the practical issue of estimating genetic heterogeneity from a minimum number of informative biopsies, we implemented a simple metric of the proportion of mutated genes in a set of 2 biopsies that was ubiquitous (i.e. present in 2 of 2 biopsies). The mean value of all pairs of biopsies from a total of n biopsies per tumor showed a strong divergence in HGG and MB tumors, with high- vs low-variability tumors well separated (Fig. 5c). These were the same tumors that scored as high vs low variability based on the accuracy metric described above. We also observed clear separation of these two classes using the R package mclust (Supplementary Fig. 18a), which models univariate mixtures of Gaussian distributions (i.e. corresponding to a mixture of high and low genetic variance brain tumors) via Expectation-Maximization and the Bayesian Information Criterion 50 . Choosing two thresholds from the mclust density peaks (low=0.55, high=0.75), we calculated the accuracy of classifying high-variance vs low-variance tumors based on a single pair of biopsies, and observed that high-variance tumors in particular have a high true positive and low false positive classification rate (Supplementary Figure 18b). I.e. based on this metric, the vast majority of pairs of biopsies from tumors with high genetic heterogeneity have a low percentage of gene mutations found in both biopsies, such that they are always classified as heterogeneous tumors, and almost never as homogeneous tumors. Expression analysis of immunotherapeutic targets in MB tumors Microarray expression data from the Affymetrix Gene 1.1 ST array (Affymetrix, Santa Clara, CA) for the MB samples were analyzed in the R environment (v3.1.1). We used the affy package (v1.44.0) and the custom CDF hugene11sthsensgcdf (v19.0.0) to summarize the expression of 21641 Ensembl (ENSG) genes and process the data. Expression data was normalized using the rma method. Spatial genetic variance vs post-treatment clonal evolution To directly measure the relative contribution of spatial heterogeneity versus clonal evolution induced by treatment, we utilized our previously published cohort of matched pre- and post-therapeutic MB samples 31 . This comparison shows that in MB, the amount of divergence observed between primary and relapse compartments far exceeds spatial genetic variance in the primary tumor. To assess whether the observed divergence between primary and recurrent MB is greater than the observed divergence between intra-tumoral biopsies, we re-analyzed the 14 primary-relapse tumor WGS samples with matched germline using the same pipeline as presented above. Briefly, mutations were called using samtools mpileup, filtered stringently against the germline, and shortlisted to those mutations that have at least 10 reads coverage in both the primary and recurrent samples, and are in areas of normal copy number and LOH. Since the samples in this work are exomes, we restricted the analysis of the primary-relapse samples to the same exonic regions. After removing the major analysis pipeline differences, we addressed differences in depth of coverage. The exome libraries were sequenced to an average of 60X, while the WGS samples were sequenced to 30X coverage. Thus, our ability to assess similarity between regions in the exome libraries is more sensitive to subclonal events present at low levels (and therefore preferentially detectable by exome seq and not by WGS). We addressed this bias by restricting the analysis to clonal events in the exomes, as clonal mutations are detectable both in exomes and genomes. To verify that this was a reasonable assumption, we compared the VAF of mutations found in the exomes to those found in matched WGS data generated from the same samples, but sequenced at 30X coverage. Matched WGS samples are available for biopsy 1 in each MB tumor with multi-regional profiling. In all cases, we found that >75% of mutations with a VAF < 0.18 in the MB exomes were not found in the matched genomes sequenced from the same samples, indicating that subclonal events are typically not well profiled at the shallower depths of a genomic library. Therefore we restricted our analysis to clonal events in both the exomes and genomes. Focusing on the clonal and homozygous events detectable in both exome and genome data, we hypothesized that any differences between primary and relapse samples that are greater than the differences expected from different biopsies in a primary tumor would be largely attributable to clonal evolution as a consequence of therapy. To see if the data support this conclusion, we used the mutations in each biopsy to measure pairwise concordance between all biopsies of individual tumors. Concordance is measured as the number of mutations in common between two biopsies, as a fraction of the total number of mutations present in both. In parallel, we used the mutations in the primary and relapse samples to measure pairwise concordance values between disease compartments. As a positive control, we compared the inter-biopsy and inter-compartmental concordance values of an adult GBM sample with multiple biopsies profiled before and after therapy (Patient 17 from reference 32 ). In MB samples we found a mean pairwise concordance of 0.3903 between biopsies of the same tumor, nearly an order of magnitude higher than the mean concordance (0.03852) observed between disease compartments (Wilcoxon Rank Sum test p-value < 2.2e–16). One sample stood out as an outlier (MB-REC-04), and we note that in this case the tumor was a Group4 local recurrence. This unusual pattern of recurrence for a Group4 tumor may indicate that the primary mass was sub-totally resected rather than grossly resected, explaining the higher similarity of the recurrent compartment to the primary. In the case of the adult GBM patient (Patient17) with multi-regionally sampled primary (3 regions; low-grade glioma) and recurrent disease (4 regions; high-grade glioma), we found the same trend: the primary-relapse mean concordance of 0.01506 was an order of magnitude smaller than the mean intra-biopsy concordance of 0.5036 (Wilcoxon Rank Sum test p-value = 0.0001406). There was no significant difference between the primary-relapse MB concordance and the primary-relapse GBM concordance observed in Patient17 (Wilcoxon Rank Sum test p-value = 0.5458). Similarly, there was no significant difference between the regional biopsies in GBM vs MB (Wilcoxon Rank Sum test p-value = 0.09926). Finally, the primary-relapse divergence calculated using reprocessed data from Patient17 was on par with that initially presented in the glioma paper 41 , thus we included, for visual comparison, all the primary-relapse values for the glioma cohort in Figure 6a (middle panel; Johnson et al, 2014; values directly derived from Supplementary Table 4 32 ). Statistical analysis All statistical analyses were performed in the R statistical environment. Comparisons of categorical variables between entity types were performed using the two-sided Fisher’s exact test. Comparisons of distributions were performed using the Welch two-sample t-test (parametric) or the Wilcoxon Rank Sum test (non-parametric). P-values <0.05 were considered statistically significant. Supplementary Material 1 2 3

          Related collections

          Most cited references51

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Fast and accurate short read alignment with Burrows–Wheeler transform

          Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ∼10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: rd@sanger.ac.uk
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

            Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data

              High-throughput sequencing platforms are generating massive amounts of genetic variation data for diverse genomes, but it remains a challenge to pinpoint a small subset of functionally important variants. To fill these unmet needs, we developed the ANNOVAR tool to annotate single nucleotide variants (SNVs) and insertions/deletions, such as examining their functional consequence on genes, inferring cytogenetic bands, reporting functional importance scores, finding variants in conserved regions, or identifying variants reported in the 1000 Genomes Project and dbSNP. ANNOVAR can utilize annotation databases from the UCSC Genome Browser or any annotation data set conforming to Generic Feature Format version 3 (GFF3). We also illustrate a ‘variants reduction’ protocol on 4.7 million SNVs and indels from a human genome, including two causal mutations for Miller syndrome, a rare recessive disease. Through a stepwise procedure, we excluded variants that are unlikely to be causal, and identified 20 candidate genes including the causal gene. Using a desktop computer, ANNOVAR requires ∼4 min to perform gene-based annotation and ∼15 min to perform variants reduction on 4.7 million variants, making it practical to handle hundreds of human genomes in a day. ANNOVAR is freely available at http://www.openbioinformatics.org/annovar/ .
                Bookmark

                Author and article information

                Journal
                9216904
                2419
                Nat Genet
                Nat. Genet.
                Nature genetics
                1061-4036
                1546-1718
                20 March 2017
                10 April 2017
                May 2017
                10 October 2017
                : 49
                : 5
                : 780-788
                Affiliations
                [1 ]Developmental & Stem Cell Biology Program, The Hospital for Sick Children, Toronto, Ontario, Canada
                [2 ]The Arthur and Sonia Labatt Brain Tumour Research Centre, The Hospital for Sick Children, Toronto, Ontario, Canada
                [3 ]Department of Pediatric Oncology, Hematology, and Clinical Immunology, Medical Faculty, University Hospital Düsseldorf, Düsseldorf, Germany
                [4 ]Department of Neuropathology, Medical Faculty, University Hospital Düsseldorf, Düsseldorf, Germany
                [5 ]Department of Pediatric Neuro-Oncogenomics, German Cancer Consortium (DKTK) and German Cancer Research Center (DKFZ), Düsseldorf, Germany
                [6 ]Division of Haematology / Oncology, Department of Pediatrics, The Hospital for Sick Children, Toronto, Ontario, Canada
                [7 ]Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada
                [8 ]Cancer Research Program, McGill University Health Centre Research Institute, Montreal, Quebec, Canada
                [9 ]MacFeeters-Hamilton Brain Tumour Centre, Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
                [10 ]Division of Neurosurgery, The Hospital for Sick Children, Toronto, Ontario, Canada
                [11 ]Canada’s Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, British Columbia, Canada
                [12 ]Departments of Pediatrics and Human Genetics, McGill University, Montreal, Quebec, Canada
                [13 ]Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada
                [14 ]Cancer Research UK London Research Institute, London, United Kingdom
                [15 ]Division of Molecular Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany
                [16 ]Division of Pediatric Neurooncology, German Cancer Research Center (DKFZ), Heidelberg, Germany
                [17 ]Informatics and Biocomputing, Ontario Institute for Cancer Research, Toronto, Ontario, Canada
                [18 ]Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
                [19 ]The Donnelly Centre, University of Toronto, Toronto, Ontario, Canada
                [20 ]Department of Pathology, Montreal Children’s Hospital, McGill University Health Centre, Montreal, QC, Canada
                [21 ]Division of Experimental Medicine, McGill University, Montreal, Quebec, Canada
                [22 ]Division of Pathology, The Hospital for Sick Children, Toronto, Ontario, Canada
                [23 ]Division of Pediatric Oncology, Children’s National Medical Center, Washington DC, United States
                [24 ]Division of Pediatric Neuro-Surgery, Children’s National Medical Center, Washington DC, United States
                [25 ]Department of Neurology, Children’s National Medical Center, Washington DC, United States
                [26 ]Clinical Cooperation Unit Neuropathology, German Cancer Research Center (DKFZ), Heidelberg, Germany
                [27 ]German Cancer Consortium (DKTK), Germany
                [28 ]Department of Pediatric Oncology, Hematology, Immunology and Pulmonology, University Hospital Heidelberg, Heidelberg, Germany
                [29 ]Institute of Neuropathology, University Medical Center, Hamburg-Eppendorf, Germany
                [30 ]Research Institute Children’s Cancer Center, Hamburg, Germany
                [31 ]Pediatric Hematology and Oncology, University Medical Center, Hamburg-Eppendorf, Germany
                [32 ]Translational Cancer Therapeutics Laboratory, The Francis Crick Institute, London, United Kingdom
                [33 ]Cancer Research UK Lung Cancer Centre of Excellence, University College, London, United Kingdom
                [34 ]Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada
                [35 ]Department of Molecular Biology & Biochemistry, Simon Fraser University, Burnaby, British Columbia, Canada
                [36 ]Division of Neurosurgery, St Michael’s Hospital, University of Toronto, Toronto, Ontario, Canada
                Author notes
                [# ]Corresponding authors: Marco A. Marra, OBC, PhD, FRSC, FCAHS, Genome Sciences Centre. 675 West 10th Ave, Vancouver, BC, Canada, Phone: 604-675-8162; Fax: 604-675-8178, mmarra@ 123456bcgsc.ca . Michael D. Taylor, MD, PhD, FRCSC, 555 University Avenue, Toronto, Ontario, Canada, Phone: 416-813-6427 ; Fax: 416-813-4975, mdtaylor@ 123456sickkids.ca
                [*]

                These authors contributed equally.

                Article
                NIHMS861288
                10.1038/ng.3838
                5553617
                28394352
                15171298-ff77-41bc-b0f9-375998fd05e3

                Users may view, print, copy, and download text and data-mine the content in such documents, for the purposes of academic research, subject always to the full Conditions of use: http://www.nature.com/authors/editorial_policies/license.html#terms

                History
                Categories
                Article

                Genetics
                spatial heterogeneity,medulloblastoma,glioblastoma multiforme,renal cell carcinoma,whole exome sequencing,gene expression profiling

                Comments

                Comment on this article