+1 Recommend
1 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Structural variant calling: the long and the short of it


      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.


          Recent research into structural variants (SVs) has established their importance to medicine and molecular biology, elucidating their role in various diseases, regulation of gene expression, ethnic diversity, and large-scale chromosome evolution—giving rise to the differences within populations and among species. Nevertheless, characterizing SVs and determining the optimal approach for a given experimental design remains a computational and scientific challenge. Multiple approaches have emerged to target various SV classes, zygosities, and size ranges. Here, we review these approaches with respect to their ability to infer SVs across the full spectrum of large, complex variations and present computational methods for each approach.

          Related collections

          Most cited references63

          • Record: found
          • Abstract: found
          • Article: not found

          Diet and the evolution of human amylase gene copy number variation.

          Starch consumption is a prominent characteristic of agricultural societies and hunter-gatherers in arid environments. In contrast, rainforest and circum-arctic hunter-gatherers and some pastoralists consume much less starch. This behavioral variation raises the possibility that different selective pressures have acted on amylase, the enzyme responsible for starch hydrolysis. We found that copy number of the salivary amylase gene (AMY1) is correlated positively with salivary amylase protein level and that individuals from populations with high-starch diets have, on average, more AMY1 copies than those with traditionally low-starch diets. Comparisons with other loci in a subset of these populations suggest that the extent of AMY1 copy number differentiation is highly unusual. This example of positive selection on a copy number-variable gene is, to our knowledge, one of the first discovered in the human genome. Higher AMY1 copy numbers and protein levels probably improve the digestion of starchy foods and may buffer against the fitness-reducing effects of intestinal disease.
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            The landscape of kinase fusions in cancer

            Kinases activated by gene fusions represent an important class of oncogenes associated with both hematopoietic malignancies and solid tumours. They are produced by translocations or other chromosomal rearrangements, and their protein products often represent ideal targets for the development of cancer drugs. For example, imatinib induces remission in leukaemia patients who are positive for BCR–ABL1 fusions. More recently, crizotinib and ceritinib have produced significant clinical benefit results in patients with lung carcinomas and mesenchymal tumours harbouring anaplastic lymphoma kinase (ALK) fusions1 2. Advances in massively parallel sequencing technologies have enabled the genomic characterization of large panels of tumours through the study of their DNA. While such studies have helped to identify numerous point mutations and small insertion/deletions in genes driving tumorigenesis, our understanding of the landscape of gene fusions in solid tumours is incomplete. There are now several thousand cancer transcriptomes, assessed through RNA-seq, publicly available. Here we describe a computational pipeline for the identification of gene fusions that we applied to the entire RNA-seq data set from The Cancer Genome Atlas (TCGA). We identify several novel and recurrent fusions involving kinases that very likely play a role in cancer. These discoveries not only go beyond augmenting our understanding of the genomic landscape of cancers, but they also have immediate implications for cancer diagnosis and therapy. Results Hallmarks of kinase fusions in solid tumours We first assembled a computational pipeline to collect evidence for all possible gene fusions and then focused our analyses on the fusions involving a kinase. Given the very large number of samples, we prioritized the sensitivity of the fusion detection pipeline (Methods) while retaining the ability to exclude false-positive calls by filtering out fusions present in normal samples as well as highly recurrent fusions that were detected by the algorithm at improbable frequencies. We then focused our detailed analysis exclusively on recurrent (n≥2 across all cancer types), putatively functional kinase fusions (Supplementary Fig. 1). We comprehensively surveyed gene fusions across 20 solid tumour types (Supplementary Table 1, Supplementary Data 1) and identified several broad contours within the landscape of kinase fusions (Fig. 1). First, as has been observed for point mutations, the proportion of samples harbouring kinase fusions was markedly different between cancer types, reflecting differences in the aetiology of these tumours. For instance, sarcoma samples showed the highest frequency of kinase fusions (0.57 fusions per sample), consistent with the current understanding that a large fraction of sarcomas harbour specific translocations, but only 12% of those were recurrent kinase fusions (Supplementary Fig. 1). Other tumour types, including thyroid cancer and glioblastoma, showed a lower frequency of kinase fusions on average (0.19 and 0.23 fusions per sample, respectively) but a relatively high proportion of these were recurrent fusions (67% and 36%, respectively), suggesting a more prominent role of kinase fusions in these cancers. Conversely, some cancer types, for example, clear cell and chromophobe renal cell carcinoma, showed a very low frequency of kinase fusions with no instances of recurrence (Supplementary Fig. 1). Overall, we detected recurrent kinase fusions in 3.0% of the samples, and all cancers except clear cell and chromophobe renal cell carcinoma harboured recurrent kinase fusions (0–12.9% of samples per cancer type, median=2.1%). Second, our pipeline was able to recapitulate most known translocation events in cancer (ALK, BRAF, EGFR, FGFR1, 2 and 3, NTRK1, 2 and 3, PDGFRA, PRKCA, RAF1, RET, ROS1). Interestingly, we identified new tumour types harbouring such fusions and discovered several novel fusion partners for these kinases. We also detected several low-frequency, pan-cancer kinase fusion events, for example in the neurotrophic tyrosine receptor kinases NTRK1, NTRK2 and NTRK3, that drive tumorigenesis in a small fraction of multiple cancers, regardless of tissue type (Fig. 1). Third, we identified several novel and recurrent kinase fusions that very likely play a role in cancer, such as those involving the MET proto-oncogene and PIK3CA (phosphatidylinositol-4,5-bisphosphate 3-kinase, catalytic subunit alpha). These bona fide oncogenes have not been shown previously to be activated by fusion events. Our analysis also uncovered novel, recurrent fusions in kinases with no known tumorigenic genomic alterations (that is, feline Gardner–Rasheed sarcoma viral oncogene homologue, FGR and protein kinase N1, PKN1), potentially resulting in active and oncogenic fusion proteins. Finally, we discovered a recurrent fusion in sarcoma encoding the non-catalytic portion of TRIO kinase, which resulted in the upregulation of the transcription of telomerase reverse transcriptase (TERT) in those tumours. Kinases known to be involved in gene fusions ALK fusions, including EML4–ALK, TFG–ALK and STRN–ALK, have been identified in multiple cancer types, including lung adenocarcinoma3, colorectal4, breast5, renal cell6, renal medullary7 and thyroid cancers8. Consistent with previous studies, we detected EML4–ALK fusions in ~1% (5/513) of lung adenocarcinoma samples, multiple ALK fusions, including a single STRN–ALK fusion, in thyroid cancer (3/498) and one in papillary renal carcinoma. We also found several novel ALK fusion events, including a TPM1–ALK fusion in bladder cancer, a SMEK2–ALK fusion in rectal adenocarcinoma and a GTF2IRD1–ALK fusion in thyroid cancer (Fig. 1, Supplementary Fig. 2), adding weight to the emerging notion that known driver events in certain tumours can also play a role in other tumour types, regardless of histology. We also identified multiple c-ros oncogene 1 (ROS1) fusions, including ROS1 fusions in 8/513 lung adenocarcinomas, all of which have been previously described9 10 11 12. In addition, we detected a CEP85L–ROS1 fusion in a glioblastoma tumour sample (Fig. 1, Supplementary Fig. 3). A similar fusion was recently reported in a single angiosarcoma sample13. RET proto-oncogene fusions have been identified previously in both lung adenocarcinoma4 14 and thyroid cancer15. Consistent with these studies, we observed recurrent CCDC6–RET fusions in thyroid cancer15 but also identified several RET fusions with novel partners, including AKAP13, FKBP15, SPECC1L and TBL1XR1 (Fig. 1, Supplementary Fig. 4). Three of these 5′ fusion partners contain dimerization-competent coiled-coil motifs within the coding region, while the fourth, TBL1XR1, contains a LisH (Lis-homology) motif, which is similarly capable of dimerization16 and therefore likely leading to RET activation. In addition, we detected previously identified RET fusions in new tumour indications, including a single CCDC6–RET fusion in colon adenocarcinoma and a single ERC1–RET fusion in breast cancer. In total, we observed RET fusions in four of the 20 cancer types analysed, providing a therapeutic rationale for the use of RET inhibitors in multiple patient subpopulations. BRAF (v-raf murine sarcoma viral oncogene homologue B) fusions have also been described previously in multiple cancer types, including prostate cancer17, melanoma18, radiation-induced thyroid cancer19 and pediatric low-grade gliomas20. Consistent with these studies, we identified a broad range of cancer types harbouring BRAF fusions, including prostate, melanoma and thyroid. We also detected a single TRIM24–BRAF fusion in rectal adenocarcinoma. Interestingly, the BRAF fusions in melanoma are exclusive of other known oncogenic events such as BRAF and NRAS mutations. Many of the specific fusions we report here, including AGK–BRAF 19, SND1–BRAF 21, MACF1–BRAF 20, TAX1BP1–BRAF and CDC27–BRAF 18, have been previously identified. A number of BRAF fusions are, however, novel (Fig. 1, Supplementary Fig. 5), including ATG7–BRAF in melanoma, as well as ZC3HAV1–BRAF and FAM114A2–BRAF in thyroid cancer. These three fusions encode 5′ protein partners that contribute coiled-coil (CC) or zinc-finger dimerization motifs, which likely produce constitutively activated BRAF dimers capable of driving tumorigenesis and poorly sensitive to RAF inhibitors, but sensitive to inhibition downstream, through MEK (mitogen-activated protein kinase kinase 1 and 2) inhibition for instance18. Other novel BRAF fusions identified do not encode protein partners with obvious dimerization motifs. However, these fusions all remove at least the first eight exons of BRAF, which has previously been shown to promote BRAF dimerization, independent of activated RAS (rat sarcoma viral oncogene homologues) or other mechanisms of BRAF dimerization22. These fusions similarly seem capable of promoting tumorigenesis. Consistent with previous studies17 23, we also found recurrent RAF1 (also known as CRAF) fusions in various tumour types (Fig. 1, Supplementary Fig. 6). In addition to known tumour occurrences (four fusions in melanoma, two fusions in prostate adenocarcinoma), we identified AGGF1–RAF1 fusions in seven papillary thyroid carcinoma samples (1.4%). This is remarkable not only because RAF1 fusions have never been identified in this cancer type before, but also because all seven examples involved a novel partner gene, AGGF1 (angiogenic factor with G patch and FHA domains 1). AGGF1 contains an N-terminal coiled–coil dimerization motif likely to activate RAF1 in a fashion similar to BRAF fusions, by forming constitutively activated, RAF inhibitor resistant, RAF1 dimers. AGGF1–RAF1 fusions appear not to be limited to thyroid cancers, as we also found a single AGGF1–RAF1 fusion in prostate cancer. We observed a broad distribution of the fibroblast growth factor receptors FGFR1, FGFR2 and FGFR3 fusions—in particular FGFR3–TACC3 fusions—across eight of the 20 tumour types analysed (Fig. 1). This is consistent with recent studies, which identified recurrent FGFR family fusions in multiple cancer types24 25. We also detected a single FGFR3–TACC3 fusion in a novel indication, papillary renal carcinoma, and a novel FGFR3–ELAVL3 fusion in low-grade glioma (Supplementary Fig. 7). Similar to RET and NTRK1–3 (see below), fusions involving FGFR1–3 provide a therapeutic opportunity for current and future FGFR inhibitors in multiple patient subpopulations. Recurrent fusions involving members of the NTRK family have been identified previously in congenital fibrosarcoma26, human secretory breast carcinoma27 and papillary thyroid cancer28, which represent clinical indications for which currently available non-kinase-targeted treatment options are usually adequate. However, recurrent NTRK1 and NTRK2 fusions have also been recently identified in diseases which represent significant unmet medical needs, including glioblastoma29, cholangiocarcinoma30 and pediatric high-grade glioma24. Consistent with previous studies, we observed recurrent NTRK1 and NTRK3 fusions in papillary thyroid cancer and glioblastoma, but also identified a number of novel NTRK2 fusions in head and neck squamous cell carcinoma (PAN3–NTRK2), low-grade glioma (AFAP1–NTRK2) and lung adenocarcinoma (TRIM24–NTRK2) (Fig. 1, Supplementary Fig. 8). In addition, we observed a known NTRK1 fusion (TPM3–NTRK1) in sarcoma, previously described only in thyroid cancer31. Across all tumour types analysed, NTRK1–3 fusions were observed at low frequency in 9 of the 20 cancer types analysed, providing a therapeutic opportunity for the use of pan-NTRK inhibitors in multiple patient populations. Protein kinase C fusions have recently been described in papillary glioneuronal tumours32 and benign fibrous histiocytoma33. We found two new occurrences of PRKCA (protein kinase C, alpha) fusions in lung squamous cell carcinoma and three PRKCB (protein kinase C, beta) fusions in lung squamous cell carcinoma, lung adenocarcinoma and low-grade glioma (Fig. 1, Supplementary Fig. 9a,c). In one instance, PRKCA was fused with IGF2BP3 (insulin-like growth factor 2 messenger RNA (mRNA) binding protein 3), an mRNA binding protein present in the nucleus and the cytoplasm. The functional domains of IGF2BP3, such as a nucleotide binding/RNA recognition domain, are intact in the fusion; however their contribution to PRKCA activation is unclear. The second fusion, TANC2–PRKCA, encodes only the first two exons of TANC2 (tetratricopeptide repeat, ankyrin repeat and CC containing 2), which contain no annotated structural domain or motif. In both cases, however, N-terminal truncation of PRKCA removes the autoinhibitory pseudosubstrate segment, possibly leading to a constitutively activated kinase in the absence of a functional fusion partner. In addition, we noticed a tendency towards overexpression of PRKCA in these two fusion-harbouring samples (PRKCA mRNA expression z-scores: 6.8 and 2.6 in the samples harbouring the TANC2–PRKCA and IGF2BP3–PRKCA fusions, respectively; Supplementary Fig. 9b). These data suggest that PRKCA fusions are potential oncogenic events in lung squamous cell carcinoma, leading to overexpression as well as constitutive activation of PRKCA. In the same fashion, PRKCB fusions truncate the N-terminal part of the protein containing the autoinhibitory domain and are predicted to activate this kinase (Supplementary Fig. 9c). Taken together, these data suggest an emerging critical role for protein kinase C alpha and beta in the tumorigenesis of non-small cell lung cancer. Novel fusions involving known oncogenes The MET proto-oncogene is implicated in a variety of cancers, particularly in papillary renal cell carcinoma where a number of somatic mutations have been described34. Anecdotally, a transforming TPR–MET fusion was previously generated in vitro via carcinogen-induced chromosomal rearrangement fusing the dimerization domain of TPR to the kinase domain of the MET receptor tyrosine kinase35. Here, we report for the first time in primary tumour samples recurrent translocation events involving MET. Two in-frame MET fusions in papillary renal carcinoma, BAIAP2L1–MET and C8orf34–MET, were detected with predicted protein products containing amino-terminal dimerization domains fused to the intracellular domain of MET. BAIAP2L1 contains a CC region while C8orf34 contains a regulatory subunit of cAMP-dependent protein kinase, both of which act as dimerization motifs (Fig. 2). Notably, BAIAP2L1 was recently described as a 3′ fusion partner for FGFR3 in bladder cancer36, incorporating the same CC region as the 5′ fusion described here. We also identified single MET fusions in four other cancers: low-grade glioma, hepatocellular carcinoma, lung adenocarcinoma and thyroid carcinoma. In at least two out of these four cases (KIF5B–MET in lung adenocarcinoma and TFG–MET in thyroid papillary carcinoma), the predicted chimeric protein follows the classic activation paradigm, fusing dimerization motifs to an intact kinase domain. These results are remarkable because MET is a known oncogene that has not previously been implicated in translocation events. This mechanism could account for a significant fraction of total MET oncogenic activation events and therefore represents druggable intervention opportunities for patients with these tumours. Mutations and, to a lesser extent, increased copy numbers in another prevalent oncogene, PIK3CA, have been characterized in diverse cancers. While activating missense mutations in PIK3CA have been described as frequently as 50% in endometrial cancers, 30% in breast invasive carcinomas and 20% in colorectal as well as head and neck cancers37, this gene has not been implicated in activating fusion events. We found two TBL1XR1–PIK3CA fusions in 1,072 breast cancer samples, and a single occurrence of the same gene fusion in prostate adenocarcinoma (1/335). In addition, one FNDC3B–PIK3CA fusion was found in uterine corpus endometrial carcinoma (1/166) (Fig. 1, Supplementary Figs 10 and 11). The nucleotide sequence of the fusion transcripts suggested that the complete wild-type sequence of PIK3CA was expressed in all four cases, with the partner gene contributing only the 5′UTR (untranslated region), and thereby driving overexpression of PIK3CA (Fig. 3a). Indeed, in all samples where we detected PIK3CA translocations, and where PIK3CA was not amplified, PIK3CA mRNA expression levels were the highest within the respective tumour types (Fig. 3b–d). Interestingly, TBL1XR1 is thought to regulate the expression of nuclear hormone receptor co-repressors38, and both tissue types in which TBL1XR1–PIK3CA fusions were found (invasive breast carcinoma and prostate cancer), are hormone driven and ranked among the highest for TBL1XR1 mRNA expression across all normal tissues (Supplementary Fig. 12). These results strongly suggest that PIK3CA overexpression is driven by its fusion partner, and that PIK3CA promoter fusions are an additional oncogenic mechanism to be considered for expanding the use of targeted therapies such as PI3K, AKT or mTOR inhibitors. Novel recurrent fusions In addition to fusions involving known oncogenes, we found several novel and recurrent fusions involving kinases that have not been previously directly linked to cancer (Fig. 1, Supplementary Data 2). One of these kinases was FGR, a member of the Src family of protein tyrosine kinases. Here, we show for the first time that genetic events can lead to FGR overexpression in primary tumour samples. We found three WASF2–FGR fusions (in lung squamous carcinoma, ovarian serous cystadenocarcinoma and skin cutaneous melanoma), harbouring the exact same breakpoints in all cases (Supplementary Figs 13 and 14). The WASF2 and FGR genes are located very proximally on the short arm of chromosome 1, and the fusion presumably results from a tandem repeat that puts their coding regions in close proximity (Supplementary Fig. 15a,b). Somewhat similarly to TBL1XR1–PIK3CA, the promoter and 5′UTR of WASF2 are fused with the 5′UTR of FGR, leading to mis- or overexpression of the entire wild-type sequence of the protein. In all three cases, FGR mRNA expression in the samples harbouring a fusion was among the highest compared with all other tumours of that tissue type (Supplementary Fig. 15c–e). Even though FGR has not been genetically linked to cancer to date, it has been hypothesized that its expression could compensate for SRC inhibition39. Collectively, these data highlight a previously undocumented mechanism of genetic deregulation of a Src family member. PKN1 has been implicated in androgen-associated prostate carcinomas40 and in Wnt/β-catenin signalling in melanomas41. We detected fusions of PKN1 in samples of squamous cell carcinoma of the lung and hepatocellular carcinoma (Fig. 1, Supplementary Fig. 16a). mRNA expression levels of both fusions are high within the respective tumour types (Supplementary Fig. 16b,c). Interestingly, the protein sequences contributed by the non-kinase fusion partners were very limited in both cases (three and five amino acids in ANXA4–PKN1 and TECR–PKN1, respectively) and resulted in a truncated PKN1 protein product missing the PKN1 N terminus. This is notable because this protein region contains regulatory domains that suppress kinase activity in the absence of binding to Rho-GTP42. PKN1 also has a Caspase-3 cleavage site near the breakpoint of both fusions that normally results in activation of the kinase on cleavage43 (Supplementary Fig. 16a). Therefore, both of these fusions are potentially activating in the absence of any functional or structural contribution from the non-kinase fusion partners. It is attractive to speculate that both fusion events cause increased PKN1 expression and constitutive activation of the kinase, leading to enhanced cell proliferation. Discussion We describe here a pan-cancer analysis of the transcriptomes of nearly 7,000 tumours from TCGA that is specifically focused on kinase gene fusion events. Overall, 3.0% of tumour samples contained a likely oncogenic, recurrent kinase fusion (2.1% excluding thyroid cancer). The observed striking differences in the frequencies of kinase fusions across solid tumours are consistent with previous data on the relative contributions of diverse types of genetic aberrations to tumourigenesis. Certain tumour types, such as ovarian serous carcinoma, harbour a large number of somatic copy number alterations, but exhibit a relatively simple mutational profile44. Other tumour types such as melanoma carry predominantly somatic point mutations45. Consistent with these observations, our data suggest that certain cancers are heavily driven by kinase rearrangements. Notably, thyroid cancers have the highest frequency of recurrent kinase fusions (63/498, 13%), and all fusion events including ALK, BRAF, MET, NTRK1, NTRK2, RAF1 and RET are mutually exclusive in this cancer type (Fig. 4). These data provide a strong genetic rationale that these alterations are driver events. In stark contrast, clear cell and chromophobe renal cell carcinoma have the lowest frequencies of kinase fusions, none of which were recurrent in our analysis of this data set (Supplementary Fig. 1). Our study primarily aimed to identify recurrent, potentially oncogenic fusions involving kinases. In addition to rediscovering previously known recurrent kinase fusions, we identified new fusion partners for many genes (for example, TPM1–ALK in bladder cancer, TBL1XR1–RET in thyroid cancer, and so on). Our study also revealed new cancer types harbouring known fusions (for example, BRAF fusion in rectal adenocarcinoma, FGFR3 fusion in prostate adenocarcinoma, RET fusions in colon adenocarcinoma and invasive breast carcinoma, EGFR–SEPT14 in low-grade glioma, and so on). These discoveries not only go beyond simply augmenting our understanding of the genomic landscape of cancers, but they also have immediate implications for cancer diagnosis and therapy. First, our new findings justify a rapid reassessment of current protocols for targeted genomic profiling of patients, which are insufficient to detect these aberrancies, to cover therapeutically actionable fusion events across cancers. In addition, our findings will hopefully motivate both industrial and academic investigators interested in drug discovery to engage in the development of cancer drugs against targets that have not been considered previously because they might have represented an insufficient fraction of targetable events. Along these lines, our pan-cancer analysis revealed that although certain kinase fusions only occurred once within a tumour type, they are clearly recurrent when multiple tissue types are considered, supporting the emergence of innovative clinical trial designs such as the ‘basket’ trials. For example, we found six cancer types with a single fusion in a gene of the NTRK family, but altogether, we detected a total of 23 NTRK1, NTRK2 and NTRK3 fusions across nine tumour types. These data strongly suggest that gene fusions are one of the most prevalent mechanisms of oncogenic activation of this receptor tyrosine kinase family. NTRK fusions therefore represent a low frequency, pan-cancer event that nevertheless may account for a significant fraction of patients who could benefit from a pan-NTRK inhibitor. Notably, this analysis uncovered several new aberrations that may drive tumourigenesis through constitutive activation of a kinase due to a fusion event. In particular, we found recurrent fusions of MET and PIK3CA that are both bona fide oncogenes commonly activated by gene amplification or point mutations. The activation of MET by fusion with a partner gene is most likely due to constitutive dimerization of the receptor. In contrast, TBL1XR1–PIK3CA fusions likely drive increased PIK3CA mRNA expression by juxtaposing the promoter region of the partner gene to the 5′ end of the intact PIK3CA coding sequence. Another example of such ‘promoter fusions’ is the recurrent WASF2–FGR fusion, which we found in three cancer types. We also observed fusions in some Ser/Thr kinases (for example, PRKCA, PKN1), where deletion of their regulatory N-terminal domain putatively leads to constitutive activation by de-repression of the kinase activity. Additional studies, expanding on our analyses, are necessary to uncover the mechanistic details of these newly described fusions and to further validate biologically the hypotheses we have put forward here. Given the number of samples that we analysed (nearly 7,000 samples across 20 tumour types) and the observed frequencies of kinase fusion events across solid tumours, relatively few kinases appear to be recurrently fused in-frame with another gene while conserving an intact kinase domain (Supplementary Fig. 1). Therefore, a pure frequency threshold (that is, recurrence) allowed us to identify several new kinase fusion driver candidates. By excluding all kinases that were involved in only one fusion event across the entire data set, we disregarded several singletons that passed all other filtering criteria and showed the characteristics of functional fusions. With increasing numbers of tumours and additional cancer types being sequenced in the future, it is probable that some of these kinase fusions will appear more frequently and prove to be important drivers. One such example is PRKACA (protein kinase, cAMP-dependent, catalytic, alpha), which was recently shown to be fused in 100% (15/15) of fibrolamellar hepatocellular carcinoma46. Only one sample of fibrolamellar hepatocellular carcinoma was included in the TCGA data set that we analysed, and our study revealed that it indeed harboured the characteristic DNAJB1–PRKACA fusion. However, the majority of singleton gene fusions in this set are predicted to be passenger events, occurring as a consequence of chromothripsis or genomic instability. In addition, focal events such as gene amplification also contribute passenger events. This is likely the case for ERBB2, PAK1, PDGFRA and RPS6KB1 (Supplementary Data 2 and 3)47. Finally, we predict that eventually some important, biologically meaningful fusions will be discovered that involve the non-enzymatic portion of kinases as partner genes. Although not the focus of this study, we describe here such an example: a recurrent fusion in two dedifferentiated liposarcoma samples (2/38; 5%) encoding the non-kinase portion of TRIO, which results in upregulation of its fusion partner TERT (Fig. 5a,b). Telomerase activity is a hallmark of many cancers, and two other genetic mechanisms of TERT reactivation have been described recently; both somatic mutations in the promoter of the TERT gene48 and DNA copy number gains of TERT 49 were shown to activate its transcription. We observed that the two liposarcoma samples harbouring TRIO–TERT fusions display a TERT mRNA expression level ~100-fold higher than measured in samples without such a fusion. These findings raise the possibility that TERT fusions might represent an alternative mechanism for telomerase reactivation in cancers. Methods Overview Briefly, fusions between any two genes were identified based on the number of chimeric reads (sequencing paired ends mapping to different genes) and split reads (spanning a fusion breakpoint), concordance between the strands of the reads and the genes involved in the putative fusion, and a number of filtering criteria to flag false positive and non-functional fusions. In addition, recurrent kinase fusions observed in a panel of 600 normal samples from TCGA and 1,800 normal samples from the Genotype–Tissue Expression (GTEx) project were also excluded from further analysis. Finally, all recurrent kinase fusions (n≥2) were manually reviewed to identify putative oncogenic drivers with distinctive characteristics of functional kinase fusions. In particular, the following features were required: presence of an intergenic junction between two exons, a predicted in-frame coding sequence and conservation of the complete kinase catalytic domain. Conversely, we excluded false positives from further analysis according to two main criteria: the presence of a homologous or repetitive sequence shared by the two fusion partners causing an alignment artifact, or the very high expression of one or both fusion partners. Origin of data To comprehensively identify the landscape of kinase fusions in solid tumours, we analysed RNA-seq data from 20 solid tumour types in TCGA (6,893 samples, Supplementary Table 1 and Supplementary Data 1), including provisional TCGA data from sarcoma tumours (103 samples). All available unaligned RNA-seq data files (fastq files) were obtained from the Cancer Genomics Hub (CGHub) and loaded into our processing pipeline. In addition, the clinical data from all available tumour types were pulled from the TCGA FTP server ( https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumour). Fusion detection algorithm and filtering Using the STAR v2.3.1q aligner50, RNA-seq data from each tumour sample was aligned to version hg19 of the human genome, while also providing transcriptome and splice junction annotations from the Gencode project v17 (ref. 51). A different genome index was generated for each of the different read lengths encountered in the RNA-seq data. Runtime options passed to STAR to generate genome indexes included: STAR --runMode genomeGenerate, --genomeDir hg19_Gencode17.overhang , --genomeFastaFiles , --sjdbGTFfile gencode.annotation.gtf and --sjdbOverhang . STAR was then used to produce alignments and was run with specific options including: STAR --readFilesIn , --readFilesCommand zcat, --genomeDir , --outSAMstrandField intronMotif, --outFilterIntronMotifs RemoveNoncanonicalUnannotated, --outReadsUnmapped None, --chimSegmentMin 15, --chimJunctionOverhangMin 15, --alignMatesGapMax 200000 and --alignIntronMax 200000. Given the programme arguments described above, the output of the STAR aligner consisted of two separate files containing sequencing reads: aligned reads consistent with a normal reference (_Aligned.sam) and aligned reads indicative of a putative rearrangement (_Chimeric.sam). A fusion detection routine was then used to identify protein fusion candidates: using the python library HTSeq-0.5.4p1 (ref. 52) and transcriptome annotations from the Gencode project v17, the name-sorted ‘chimeric’ alignments in the output of the STAR aligner were examined to count the number of chimeric pairs (where each sequencing end aligns to a different gene) and reads split between two genes. Specific filters were then applied to improve specificity: the strands of the alignments were compared with the strands of the genes to keep only those consistent with a proper 5′→3′ fusion; putative fusions between homologous genes were discarded; putative fusions between genes and overlapping homologous genes were discarded as well. This procedure then returned the complete list of possible gene fusions in a given sample, modulo the alignment artifacts, along with the number of chimeric reads and split reads supporting each fusion. Next, fusions were filtered based on the number of chimeric reads and paired reads supporting them: five chimeric reads or more were required when two or more split reads were present; 10 chimeric reads or more were required when only 1 split read was present; 20 chimeric reads or more were required when no split reads were detected. Finally, the output of the fusion detection step above was filtered to further improve specificity. This step relied on the analysis of a large number of samples to filter out highly recurrent fusions that were detected at improbable frequencies within a cancer type: for instance, fusions detected in >95% of samples, or fusions where both gene partners are themselves involved in >1 fusions in >25% of samples, were flagged as putative false positives. In addition, recurrent kinase fusions observed using the same procedure in five samples or more in a panel of 647 normal samples from TCGA (downloaded from CGHub on 2014-03-10) and 1,750 normal samples from the GTEx project (downloaded from dbGaP project phs000424.v4.p1 on 2014-01-17, excluding all transformed and cancer cell lines) were also excluded. This allowed us to exclude a large number of library construction and alignment artifacts. All recurrent kinase fusion candidates (n≥2 across cancer types) identified by this procedure were then manually reviewed in the Integrative Genomics Viewer53 to identify putative drivers with distinctive patterns of functional kinase fusions, and reject passenger and false-positive fusions. In particular, the following features were required for putative functional fusions: (1) presence of an intergenic junction (between two exons or between an exon and a cryptic exon); (2) a predicted in-frame coding sequence; (3) conservation of the full kinase catalytic domain. Manual review was facilitated by the fact that passenger fusions could mainly be linked to the following: (1) the absence of a fusion protein coding sequence that was in-frame; (2) the kinase domain was absent or truncated from the predicted protein sequence; or (3) the kinase was found to be fused only once in all samples (non-recurrent fusion). Conversely, we flagged false positives according to two main causes: (1) a homologous or repetitive sequence shared by the two fusion partners and causing an alignment artifact, or (2) very high expression of one or both fusion partners in a particular sample, causing the production of non-specific RNA chimera by trans-splicing. This process, occurring at the step of cDNA preparation through template switching by the reverse transcriptase54, produces multiple experimental artifacts that can appear like real fusions but lack a clear exon–exon breakpoint and generally are not supported by split reads. Functional annotation Recurrent candidate fusion protein sequences were searched for structural domains against the Pfam database55. Particular attention was paid to breakpoints that occurred outside of structural domains of the kinase and the fusion protein partner. Fusion partner sequences were checked for presence of CC domains by the method of Lupas et al. 56 Other dimerization or multimerization domains were checked within partner protein sequences using InterPro57. RNA-seq expression quantification mRNA gene expression, measured in fragments per kilobase of mRNA per million mapped reads, were calculated for all CCDS transcripts58 in the Gencode v17 database51 using Cufflinks v2.1.1 (ref. 59) with the options that included the following: cufflinks --multi-read-correct, --GTF , --mask-file . Copy number data Copy number data were downloaded from the 2014-04-16 release on the GDAC portal ( http://gdac.broadinstitute.org/). Author contributions N.S., E.C. and C.L. designed the study. N.S. and E.C. developed the computational infrastructure, the methods and performed the analysis. N.S., E.C., S.S. and J.L.K. interpreted the results. N.S., E.C., S.S., J.L.K. and C.L. wrote the manuscript. Additional information How to cite this article: Stransky, N. et al. The landscape of kinase fusions in cancer. Nat. Commun. 5:4846 doi: 10.1038/ncomms5846 (2014). Supplementary Material Supplementary Figures and Supplementary Table Supplementary Figures 1-16 and Supplementary Table 1 Supplementary Data 1 List of all TCGA samples analyzed Supplementary Data 2 List of all recurrent kinase fusions and sample ids Supplementary Data 3 List of all non-recurrent and non-curated kinase fusions and sample ids
              • Record: found
              • Abstract: found
              • Article: not found

              Efficient de novo assembly of large genomes using compressed data structures.

              De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error-correct, assemble, and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs <200 bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster.

                Author and article information

                Genome Biol
                Genome Biol
                Genome Biology
                BioMed Central (London )
                20 November 2019
                20 November 2019
                : 20
                : 246
                [1 ]ISNI 0000 0001 2160 926X, GRID grid.39382.33, Human Genome Sequencing Center, , Baylor College of Medicine, ; Houston, USA
                [2 ]ISNI 0000 0001 2165 4204, GRID grid.9851.5, Center for Integrative Genomics, , University of Lausanne, ; Lausanne, Switzerland
                [3 ]ISNI 0000 0001 2223 3006, GRID grid.419765.8, Swiss Institute of Bioinformatics, ; Lausanne, Switzerland
                [4 ]ISNI 0000 0001 2165 4204, GRID grid.9851.5, Department of Computational Biology, , University of Lausanne, ; Lausanne, Switzerland
                [5 ]University Center for Primary Care and Public Health, Lausanne, Switzerland
                [6 ]ISNI 0000000121901201, GRID grid.83440.3b, Centre for Life’s Origins and Evolution, Department of Genetics, Evolution & Environment, , University College London, ; London, UK
                [7 ]ISNI 0000000121901201, GRID grid.83440.3b, Department of Computer Science, , University College London, ; London, UK
                Author information
                © The Author(s). 2019

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                : 11 April 2019
                : 19 September 2019
                Funded by: FundRef http://dx.doi.org/10.13039/100000009, Foundation for the National Institutes of Health;
                Award ID: UM1 HG008898
                Funded by: FundRef http://dx.doi.org/10.13039/501100001711, Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung;
                Award ID: 31003A_173182
                Award ID: 31003A-143914
                Award ID: 150654
                Award Recipient :
                Funded by: FundRef http://dx.doi.org/10.13039/100010663, H2020 European Research Council;
                Award ID: CAMERA
                Custom metadata
                © The Author(s) 2019

                structural variant (sv) detection,de novo assembly,short-read,long-read,mapping,hybrid,rna-seq,gene fusion


                Comment on this article