91
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      VERSE: a novel approach to detect virus integration in host genomes through reference genome customization.

      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Fueled by widespread applications of high-throughput next generation sequencing (NGS) technologies and urgent need to counter threats of pathogenic viruses, large-scale studies were conducted recently to investigate virus integration in host genomes (for example, human tumor genomes) that may cause carcinogenesis or other diseases. A limiting factor in these studies, however, is rapid virus evolution and resulting polymorphisms, which prevent reads from aligning readily to commonly used virus reference genomes, and, accordingly, make virus integration sites difficult to detect. Another confounding factor is host genomic instability as a result of virus insertions. To tackle these challenges and improve our capability to identify cryptic virus-host fusions, we present a new approach that detects Virus intEgration sites through iterative Reference SEquence customization (VERSE). To the best of our knowledge, VERSE is the first approach to improve detection through customizing reference genomes. Using 19 human tumors and cancer cell lines as test data, we demonstrated that VERSE substantially enhanced the sensitivity of virus integration site detection. VERSE is implemented in the open source package VirusFinder 2 that is available at http://bioinfo.mc.vanderbilt.edu/VirusFinder/.

          Related collections

          Most cited references25

          • Record: found
          • Abstract: found
          • Article: not found

          Mutation rates among RNA viruses.

          The rate of spontaneous mutation is a key parameter in modeling the genetic structure and evolution of populations. The impact of the accumulated load of mutations and the consequences of increasing the mutation rate are important in assessing the genetic health of populations. Mutation frequencies are among the more directly measurable population parameters, although the information needed to convert them into mutation rates is often lacking. A previous analysis of mutation rates in RNA viruses (specifically in riboviruses rather than retroviruses) was constrained by the quality and quantity of available measurements and by the lack of a specific theoretical framework for converting mutation frequencies into mutation rates in this group of organisms. Here, we describe a simple relation between ribovirus mutation frequencies and mutation rates, apply it to the best (albeit far from satisfactory) available data, and observe a central value for the mutation rate per genome per replication of micro(g) approximately 0.76. (The rate per round of cell infection is twice this value or about 1.5.) This value is so large, and ribovirus genomes are so informationally dense, that even a modest increase extinguishes the population.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Iterative Correction of Reference Nucleotides (iCORN) using second generation sequencing technology

            Motivation: The accuracy of reference genomes is important for downstream analysis but a low error rate requires expensive manual interrogation of the sequence. Here, we describe a novel algorithm (Iterative Correction of Reference Nucleotides) that iteratively aligns deep coverage of short sequencing reads to correct errors in reference genome sequences and evaluate their accuracy. Results: Using Plasmodium falciparum (81% A + T content) as an extreme example, we show that the algorithm is highly accurate and corrects over 2000 errors in the reference sequence. We give examples of its application to numerous other eukaryotic and prokaryotic genomes and suggest additional applications. Availability: The software is available at http://icorn.sourceforge.net Contact: tdo@sanger.ac.uk; cnewbold@hammer.imm.ox.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              The landscape of viral expression and host gene fusion and adaptation in human cancer

              A century of tumour virology has revealed that seven types of viruses cause 10–15% of all human malignancies1. Viruses can cause cellular transformation by expression of viral oncogenes, by genomic integration to alter the activity of cellular proto-oncogenes or tumour suppressors, and by inducing inflammation that promotes oncogenesis. Viral aetiology is particularly evident in cervical carcinoma (CESC), which is almost exclusively caused by high-risk human papillomaviruses (HPV), and in hepatocellular carcinoma (LIHC), where infection with hepatitis B virus (HBV) or hepatitis C virus (HCV) is the predominant cause in some countries2. In addition, several rare cancers have a strong viral component, including Epstein–Barr virus (EBV)/human herpes virus (HHV) 4 in most Burkitt’s lymphomas. Huge advances in the prevention of virus-associated cancer has been made through vaccination programmes against HPV and HBV, second only to smoke cessation in the number of yearly cancer cases prevented worldwide3. Our current knowledge of virus–tumour associations is based largely on data gathered with low-throughput methodologies in the pre-genomic era. However, massively parallel sequencing is now showing promise for efficient unbiased detection of viruses in tumour tissue. This recently led to the discovery of a new polyomavirus as the cause of most Merkel cell carcinomas4, where essential virus–host interactions are currently being targeted in clinical drug trials5. Recent studies describe techniques for detection of viruses using high-throughput RNA or DNA sequencing6 7, and massively parallel sequencing has been used to survey sites of genomic integration of HBV in hepatocellular carcinoma8 9. Similarly, viral integration sites were recently mapped in 17 cervical and 239 head and neck carcinomas by detecting host–virus fusions in transcriptome sequencing (RNA-seq) data from The Cancer Genome Atlas (TCGA)10. These studies provided important insights and clearly demonstrate the potential of the methodology, but the scope and the number of tumours has thus far been limited. This motivates a broad unbiased survey of viral expression and integration in human cancer. Here we screen for expressed viruses in a diverse landscape of human cancer, encompassing 19 tumour types and 4,433 tumours, using RNA-seq data generated within the TCGA consortium. The resulting map provides a cross-cancer view of tumour–virus associations on a previously unseen scale and level of detail, and enables several powerful analyses. Our observations fall into six main categories: confirmation of established associations, such as high-risk HPV in cervical and head and neck cancer, which validates our methodology and provides reference viral expression levels and patterns in tumours with known viral aetiology; confirmation or rejection of controversial hypotheses, such as HPV18 in colorectal cancer; rare occurrences of known viruses in novel contexts; new viral isolates, including a novel recombinant enterovirus strain; novel recurrent host—virus fusion events, such as HPV insertions in ERBB2 and RAD51B; and patterns of coadaptation between viral and host gene expression. Results A map of tumour viruses in 19 human cancers We used two complementary approaches to detect and quantify expression of known and novel viruses in tumours (Fig. 1a, Methods). Briefly, RNA-seq libraries were filtered of human content, and remaining sequences were screened for matches to the complete RefSeq collection of viral genomes (n=3,590 excluding bacteriophages). Viral mRNA was quantified by computing the fraction of viral reads (FVR), presented as parts per million (p.p.m.) of total library size. To enable detection of missing strains and novel viruses, we de novo assembled non-human reads into contiguous segments (contigs) that were annotated while allowing for strong sequence divergence. On the basis of this, we added additional viral genomes, such as papilloma types missing in RefSeq and two novel assembled genomes (Supplementary Table S1 and Supplementary Fig. S1), to allow quantification as described above. Cases with unnaturally restricted viral genomic read coverage, probably due to traces of recombinant DNA, were excluded (Methods). We applied our pipeline to RNA-seq libraries from 19 cancers, encompassing a total of 4,433 tumours and 404 normal tissue controls that were each sequenced at an average depth of 151 million reads (Fig. 1b; additional library and sample information in Supplementary Table S2). We identified 178 tumours with FVR (viral expression) >2 p.p.m., but found that most positive cases had considerably higher levels (on average 168 and up to 854 p.p.m.; the complete results are available in Supplementary Data 1). Expectedly, CESC and LIHC showed the highest proportion of virus-positive tumours (96.6% and 32.4%, respectively, >2 p.p.m.), followed by head and neck squamous cell carcinoma (HNSC, 14.8%; Fig. 1b). De novo assembly revealed HPV in 15/18 CESC tumours that were originally negative, demonstrating a high sensitivity for detecting missing and novel viruses. Comparison with HPV status as determined by in situ hybridization in HNSC showed that 8/8 positive and 44/44 negative samples were correctly classified by our pipeline. The known tumour viruses HPV and HBV constituted the vast majority of strong signals >10 p.p.m. (90.5%; Fig. 1c). In contrast, matches in the 2–10-p.p.m. range were often because of HHVs that are known to infect and remain latent in lymphocytes (47.6%). Many of these detections could be attributed to cytomegalovirus (CMV/HHV5) and EBV in colon adenocarcinoma (COAD), probably because of lymphocytic infiltration (Fig. 2a). T-lymphocyte infiltration could also probably explain one case of low-FVR HIV1 in rectal adenocarcinoma (READ). We conclude that viruses that are actively participating in tumour formation and maintenance often, but not always, show FVR values >10 p.p.m. Importantly, we note an absence of relevant viral expression in several cancers otherwise subject to regular speculation about strong viral aetiology, including EBV in breast invasive carcinoma and CMV in glioblastoma multiforme11 12. The deep sequencing depth in these samples allowed us to safely estimate upper limits on viral expression: in the worst-case tumours, CMV was expressed at 140.000 reads) but more typically in the 100–200-p.p.m. range (Fig. 3a). There has been controversy regarding associations between HPV and colorectal cancer, with prevalence ranging from 0 to 83% in different studies15 16. Contamination has been suggested as a possible cause of false positives16. We observed weak expression (2–6.5 p.p.m.) of HPV18 in 5 cases (1.9%) of COAD/READ, which increased to 12 cases (4.5%) with inclusion of the 1–2-p.p.m. range (Supplementary Data 1). Viral gene expression patterns in these samples were different from known HPV-induced tumours, with consistent expression of E1 more indicative of active replication (Supplementary Fig. S2). We did not detect HPV18 in other tumours apart from CESC, which argues against contamination. HPV18 is one of few HPV types with glandular tropism17, and could conceivably infect colorectal adenocarcinomas. We conclude that earlier reports of HPV18 in colorectal tumours are probably correct. However, prevalence may have been overestimated, and expression patterns and levels speak against a contribution to carcinogenesis. Apart from matched normal liver samples with expected HBV (discussed below), only 2/404 normal tissue controls tested positive in this study, both with papillomavirus (Fig. 2a): one breast biopsy with low levels (3.1 p.p.m.) of a wart virus, HPV2, which expressed early as well as late genes indicative of active production of viral particles, and a normal kidney sample with HPV18 (12.9 p.p.m.), with viral gene expression similar to HPV in COAD/READ consistent with productive viral infection (Supplementary Fig. S2) but also with evidence of host–virus fusion (Fig. 2b, fusions are discussed below). These cases suggest novel tropisms for HPV, but more work is needed. Hepatitis virus prevalence As expected, HBV was detected in hepatocellular cancer (Fig. 2a): 11/34 (32.3%) of LIHC tumours expressed HBV at up to 854 p.p.m., but more typically in the 2–100-p.p.m. range (Fig. 3b). In positive cases, we consistently detected HBV in matched normal liver controls (5/5). A single tumour expressed HCV but at low levels (0.8 p.p.m.; Supplementary Data 1), likely explained by the non-polyadenylated nature of the HCV genome18. No other viruses were detected in LIHC. Inflammation/cirrhosis is a major promoter of HBV-induced oncogenesis, but expression of the viral gene X (HBx) also contributes19. Consistently, HBx was the predominantly expressed viral gene (Supplementary Fig. S3). In addition to LIHC, we found a single clear cell renal cell carcinoma (KIRC) primary tumour with moderate expression (28.9 p.p.m.) of the common HBV genotype C (Fig. 2a, Supplementary Table S3). However, although viral genes were expressed similarly to HBV-positive LIHC tumours (Supplementary Fig. S3) and the tumour mRNA profile was similar to other KIRC samples, further analysis revealed weak but consistent induction of LIHC marker genes in this sample (Supplementary Fig. S4). This supports that low-grade contamination with LIHC RNA could explain this detection. Rare occurrences and novel viral sequences BK polyomavirus (BKV) infects kidneys and the urinary tract, and has been implicated as a human tumour virus because of its oncogenic large tumour antigen (TAg) gene. There are contrasting reports of BKV in bladder cancer, ranging from high frequency to no association or lack of TAg expression20. We detected abundantly expressed BKV (318 p.p.m.) in 1/96 BLCA tumours, with predominant expression of full-length large TAg (Supplementary Fig. S5) as well as evidence of host–virus fusion (Fig. 2b, fusions are discussed below). This gives additional support for an aetiological role for BKV in rare cases of bladder cancer. HHV1, which normally causes mucoepithelial herpes lesions21, was detected at high FVR (338 p.p.m.) in a single HNSC tumour (Fig. 2a). HHV1 has not been described in tumours, although elevated HHV1 antibody titres have been shown in HNSC patients22. High HHV1 mRNA in this tumour could reflect reactivated virus infecting adjacent epithelium rather than tumour tissue. Enteroviruses cause a range of diseases including gastroenteritis. De novo assembly in COAD detected a novel enterovirus, revealed by detailed analysis as a recombinant of Coxsackievirus strains A19 and A22 (Supplementary Fig. S1). Presence of the virus in tumour tissue is supported by high FVR (67.0 p.p.m.) and the vast tropism of Coxsackieviruses21. Although our analysis involved unbiased matching to 3,065 non-human viral genomes, only a few hits involved viruses unlikely to infect humans (7/4,837 samples, Fig. 2a). One COAD tumour showed strong (456 p.p.m.) expression of murine type C retrovirus, also detected at low levels (3.1 and 3.8 p.p.m.) in another COAD tumour and a normal kidney biopsy. Murine type C retrovirus has strong similarity to XMRV, which was erroneously associated with disease because of contamination from common murine cell lines23. De novo assembly detected a novel mosaic-like virus (Supplementary Fig. S1) in COAD, and traces of tomato mosaic virus (3.6 p.p.m.) were found in one uterine endometroid carcinoma tumour. These viruses, and two other non-human detections (Fig. 2a), are unlikely to be oncogenic pathogens, suggesting contamination or environmental exposure at the tumour site. Analysis of host–virus fusions HPV genomic integrations are believed to occur as a consequence of HPV oncogene-induced chromosomal instability, and integrations in or near known tumour genes have been described, sometimes in conjunction with local copy-number change and altered expression of targeted genes24 25 26. Integrations associated with altered gene activity are similarly important in HBV-induced oncogenesis8. We employed a stringent procedure for detecting integrations as evidenced by host–virus fusion transcripts in RNA-seq, considering only breakpoints supported by multiple discordant sequencing mate pairs where human reads clustered within a limited region (Methods). We validated our methodology using whole-genome sequencing data from nine HPV-positive HNSC tumours, and found that eight of nine RNA-seq-derived integrations had support from discordant mate pairs in whole-genome sequencing libraries (Supplementary Table S4). Confirming previous data25, we observed a high integration frequency for HPV18 (100%) and a lower frequency for HPV16 (58.5%; Fig. 2b, Supplementary Data 2). Similarly, confirmatory, most HBV-positive tumours and normal tissue controls had viral integration8 (76.5%), and all HHV cases lacked integration (Fig. 2b). Both HPV and HBV integrations were widespread across the genome, with a few hotspots of recurrent integration (Fig. 2b). Further analysis in HNSC revealed the positional distribution to be non-random with a strong preference for integration near DNA copy-number breakpoints. A large fraction of integration clusters (41.8%) colocalized ( 10 p.p.m.). Viral integrations in CESC have not previously been assayed using massively parallel sequencing in comprehensive cohorts. Recurrent integrations, as evidenced by host–virus fusions, were typically in known cancer genes, including ERBB2, RAD51B and in the 13q22.1 intergenic region harbouring the LINC00393 lncRNA. Technical limitations of earlier methods may have caused these recurrent sites to be missed, as previously suggested for HBV8. Integrations were typically associated with altered gene expression, and our analysis in HNSC revealed strong association between viral integration and copy-number change, similar to what has been reported in CESC38. Although this is compatible with induction of local genomic instability, integration could alternatively be facilitated in these regions by pre-existing instability, and future studies should aim to better differentiate between these models. Analysis of host transcriptome perturbations caused by tumour virus proteins can facilitate identification and prioritization of cancer-causing genes and pathways39. Abundant RNA-seq data from hundreds of positive tumours here enabled us to analyse host gene expression in relation to viral infection and viral gene expression at a previously intractable scale and level of detail. The E6/E7-expressing subcategory of HPV tumours, which we could associate with a de-differentiated host signature, may be of particular interest. Future work should investigate this subtype in relation to mutational profiles, clinical variables and responsiveness to therapy. De novo assembly of novel viral sequences revealed an enterovirus recombinant and a new mosaic-like virus, and was highly efficient at identifying HPV when relevant strains were missing in our viral database. However, the number of discovered novel viruses was still surprisingly low. An exciting future application is rare cancers, which could lead more novel viruses or recurrent associations being uncovered. The present work provides a reference for expected viral expression levels in virus-induced tumours, and paves the way for future unbiased mapping of tumour-associated viruses in large-scale cancer genomics data sets. Methods Detection of viruses in tumour RNA-seq RNA-seq data in BAM format for 19 cancers encompassing 4,433 tumours and 404 normal tissue controls was obtained from the TCGA CGHub repository (current data as of 25 February 2013). Unaligned (non-human) reads were extracted using bam2fastq (http://www.hudsonalpha.org/gsl/information/software/bam2fastq) and further filtered of human content using Bowtie40. The prinseq-lite utility41 was used to remove low-complexity sequences (using a DUST threshold of 7) and short reads 2 p.p.m. of total library reads are presented in Fig. 2, whereas all detections >0.5 p.p.m. are documented in Supplementary Data 1. A nearest-neighbor approach was applied to RNA-seq profiles to confirm the overall correctness of TCGA tissue annotations (98.2% correctly classified, Supplementary Data 6). We used low viral genomic read coverage (number of unique positions) as an indicator of unnaturally restricted expression. Manual inspection of such cases ( 400 nucleotides. To identify missing strains and novel viruses, contigs were matched to known viruses by BLAST43 using a word size of 7. Simulated contigs from Merkel cell polyomavirus showed that the approach could detect unknown viruses with high sensitivity, also when considerably diverged from the reference genomes. On the basis of this analysis, we added additional viral genomes from other sources, such as papilloma types missing in RefSeq, as well as two novel assembled genomes (Supplementary Fig. S1 and Supplementary Table S1). Post-processing included generation of html reports, describing all BLAST alignments >200 nucleotides. Identification of viral integration sites Sites of viral integration were identified using mate information from paired-end sequencing, similar in principle to a previous report10. Reads were subject to quality filtering as described above. Discordant human–viral mate pairs were identified by alignment of non-viral reads to the Hg19 human reference with Bowtie, allowing up to two mismatches and discarding non-uniquely mapped reads. Human mates in discordant pairs were clustered by position using a maximum gap size of 100. To identify single distinct breakpoints supported by multiple reads, we considered clusters with at least 10 reads (unique positions). Integrations into the mitochondrial genome, indicative of false positives, were completely absent at this level of stringency. Breakpoint clusters were finally annotated against the GENCODE (v11) gene annotation44. Recurrent integrations in TMPRSS3 were not considered, as they were due to a single long transcript likely to be a mis-annotation. For nucleotide-resolution mapping of integration breakpoints, we used the Subread aligner45 to identify breakpoint-spanning reads that aligned in part to Hg19 and in part to the viral database. These were filtered based on the pair-end integration results, such that only those that aligned to relevant genes and viruses were considered. Breakpoints with support from at least 10 breakpoint-spanning reads were considered for further analysis. Host gene expression analyses Host gene expression analyses were done using TCGA Level 3 (RNASeqV2) transcription profiles. We tested for differential expression between HPV-positive/-negative tumours, and between subsets of tumours classified by their viral expression patterns, using Student’s t-test based on log2-transformed mRNA levels. We considered genes with detectable expression in at least half of the samples. Expression ratios were computed by comparing median levels in each group. P-values were corrected for multiple testing by computing q-values (false discovery rates) as described previously46. Viral genes were manually classified as having high, medium, low or absent relative expression based on read density plots (Supplementary Fig. S8 and Supplementary Data 4). PCA analysis was performed based on 14,714 genes with expression level >500 in at least one sample. Author contributions K-W.T, B.A-M and E.L. analysed data. T.S and M.L. performed additional analyses. E.L. wrote the paper with contributions from K-W.T. K-W.T and E.L. conceived the study. Additional information How to cite this article: Tang, K.-W. et al. The landscape of viral expression and host gene fusion and adaptation in human cancer. Nat. Commun. 4:2513 doi: 10.1038/ncomms3513 (2013). Supplementary Material Supplementary Figures and Tables Supplementary Figures S1-S8 and Supplementary Tables S1-S5 Supplementary Data 1 Complete screening results for 4849 RNA-seq libraries Supplementary Data 2 Integration analysis results Supplementary Data 3 HPV-regulated host genes in HNSC Supplementary Data 4 HPV viral gene expression table Supplementary Data 5 E6/E7-associated host genes in CESC Supplementary Data 6 Nearest-neighbour analysis of TCGA RNA-seq expression profiles Supplementary Data 7 Viral gene expression patterns (coverage plots) for all positive tumours (>2 ppm). Supplementary Note 1 List of The Cancer Genome Atlas Research Network contributors
                Bookmark

                Author and article information

                Journal
                Genome Med
                Genome medicine
                Springer Science and Business Media LLC
                1756-994X
                1756-994X
                2015
                : 7
                : 1
                Affiliations
                [1 ] Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN 37203 USA.
                [2 ] Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN 37203 USA ; Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232 USA.
                [3 ] Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN 37203 USA ; Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232 USA ; Department of Psychiatry, Vanderbilt University School of Medicine, Nashville, TN 37232 USA ; Department of Cancer Biology, Vanderbilt University School of Medicine, Nashville, TN 37232 USA.
                Article
                126
                10.1186/s13073-015-0126-6
                4333248
                25699093
                9ae3e148-0d47-4cf1-9c42-7c8705d47121
                History

                Comments

                Comment on this article