15
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      viGEN: An Open Source Pipeline for the Detection and Quantification of Viral RNA in Human Tumors

      methods-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          An estimated 17% of cancers worldwide are associated with infectious causes. The extent and biological significance of viral presence/infection in actual tumor samples is generally unknown but could be measured using human transcriptome (RNA-seq) data from tumor samples. We present an open source bioinformatics pipeline viGEN, which allows for not only the detection and quantification of viral RNA, but also variants in the viral transcripts. The pipeline includes 4 major modules: The first module aligns and filter out human RNA sequences; the second module maps and count (remaining un-aligned) reads against reference genomes of all known and sequenced human viruses; the third module quantifies read counts at the individual viral-gene level thus allowing for downstream differential expression analysis of viral genes between case and controls groups. The fourth module calls variants in these viruses. To the best of our knowledge, there are no publicly available pipelines or packages that would provide this type of complete analysis in one open source package. In this paper, we applied the viGEN pipeline to two case studies. We first demonstrate the working of our pipeline on a large public dataset, the TCGA cervical cancer cohort. In the second case study, we performed an in-depth analysis on a small focused study of TCGA liver cancer patients. In the latter cohort, we performed viral-gene quantification, viral-variant extraction and survival analysis. This allowed us to find differentially expressed viral-transcripts and viral-variants between the groups of patients, and connect them to clinical outcome. From our analyses, we show that we were able to successfully detect the human papilloma virus among the TCGA cervical cancer patients. We compared the viGEN pipeline with two metagenomics tools and demonstrate similar sensitivity/specificity. We were also able to quantify viral-transcripts and extract viral-variants using the liver cancer dataset. The results presented corresponded with published literature in terms of rate of detection, and impact of several known variants of HBV genome. This pipeline is generalizable, and can be used to provide novel biological insights into microbial infections in complex diseases and tumorigeneses. Our viral pipeline could be used in conjunction with additional type of immuno-oncology analysis based on RNA-seq data of host RNA for cancer immunology applications. The source code, with example data and tutorial is available at: https://github.com/ICBI/viGEN/.

          Related collections

          Most cited references30

          • Record: found
          • Abstract: found
          • Article: found

          Comprehensive and Integrative Genomic Characterization of Hepatocellular Carcinoma

          (2017)
          Liver cancer has the second highest worldwide cancer mortality rate and has limited therapeutic options. We analyzed 363 hepatocellular carcinoma (HCC) cases by whole exome sequencing and DNA copy number analyses, and 196 HCC also by DNA methylation, RNA, miRNA, and proteomic expression. DNA sequencing and mutation analysis identified significantly mutated genes including LZTR1 , EEF1A1 , SF3B1 , and SMARCA4 . Significant alterations by mutation or down-regulation by hypermethylation in genes likely to result in HCC metabolic reprogramming ( ALB , APOB , and CPS1 ) were observed. Integrative molecular HCC subtyping incorporating unsupervised clustering of five data platforms identified three subtypes, one of which was associated with poorer prognosis in three HCC cohorts. Integrated analyses enabled development of a p53 target gene expression signature correlating with poor survival. Potential therapeutic targets for which inhibitors exist include WNT signaling, MDM4, MET, VEGFA, MCL1, IDH1, TERT, and immune checkpoint proteins CTLA-4, PD-1, and PD-L1. Multiplex molecular profiling of human hepatocellular carcinoma patients provides insight into subtype characteristics and points toward key pathways to target therapeutically.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            The landscape of viral expression and host gene fusion and adaptation in human cancer

            A century of tumour virology has revealed that seven types of viruses cause 10–15% of all human malignancies1. Viruses can cause cellular transformation by expression of viral oncogenes, by genomic integration to alter the activity of cellular proto-oncogenes or tumour suppressors, and by inducing inflammation that promotes oncogenesis. Viral aetiology is particularly evident in cervical carcinoma (CESC), which is almost exclusively caused by high-risk human papillomaviruses (HPV), and in hepatocellular carcinoma (LIHC), where infection with hepatitis B virus (HBV) or hepatitis C virus (HCV) is the predominant cause in some countries2. In addition, several rare cancers have a strong viral component, including Epstein–Barr virus (EBV)/human herpes virus (HHV) 4 in most Burkitt’s lymphomas. Huge advances in the prevention of virus-associated cancer has been made through vaccination programmes against HPV and HBV, second only to smoke cessation in the number of yearly cancer cases prevented worldwide3. Our current knowledge of virus–tumour associations is based largely on data gathered with low-throughput methodologies in the pre-genomic era. However, massively parallel sequencing is now showing promise for efficient unbiased detection of viruses in tumour tissue. This recently led to the discovery of a new polyomavirus as the cause of most Merkel cell carcinomas4, where essential virus–host interactions are currently being targeted in clinical drug trials5. Recent studies describe techniques for detection of viruses using high-throughput RNA or DNA sequencing6 7, and massively parallel sequencing has been used to survey sites of genomic integration of HBV in hepatocellular carcinoma8 9. Similarly, viral integration sites were recently mapped in 17 cervical and 239 head and neck carcinomas by detecting host–virus fusions in transcriptome sequencing (RNA-seq) data from The Cancer Genome Atlas (TCGA)10. These studies provided important insights and clearly demonstrate the potential of the methodology, but the scope and the number of tumours has thus far been limited. This motivates a broad unbiased survey of viral expression and integration in human cancer. Here we screen for expressed viruses in a diverse landscape of human cancer, encompassing 19 tumour types and 4,433 tumours, using RNA-seq data generated within the TCGA consortium. The resulting map provides a cross-cancer view of tumour–virus associations on a previously unseen scale and level of detail, and enables several powerful analyses. Our observations fall into six main categories: confirmation of established associations, such as high-risk HPV in cervical and head and neck cancer, which validates our methodology and provides reference viral expression levels and patterns in tumours with known viral aetiology; confirmation or rejection of controversial hypotheses, such as HPV18 in colorectal cancer; rare occurrences of known viruses in novel contexts; new viral isolates, including a novel recombinant enterovirus strain; novel recurrent host—virus fusion events, such as HPV insertions in ERBB2 and RAD51B; and patterns of coadaptation between viral and host gene expression. Results A map of tumour viruses in 19 human cancers We used two complementary approaches to detect and quantify expression of known and novel viruses in tumours (Fig. 1a, Methods). Briefly, RNA-seq libraries were filtered of human content, and remaining sequences were screened for matches to the complete RefSeq collection of viral genomes (n=3,590 excluding bacteriophages). Viral mRNA was quantified by computing the fraction of viral reads (FVR), presented as parts per million (p.p.m.) of total library size. To enable detection of missing strains and novel viruses, we de novo assembled non-human reads into contiguous segments (contigs) that were annotated while allowing for strong sequence divergence. On the basis of this, we added additional viral genomes, such as papilloma types missing in RefSeq and two novel assembled genomes (Supplementary Table S1 and Supplementary Fig. S1), to allow quantification as described above. Cases with unnaturally restricted viral genomic read coverage, probably due to traces of recombinant DNA, were excluded (Methods). We applied our pipeline to RNA-seq libraries from 19 cancers, encompassing a total of 4,433 tumours and 404 normal tissue controls that were each sequenced at an average depth of 151 million reads (Fig. 1b; additional library and sample information in Supplementary Table S2). We identified 178 tumours with FVR (viral expression) >2 p.p.m., but found that most positive cases had considerably higher levels (on average 168 and up to 854 p.p.m.; the complete results are available in Supplementary Data 1). Expectedly, CESC and LIHC showed the highest proportion of virus-positive tumours (96.6% and 32.4%, respectively, >2 p.p.m.), followed by head and neck squamous cell carcinoma (HNSC, 14.8%; Fig. 1b). De novo assembly revealed HPV in 15/18 CESC tumours that were originally negative, demonstrating a high sensitivity for detecting missing and novel viruses. Comparison with HPV status as determined by in situ hybridization in HNSC showed that 8/8 positive and 44/44 negative samples were correctly classified by our pipeline. The known tumour viruses HPV and HBV constituted the vast majority of strong signals >10 p.p.m. (90.5%; Fig. 1c). In contrast, matches in the 2–10-p.p.m. range were often because of HHVs that are known to infect and remain latent in lymphocytes (47.6%). Many of these detections could be attributed to cytomegalovirus (CMV/HHV5) and EBV in colon adenocarcinoma (COAD), probably because of lymphocytic infiltration (Fig. 2a). T-lymphocyte infiltration could also probably explain one case of low-FVR HIV1 in rectal adenocarcinoma (READ). We conclude that viruses that are actively participating in tumour formation and maintenance often, but not always, show FVR values >10 p.p.m. Importantly, we note an absence of relevant viral expression in several cancers otherwise subject to regular speculation about strong viral aetiology, including EBV in breast invasive carcinoma and CMV in glioblastoma multiforme11 12. The deep sequencing depth in these samples allowed us to safely estimate upper limits on viral expression: in the worst-case tumours, CMV was expressed at 140.000 reads) but more typically in the 100–200-p.p.m. range (Fig. 3a). There has been controversy regarding associations between HPV and colorectal cancer, with prevalence ranging from 0 to 83% in different studies15 16. Contamination has been suggested as a possible cause of false positives16. We observed weak expression (2–6.5 p.p.m.) of HPV18 in 5 cases (1.9%) of COAD/READ, which increased to 12 cases (4.5%) with inclusion of the 1–2-p.p.m. range (Supplementary Data 1). Viral gene expression patterns in these samples were different from known HPV-induced tumours, with consistent expression of E1 more indicative of active replication (Supplementary Fig. S2). We did not detect HPV18 in other tumours apart from CESC, which argues against contamination. HPV18 is one of few HPV types with glandular tropism17, and could conceivably infect colorectal adenocarcinomas. We conclude that earlier reports of HPV18 in colorectal tumours are probably correct. However, prevalence may have been overestimated, and expression patterns and levels speak against a contribution to carcinogenesis. Apart from matched normal liver samples with expected HBV (discussed below), only 2/404 normal tissue controls tested positive in this study, both with papillomavirus (Fig. 2a): one breast biopsy with low levels (3.1 p.p.m.) of a wart virus, HPV2, which expressed early as well as late genes indicative of active production of viral particles, and a normal kidney sample with HPV18 (12.9 p.p.m.), with viral gene expression similar to HPV in COAD/READ consistent with productive viral infection (Supplementary Fig. S2) but also with evidence of host–virus fusion (Fig. 2b, fusions are discussed below). These cases suggest novel tropisms for HPV, but more work is needed. Hepatitis virus prevalence As expected, HBV was detected in hepatocellular cancer (Fig. 2a): 11/34 (32.3%) of LIHC tumours expressed HBV at up to 854 p.p.m., but more typically in the 2–100-p.p.m. range (Fig. 3b). In positive cases, we consistently detected HBV in matched normal liver controls (5/5). A single tumour expressed HCV but at low levels (0.8 p.p.m.; Supplementary Data 1), likely explained by the non-polyadenylated nature of the HCV genome18. No other viruses were detected in LIHC. Inflammation/cirrhosis is a major promoter of HBV-induced oncogenesis, but expression of the viral gene X (HBx) also contributes19. Consistently, HBx was the predominantly expressed viral gene (Supplementary Fig. S3). In addition to LIHC, we found a single clear cell renal cell carcinoma (KIRC) primary tumour with moderate expression (28.9 p.p.m.) of the common HBV genotype C (Fig. 2a, Supplementary Table S3). However, although viral genes were expressed similarly to HBV-positive LIHC tumours (Supplementary Fig. S3) and the tumour mRNA profile was similar to other KIRC samples, further analysis revealed weak but consistent induction of LIHC marker genes in this sample (Supplementary Fig. S4). This supports that low-grade contamination with LIHC RNA could explain this detection. Rare occurrences and novel viral sequences BK polyomavirus (BKV) infects kidneys and the urinary tract, and has been implicated as a human tumour virus because of its oncogenic large tumour antigen (TAg) gene. There are contrasting reports of BKV in bladder cancer, ranging from high frequency to no association or lack of TAg expression20. We detected abundantly expressed BKV (318 p.p.m.) in 1/96 BLCA tumours, with predominant expression of full-length large TAg (Supplementary Fig. S5) as well as evidence of host–virus fusion (Fig. 2b, fusions are discussed below). This gives additional support for an aetiological role for BKV in rare cases of bladder cancer. HHV1, which normally causes mucoepithelial herpes lesions21, was detected at high FVR (338 p.p.m.) in a single HNSC tumour (Fig. 2a). HHV1 has not been described in tumours, although elevated HHV1 antibody titres have been shown in HNSC patients22. High HHV1 mRNA in this tumour could reflect reactivated virus infecting adjacent epithelium rather than tumour tissue. Enteroviruses cause a range of diseases including gastroenteritis. De novo assembly in COAD detected a novel enterovirus, revealed by detailed analysis as a recombinant of Coxsackievirus strains A19 and A22 (Supplementary Fig. S1). Presence of the virus in tumour tissue is supported by high FVR (67.0 p.p.m.) and the vast tropism of Coxsackieviruses21. Although our analysis involved unbiased matching to 3,065 non-human viral genomes, only a few hits involved viruses unlikely to infect humans (7/4,837 samples, Fig. 2a). One COAD tumour showed strong (456 p.p.m.) expression of murine type C retrovirus, also detected at low levels (3.1 and 3.8 p.p.m.) in another COAD tumour and a normal kidney biopsy. Murine type C retrovirus has strong similarity to XMRV, which was erroneously associated with disease because of contamination from common murine cell lines23. De novo assembly detected a novel mosaic-like virus (Supplementary Fig. S1) in COAD, and traces of tomato mosaic virus (3.6 p.p.m.) were found in one uterine endometroid carcinoma tumour. These viruses, and two other non-human detections (Fig. 2a), are unlikely to be oncogenic pathogens, suggesting contamination or environmental exposure at the tumour site. Analysis of host–virus fusions HPV genomic integrations are believed to occur as a consequence of HPV oncogene-induced chromosomal instability, and integrations in or near known tumour genes have been described, sometimes in conjunction with local copy-number change and altered expression of targeted genes24 25 26. Integrations associated with altered gene activity are similarly important in HBV-induced oncogenesis8. We employed a stringent procedure for detecting integrations as evidenced by host–virus fusion transcripts in RNA-seq, considering only breakpoints supported by multiple discordant sequencing mate pairs where human reads clustered within a limited region (Methods). We validated our methodology using whole-genome sequencing data from nine HPV-positive HNSC tumours, and found that eight of nine RNA-seq-derived integrations had support from discordant mate pairs in whole-genome sequencing libraries (Supplementary Table S4). Confirming previous data25, we observed a high integration frequency for HPV18 (100%) and a lower frequency for HPV16 (58.5%; Fig. 2b, Supplementary Data 2). Similarly, confirmatory, most HBV-positive tumours and normal tissue controls had viral integration8 (76.5%), and all HHV cases lacked integration (Fig. 2b). Both HPV and HBV integrations were widespread across the genome, with a few hotspots of recurrent integration (Fig. 2b). Further analysis in HNSC revealed the positional distribution to be non-random with a strong preference for integration near DNA copy-number breakpoints. A large fraction of integration clusters (41.8%) colocalized ( 10 p.p.m.). Viral integrations in CESC have not previously been assayed using massively parallel sequencing in comprehensive cohorts. Recurrent integrations, as evidenced by host–virus fusions, were typically in known cancer genes, including ERBB2, RAD51B and in the 13q22.1 intergenic region harbouring the LINC00393 lncRNA. Technical limitations of earlier methods may have caused these recurrent sites to be missed, as previously suggested for HBV8. Integrations were typically associated with altered gene expression, and our analysis in HNSC revealed strong association between viral integration and copy-number change, similar to what has been reported in CESC38. Although this is compatible with induction of local genomic instability, integration could alternatively be facilitated in these regions by pre-existing instability, and future studies should aim to better differentiate between these models. Analysis of host transcriptome perturbations caused by tumour virus proteins can facilitate identification and prioritization of cancer-causing genes and pathways39. Abundant RNA-seq data from hundreds of positive tumours here enabled us to analyse host gene expression in relation to viral infection and viral gene expression at a previously intractable scale and level of detail. The E6/E7-expressing subcategory of HPV tumours, which we could associate with a de-differentiated host signature, may be of particular interest. Future work should investigate this subtype in relation to mutational profiles, clinical variables and responsiveness to therapy. De novo assembly of novel viral sequences revealed an enterovirus recombinant and a new mosaic-like virus, and was highly efficient at identifying HPV when relevant strains were missing in our viral database. However, the number of discovered novel viruses was still surprisingly low. An exciting future application is rare cancers, which could lead more novel viruses or recurrent associations being uncovered. The present work provides a reference for expected viral expression levels in virus-induced tumours, and paves the way for future unbiased mapping of tumour-associated viruses in large-scale cancer genomics data sets. Methods Detection of viruses in tumour RNA-seq RNA-seq data in BAM format for 19 cancers encompassing 4,433 tumours and 404 normal tissue controls was obtained from the TCGA CGHub repository (current data as of 25 February 2013). Unaligned (non-human) reads were extracted using bam2fastq (http://www.hudsonalpha.org/gsl/information/software/bam2fastq) and further filtered of human content using Bowtie40. The prinseq-lite utility41 was used to remove low-complexity sequences (using a DUST threshold of 7) and short reads 2 p.p.m. of total library reads are presented in Fig. 2, whereas all detections >0.5 p.p.m. are documented in Supplementary Data 1. A nearest-neighbor approach was applied to RNA-seq profiles to confirm the overall correctness of TCGA tissue annotations (98.2% correctly classified, Supplementary Data 6). We used low viral genomic read coverage (number of unique positions) as an indicator of unnaturally restricted expression. Manual inspection of such cases ( 400 nucleotides. To identify missing strains and novel viruses, contigs were matched to known viruses by BLAST43 using a word size of 7. Simulated contigs from Merkel cell polyomavirus showed that the approach could detect unknown viruses with high sensitivity, also when considerably diverged from the reference genomes. On the basis of this analysis, we added additional viral genomes from other sources, such as papilloma types missing in RefSeq, as well as two novel assembled genomes (Supplementary Fig. S1 and Supplementary Table S1). Post-processing included generation of html reports, describing all BLAST alignments >200 nucleotides. Identification of viral integration sites Sites of viral integration were identified using mate information from paired-end sequencing, similar in principle to a previous report10. Reads were subject to quality filtering as described above. Discordant human–viral mate pairs were identified by alignment of non-viral reads to the Hg19 human reference with Bowtie, allowing up to two mismatches and discarding non-uniquely mapped reads. Human mates in discordant pairs were clustered by position using a maximum gap size of 100. To identify single distinct breakpoints supported by multiple reads, we considered clusters with at least 10 reads (unique positions). Integrations into the mitochondrial genome, indicative of false positives, were completely absent at this level of stringency. Breakpoint clusters were finally annotated against the GENCODE (v11) gene annotation44. Recurrent integrations in TMPRSS3 were not considered, as they were due to a single long transcript likely to be a mis-annotation. For nucleotide-resolution mapping of integration breakpoints, we used the Subread aligner45 to identify breakpoint-spanning reads that aligned in part to Hg19 and in part to the viral database. These were filtered based on the pair-end integration results, such that only those that aligned to relevant genes and viruses were considered. Breakpoints with support from at least 10 breakpoint-spanning reads were considered for further analysis. Host gene expression analyses Host gene expression analyses were done using TCGA Level 3 (RNASeqV2) transcription profiles. We tested for differential expression between HPV-positive/-negative tumours, and between subsets of tumours classified by their viral expression patterns, using Student’s t-test based on log2-transformed mRNA levels. We considered genes with detectable expression in at least half of the samples. Expression ratios were computed by comparing median levels in each group. P-values were corrected for multiple testing by computing q-values (false discovery rates) as described previously46. Viral genes were manually classified as having high, medium, low or absent relative expression based on read density plots (Supplementary Fig. S8 and Supplementary Data 4). PCA analysis was performed based on 14,714 genes with expression level >500 in at least one sample. Author contributions K-W.T, B.A-M and E.L. analysed data. T.S and M.L. performed additional analyses. E.L. wrote the paper with contributions from K-W.T. K-W.T and E.L. conceived the study. Additional information How to cite this article: Tang, K.-W. et al. The landscape of viral expression and host gene fusion and adaptation in human cancer. Nat. Commun. 4:2513 doi: 10.1038/ncomms3513 (2013). Supplementary Material Supplementary Figures and Tables Supplementary Figures S1-S8 and Supplementary Tables S1-S5 Supplementary Data 1 Complete screening results for 4849 RNA-seq libraries Supplementary Data 2 Integration analysis results Supplementary Data 3 HPV-regulated host genes in HNSC Supplementary Data 4 HPV viral gene expression table Supplementary Data 5 E6/E7-associated host genes in CESC Supplementary Data 6 Nearest-neighbour analysis of TCGA RNA-seq expression profiles Supplementary Data 7 Viral gene expression patterns (coverage plots) for all positive tumours (>2 ppm). Supplementary Note 1 List of The Cancer Genome Atlas Research Network contributors
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Applications of Next-Generation Sequencing Technologies to Diagnostic Virology

              Novel DNA sequencing techniques, referred to as “next-generation” sequencing (NGS), provide high speed and throughput that can produce an enormous volume of sequences with many possible applications in research and diagnostic settings. In this article, we provide an overview of the many applications of NGS in diagnostic virology. NGS techniques have been used for high-throughput whole viral genome sequencing, such as sequencing of new influenza viruses, for detection of viral genome variability and evolution within the host, such as investigation of human immunodeficiency virus and human hepatitis C virus quasispecies, and monitoring of low-abundance antiviral drug-resistance mutations. NGS techniques have been applied to metagenomics-based strategies for the detection of unexpected disease-associated viruses and for the discovery of novel human viruses, including cancer-related viruses. Finally, the human virome in healthy and disease conditions has been described by NGS-based metagenomics.
                Bookmark

                Author and article information

                Contributors
                Journal
                Front Microbiol
                Front Microbiol
                Front. Microbiol.
                Frontiers in Microbiology
                Frontiers Media S.A.
                1664-302X
                05 June 2018
                2018
                : 9
                : 1172
                Affiliations
                Innovation Center for Biomedical Informatics, Georgetown University , Washington, DC, United States
                Author notes

                Edited by: Diana Elizabeth Marco, Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Argentina

                Reviewed by: João Marcelo Pereira Alves, Universidade de São Paulo, Brazil; Hetron Mweemba Munang'andu, Norwegian University of Life Sciences, Norway

                *Correspondence: Krithika Bhuvaneshwar kb472@ 123456georgetown.edu

                This article was submitted to Systems Microbiology, a section of the journal Frontiers in Microbiology

                Article
                10.3389/fmicb.2018.01172
                5996193
                29922260
                7a633c0e-4845-4ae3-8fd3-c3eef54c112b
                Copyright © 2018 Bhuvaneshwar, Song, Madhavan and Gusev.

                This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

                History
                : 26 January 2018
                : 15 May 2018
                Page count
                Figures: 2, Tables: 8, Equations: 1, References: 55, Pages: 13, Words: 9472
                Categories
                Microbiology
                Methods

                Microbiology & Virology
                rna-seq,viral detection,liver cancer,tcga,variant analysis,next-generation sequencing,cancer immunology

                Comments

                Comment on this article