Introduction Over the past decade, genomic projects have obtained evidence that thousands of non-protein-coding RNAs (ncRNAs) are transcribed from mammalian genomes, and it is becoming increasingly apparent that many long ncRNAs (>200 nucleotides [nt]) are biologically important (1–3). Though some small ncRNAs such as microRNAs (4) have been found to be involved in virus-host interactions, the relevance of long ncRNAs to viral infections has not been systematically studied, in part because these ncRNAs have not been easily accessible with typically available technologies. In this study, we performed whole-transcriptome analysis of severe acute respiratory syndrome coronavirus (SARS-CoV)-infected lung samples collected from four mouse strains using deep-sequencing technology. Our results show that there was a widespread differential regulation of long ncRNAs in response to viral infection, suggesting that these ncRNAs are involved in regulating the host response, including innate immunity. At the same time, virus infection models provide a unique platform for studying the biology and regulation of ncRNAs. RESULTS Whole-transcriptome analysis of SARS-CoV-infected mouse lung samples. To systematically investigate the regulation of long ncRNAs during viral infection, we infected four different strains of mice with a mouse-adapted severe acute respiratory syndrome coronavirus (SARS-CoV) (5). These mice were selected due to their differential range in susceptibility phenotypes following infection with SARS-CoV or influenza virus and the capacity to pursue downstream quantitative trait locus (QTL) mapping of regulation and function in the Collaborative Cross. Weight loss in the animals was monitored over the course of the infection with SARS MA15 or influenza virus A/PR/8/34 as a measure of disease severity (Fig. 1). We then performed a whole-transcriptome analysis of collected lung tissue samples using next-generation sequencing (NGS). Directional cDNA libraries were constructed using the not-so-random (NSR) priming method (6), which enabled the profiling of polyadenylated, nonpolyadenylated, coding, and noncoding transcripts, but not small RNAs (6). FIG 1 Measurement of weight loss in four strains of mice following infection with SARS MA15 or influenza virus A/PR/8/34. (a) Over the course of a 2-day SARS-CoV infection, CAST/EiJ (CAST) mice lost 12% of their starting weight, PWK/PhJ (PWK) mice lost 20% of their starting weight, WSB/EiJ (WSB) mice lost 12% of their starting weight, and 129S1/SvImJ (129/S1) mice lost 10% of their starting weight. (b) Two days after infection with 500 PFU of influenza virus strain A/PR/8/34, 129/S1 mice lost 3% of their starting weight, PWK mice lost 7% of their starting weight, WSB mice lost 6% of their starting weight, and CAST mice lost 9% of their starting weight. See supplementary material for more details. We observed a large number of reads (1.5 to 7 million) that uniquely mapped to viral RNAs (viral genomic RNAs and transcripts) (Fig. 2) (see Table S1 in the supplemental material) in samples from virus-infected animals. From all samples, we obtained on average over 22 million reads that uniquely mapped to host genomic sites, including many that mapped to nonannotated intergenic regions (Fig. 2a; see Table S1 in the supplemental material). We reasoned that the transcriptional activities detected in nonannotated regions were largely from ncRNAs and that some could be differentially expressed in response to viral infection. To evaluate our approach for the identification of differentially expressed genes, we profiled the same samples using microarrays and compared the profiles with the profiles of the protein-coding part of the NGS data set. We observed a very good correlation (Pearson correlation coefficients of 0.73 to 0.8) between two platforms (see Fig. S2 in the supplemental material), and even better agreement between NGS and quantitative PCR (qPCR) (data not shown). FIG 2 Global classification of transcriptional activity in lung samples from SARS-CoV-infected mice. (a) Global classification of transcriptional activity. Short reads were assigned to one of six nonoverlapping categories. The exonic, intronic, and intergenic categories were defined by the genomic coordinates for UCSC known genes and include only reads that map to unique genomic locations. See Fig. S1 in the supplemental material for more details on read mapping. (b) Estimation of relative abundance of viral RNAs in lung samples from SARS-CoV-infected mice. (Top left) The relative abundances of viral RNAs in lung samples from SARS-CoV-infected mice estimated as the ratio of the total number of next-generation sequencing (NGS) reads uniquely mapped to the SARS-CoV genome to the total number of reads uniquely mapped to the mouse genome. The four mouse strains are indicated below the bars. The relative abundances of viral RNAs estimated as the differences between the qPCR threshold cycle (CT ) values obtained with primers designed for SARS protein-coding genes ORF1 (O1R6), SARS Envelope Reverse 1 (SER1), and SARS Spike Reverse 4 (SSR4) and those with primers designed for mouse 18S gene using the same strains as in the top left graph. Differential expression of long ncRNAs during SARS-CoV infection. First, we studied annotated non-protein-coding RNAs (ncRNAs); the compilation of annotated ncRNAs produced 10,986 nonoverlapping ncRNA loci (Materials and Methods). We found that 509 of these loci were differentially expressed during SARS-CoV MA15 infection (Fig. 3), 485 of which had more than 2.5-fold change in at least one of four mouse strains during infection, and 209 of which were all upregulated or all downregulated by at least 1.8-fold in three or more mouse strains (see Tables S2 and S3 in the supplemental material). Nearly all (504 of 509) were long ncRNAs (>200 nt). These results clearly show that there is widespread differential regulation of long ncRNAs in response to SARS-CoV infection. FIG 3 Examples of annotated ncRNA loci (a and b) and nonannotated genomic regions (c and d) differentially expressed during SARS-CoV infection. (a) An overview of short reads from whole-transcriptome analysis of mouse lung samples mapped to a 33-kb region of chromosome 3 displayed by Integrative Genome Viewer (IGV) browser (http://www.broadinstitute.org/igv). Each track represents data collected from a single mouse lung sample, with SARS-CoV-infected sample (+) and mock-infected sample (−) shown by the labeled arrows (+ or −). Infected samples are depicted in red, and mock-infected samples are depicted in blue. The strains of mice used are shown on the left. 129/S1 polyA+ represents short-read data generated from libraries separately created from the same samples with poly(A) selection. For reference, UCSC annotation of nearby protein-coding genes is shown at the bottom in blue. K4-K36_1026 is the entire K4-K36 domain of a large intervening ncRNAs (lincRNA) as identified in reference 7, which is upregulated during SARS infection, but no significant expression was observed in poly(A)-selected samples. The green box indicates that the locus was followed up using qPCR (the same for panels b, c, and d). (b) Overview similar to that in panel a for a 124-kb region of chromosome 6. The underlined UCSC annotation is noncoding. It was upregulated during SARS-CoV infection. (c) Overview similar to that in panel a for a 203-kb region of chromosome 11. The loci shown in orange indicate nonannotated genomic regions identified here as differentially expressed during SARS-CoV infection (as in panel d). These regions were downregulated. (d) Overview similar to that in panel a for a 202-kb region of chromosome 12. The locus in orange indicates an nonannotated genomic region upregulated during SARS-CoV infection. Next we systematically scanned the mouse genome for nonannotated regions that encoded transcripts differentially expressed during viral infection (Materials and Methods). In total, we uncovered 1,406 nonannotated genomic regions that did not overlap any annotated protein-coding genes (UCSC or Ensembl annotations) but that consistently had changes in expression of more than 1.4-fold (all upregulated or all downregulated) in at least 3 mouse strains during infection (Fig. 4; see Table S4 in the supplemental material). For 997 of these regions, we did not find overlap with any annotated loci (UCSC and Ensembl annotations), indicating that many infection-induced changes in RNA transcript abundance are not monitored by conventional microarrays. It also suggests that possibly other infection-related transcripts remain to be discovered under different experimental conditions. FIG 4 Characteristics of genomic regions differentially expressed during SARS-CoV infection. (a) The length distribution of genomic regions differentially expressed during SARS-CoV infection. The genomic regions are indicated below the graph as follows: All ncRNAs, 429 genomic regions that overlapped with one or more annotated ncRNAs; non-redundant ncRNA loci, 285 of 429 genomic regions that overlapped with the nonredundant set of 10,986 annotated ncRNAs as compiled in this study; Annotated ncRNAs overlapping with protein-coding genes, 144 of 429 genomic regions that overlapped with those annotated ncRNAs that we filtered out because of partial overlapping with a protein-coding gene (see Materials and Methods); Unknown, 977 genomic regions without any overlapping annotated genes; Antisense, 249 genomic regions that were antisense to annotated protein-coding genes; Intergenic, 1,157 genomic regions were located between annotated protein-coding genes; All regions, all 1,406 genomic regions identified. (b) Characteristics of genomic regions differentially expressed during SARS-CoV infection. The genome regions are depicted as in panel a. Annotations showing what the identified genomic regions overlap with are shown below the x axis as follows: piRNA, piwi-associated small RNAs; RNAz, conserved RNA secondary structures predicted by RNAz; Retrotransposon, retrotransposons of the SINE, LINE, LTR, and DNA superfamilies; Simple, simple repeats and low complexity; Other, remaining retrotransposons and repeats; None, no overlapping with the annotation categories above; Antisense, antisense to protein-coding genes annotated by UCSC or Ensembl. See text and Materials and Methods for details. Differential expression of long ncRNAs in response to altered innate immunity. We used qPCR to further evaluate the differential expression of a subset of ncRNAs in replicate samples. We selected 39 loci/regions that represented a variety of loci for the follow-up studies, including 19 nonannotated genomic regions, 13 annotated ncRNAs, 5 large intervening ncRNAs (lincRNAs [7]) partially overlapping with annotated protein-coding genes (therefore not included in our nonredundant set of annotated ncRNAs), plus two protein-coding genes (Mx1 and Ifit1) known to be regulated during viral infection. Importantly, we observed a very good agreement (Pearson correlation coefficients of 0.87 to 0.94) between SARS-CoV infection to mock infection expression log ratios obtained using NGS and the corresponding log ratios obtained using qPCR on the set of independent samples with multiple replicates (Fig. 5a; see Fig. S3 in the supplemental material). FIG 5 Comparison of infection to mock infection expression ratios for 37 differentially expressed ncRNAs and genomic regions. (a) Comparison of the log2 infection/mock infection expression ratios for 37 differentially expressed ncRNAs and genomic regions originally obtained by NGS and then by qPCR on independently collected lung samples from mice infected with SARS-CoV or influenza virus. The influenza virus used was A/PR/8/34 (PR8). Each column represents an independent sample collected from one infected mouse. For qPCR, individual replicate samples were compared to the average of 2 or 3 matched mock-infected samples to form infection/mock infection expression ratios. All samples were collected 2 days after infection. The color red on the heat map indicates upregulation during infection, while the color green indicates downregulation during infection. In the colored bar to the left of the heat map, 13 annotated ncRNAs are shown in dark orange, 19 nonannotated genomic regions are shown in blue, and 5 lincRNAs partially overlapping with annotated protein-coding genes are shown in green. The genomic locations of ncRNAs and genomic regions are shown in Fig. S4 in the supplemental material. The mouse strains are shown at the bottom of the panel. (b) Temporal expression profiles of the same selected ncRNAs and genomic regions (in the same order as in panel a) measured by qPCR in wild-type (129/S6), STAT1 knockout (STAT1−/−), and type 1 interferon receptor knockout (IFNAR−/−) mice infected by SARS-CoV. The values shown are the log2 ratios of the expression levels in infected replicate samples to the average expression levels in strain-matched mock-infected samples. The color red on the heat map indicates up regulation during infection, while the color green indicates downregulation during infection. The samples were collected 2, 5, and 9 days after infection (dpi 2, 5, and 9, respectively). The numbered lines to the right of the heat map indicate groups of ncRNAs and genomic regions based on their temporal expression changes during infection to guide visualization: 1, mostly downregulated in STAT1−/− mice; 2, consistently upregulated in all three types of mice; 3, mostly upregulated in wild-type and STAT1−/− mice, but not in IFNAR−/− mice, especially for the later time points; 4, highly upregulated in STAT1−/− mice, but not in wild-type and IFNAR−/− mice; 5, transiently upregulated at early time points in all three types of mice; 6, downregulated in STAT1−/− mice only and, as shown in panel a, upregulated in SARS-CoV-infected mice but downregulated in influenza virus-infected mice; and 7, no significant changes observed in all three types of mice. (c) Temporal expression profiles of the same selected ncRNAs and genomic regions (in the same order as in panel a) measured by qPCR in mouse embryonic fibroblasts (MEFs) from mice of strain 129S1/SvlmJ (129/S1 MEFs) and PWK/PhJ (PWK MEFs) treated with 500 U of mouse beta interferon (IFN) or influenza A virus A/Pr/8/34 (PR8) (MOI of 1 or 10). The values shown are the log2 ratios of the expression levels in samples from virus-infected or IFN-treated MEFs to the average expression levels in time- and strain-matched samples from mock-infected MEFs. The color red on the heat map indicates upregulation during infection, while the color green indicates downregulation during infection. The samples were collected 6 and 24 hours after IFN treatment or virus infection (shown below the heat maps). Mx1 and Ifit1, two protein-coding genes that are known to be upregulated during viral infection, are positive controls for in vitro treatment. To investigate whether the observed differential expression of long ncRNAs was specific to SARS-CoV infection or represented a more general host response to viral infection, we infected the same strains of mice with influenza virus A/PR/8/34 and used qPCR to quantify expression changes of the 37 selected ncRNAs and genomic regions in lung samples from infected animals. Interestingly, we found that most (35 of 37) of the selected ncRNAs and genomic regions were similarly differentially expressed during influenza virus infection (Fig. 5a). Thus, many long ncRNAs are differentially regulated during both SARS-CoV and influenza virus infections, suggesting that the differential regulation of long ncRNAs may be a common host response to respiratory viral infection. To determine the relationship between differential expression of long ncRNAs and innate immune signaling, we performed qPCR on lung samples obtained from a previous study in which mice lacking the type I interferon receptor (IFNAR−/−) or STAT1 (signal transducer and activator of transcription factor 1) (STAT−/−) were infected with SARS-CoV. In that study, we found that SARS-CoV infection resulted in the death of STAT−/− mice, but not IFNAR−/− mice (8). As shown in Fig. 5b, even for the set of 37 ncRNAs examined here, we observed unique patterns of expression changes over time. As expected, most (35 of 37 [95%]) of the selected ncRNAs and genomic regions were differentially expressed (P 200 nt) were compiled from UCSC known genes and three published studies (7, 10, 18). As it is not trivial to differentiate short reads mapped to the regions shared by overlapping transcripts, we clustered the overlapping annotated transcripts into single loci. We then filtered out those loci that overlap with any protein-coding transcripts as annotated or predicted by UCSC or Ensembl to minimize the possibility of the inclusion of protein-coding genes. Obviously, many genuine ncRNAs were excluded from our consideration because of this conservative approach. We obtained 10,986 nonoverlapping ncRNA loci (8,008 of which were larger than 200 nt), in addition to 21,565 protein-coding loci. We estimated the transcript abundance of a locus by counting all reads mapped to the locus, instead of only exonic regions as is typically done when using RNA-Seq for protein-coding loci (19), as the gene structures of many ncRNAs were unknown. We then normalized the raw read counts by the length of the locus and the total uniquely mapped reads for each sample and represented the normalized expression levels similarly as typical RPKMs (reads per kilobase per million reads). An offset of 0.05 was added to all RPKMs before calculating log ratios to avoid taking the log of 0 and to decrease the variability of the log ratios for loci with low read counts. To balance individual strain differences, we used two complementary criteria that differed in stringency to select differentially expressed ncRNAs: the first being a relatively large fold change (>2.5-fold) in normalized expression during infection in at least one mouse strain and the second being consistent up- or downregulation during infection across multiple strains (3 or more strains here) but with a slightly smaller difference in expression (>1.8-fold) within each strain. In both cases, we also required that the locus must have at least 20 uniquely mapped reads in at least one sample when calculating the ratios between samples from virus-infected and mock-infected animals. Expression profiling using oligonucleotide microarray. cRNA probes were generated from each sample using the Agilent one-color Quick-Amp labeling kit. Individual cRNA samples were hybridized to Agilent 4 × 44 mouse whole-genome oligonucleotide microarrays according to the manufacturer’s instructions. Slides were scanned with an Agilent DNA microarray scanner, and the resulting images were analyzed using Agilent Feature Extractor software. Data were warehoused in the Katze LabKey system (LabKey, Inc., Seattle, WA) and preprocessed using Agi4x44PreProcess (version 1.4.0), a Bioconductor package in R (20). Comparison of differential expressions from next-generation sequencing and microarray. We first mapped all probes represented on the array to the mouse reference genomic sequences and selected only those that mapped uniquely and perfectly. We then mapped those selected probes to the assembled loci as described above for both protein-coding and noncoding loci. When more than one probe mapped to the same locus, we selected the probe with the highest normalized intensity averaged over all eight samples. For each mouse strain, for samples from virus-infected and mock-infected animals, we then compared the log2 ratios of mapped probe intensities to the log2 ratios of the RPKMs for the same loci. Identification of novel transcripts by a genome-wide scan. Briefly, we first assigned reads that were mapped uniquely in the genome to their site of origin. To identify regions differentially expressed during viral infection, we employed a sliding window approach to compare the expression levels between a pair of samples (infected versus mock-infected samples in this study): we slid windows, scored each window based on the number of uniquely mapped reads, and selected intervals with fold changes between two samples above a threshold level. Specifically, we did the following. (i) We fixed a window size (w) and slid it across the genome with a moving step (s). For each window, we computed a score, Sw , as the number of reads aligned within the window, normalized by the total number of uniquely mapped reads for each sample. (ii) To identify differentially expressed windows, we created ratios of scores between the pair of samples and selected those windows passing a threshold (fs ) for fold change. (iii) We merged overlapping windows into large intervals if they were differentially expressed in the same direction. (iv) To obtain larger intervals, we joined identified neighboring intervals if there were a low number of reads in between and the larger intervals formed by neighboring ones were also differentially expressed, judged by a threshold fj . (v) To increase the confidence, we then selected only those intervals that were differentially expressed consistently in at least k pairs of samples (here all upregulated or all downregulated in at least 3 out of 4 mouse strains). (vi) We then removed those intervals overlapping protein-coding genes annotated by UCSC or Ensembl, merged remaining overlapping intervals identified from all scans into nonoverlapping genomic regions, and recalculated expression ratios. We searched the identified genomic regions against different annotations, including noncoding RNA annotations from ncRNA.org (http://www.ncrna.org/). Annotation of piwi-associated small RNAs (piRNAs) were obtained from the functional RNA database (21). Conserved RNA secondary structures (P > 0.5) were predicted based on the 30-way multiple alignments downloaded from the UCSC genome browser (http://genome.ucsc.edu) using RNAz (22). The repeat information was downloaded from RepeatMasker Track of the UCSC genome browser. For simplicity, the different classes of repeats were grouped similarly as previously described (23), and denoted as “retrotransposon” for short interspersed nuclear elements (SINE), long interspersed nuclear elements (LINE), long terminal repeat elements (LTR), and DNA repeat elements (DNA) superfamilies; “Simple” for simple repeats and low complexity; and “Others” for the rest. Quantitative real-time PCR. Quantitative real-time PCR was used to validate expression of noncoding RNA. For each sample, total RNA input of approximately 100 ng was used, and cDNA was synthesized by reverse transcription using the QuantiTect reverse transcription kit (Qiagen). Primer sets for SYBR green quantitative reverse transcription-PCR (qRT-PCR) were designed using Primer3 (24). For each locus of interest, we designed two or more pairs of primers, and we selected the one with the best amplification efficiency in samples across all mouse strains for the subsequent quantification. Primer sequences are available in Table S5 in the supplemental material. qPCR was performed using an ABI 7900HT real-time PCR system, and each assay was run in triplicate using Power SYBR green PCR master mix (Applied Biosystems). We chose the 18S rRNA gene for normalization using geNORM (25) and assaying multiple endogenous controls across all samples. These endogenous controls were the 18S rRNA gene, actin beta (Actb), beta-2 microglobulin (B2m), glyceraldehyde-3-phosphate dehydrogenase (Gapdh), hypoxanthine guanine phosphoribosyl transferase (Hprt1), phosphoglycerate kinase 1 (Pgk1), and transferrin receptor (Tfrc). We selected 39 candidates representing a variety of loci for follow-up by qPCR. We required that candidate loci have genomic locations containing unique sequences for designing PCR primers and a reasonable read coverage suggesting efficient amplification.