Introduction As is the case with many natural RNAs, the function of the HIV RNA genome is strongly linked to its ability to form higher-order structure and to interact with protein effectors in each stage of its replication cycle. The 5′ end of the HIV genome alone contains a noncoding, highly structured region that is involved in viral packaging, dimerization, pairing with the cellular tRNA primer for cDNA synthesis, and binding numerous viral proteins [1,2]. Immediately downstream of this regulatory region, the HIV genome contains nine open reading frames. The first protein encoded by the HIV-1 genome is Gag, which is ultimately cleaved into a complex set of proteins, including matrix, capsid, and nucleocapsid . The nucleocapsid protein has two dissimilar RNA binding activities. First, as part of the Gag precursor protein, the nucleocapsid domain specifically recognizes and directs the HIV genome to be packaged into new virions [3–5] in the context of a vast excess of other cellular RNAs. In contrast, the nucleocapsid protein also has a significant duplex destabilizing activity, which plays a key role in facilitating the RNA annealing and rearrangement events essential for viral reverse transcription [6,7]. Because of its importance in regulating replication and for governing interactions with protein cofactors, extensive efforts have been directed towards developing structural models for the HIV-1 genome and for identifying candidate interaction sites for the nucleocapsid protein. Multiple models based on phylogenetic, chemical mapping, and mutagenesis approaches have been proposed [8–17]. In general, agreement between proposed structures is limited to highly structured hairpins; large zones of disagreement between models are common. The available experimental and sequence information is not sufficient to support one model over another. This lack of consensus in understanding HIV genome structure reflects challenges common to the analysis of any large RNA in its native biological context. One of the most successful methods for determining an RNA secondary structure has been phylogenetic covariation analysis [18,19]. However, this method has a narrow range of usefulness in that homologous RNAs must be similar enough to form the same structure, but simultaneously exhibit sufficient polymorphism to be informative. Few RNAs meet this standard. In the case of HIV and most viral RNAs, sequences are typically too similar to each other to provide sufficient constraints over large regions. Alternatively, RNA structural information can be inferred by treatment with chemicals or enzymes that discriminate between paired and unpaired nucleotides [20,21]. This reactivity information is then used to choose among various models, usually predicted with the assistance of thermodynamic folding algorithms. Conventional chemical and enzymatic mapping information has a narrow dynamic range and is usually available only for 25% to 50% of nucleotides in an RNA. As a result, conventional mapping experiments tend to focus on short pieces of RNA in artificial contexts and rely heavily on thermodynamic prediction and extrapolation to relate structures and protein binding sites identified in vitro to the biology of large RNAs in their native contexts. When used against intact viral particles, conventional probes have thus far also been unable to detect protein–RNA interactions inside HIV virions . Therefore, comprehensive and accurate analysis of any cellular or viral RNA in a relevant biological context requires a new technology that may be used for any RNA sequence under a wide variety of biologically relevant conditions. Our approach is based on the observation that electrophiles, such as N-methylisatoic anhydride (NMIA), react selectively with flexible RNA nucleotides at the ribose 2′-hydroxyl group (Figure 1) . Figure 1 Scheme for RNA SHAPE Chemistry The RNA is exposed to NMIA at a concentration that yields approximately one 2′-O-adduct per 300 nucleotides (nts). Adducts are detected by their ability to inhibit primer extension by reverse transcriptase. A control extension reaction omitting NMIA to assess background, along with dideoxy sequencing extensions to assign nucleotide positions, are performed in parallel. These combined steps are called selective 2′-hydroxyl acylation analyzed by primer extension, or SHAPE [22–24]. Because the four canonical RNA nucleotides each contain a 2′-hydroxyl group, local nucleotide flexibility at all sites in an RNA is quantitatively interrogated in a single experiment. In this work, we couple SHAPE chemistry with automated data readout and analysis systems so that hundreds of nucleotides of RNA structure can be analyzed in a single experiment. The overarching impact of this technology begins to realize the goal of making RNA structure analysis roughly as simple as contemporary DNA sequencing. High-throughput SHAPE (hSHAPE) makes it possible to rapidly measure RNA backbone flexibility at greater than 90% of the nucleotides in an RNA under a variety of biologically relevant conditions. To demonstrate the power of hSHAPE while examining one of the most functionally important regions of an HIV genome, we applied hSHAPE to the first 900 nts of the 5′ end of the HIV-1 genome, under four biologically informative states. The data from these experiments allow us to address a comprehensive set of challenges that all require accurate knowledge of the structure of the HIV RNA genome: Does the viral genome have a global, long-range architecture, or is it largely organized as a series of isolated structural elements? Is it possible to identify RNA regulatory motifs based on their underlying RNA structure? Can consensus RNA binding sites be identified for the viral nucleocapsid protein, and if so, are these organized in domains? Can nucleocapsid interaction sites that function in specific RNA binding and genome packaging be distinguished from those that are targets for duplex destabilization by this same protein? If two classes of interaction sites for the nucleocapsid protein can be identified, do the specific binding and duplex destabilization activities operate in the same or in neighboring regions of the HIV genome? Results hSHAPE Strategy To create the technology necessary to analyze long regions of an RNA in biologically relevant environments in a single experiment, we performed each extension using a primer labeled with a color-coded fluorophore. The resulting cDNA products (from the (+) and (−) reagent reactions plus two sequencing traces) are combined and resolved in one multi-fluor run by automated capillary electrophoresis. In a single multiplex hSHAPE read, we typically obtain 250–400 nts of quantitative RNA structural information at single-nucleotide resolution (Figure 2A). Figure 2 Analysis of HIV-1 RNA Genome Structure Using hSHAPE (A) Intensity versus elution time for an hSHAPE analysis resolved by single capillary electrophoresis using the HIV-1 in vitro transcript. The (+) and (−) NMIA reactions (red and blue) are offset from the A and C sequencing lanes (black and green) for clarity. (B) Whole-trace peak integration of a section of data from part (A). (+) and (−) peaks were fit to Gaussian curves (thin black lines) and aligned to the sequencing channels. (C) Signal decay correction for the (+) NMIA trace assuming a constant probability for extension at each nucleotide (black line). (D) Processed SHAPE reactivities as a function of nucleotide position. Highly reactive nucleotides (red and orange bars) report flexible positions in the RNA. (E) Absolute SHAPE reactivities superimposed on a secondary structure model for the TAR and Poly(A) stem-loops. These elution time versus dye amount data resemble DNA sequencing information analyzed by capillary electrophoresis, and require similar preprocessing steps, such as baseline correction and corrections for fluorescent multiplexing . However, whereas DNA sequencing requires detection of only the most intense peak at each position, the amplitudes in the (+) and (−) NMIA reagent channels in an hSHAPE experiment report quantitative RNA structural information. Peaks with little or no reactivity in the (+) NMIA channel correspond to RNA nucleotides constrained by base pairing or other interactions, whereas tall peaks indicate conformationally flexible positions (red and blue traces, Figure 2A). The dynamic range that distinguishes flexible from paired nucleotides is approximately 30-fold. We calculated quantitative SHAPE reactivity information at every nucleotide position by developing hSHAPE-specific processing algorithms that (1) align the (+) and (−) NMIA channels to the RNA sequence (Figure 2A), (2) integrate the (+) and (−) NMIA peaks (Figure 2B), (3) correct signal decay (black line, Figure 2C), and (4) normalize SHAPE reactivities to a universal scale (Figure 2D). In this case, these steps produce a single-nucleotide resolution view of RNA flexibility for HIV-1 nucleotides 8 through 286 (Figure 2D). To verify the accuracy of hSHAPE, we superimposed SHAPE reactivities on the well-characterized TAR and poly(A) stem-loops (nucleotides 1–104), which is the only region in the first 900 nts of the HIV RNA genome longer than approximately 50 nts for which prior analyses have converged on a single structural consensus [9,15,16]. The SHAPE reactivity information is exactly consistent with the consensus secondary structure model for this region (Figure 2E). Nucleotides with normalized SHAPE reactivities greater than 0.5 are almost always single stranded (orange and red columns, Figure 2E), whereas positions with reactivities less than approximately 0.2 (purple bars, Figure 2E) are almost always paired. Reactivities between these values (green) may be paired or participate in other partially constraining interactions. SHAPE reactivities also accurately report fine-scale structural differences. For example, nucleotides in the UCU bulge in the TAR stem show intermediate reactivities (Figure 2E), consistent with nuclear magnetic resonance (NMR) studies  that indicate that these nucleotides are partially stacked. Structural Analysis of Four HIV-1 Genome States Our goal was to determine the structure of and derive biological inferences for the 5′ region of the HIV-1 RNA genome, as it exists inside wild-type viral particles. As will be shown below, hSHAPE provides an extraordinarily detailed and high-resolution view of the HIV-1 genome inside authentic virions that reflects multiple biologically important RNA–RNA and RNA–protein interactions. To identify and characterize virion-specific RNA conformational changes and RNA–protein interactions, we used hSHAPE to analyze the structures of four states in total. In addition to (1) genomic RNA inside virions (the in virio state), we compared the structure of the in virio state with three simpler states involving (2) authentic HIV-1 genomic RNA that had been gently, but completely, deproteinized and extracted from virions (ex virio), (3) genomic RNA inside virions, but in which nucleocapsid-RNA interactions were selectively disrupted in situ by treatment with Aldrithiol-2 (AT-2 treated, described in detail below), and (4) a 976-nt HIV-1 transcript generated in vitro in the absence of any viral component. We used the protein-free ex virio RNA as the reference state for comparison. This RNA state is strongly influenced by the authentic virion environment but simultaneously lacks the complex influence of bound proteins. We constructed quantitative, single-nucleotide resolution profiles for the first 900 nts, or 10% of the HIV-1 genome, for each of these four states by combining the information from overlapping, and highly reproducible (Figure 3A) hSHAPE reads. Two to three independent repetitions were obtained for each primer; standard deviations are small, on the order of 0.1 normalized SHAPE unit or less. By combining the results of 32 individual hSHAPE experiments of approximately 300 nts each, we obtained structural information for 94% of all nucleotides in these four states plus analysis of three structural mutants, for a total analysis of 9,100 nts (Figures 3B and S1). Figure 3 Single Nucleotide Resolution hSHAPE RNA Structure Analysis Is Quantitative and Highly Reproducible (A) SHAPE reactivities of two overlapping extension reactions for the in vitro transcript using primers that anneal approximately 200 nucleotides apart. Solid and open bars indicate normalized SHAPE reactivities for primers that anneal at nucleotides 342–362 and 535–555, respectively. Error bars represent standard deviations calculated from independent experiments using the same primer. (B) Complete SHAPE data for HIV-1 genomic RNA analyzed in virions and in the ex virio, AT-2–treated, and in vitro transcript states (see Dataset S1). Structural Model of the HIV-1 Genome The nucleotide-resolution SHAPE reactivities for the large analyzed region of the HIV-1 genome represent an unprecedented amount of structural information with which to characterize RNA structure, protein binding sites, and conformational changes that differentiate our four genome states. However, comprehensive single-nucleotide resolution data do not, by themselves, yield a secondary structure model for an RNA. We incorporate SHAPE reactivity information as an additional quasi-energetic constraint into an existing thermodynamic model [27,28] for RNA secondary structure prediction. Algorithms used to predict RNA structures from sequence show large increases in accuracy when experimental constraints are included in the prediction [28–30]. Ongoing work from our laboratories shows that using SHAPE information to constrain RNA secondary structure prediction has a dramatic impact on the accuracy of predicted structures. For example, prediction accuracy improves from 52% to 90% for the RNase P specificity domain  and from 38% to approximately 90% for the 1,542-nt Escherichia coli 16S rRNA (unpublished data). These predictions feature overall topologies that closely resemble the correct structure, with errors generally limited to local rearrangements at multi-helix junctions. SHAPE reactivities provide model-free information about the extent of structure at each nucleotide. Therefore, in addition to proposing a complete secondary structure for our ex virio reference state (Figure 4A), we directly assessed the well determinedness of each helix in the secondary structure by increasing the relative weight of the SHAPE information in calculating low free-energy structures. We term this analysis the “pairing persistence” of each helix. Highly persistent helices (black and purple bars indicating base pairing in Figure 4A) form even when SHAPE reactivities were used to impose large pairing penalties for even slightly reactive nucleotides. Less persistent helices (blue and green bars, Figure 4A) form only when the SHAPE contributions to the constraints are decreased. Figure 4 Architecture and Protein Modulation of the Structure of the HIV-1 NL4–3 RNA Genome (A) Secondary structure model generated by SHAPE-constrained folding of the ex virio state. All four states fold to this structure. Nucleotides proposed to form intermolecular base pairs with a second monomer are enclosed within a black box; the AUG start codon for the Gag polyprotein (nucleotide 336) is boldface. Nucleotides are colored according to their SHAPE reactivity; bars indicating base pairing are colored by their pairing persistence. Effects of pretreatment of viral particles with AT-2 are indicated with closed and open arrowheads. Clustered sites that show a strong increase in reactivity with AT-2 treatment (and report specific nucleocapsid binding sites) are emphasized in cyan; proposed primary and secondary sites are identified with double (**) and single asterisks (*), respectively. Sites showing a strong reduction in SHAPE reactivity (and reporting the structure destabilizing activity of nucleocapsid) are emphasized with solid and dashed gray arrows. The bound tRNA(lys3) is shown starting at nucleotide 33. The pseudoknot involving positions 75–84/443–449  was not predicted directly by our algorithm, but by inference because both loop regions were unreactive towards SHAPE chemistry. (B) The 5′ regulatory domain is more highly structured than the 3′ coding region. Median SHAPE reactivities (solid lines) were calculated over rolling windows of 45 and 135 nts. Median reactivities for the entire 5′ regulatory and 3′ coding regions (dashed lines) are 0.13 and 0.40, respectively. (C) Box plot analysis  of distinct reactivity distributions for the 5′ and 3′ domains. Boxes outline middle 50% of each dataset; medians are shown with heavy lines. Open circles indicate values >1.5 times the interquartile range (boxed) and are considered outliers. The fences (small horizontal lines above and below the box) are the largest or smallest non-outlier values. The 5′ Regulatory Domain Is More Highly Structured Than Adjacent mRNA-Like Coding Sequences The 5′ end of the HIV genome contains two functional regions whose boundary is the AUG start codon for the Gag coding sequence (nucleotides 336–338; in bold, Figure 4A). Positions 5′ of the AUG start codon form a 340-nt–long noncoding regulatory domain that plays multiple important roles in the viral replication cycle. In contrast, nucleotides 3′ of the start codon contain the Gag protein coding region, of which we have analyzed approximately 560 nts. An important unresolved issue has been whether regulatory motifs might be identified from single RNA sequences. This is a challenging problem, and it is currently not possible to distinguish coding versus noncoding regions by thermodynamic secondary structure prediction alone [31,32]. hSHAPE analysis directly addresses this issue in two ways. First, the median SHAPE reactivity, a metric for the typical amount of structure in the two regions, is 0.14 for the 5′ regulatory domain and 0.40 for the 3′ mRNA-like region (dashed lines, Figure 4B). Differences in SHAPE reactivities between the two regions are statistically significant (Wilcoxon rank sum test p-value 4. RNAstructure is available for download at http://rd.plos.org/pbio.0060096.2. Statistical analyses, general procedures. To analyze differences in SHAPE reactivities between groups of nucleotides, such as comparing the 5′ regulatory versus 3′ protein coding regions, several tests were applied to determine the statistical characteristics of the subgroups and to determine the appropriate statistical measures of differences. We applied Levene's test to analyze homoscedasticity, which confirmed that the SHAPE reactivity variances between analyzed groups of nucleotides are equivalent. However, quantile-quantile (Q-Q) plots of these data indicated departures from normality for the SHAPE reactivity data, so the standard Student t-test could not be applied. Instead, we applied the Wilcoxon rank sum test to analyze the statistical SHAPE reactivity differences between groups, which eliminates dependence on normality by rank-ordering the reactivity data and deriving statistics from the rankings. In the small number of cases in which reactivity data were missing for a nucleotide in one group, then that nucleotide was removed from the other groups in a given comparison. All analyses were performed using the open source R package (version 2.6.1) . Differences in structure between regulatory and coding domains. SHAPE reactivity data for the ex virio state were divided into two groups corresponding to the 5′ regulatory (nucleotides 1–335) and Gag coding sequences (nucleotides 336–906) and analyzed by the Wilcoxon rank sum test. Differences between the 5′ regulatory and 3′ coding regions were highly significant (p < 0.0001). Absence of differences between TAR and DIS SHAPE reactivities for in vitro transcript, ex virio, and AT-2–treated states. Reactivity data were analyzed for the TAR (nucleotides 1–57) and DIS (nucleotides 236–282) regions. Three groups of reactivities were collected, reflecting each RNA state (in vitro, ex virio, and AT-2 treated) with 141 and 129 nts for the TAR and DIS regions, respectively. We applied a one-way analysis of variance (ANOVA) to the three groups for each of the regions, to test whether they displayed statistically significant differences among groups for either region. The F ratio was calculated to determine whether the means of individual groups, corresponding to the three states, represent significant variation from the overall mean. The initial F ratio values calculated showed no significant difference between groups, with p-values of 0.63 for TAR and 0.98 for DIS, respectively. However, because SHAPE reactivities are not normally distributed, the F ratio can be inaccurate. To assess whether the initial F ratio was representative of the underlying distribution, we performed bootstrap resampling of the SHAPE data from the groups, with replacement . We recalculated the F ratio for each of 15,000 repetitions, then calculated p-values by determining the number of times F ratios with larger (more extreme) values were obtained. The bootstrap–re-estimated p-values were 0.625 for TAR and 0.974 for DIS, in close agreement with the initial calculation. This indicates that the original F ratio, which showed no statistical difference between these groups, is representative of the lack of difference between the TAR and DIS structures in these three HIV-1 RNA genome states. Statistical significance of nucleocapsid binding sites. SHAPE reactivity data were divided into two groups corresponding to the in virio and AT-2–treated states. For each state, the nucleotides were divided into the three nucleocapsid protein binding classes (Figures 7 and 8). This resulted in two groups of 34, 42, and 28 nts for the Class 1, 2, and 3 sites, respectively. Each of these were compared using the Wilcoxon rank sum test. For all three classes of nucleocapsid protein interaction sites, p-values < 0.001, indicating statistically significant differences in the SHAPE reactivity between in virio and AT-2–treated states. We also performed the inverse analysis, using all nucleotides that were not associated with any of the three classes of nucleocapsid sites. The Wilcoxon test yielded a p-value of 0.859, indicating that nucleotides outside of these sites were statistically equivalent. Information content of nucleocapsid protein binding sites was calculated as described [54,55]. Because the number of sequences was relatively small, information content shows some variation depending on the specific alignment. For example, information content was 9.6 bits for the alignment shown in Figure 10; alternative alignments yielded information contents of 8 to 12 bits. Supporting Information Dataset S1 SHAPE Reactivity Data This dataset contains the complete SHAPE reactivity data for the ex virio, in virio, AT-2 treated, and in vitro transcript states. (346 KB XLS) Click here for additional data file. Figure S1 hSHAPE Accurately Detects Small Changes in HIV-1 Genome Structure In this figure, structures are illustrated schematically at the left. Representative nucleotides are colored by their experimental SHAPE reactivity using the scale shown in Figure 4. Reactivity histograms are shown at the right; large upward or downward pointing colored arrows indicate increases or decreases in SHAPE reactivity, respectively. (A) Deletion of the U22CU24 bulge in TAR. Native and ΔU22CU24 histograms are black and purple, respectively. (B) SHAPE analysis of the 79–85/443–339 pseudoknot. Pseudoknot structure was analyzed for the 976 nt in vitro transcript (black histogram) as compared to a 362 nt RNA that lacks sequences required to form the pseudoknot (cyan). (C) Ex virio genomic RNA bound to tRNA(lys) (red) undergoes a local rearrangement to maximize intra-molecular pairs (orange) upon heating and refolding. (294 KB PDF) Click here for additional data file. Accession Numbers The GenBank (http://www.ncbi.nlm.nih.gov/Genbank) accession number for the pNL4–3 molecular clone is AF324493.