Blog
About

417
views
0
recommends
+1 Recommend
1 collections
    24
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      Improving CRISPR-Cas nuclease specificity using truncated guide RNAs

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Clustered, regularly interspaced, short palindromic repeat (CRISPR) RNA-guided nucleases (RGNs) are highly efficient genome editing tools. 1–3 CRISPR-associated 9 (Cas9) RGNs are directed to genomic loci by guide RNAs (gRNAs) containing 20 nucleotides that are complementary to a target DNA sequence. However, RGNs can induce mutations at sites that differ by as many as five nucleotides from the intended target 4–6 . Here we report that truncated gRNAs, with shorter regions of target complementarity <20 nucleotides in length, can decrease undesired mutagenesis at off-target sites by as much as 5000-fold or more without sacrificing on-target genome editing efficiencies. In addition, combining truncated gRNAs with pairs of Cas9 variants that nick DNA (paired nickases) can lead to further reductions in off-target mutations. Our results delineate a simple, effective strategy to improve the specificities of Cas9 nucleases or paired nickases. The Streptococcus pyogenes Cas9 nuclease (hereafter referred to as Cas9) can robustly induce insertion or deletion mutations (indels) or precise alterations via repair of Cas9-induced double-stranded breaks (DSBs) by non-homologous end-joining (NHEJ) or homology-directed repair (HDR), respectively. However, unwanted indel mutations can also be induced at off-target sites sharing sequence similarity to the on-target site 4–6 . Recently, several approaches to improve the specificity of RNA-guided Cas9 have been recently described including truncation of the 3' end of gRNA (which is derived from the tracrRNA domain that is believed to mediate interaction with Cas9) or addition of two G nucleotides to the 5' end of the gRNA (just before the 20 nt complementarity region); however, RGNs utilizing these altered gRNAs show decreased on-target activities 6, 7 . Alternatively, a “paired nicking” strategy (originally implemented with pairs of closely spaced zinc finger nickases 8 ), in which two gRNAs targeted to adjacent sites on opposite DNA strands each recruit a Cas9 variant (Cas9-D10A) that nicks DNA instead of cutting both strands 7, 9, 10 , can reduce mutation frequencies at known off-target sites of single gRNA-guided Cas9 nuclease in human cells 7, 10 . Nevertheless, indel mutations can still be observed at some off-target sites 10 and the addition of a second gRNA might introduce new off-target mutations because a single gRNA-directed Cas9 nickase can efficiently induce indels at some sites 7, 9, 11 . In addition, the need to express appropriately positioned and oriented paired gRNAs presents technical challenges if implemented with multiplex 12–14 or genome-wide library-based applications of RGNs 15, 16 . Finally, the paired nickase strategy cannot be used to improve the specificities of catalytically inactive Cas9 (dCas9) fused to heterologous effectors, such as transcriptional activation domains 17–19 . Thus, the development of additional methods to improve the specificity of CRISPR-based systems remains an important priority. We hypothesized that off-target effects of RGNs might be minimized by decreasing the length of the gRNA-DNA interface. Such an approach might seem counterintuitive, but we 20 and others 7, 10 have shown that lengthening the 5' end of the complementary region can reduce on-target editing efficiency, with some of these longer gRNAs processed back to standard length in human cells 10 . In contrast, certain gRNAs bearing either truncations or progressively greater numbers of mismatches at the 5' end of their complementarity targeting regions have been shown to retain robust Cas9-mediated on-target cleavage activities 4, 9, 21 . We hypothesized that these 5'-end nucleotides might not be necessary for full gRNA activity and that these nucleotides might normally compensate for mismatches at other positions along the gRNA-target DNA interface; therefore, we reasoned that shorter gRNAs might be more sensitive to mismatches and thus more specific (Supplementary Fig. 1). To test our predictions, we constructed a series of progressively shorter gRNAs for a target site in the EGFP reporter gene containing 15, 17, 19, and 20 complementary nucleotides (Online Methods and Fig. 1a). We measured the abilities of these gRNAs to direct Cas9-induced indels at this target site in human U2OS.EGFP cells by quantifying mutation of a single integrated and constitutively expressed EGFP gene 4, 22 (Online Methods). gRNAs that have 17 or 19 nucleotides of target complementarity showed activities comparable to the full-length gRNA with 20 nucleotides of complementarity, whereas a gRNA containing 15 nucleotides of complementarity failed to show activity (Fig. 1b). To extend the generality of these findings, we assayed full-length gRNAs and matched gRNAs with 18, 17 and/or 16 nucleotides of complementarity to four additional EGFP reporter gene sites (EGFP sites #1, #2, #3 and #4; Fig. 1c). For all four target sites, gRNAs with 17 and/or 18 nucleotides of complementarity functioned as efficiently as (or, in one case, more efficiently than) their matched full-length counterparts (Fig. 1c). gRNAs with only 16 nucleotides of complementarity showed substantially decreased or undetectable activities at the two sites for which they could be made (Fig. 1c). Given these results, in this report we refer to truncated gRNAs with complementarity lengths of 17 or 18 nucleotides as “trugRNAs” and RGNs using these tru-gRNAs as “tru-RGNs”. To determine whether tru-RGNs could efficiently edit endogenous genes, we constructed trugRNAs for seven sites in three human genes (VEGFA, EMX1, and CLTA), including four sites targeted in previous studies 4–6, 10 (Fig. 1d). Five of these seven tru-gRNAs induced Cas9-mediated indel mutations with efficiencies comparable to those mediated by matched standard length RGNs (Fig. 1d) (Online Methods) with the two tru-gRNAs that showed lower activities than their full-length counterparts still exhibiting high absolute rates of mutagenesis (means of 13.3% and 16.6%) (Fig. 1d). Sanger sequencing confirmed that indels introduced by tru-RGNs originate at the expected site of cleavage and are essentially indistinguishable from those induced by their standard counterpart RGNs (Supplementary Fig. 2). We also found that tru-gRNAs containing a mismatched 5' G and 17 nucleotides of complementarity could efficiently direct Cas9-induced indels, whereas those bearing a mismatched 5' G and 16 nucleotides of complementarity showed lower or undetectable activities compared with matched full-length gRNAs (Supplementary Fig. 3), consistent with our results that a minimum of 17 nucleotide of complementarity is required for efficient RGN activity. Finally, we tested whether tru-RGNs could introduce a BamHI restriction site insertion via HDR with ssODNs and found that they did so for two sites in the VEGFA and EMX1 genes with efficiencies comparable to or higher than matched standard RGNs (Fig. 1e). We conclude that tru-RGNs can function efficiently to introduce on-target indels or HDR-mediated genome editing events in human cells. To assess the specificities of tru-RGNs, we first tested whether they possess greater sensitivity to Watson-Crick mismatches at the gRNA-DNA interface by testing variants of the full-length and trugRNAs we had previously made for EGFP sites #1, #2 and #3 (Fig. 1c above). These variant gRNAs harbor single substitutions or adjacent double substitutions at each position within the complementarity region (except the 5' G) (Fig. 2a and 2b). Using the EGFP disruption assay in human U2OS.EGFP cells, we found that tru-RGNs were generally more sensitive to single and adjacent double mismatches than standard RGNs (compare bottom and top panels of Fig. 2a and 2b). Magnitudes of sensitivity to mismatches were site-dependent, with tru-RGNs to EGFP site #1 exhibiting less sensitivity to single and double mismatches than tru-RGNs to EGFP sites #2 and #3. Note that EGFP site #1 tru-gRNAs have 18 complementary nucleotides whereas the others have 17. We next examined whether tru-RGNs have reduced genomic off-target effects in human cells by using matched full-length and tru-gRNAs targeted to VEGFA site 1, VEGFA site 3,and EMX1 site 1 (Fig. 1d). We chose these target sites because previous studies had defined a total of 13 off-target sites for the full-length gRNAs targeted to these sequences 4, 5 . tru-RGNs exhibited substantially reduced mutagenesis activity relative to matched standard RGNs at all 13 previously identified off-target sites in human U2OS.EGFP cells (Table 1), with 11 sites having mutation frequencies below the reliable detection limit (2 – 5%) of the T7 Endonuclease I (T7EI) assay used for these experiments (Table 1; Online Methods). We also observed similar results in a different human cell line (FT-HEK293 cells) (Supplementary Table 1). To enable more sensitive detection of off-target mutations, we used high-throughput sequencing to assess 12 of the 13 off-target sites we had analyzed by T7EI assay (for technical reasons, we were unable to amplify the required shorter amplicon for one of the sites) as well as an additional, previously identified 5 off-target site for EMX1 site 1 in U2OS.EGFP cells (Fig. 2c). These sequencing results showed that tru-RGNs induced substantially decreased mutagenesis frequencies at all 13 off-target sites relative to matched standard RGNs (Fig. 2c and Supplementary Table 2), with some sites showing decreases of ~5000-fold or more (Fig. 2d). No indel mutations were observed with RGNs f or off-target sites (OT1–4 and OT1–11). Therefore, for these two sites, we conservatively estimated a likely upper boundary for the average indel frequencies and calculated a minimum improvement in tru-RGN specificities for these sites of >10,000 or more over standard RGNs (Online Methods; Fig. 2d). We then sought to assess whether tru-RGNs might induce additional off-target mutations in the human genome beyond those previously identified for matched full-length RGNs. Based on the results of our EGFP disruption assay, we reasoned that genomic sites with the fewest mismatches compared to the on-target site would be the most likely to be mutated. Therefore, we computationally identified sites in the human genome with one, two or three mismatches relative to tru-gRNAs targeted to VEGFA site 1, VEFGA site 3 and EMX1 site 1 (Supplementary Table 3a), excluding known off-target sites that had already been examined by deep sequencing above (Supplementary Table 3b). We used the T7EI assay to examine 97 potential off-target sites including all with one mismatch and either all or some with two or three mismatches (Supplementary Table 3c and Supplementary Table 4). Only one of these 97 sites showed detectable levels of indel mutations by T7EI assay in human U2OS.EGFP cells and none showed detectable indels in human FT-HEK293 cells (Supplementary Table 4). The one site for which indels were observed was also mutated by the corresponding standard full-length RGN (Supplementary Results and Supplementary Fig. 4), demonstrating that this off-target site is not unique to the tru-RGN. We also used deep sequencing to examine 30 of the most closely matched potential off-target sites (including all sites with one mismatch for all three RGNs and nearly all sites with two mismatches for the RGNs targeted to VEGFA Site #1 and EMX1 Site #1 (Supplementary Table 4d)) and found undetectable or very low rates of indel mutations (Supplementary Table 5), comparable to those induced by tru-RGNs for other previously known off-target sites (Supplementary Table 2). Of note, the percentage of sites with mutation rates above 0.1% decreases with increasing numbers of mismatches (Supplementary Table 4e), validating our focus on sites with fewest mismatches. tru-RGNs generally appear to induce either very low or undetectable levels of mutations at potential off-target sites that differ by one or two mismatches; by contrast, our previous study using standard RGNs showed high levels of mutations at numerous off-target sites bearing up to four or five mismatches 4 . Because neither tru-gRNAs nor paired nickases completely eliminate off-target effects, we explored whether combining these strategies might further reduce such mutations. First, we used a pair of gRNAs targeted to sites in the human VEGFA gene (VEGFA site 1 and VEGFA site 4; Fig. 2e) previously shown to work with the paired Cas9-D10A nickase approach 10 . Substitution of the full-length gRNA for VEGFA Site #1 with a tru-gRNA did not adversely affect the induction of indels (Fig. 2f) or incorporation of a restriction site sequence by HDR (Fig. 2g). We next used deep sequencing to examine mutation frequencies at four previously validated off-target sites of the VEGFA site 1 gRNA and found that mutation rates dropped below the detection limit of the sequencing assay at all four of these off-target sites when using paired nickases with one full-length gRNA and one tru-gRNA (Supplementary Table 6). By contrast, both a single tru-RGN (Supplementary Table 2) and paired nickases with full-length gRNAs (Supplementary Table 6) still induced off-target mutations at one of these four off-target sites (OT1–3). Although we substituted a tru-gRNA for only one of the two full-length gRNAs in our VEGFA paired nickase experiments, we have shown at another locus (in the EMX1 gene) that use of two tru-gRNAs with Cas9 nickase can robustly induce indels at an endogenous human gene locus (Supplementary Fig. 5). Taken together, we conclude that tru-gRNAs can further reduce the off-target effects of paired Cas9 nickases (and vice versa) without compromising the efficiency of on-target genome editing. Our results show that tru-RGNs can generally introduce mutations via NHEJ or HDR at on-target sites with high efficiencies and show reduced mutagenic effects at closely matched off-target sites. One potential model to explain our overall results is that standard RGNs with full-length gRNAs might possess more affinity for their target sites than is required (perhaps not unsurprisingly because it might be beneficial for naturally occurring CRISPR systems to tolerate the introduction of alterations in the target sequence) and that truncation of the gRNA might poise the tru-RGN/DNA complex to be more sensitive to mismatches, perhaps by reducing binding energy at the gRNA/DNA interface. The concept of excess affinity affecting the specificity of DNA binding domains has previously been suggested by others for engineered zinc finger proteins 23, 24 . The use of tru-gRNAs for improving CRISPR specificity offers important advantages over the paired Cas9 nickase strategy. tru-gRNAs should be technically simpler to implement with applications involving multiplex 12–14 or genome-wide libraries of gRNAs 15, 16 and, unlike the paired nickase strategy, can also be used to improve the specificities of dCas9 25 or dCas9 fusions to heterologous effector domains such as transcriptional regulatory domains 17–19 . In this regard, we note that we have found that tru-gRNAs can efficiently recruit dCas9-VP64 transcriptional activators to an endogenous human gene (Y.F., V.M.C., and J.K.J., unpublished data). tru-RGNs also avoid the need to use a second gRNA that on its own can potentially induce additional unwanted mutations with a Cas9 nickase. For example, consistent with previously published results 7, 9, 11 , we found that a single gRNA to VEGFA site 4 was capable of efficiently inducing indel mutations with Cas9 nickases (red bar, Fig. 2f). Our findings have several important implications for how to choose potential RGN target sites. The use of shorter gRNAs does not decrease the targeting range of the platform because target sites with 17 or 18 nts of complementarity will each occur in random DNA with frequencies equal to those with 20 nts of complementarity. Our results show that tru-gRGNs generally appear to induce very low or undetectable levels of mutagenesis at off-target sites with as few as one or two mismatches and suggest that sites with three or more mismatches will not have mutations at high frequencies, if at all. Thus, a reasonable strategy for choosing target sites to minimize off-target effects might be to choose tru-gRNA target sites that are unique in the genome and that have the smallest possible number of potential off-target sites with 1 or 2 mistmatches. We have modified our web-based ZiFiT Targeter program 26, 27 so that it can identify tru-gRNA sites with 17 or 18 nts of complementarity and provide the user with information about potential off-target sites that differ by 0, 1 or 2 positions in the published genomes of fruit flies, roundworm, zebrafish, mice, rats, humans using the Bowtie program 28 . This modified version of ZiFiT is currently accessible at http://zifit.partners.org. Our results raise several important questions to be addressed in future experiments. Although we found that tru-gRNAs with 17 or 18 nts of complementarity generally function efficiently at the intended target site and have improved specificities, it is possible that certain gRNAs with shorter and longer complementarity lengths might also possess such properties. Testing of greater numbers of truncated gRNAs in future experiments should help to determine this and also whether certain characteristics such as GC content might predict activity levels. In addition, we presume our tru-gRNAs are generally being expressed at the same levels as their full-length counterparts because they show comparable activities and because titration experiments performed for EGFP sites #1, #2, and #3 in which the amounts of gRNA and Cas9 expression vectors are varied demonstrate that shortened gRNAs give activity curves similar to their full-length counterparts (Supplementary Fig. 6); however, more direct, quantitative assessments of tru-gRNA expression levels will be required to definitively establish this and to better understand why and how tru-gRNAs function with high efficiencies and specificities. In summary, tru-gRNAs provide a simple and flexible approach to minimize the off-target effects of individual Cas9 nucleases, paired Cas9 nickases and, potentially, dCas9 fusion proteins in human cells. However, we note that definitive assessment of the relative efficacies of any platform designed to improve specificities will require the development of an unbiased approach for globally assessing off-target effects in human cells. Continued efforts to develop methods for assessing and improving specificity will further accelerate the use of CRISPR-based reagents for research and therapeutic applications. Online Methods Plasmid construction All gRNA expression plasmids were assembled by designing, synthesizing, annealing, and cloning pairs of oligonucleotides (IDT) harboring the complementarity region into plasmid pMLM3636 (available from Addgene; http://www.addgene.org/crispr-cas) as previously described 4 . The resulting gRNA expression vectors encode a ~100 nt gRNA whose expression is driven by a human U6 promoter. The sequences of all oligonucleotides used to construct gRNA expression vectors are shown in Supplementary Table 7. The Cas9 D10A nickase expression plasmid (pJDS271) bearing a mutation in the RuvC endonuclease domain was generated by mutating plasmid pJDS246 using a QuikChange kit (Agilent Technologies) with the following primers: Cas9 D10A sense primer 5'-tggataaaaagtattctattggtttagccatcggcactaattccg-3'; Cas9 D10A antisense primer 5'-cggaattagtgccgatggctaaaccaatagaatactttttatcca-3'. All the targeted gRNA plasmids and the Cas9 nickase plasmids used in this study will be made available through the non-profit plasmid distribution service Addgene. Human cell-based EGFP disruption assay U2OS.EGFP cells harboring a single-copy, integrated EGFP-PEST gene reporter have been previously described 22 . These cells were maintained in Advanced DMEM (Life Technologies) supplemented with 10% FBS, 2 mM GlutaMax (Life Technologies), penicillin/streptomycin and 400 μg/ml G418. To assay for disruption of EGFP expression, 2 × 105 U2OS.EGFP cells were transfected in duplicate with gRNA expression plasmid or an empty U6 promoter plasmid as a negative control, Cas9 expression plasmid (pJDS246) 4 , and 10 ng of td-Tomato expression plasmid (to control for transfection efficiency) using a LONZA 4D-Nucleofector™, with SE solution and DN100 program according to the manufacturer's instructions. We used 25 ng/250 ng, 250 ng/750 ng, 200 ng/750 ng, and 250 ng/750 ng of gRNA expression plasmid/Cas9 expression plasmid for experiments with EGFP site #1, #2, #3, and #4, respectively. Two days following transfection, cells were trypsinized and resuspended in Dulbecco's modified Eagle medium (DMEM, Invitrogen) supplemented with 10% (vol/vol) fetal bovine serum (FBS) and analyzed on a BD LSRII flow cytometer. For each sample, transfections and flow cytometry measurements were performed in duplicate. Transfection of human cells and isolation of genomic DNA To assess the on-target and off-target indel mutations induced by RGNs targeted to endogenous human genes, we transfected plasmids into U2OS.EGFP or FT-HEK293 (Life Technologies) cells using the following conditions: U2OS.EGFP cells were transfected using the same conditions as for the EGFP disruption assay described above. FT-HEK293 cells were transfected by seeding them at a density of 1.65 × 105 cells per well in 24 well plates in Advanced DMEM (Life Technologies) supplemented with 10% FBS and 2 mM GlutaMax (Life Technologies) at 37°C in a CO2 incubator. After 22 – 24 hours of incubation, cells were transfected with 125 ng of gRNA expression plasmid or an empty U6 promoter plasmid (as a negative control), 375 ng of Cas9 expression plasmid (pJDS246) 4 , and 10 ng of a td-Tomato expression plasmid, using Lipofectamine LTX reagent according to the manufacturer's instructions (Life Technologies). Medium was changed 16 hours after transfection. For both types of cells, genomic DNA was harvested two days post-transfection using an Agencourt DNAdvance genomic DNA isolation kit (Beckman) according to the manufacturer's instructions. For each RGN sample to be assayed, we performed 12 individual 4D transfection replicates, isolated genomic DNA from each of these 12 transfections, and then combined these samples to create two “duplicate” pools each consisting of six pooled genomic DNA samples. We then assessed indel mutations at on-target and off-target sites from these duplicate samples by T7EI assay, Sanger sequencing, and/or deep sequencing as described below. To assess frequencies of precise alterations introduced by HDR with ssODN donor templates, 2×105 U2OS.EGFP cells were transfected 250 ng of gRNA expression plasmid or an empty U6 promoter plasmid (as a negative control), 750 ng Cas9 expression plasmid (pJDS246), 50 pmol of ssODN donor (or no ssODN for controls), and 10 ng of td-Tomato expression plasmid (as the transfection control). Genomic DNA was purified three days after transfection using Agencourt DNAdvance and assayed for the introduction of a BamHI site at the locus of interest as described below. All of these transfections were performed in duplicate. For experiments involving Cas9 nickases, 2 × 105 U2OS.EGFP cells were transfected with 125 ng of each gRNA expression plasmid (if using paired gRNAs) or 250 ng of gRNA expression plasmid (if using a single gRNA), 750 ng of Cas9-D10A nickase expression plasmid (pJDS271), 10 ng of td-Tomato plasmid, and (if performing HDR) 50 pmol of ssODN donor template (encoding the BamHI site). All transfections were performed in duplicate. Genomic DNA harvested two days after transfection (if assaying for indel mutations) or three days after transfection (if assaying for HDR/ssODN-mediated alterations) using the Agencourt DNAdvance genomic DNA isolation kit (Beckman). U2OS.EGFP and FT-HEK293 cell lines used in this study were tested for mycoplasma contamination every two weeks. T7EI assays for quantifying frequencies of indel mutations T7EI assays were performed as previously described 4 . In brief, PCR reactions to amplify specific on-target or off-target sites were performed with Phusion high-fidelity DNA polymerase (New England Biolabs) using one of the two following programs: (1) Touchdown PCR program [(98°C, 10 s; 72–62°C, −1 °C/cycle, 15 s; 72°C, 30 s) × 10 cycles, (98°C, 10 s; 62°C, 15 s; 72°C, 30 s) × 25 cycles] or (2) Constant Tm PCR program [(98°C, 10 s; 68°C or 72°C, 15 s; 72°C, 30 s) × 35 cycles], with 3% DMSO or 1 M betaine if necessary. All primers used for these amplifications are listed in Supplementary Table 8. Resulting PCR products ranged in size from 300 to 800 bps and were purified by Ampure XP beads (Agencourt) according to the manufacturer's instructions. 200ng of purified PCR products were hybridized in 1 × NEB buffer 2 in a total volume of 19 μl and denatured to form heteroduplexes using the following conditions: 95 °C, 5 minutes; 95 to 85 °C, −2 °C/s; 85 to 25 °C, −0.1 °C/s; hold at 4 °C. 1 μl of T7 Endonuclease I (New England Biolabs, 10 units/μl) was added to the hybridized PCR products and incubated at 37°C for 15 minutes. The T7EI reaction was stopped by adding 2 μl of 0.25 M EDTA solution and the reaction products were purified using AMPure XP beads (Agencourt) with elution in 20 μl 0.1× EB buffer (QIAgen). Reactions products were then analyzed on a QIAXCEL capillary electrophoresis system and the frequencies of indel mutations were calculated using the same formula as previously described 22 . A more detailed protocol for the T7EI assay and examples of sample capillary electrophoresis traces has been previously described 22 . Sanger sequencing for quantifying frequencies of indel mutations Purified PCR products used for T7EI assay were ligated into a Zero Blunt TOPO vector (Life Technologies) and transformed into chemically competent Top 10 bacterial cells. Plasmid DNAs were isolated and sequenced by the Massachusetts General Hospital (MGH) DNA Automation Core, using an M13 forward primer (5'-GTAAAACGACGGCCAG-3'). Restriction digest assay for quantifying specific alterations induced by HDR with ssODNs PCR reactions of specific on-target sites were performed using Phusion high-fidelity DNA polymerase (New England Biolabs). The VEGFA and EMX1 loci were amplified using a touchdown PCR program ((98 °C, 10 s; 72–62 °C, −1 °C/cycle, 15 s; 72 °C, 30 s) × 10 cycles, (98 °C, 10 s; 62 °C, 15 s; 72 °C, 30 s) × 25 cycles), with 3% DMSO. The primers used for these PCR reactions are listed in Supplementary Table 8. PCR products were purified by Ampure XP beads (Agencourt) according to the manufacturer's instructions. For detection of the BamHI restriction site encoded by the ssODN donor template, 200 ng of purified PCR products were digested with BamHI at 37 °C for 45 minutes. The digested products were purified by Ampure XP beads (Agencourt), eluted in 20ul 0.1×EB buffer and analyzed and quantified using a QIAXCEL capillary electrophoresis system. TruSeq library generation and sequencing data analysis Locus-specific primers were designed to flank on-target and potential and verified off-target sites to produce PCR products ~300bp to 400 bps in length. Genomic DNAs from the pooled duplicate samples described above were used as templates for PCR. All PCR products were purified by Ampure XP beads (Agencourt) per the manufacturer's instructions. Purified PCR products were quantified on a QIAXCEL capillary electrophoresis system. PCR products for each locus were amplified from each of the pooled duplicate samples (described above), purified, quantified, and then pooled together in equal quantities for deep sequencing. Pooled amplicons were ligated with dual-indexed Illumina TruSeq adaptors as previously described 29 . The libraries were purified and run on a QIAXCEL capillary electrophoresis system to verify change in size following adaptor ligation. The adapter-ligated libraries were quantified by qPCR and then sequenced using Illumina MiSeq 250 bp paired-end reads performed by the Dana-Farber Cancer Institute Molecular Biology Core Facilities. We analyzed between 75,000 and 1,270,000 (average ~422,000) reads for each sample. The TruSeq reads were analyzed for rates of indel mutagenesis as previously described 30 . Specificity ratios were calculated as the ratio of observed mutagenesis at an on-target locus to that of a particular off-target locus as determined by deep sequencing. Fold-improvements in specificity with tru-RGNs for individual off-target sites were calculated as the specificity ratio observed with tru-gRNAs to the specificity ratio for that same target with the matched full-length gRNA. As mentioned in the text, for two of the known off-target sites, no indel mutations were detected with tru-gRNAs. For these particular sites, it was difficult to quantify the on-target to off-target ratios for tru-RGNs and, therefore, also the magnitude of improved specificity for tru-RGNs to standard RGNs. For example, we did not observe any tru-RGN-induced indels at sites OT1–4 and OT1–11 and thus the ratio of on-target to off-target rates would calculate to be infinite in these cases. For these sites, we reasoned it was possible that the true rates could be below the detection limit of our assay. Using the following equation, we calculated that we would have a 95% probability of observing 1 or more mutagenesis events if the mean number of mutagenesis events was 3: p ( x ; μ ) = e − μ μ x x ! Therefore, we used 3 as a conservative estimate of the number of events (instead of the 0 observed) as the numerator to estimate a lower bound for the fold-improvement of the tru-gRNA in place of infinite improvement suggested by observing 0 events. All sequencing data has been deposited with National Center for Biotechnology Information Sequence Read Archive (NCBI SRA), accession number SRP033215. Supplementary Material 1 2 3

          Related collections

          Most cited references 32

          • Record: found
          • Abstract: found
          • Article: found

          Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

          Rationale Improvements in the efficiency of DNA sequencing have both broadened the applications for sequencing and dramatically increased the size of sequencing datasets. Technologies from Illumina (San Diego, CA, USA) and Applied Biosystems (Foster City, CA, USA) have been used to profile methylation patterns (MeDIP-Seq) [1], to map DNA-protein interactions (ChIP-Seq) [2], and to identify differentially expressed genes (RNA-Seq) [3] in the human genome and other species. The Illumina instrument was recently used to re-sequence three human genomes, one from a cancer patient and two from previously unsequenced ethnic groups [4-6]. Each of these studies required the alignment of large numbers of short DNA sequences ('short reads') onto the human genome. For example, two of the studies [4,5] used the short read alignment tool Maq [7] to align more than 130 billion bases (about 45× coverage) of short Illumina reads to a human reference genome in order to detect genetic variations. The third human re-sequencing study [6] used the SOAP program [8] to align more than 100 billion bases to the reference genome. In addition to these projects, the 1,000 Genomes project is in the process of using high-throughput sequencing instruments to sequence a total of about six trillion base pairs of human DNA [9]. With existing methods, the computational cost of aligning many short reads to a mammalian genome is very large. For example, extrapolating from the results presented here in Tables 1 and 2, one can see that Maq would require more than 5 central processing unit (CPU)-months and SOAP more than 3 CPU-years to align the 140 billion bases from the study by Ley and coworkers [5]. Although using Maq or SOAP for this purpose has been shown to be feasible by using multiple CPUs, there is a clear need for new tools that consume less time and computational resources. Table 1 Bowtie alignment performance versus SOAP and Maq Platform CPU time Wall clock time Reads mapped per hour (millions) Peak virtual memory footprint (megabytes) Bowtie speed-up Reads aligned (%) Bowtie -v 2 Server 15 m 7 s 15 m 41 s 33.8 1,149 - 67.4 SOAP 91 h 57 m 35 s 91 h 47 m 46 s 0.10 13,619 351× 67.3 Bowtie PC 16 m 41 s 17 m 57 s 29.5 1,353 - 71.9 Maq 17 h 46 m 35 s 17 h 53 m 7 s 0.49 804 59.8× 74.7 Bowtie Server 17 m 58 s 18 m 26 s 28.8 1,353 - 71.9 Maq 32 h 56 m 53 s 32 h 58 m 39 s 0.27 804 107× 74.7 The performance and sensitivity of Bowtie v0.9.6, SOAP v1.10, and Maq v0.6.6 when aligning 8.84 M reads from the 1,000 Genome project (National Center for Biotechnology Information Short Read Archive: SRR001115) trimmed to 35 base pairs. The 'soap.contig' version of the SOAP binary was used. SOAP could not be run on the PC because SOAP's memory footprint exceeds the PC's physical memory. For the SOAP comparison, Bowtie was invoked with '-v 2' to mimic SOAP's default matching policy (which allows up to two mismatches in the alignment and disregards quality values). For the Maq comparison Bowtie is run with its default policy, which mimics Maq's default policy of allowing up to two mismatches during the first 28 bases and enforcing an overall limit of 70 on the sum of the quality values at all mismatched positions. To make Bowtie's memory footprint more comparable to Maq's, Bowtie is invoked with the '-z' option in all experiments to ensure only the forward or mirror index is resident in memory at one time. CPU, central processing unit. Table 2 Bowtie alignment performance versus Maq with filtered read set Platform CPU time Wall clock time Reads mapped per hour (millions) Peak virtual memory footprint (megabytes) Bowtie speed up Reads aligned (%) Bowtie PC 16 m 39 s 17 m 47 s 29.8 1,353 - 74.9 Maq 11 h 15 m 58 s 11 h 22 m 2 s 0.78 804 38.4× 78.0 Bowtie Server 18 m 20 s 18 m 46 s 28.3 1,352 - 74.9 Maq 18 h 49 m 7 s 18 h 50 m 16 s 0.47 804 60.2× 78.0 Performance and sensitivity of Bowtie v0.9.6 and Maq v0.6.6 when the read set is filtered using Maq's 'catfilter' command to eliminate poly-A artifacts. The filter eliminates 438,145 out of 8,839,010 reads. Other experimental parameters are identical to those of the experiments in Table 1. CPU, central processing unit. Maq and SOAP take the same basic algorithmic approach as other recent read mapping tools such as RMAP [10], ZOOM [11], and SHRiMP [12]. Each tool builds a hash table of short oligomers present in either the reads (SHRiMP, Maq, RMAP, and ZOOM) or the reference (SOAP). Some employ recent theoretical advances to align reads quickly without sacrificing sensitivity. For example, ZOOM uses 'spaced seeds' to significantly outperform RMAP, which is based on a simpler algorithm developed by Baeza-Yaetes and Perleberg [13]. Spaced seeds have been shown to yield higher sensitivity than contiguous seeds of the same length [14,15]. SHRiMP employs a combination of spaced seeds and the Smith-Waterman [16] algorithm to align reads with high sensitivity at the expense of speed. Eland is a commercial alignment program available from Illumina that uses a hash-based algorithm to align reads. Bowtie uses a different and novel indexing strategy to create an ultrafast, memory-efficient short read aligner geared toward mammalian re-sequencing. In our experiments using reads from the 1,000 Genomes project, Bowtie aligns 35-base pair (bp) reads at a rate of more than 25 million reads per CPU-hour, which is more than 35 times faster than Maq and 300 times faster than SOAP under the same conditions (see Tables 1 and 2). Bowtie employs a Burrows-Wheeler index based on the full-text minute-space (FM) index, which has a memory footprint of only about 1.3 gigabytes (GB) for the human genome. The small footprint allows Bowtie to run on a typical desktop computer with 2 GB of RAM. The index is small enough to be distributed over the internet and to be stored on disk and re-used. Multiple processor cores can be used simultaneously to achieve even greater alignment speed. We have used Bowtie to align 14.3× coverage worth of human Illumina reads from the 1,000 Genomes project in about 14 hours on a single desktop computer with four processor cores. Bowtie makes a number of compromises to achieve this speed, but these trade-offs are reasonable within the context of mammalian re-sequencing projects. If one or more exact matches exist for a read, then Bowtie is guaranteed to report one, but if the best match is an inexact one then Bowtie is not guaranteed in all cases to find the highest quality alignment. With its highest performance settings, Bowtie may fail to align a small number of reads with valid alignments, if those reads have multiple mismatches. If the stronger guarantees are desired, Bowtie supports options that increase accuracy at the cost of some performance. For instance, the '--best' option will guarantee that all alignments reported are best in terms of minimizing mismatches in the seed portion of the read, although this option incurs additional computational cost. With its default options, Bowtie's sensitivity measured in terms of reads aligned is equal to SOAP's and somewhat less than Maq's. Command line options allow the user to increase sensitivity at the cost of greater running time, and to enable Bowtie to report multiple hits for a read. Bowtie can align reads as short as four bases and as long as 1,024 bases. The input to a single run of Bowtie may comprise a mixture of reads with different lengths. Bowtie description and results Bowtie indexes the reference genome using a scheme based on the Burrows-Wheeler transform (BWT) [17] and the FM index [18,19]. A Bowtie index for the human genome fits in 2.2 GB on disk and has a memory footprint of as little as 1.3 GB at alignment time, allowing it to be queried on a workstation with under 2 GB of RAM. The common method for searching in an FM index is the exact-matching algorithm of Ferragina and Manzini [18]. Bowtie does not simply adopt this algorithm because exact matching does not allow for sequencing errors or genetic variations. We introduce two novel extensions that make the technique applicable to short read alignment: a quality-aware backtracking algorithm that allows mismatches and favors high-quality alignments; and 'double indexing', a strategy to avoid excessive backtracking. The Bowtie aligner follows a policy similar to Maq's, in that it allows a small number of mismatches within the high-quality end of each read, and it places an upper limit on the sum of the quality values at mismatched alignment positions. Burrows-Wheeler indexing The BWT is a reversible permutation of the characters in a text. Although originally developed within the context of data compression, BWT-based indexing allows large texts to be searched efficiently in a small memory footprint. It has been applied to bioinformatics applications, including oligomer counting [20], whole-genome alignment [21], tiling microarray probe design [22], and Smith-Waterman alignment to a human-sized reference [23]. The Burrows-Wheeler transformation of a text T, BWT(T), is constructed as follows. The character $ is appended to T, where $ is not in T and is lexicographically less than all characters in T. The Burrows-Wheeler matrix of T is constructed as the matrix whose rows comprise all cyclic rotations of T$. The rows are then sorted lexicographically. BWT(T) is the sequence of characters in the rightmost column of the Burrows-Wheeler matrix (Figure 1a). BWT(T) has the same length as the original text T. Figure 1 Burrows-Wheeler transform. (a) The Burrows-Wheeler matrix and transformation for 'acaacg'. (b) Steps taken by EXACTMATCH to identify the range of rows, and thus the set of reference suffixes, prefixed by 'aac'. (c) UNPERMUTE repeatedly applies the last first (LF) mapping to recover the original text (in red on the top line) from the Burrows-Wheeler transform (in black in the rightmost column). This matrix has a property called 'last first (LF) mapping'. The ith occurrence of character X in the last column corresponds to the same text character as the ith occurrence of X in the first column. This property underlies algorithms that use the BWT index to navigate or search the text. Figure 1b illustrates UNPERMUTE, an algorithm that applies the LF mapping repeatedly to re-create T from BWT(T). The LF mapping is also used in exact matching. Because the matrix is sorted lexicographically, rows beginning with a given sequence appear consecutively. In a series of steps, the EXACTMATCH algorithm (Figure 1c) calculates the range of matrix rows beginning with successively longer suffixes of the query. At each step, the size of the range either shrinks or remains the same. When the algorithm completes, rows beginning with S0 (the entire query) correspond to exact occurrences of the query in the text. If the range is empty, the text does not contain the query. UNPERMUTE is attributable to Burrows and Wheeler [17] and EXACTMATCH to Ferragina and Manzini [18]. See Additional data file 1 (Supplementary Discussion 1) for details. Searching for inexact alignments EXACTMATCH is insufficient for short read alignment because alignments may contain mismatches, which may be due to sequencing errors, genuine differences between reference and query organisms, or both. We introduce an alignment algorithm that conducts a backtracking search to quickly find alignments that satisfy a specified alignment policy. Each character in a read has a numeric quality value, with lower values indicating a higher likelihood of a sequencing error. Our alignment policy allows a limited number of mismatches and prefers alignments where the sum of the quality values at all mismatched positions is low. The search proceeds similarly to EXACTMATCH, calculating matrix ranges for successively longer query suffixes. If the range becomes empty (a suffix does not occur in the text), then the algorithm may select an already-matched query position and substitute a different base there, introducing a mismatch into the alignment. The EXACTMATCH search resumes from just after the substituted position. The algorithm selects only those substitutions that are consistent with the alignment policy and which yield a modified suffix that occurs at least once in the text. If there are multiple candidate substitution positions, then the algorithm greedily selects a position with a minimal quality value. Backtracking scenarios play out within the context of a stack structure that grows when a new substitution is introduced and shrinks when the aligner rejects all candidate alignments for the substitutions currently on the stack. See Figure 2 for an illustration of how the search might proceed. Figure 2 Exact matching versus inexact alignment. Illustration of how EXACTMATCH (top) and Bowtie's aligner (bottom) proceed when there is no exact match for query 'ggta' but there is a one-mismatch alignment when 'a' is replaced by 'g'. Boxed pairs of numbers denote ranges of matrix rows beginning with the suffix observed up to that point. A red X marks where the algorithm encounters an empty range and either aborts (as in EXACTMATCH) or backtracks (as in the inexact algorithm). A green check marks where the algorithm finds a nonempty range delimiting one or more occurrences of a reportable alignment for the query. In short, Bowtie conducts a quality-aware, greedy, randomized, depth-first search through the space of possible alignments. If a valid alignment exists, then Bowtie will find it (subject to the backtrack ceiling discussed in the following section). Because the search is greedy, the first valid alignment encountered by Bowtie will not necessarily be the 'best' in terms of number of mismatches or in terms of quality. The user may instruct Bowtie to continue searching until it can prove that any alignment it reports is 'best' in terms of number of mismatches (using the option --best). In our experience, this mode is two to three times slower than the default mode. We expect that the faster default mode will be preferred for large re-sequencing projects. The user may also opt for Bowtie to report all alignments up to a specified number (option -k) or all alignments with no limit on the number (option -a) for a given read. If in the course of its search Bowtie finds N possible alignments for a given set of substitutions, but the user has requested only K alignments where K < N, Bowtie will report K of the N alignments selected at random. Note that these modes can be much slower than the default. In our experience, for example, -k 1 is more than twice as fast as -k 2. Excessive backtracking The aligner as described so far can, in some cases, encounter sequences that cause excessive backtracking. This occurs when the aligner spends most of its effort fruitlessly backtracking to positions close to the 3' end of the query. Bowtie mitigates excessive backtracking with the novel technique of 'double indexing'. Two indices of the genome are created: one containing the BWT of the genome, called the 'forward' index, and a second containing the BWT of the genome with its character sequence reversed (not reverse complemented) called the 'mirror' index. To see how this helps, consider a matching policy that allows one mismatch in the alignment. A valid alignment with one mismatch falls into one of two cases according to which half of the read contains the mismatch. Bowtie proceeds in two phases corresponding to those two cases. Phase 1 loads the forward index into memory and invokes the aligner with the constraint that it may not substitute at positions in the query's right half. Phase 2 uses the mirror index and invokes the aligner on the reversed query, with the constraint that the aligner may not substitute at positions in the reversed query's right half (the original query's left half). The constraints on backtracking into the right half prevent excessive backtracking, whereas the use of two phases and two indices maintains full sensitivity. Unfortunately, it is not possible to avoid excessive backtracking fully when alignments are permitted to have two or more mismatches. In our experiments, we have observed that excessive backtracking is significant only when a read has many low-quality positions and does not align or aligns poorly to the reference. These cases can trigger in excess of 200 backtracks per read because there are many legal combinations of low-quality positions to be explored before all possibilities are exhausted. We mitigate this cost by enforcing a limit on the number of backtracks allowed before a search is terminated (default: 125). The limit prevents some legitimate, low-quality alignments from being reported, but we expect that this is a desirable trade-off for most applications. Phased Maq-like search Bowtie allows the user to select the number of mismatches permitted (default: two) in the high-quality end of a read (default: the first 28 bases) as well as the maximum acceptable quality distance of the overall alignment (default: 70). Quality values are assumed to follow the definition in PHRED [24], where p is the probability of error and Q = -10log p. Both the read and its reverse complement are candidates for alignment to the reference. For clarity, this discussion considers only the forward orientation. See Additional data file 1 (Supplementary Discussion 2) for an explanation of how the reverse complement is incorporated. The first 28 bases on the high-quality end of the read are termed the 'seed'. The seed consists of two halves: the 14 bp on the high-quality end (usually the 5' end) and the 14 bp on the low-quality end, termed the 'hi-half' and the 'lo-half', respectively. Assuming the default policy (two mismatches permitted in the seed), a reportable alignment will fall into one of four cases: no mismatches in seed (case 1); no mismatches in hi-half, one or two mismatches in lo-half (case 2); no mismatches in lo-half, one or two mismatches in hi-half (case 3); and one mismatch in hi-half, one mismatch in lo-half (case 4). All cases allow any number of mismatches in the nonseed part of the read and all cases are also subject to the quality distance constraint. The Bowtie algorithm consists of three phases that alternate between using the forward and mirror indices, as illustrated in Figure 3. Phase 1 uses the mirror index and invokes the aligner to find alignments for cases 1 and 2. Phases 2 and 3 cooperate to find alignments for case 3: Phase 2 finds partial alignments with mismatches only in the hi-half and phase 3 attempts to extend those partial alignments into full alignments. Finally, phase 3 invokes the aligner to find alignments for case 4. Figure 3 The three phases of the Bowtie algorithm for the Maq-like policy. A three-phase approach finds alignments for two-mismatch cases 1 to 4 while minimizing backtracking. Phase 1 uses the mirror index and invokes the aligner to find alignments for cases 1 and 2. Phases 2 and 3 cooperate to find alignments for case 3: Phase 2 finds partial alignments with mismatches only in the hi-half, and phase 3 attempts to extend those partial alignments into full alignments. Finally, phase 3 invokes the aligner to find alignments for case 4. Performance results We evaluated the performance of Bowtie using reads from the 1,000 Genomes project pilot (National Center for Biotechnology Information [NCBI] Short Read Archive:SRR001115). A total of 8.84 million reads, about one lane of data from an Illumina instrument, were trimmed to 35 bp and aligned to the human reference genome [NCBI build 36.3]. Unless specified otherwise, read data are not filtered or modified (besides trimming) from how they appear in the archive. This leads to about 70% to 75% of reads aligning somewhere to the genome. In our experience, this is typical for raw data from the archive. More aggressive filtering leads to higher alignment rates and faster alignment. All runs were performed on a single CPU. Bowtie speedups were calculated as a ratio of wall-clock alignment times. Both wall-clock and CPU times are given to demonstrate that input/output load and CPU contention are not significant factors. The time required to build the Bowtie index was not included in the Bowtie running times. Unlike competing tools, Bowtie can reuse a pre-computed index for the reference genome across many alignment runs. We anticipate most users will simply download such indices from a public repository. The Bowtie site [25] provides indices for current builds of the human, chimp, mouse, dog, rat, and Arabidopsis thaliana genomes, as well as many others. Results were obtained on two hardware platforms: a desktop workstation with 2.4 GHz Intel Core 2 processor and 2 GB of RAM; and a large-memory server with a four-core 2.4 GHz AMD Opteron processor and 32 GB of RAM. These are denoted 'PC' and 'server', respectively. Both PC and server run Red Hat Enterprise Linux AS release 4. Comparison to SOAP and Maq Maq is a popular aligner [1,4,5,26,27] that is among the fastest competing open source tools for aligning millions of Illumina reads to the human genome. SOAP is another open source tool that has been reported and used in short-read projects [6,28]. Table 1 presents the performance and sensitivity of Bowtie v0.9.6, SOAP v1.10, and Maq v0.6.6. SOAP could not be run on the PC because SOAP's memory footprint exceeds the PC's physical memory. The 'soap.contig' version of the SOAP binary was used. For comparison with SOAP, Bowtie was invoked with '-v 2' to mimic SOAP's default matching policy (which allows up to two mismatches in the alignment and disregards quality values), and with '--maxns 5' to simulate SOAP's default policy of filtering out reads with five or more no-confidence bases. For the Maq comparison Bowtie is run with its default policy, which mimics Maq's default policy of allowing up to two mismatches in the first 28 bases and enforcing an overall limit of 70 on the sum of the quality values at all mismatched read positions. To make Bowtie's memory footprint more comparable to Maq's, Bowtie is invoked with the '-z' option in all experiments to ensure that only the forward or mirror index is resident in memory at one time. The number of reads aligned indicates that SOAP (67.3%) and Bowtie -v 2 (67.4%) have comparable sensitivity. Of the reads aligned by either SOAP or Bowtie, 99.7% were aligned by both, 0.2% were aligned by Bowtie but not SOAP, and 0.1% were aligned by SOAP but not Bowtie. Maq (74.7%) and Bowtie (71.9%) also have roughly comparable sensitivity, although Bowtie lags by 2.8%. Of the reads aligned by either Maq or Bowtie, 96.0% were aligned by both, 0.1% were aligned by Bowtie but not Maq, and 3.9% were aligned by Maq but not Bowtie. Of the reads mapped by Maq but not Bowtie, almost all are due to a flexibility in Maq's alignment algorithm that allows some alignments to have three mismatches in the seed. The remainder of the reads mapped by Maq but not Bowtie are due to Bowtie's backtracking ceiling. Maq's documentation mentions that reads containing 'poly-A artifacts' can impair Maq's performance. Table 2 presents performance and sensitivity of Bowtie and Maq when the read set is filtered using Maq's 'catfilter' command to eliminate poly-A artifacts. The filter eliminates 438,145 out of 8,839,010 reads. Other experimental parameters are identical to those of the experiments in Table 1, and the same observations about the relative sensitivity of Bowtie and Maq apply here. Read length and performance As sequencing technology improves, read lengths are growing beyond the 30-bp to 50-bp commonly seen in public databases today. Bowtie, Maq, and SOAP support reads of lengths up to 1,024, 63, and 60 bp, respectively, and Maq versions 0.7.0 and later support read lengths up to 127 bp. Table 3 shows performance results when the three tools are each used to align three sets of 2 M untrimmed reads, a 36-bp set, a 50-bp set and a 76-bp set, to the human genome on the server platform. Each set of 2 M is randomly sampled from a larger set (NCBI Short Read Archive: SRR003084 for 36-bp, SRR003092 for 50-bp, SRR003196 for 76-bp). Reads were sampled such that the three sets of 2 M have uniform per-base error rate, as calculated from per-base Phred qualities. All reads pass through Maq's 'catfilter'. Table 3 Varying read length using Bowtie, Maq and SOAP Length Program CPU time Wall clock time Peak virtual memory footprint (megabytes) Bowtie speed-up Reads aligned (%) 36 bp Bowtie 6 m 15 s 6 m 21 s 1,305 - 62.2 Maq 3 h 52 m 26 s 3 h 52 m 54 s 804 36.7× 65.0 Bowtie -v 2 4 m 55 s 5 m 00 s 1,138 - 55.0 SOAP 16 h 44 m 3 s 18 h 1 m 38 s 13,619 216× 55.1 50 bp Bowtie 7 m 11 s 7 m 20 s 1,310 - 67.5 Maq 2 h 39 m 56 s 2 h 40 m 9 s 804 21.8× 67.9 Bowtie -v 2 5 m 32 s 5 m 46 s 1,138 - 56.2 SOAP 48 h 42 m 4 s 66 h 26 m 53 s 13,619 691× 56.2 76 bp Bowtie 18 m 58 s 19 m 6 s 1,323 - 44.5 Maq 0.7.1 4 h 45 m 7 s 4 h 45 m 17 s 1,155 14.9× 44.9 Bowtie -v 2 7 m 35 s 7 m 40 s 1,138 - 31.7 The performance of Bowtie v0.9.6, SOAP v1.10, and Maq versions v0.6.6 and v0.7.1 on the server platform when aligning 2 M untrimmed reads from the 1,000 Genome project (National Center for Biotechnology Information Short Read Archive: SRR003084 for 36 base pairs [bp], SRR003092 for 50 bp, and SRR003196 for 76 bp). For each read length, the 2 M reads were randomly sampled from the FASTQ file downloaded from the Archive such that the average per-base error rate as measured by quality values was uniform across the three sets. All reads pass through Maq's "catfilter". Maq v0.7.1 was used for the 76-bp reads because v0.6.6 does not support reads longer than 63 bp. SOAP is excluded from the 76-bp experiment because it does not support reads longer than 60 bp. Other experimental parameters are identical to those of the experiments in Table 1. CPU, central processing unit. Bowtie is run both in its Maq-like default mode and in its SOAP-like '-v 2' mode. Bowtie is also given the '-z' option to ensure that only the forward or mirror index is resident in memory at one time. Maq v0.7.1 was used instead of Maq v0.6.6 for the 76-bp set because v0.6.6 cannot align reads longer than 63 bp. SOAP was not run on the 76-bp set because it does not support reads longer than 60 bp. The results show that Maq's algorithm scales better overall to longer read lengths than Bowtie or SOAP. However, Bowtie in SOAP-like '-v 2' mode also scales very well. Bowtie in its default Maq-like mode scales well from 36-bp to 50-bp reads but is substantially slower for 76-bp reads, although it is still more than an order of magnitude faster than Maq. Parallel performance Alignment can be parallelized by distributing reads across concurrent search threads. Bowtie allows the user to specify a desired number of threads (option -p); Bowtie then launches the specified number of threads using the pthreads library. Bowtie threads synchronize with each other when fetching reads, outputting results, switching between indices, and performing various forms of global bookkeeping, such as marking a read as 'done'. Otherwise, threads are free to operate in parallel, substantially speeding up alignment on computers with multiple processor cores. The memory image of the index is shared by all threads, and so the footprint does not increase substantially when multiple threads are used. Table 4 shows performance results for running Bowtie v0.9.6 on the four-core server with one, two, and four threads. Table 4 Bowtie parallel alignment performance CPU time Wall clock time Reads mapped per hour (millions) Peak virtual memory footprint (megabytes) Speedup Bowtie, one thread 18 m 19 s 18 m 46 s 28.3 1,353 - Bowtie, two threads 20 m 34 s 10 m 35 s 50.1 1,363 1.77× Bowtie, four threads 23 m 9 s 6 m 1 s 88.1 1,384 3.12× Performance results for running Bowtie v0.9.6 on the four-core server with one, two, and four threads. Other experimental parameters are identical to those of the experiments in Table 1. CPU, central processing unit. Index building Bowtie uses a flexible indexing algorithm [29] that can be configured to trade off between memory usage and running time. Table 5 illustrates this trade-off when indexing the entire human reference genome (NCBI build 36.3, contigs). Runs were performed on the server platform. The indexer was run four times with different upper limits on memory usage. Table 5 Bowtie index building performance Physical memory target (GB) Actual peak memory footprint (GB) Wall clock time 16 14.4 4 h 36 m 8 5.84 5 h 5 m 4 3.39 7 h 40 m 2 1.39 21 h 30 m Performance results and memory footprints of running the Bowtie v0.9.6 indexer on the whole human genome (National Center for Biotechnology Information build 36.3, contigs). Runs were performed on the server platform. The indexer was run four times with different upper limits on memory usage. See Additional data file 1 (Supplementary Discussion 3 and Supplementary Table 1) for details. The reported times compare favorably with alignment times of competing tools that perform indexing during alignment. Less than 5 hours is required for Bowtie to both build and query a whole-human index with 8.84 million reads from the 1,000 Genome project (NCBI Short Read Archive:SRR001115) on a server, more than sixfold faster than the equivalent Maq run. The bottom-most row illustrates that the Bowtie indexer, with appropriate arguments, is memory-efficient enough to run on a typical workstation with 2 GB of RAM. Additional data file 1 (Supplementary discussions 3 and 4) explains the algorithm and the contents of the resulting index. Software Bowtie is written in C++ and uses the SeqAn library [30]. The converter to the Maq mapping format uses code from Maq. Discussion Bowtie exhibits a large performance advantage over both Maq and SOAP when mapping reads to the human genome. Bowtie's sensitivity in terms of reads aligned is comparable to that of SOAP and slightly less than Maq's, although the user may use command-line options to trade slower running time for greater sensitivity. Unlike SOAP, Bowtie's 1.3 GB memory footprint allows it to run on a typical PC with 2 GB of RAM. Bowtie aligns Illumina reads to the human genome at a rate of over 25 million reads per hour. Multiple processor cores can run parallel Bowtie threads to achieve even greater alignment speed; experiments show a speed up of 3.12 for four threads on a typical Opteron server. Unlike many other short-read aligners, Bowtie creates a permanent index of the reference that may be re-used across alignment runs. Building the index is fast - Bowtie outperforms competing tools when aligning lanes of Illumina reads even with index construction time included. At 2.2 GB for the human genome, the on-disk size of a Bowtie index is small enough to distribute over the internet. The Bowtie website hosts pre-built indices for the human genome and several other model organisms including chimp, dog, rat, mouse, and chicken. Bowtie's speed and small memory footprint are due chiefly to its use of the Burrows-Wheeler index in combination with the novel, quality-aware, backtracking algorithm introduced here. Double indexing is used to avoid the performance penalty of excessive backtracking. Bowtie supports standard FASTQ and FASTA input formats, and comes with a conversion program that allows Bowtie output to be used with Maq's consensus generator and single nucleotide polymorphism caller. Bowtie does not yet support paired-end alignment or alignments with insertions or deletions, although both improvements are planned for the future. Paired-end alignment is not difficult to implement in Bowtie's framework, and we expect that Bowtie's performance advantage will be comparable to, though perhaps somewhat less than, that of unpaired alignment mode. Support for insertions and deletions is also a conceptually straightforward addition. Bowtie is free, open source software available from the Bowtie website [25]. Abbreviations bp: base pair; BWT: Burrows-Wheeler transform; CPU: central processing unit; FM: full-text minute-space; GB: gigabytes; LF: last first; NCBI: National Center for Biotechnology Information. Authors' contributions BL developed the algorithms, collected results, and wrote most of the software. CT wrote some of the software. CT and MP contributed to discussions on algorithms. BL, CT, MP, and SLS wrote the manuscript. Additional data files The following additional data are included with the online version of this article: a document containing supplementary discussions, tables, and figures pertaining to algorithms for navigating the Burrows-Wheeler transform, the full four-phase version of the alignment algorithm that incorporates the reverse-complement, index construction, and components of the index (Additional data file 1). Supplementary Material Additional data file 1 Presented are supplementary discussions, tables, and figures pertaining to algorithms for navigating the Burrows-Wheeler transform, the full four-phase version of the alignment algorithm that incorporates the reverse-complement, index construction, and components of the index. Click here for file
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Multiplex genome engineering using CRISPR/Cas systems.

            Functional elucidation of causal genetic variants and elements requires precise genome editing technologies. The type II prokaryotic CRISPR (clustered regularly interspaced short palindromic repeats)/Cas adaptive immune system has been shown to facilitate RNA-guided site-specific DNA cleavage. We engineered two different type II CRISPR/Cas systems and demonstrate that Cas9 nucleases can be directed by short RNAs to induce precise cleavage at endogenous genomic loci in human and mouse cells. Cas9 can also be converted into a nicking enzyme to facilitate homology-directed repair with minimal mutagenic activity. Lastly, multiple guide sequences can be encoded into a single CRISPR array to enable simultaneous editing of several sites within the mammalian genome, demonstrating easy programmability and wide applicability of the RNA-guided nuclease technology.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity.

              Clustered regularly interspaced short palindromic repeats (CRISPR)/CRISPR-associated (Cas) systems provide bacteria and archaea with adaptive immunity against viruses and plasmids by using CRISPR RNAs (crRNAs) to guide the silencing of invading nucleic acids. We show here that in a subset of these systems, the mature crRNA that is base-paired to trans-activating crRNA (tracrRNA) forms a two-RNA structure that directs the CRISPR-associated protein Cas9 to introduce double-stranded (ds) breaks in target DNA. At sites complementary to the crRNA-guide sequence, the Cas9 HNH nuclease domain cleaves the complementary strand, whereas the Cas9 RuvC-like domain cleaves the noncomplementary strand. The dual-tracrRNA:crRNA, when engineered as a single RNA chimera, also directs sequence-specific Cas9 dsDNA cleavage. Our study reveals a family of endonucleases that use dual-RNAs for site-specific DNA cleavage and highlights the potential to exploit the system for RNA-programmable genome editing.
                Bookmark

                Author and article information

                Affiliations
                [1 ]Molecular Pathology Unit, Massachusetts General Hospital, Charlestown, MA, USA
                [2 ]Center for Computational and Integrative Biology, Massachusetts General Hospital, Charlestown, MA, USA
                [3 ]Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA USA
                [4 ]Department of Pathology, Harvard Medical School, Boston, MA 02115 USA
                Author notes
                Correspondence should be addressed to J.K.J. ( jjoung@ 123456mgh.harvard.edu ) or J.D.S.( jsander@ 123456alumni.iastate.edu )
                Journal
                9604648
                20305
                Nat Biotechnol
                Nat. Biotechnol.
                Nature biotechnology
                1087-0156
                1546-1696
                7 February 2014
                26 January 2014
                March 2014
                01 September 2014
                : 32
                : 3
                : 279-284
                NIHMS553838
                10.1038/nbt.2808
                3988262
                24463574

                Users may view, print, copy, download and text and data- mine the content in such documents, for the purposes of academic research, subject always to the full Conditions of use: http://www.nature.com/authors/editorial_policies/license.html#terms

                Categories
                Article

                Biotechnology

                Comments

                Comment on this article