94
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Annotation-based genome-wide SNP discovery in the large and complex Aegilops tauschii genome using next-generation sequencing without a reference genome sequence

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Many plants have large and complex genomes with an abundance of repeated sequences. Many plants are also polyploid. Both of these attributes typify the genome architecture in the tribe Triticeae, whose members include economically important wheat, rye and barley. Large genome sizes, an abundance of repeated sequences, and polyploidy present challenges to genome-wide SNP discovery using next-generation sequencing (NGS) of total genomic DNA by making alignment and clustering of short reads generated by the NGS platforms difficult, particularly in the absence of a reference genome sequence.

          Results

          An annotation-based, genome-wide SNP discovery pipeline is reported using NGS data for large and complex genomes without a reference genome sequence. Roche 454 shotgun reads with low genome coverage of one genotype are annotated in order to distinguish single-copy sequences and repeat junctions from repetitive sequences and sequences shared by paralogous genes. Multiple genome equivalents of shotgun reads of another genotype generated with SOLiD or Solexa are then mapped to the annotated Roche 454 reads to identify putative SNPs. A pipeline program package, AGSNP, was developed and used for genome-wide SNP discovery in Aegilops tauschii-the diploid source of the wheat D genome, and with a genome size of 4.02 Gb, of which 90% is repetitive sequences. Genomic DNA of Ae. tauschii accession AL8/78 was sequenced with the Roche 454 NGS platform. Genomic DNA and cDNA of Ae. tauschii accession AS75 was sequenced primarily with SOLiD, although some Solexa and Roche 454 genomic sequences were also generated. A total of 195,631 putative SNPs were discovered in gene sequences, 155,580 putative SNPs were discovered in uncharacterized single-copy regions, and another 145,907 putative SNPs were discovered in repeat junctions. These SNPs were dispersed across the entire Ae. tauschii genome. To assess the false positive SNP discovery rate, DNA containing putative SNPs was amplified by PCR from AL8/78 and AS75 and resequenced with the ABI 3730 xl. In a sample of 302 randomly selected putative SNPs, 84.0% in gene regions, 88.0% in repeat junctions, and 81.3% in uncharacterized regions were validated.

          Conclusion

          An annotation-based genome-wide SNP discovery pipeline for NGS platforms was developed. The pipeline is suitable for SNP discovery in genomic libraries of complex genomes and does not require a reference genome sequence. The pipeline is applicable to all current NGS platforms, provided that at least one such platform generates relatively long reads. The pipeline package, AGSNP, and the discovered 497,118 Ae. tauschii SNPs can be accessed at ( http://avena.pw.usda.gov/wheatD/agsnp.shtml).

          Related collections

          Most cited references26

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Accuracy and quality of massively parallel DNA pyrosequencing

          Background Direct interrogation of microbial genomes based upon comparisons of orthologous gene sequences or metagenomic surveys provides a means to assess the diversity of microbial communities without requiring the cultivation of microbes in the laboratory. Since the cost of cloning DNA templates and capillary-based DNA sequencing constrains the number of sequences included in most of these investigations, the detection of low abundance taxa demands surveys that are many orders of magnitude larger than those reported in the literature. Massively parallel pyrosequencing on the Roche GS20 system developed by 454 Life Sciences offers a means to more extensively sample molecular diversity in microbial populations. It is now possible to generate hundreds of thousands of short (100-200 nucleotide) DNA sequence reads in a few hours without requiring the preparation of sequence templates by conventional cloning. In the near future, technical advances will likely increase the number and length of sequence reads. Pyrosequencing technology relies upon enzyme cascades and CCD luminescence detection capabilities to measure the release of inorganic pyrophosphate with every nucleotide incorporation [1]. The GS20 system takes advantage of DNA capture beads that contain, on average, one single-stranded template, which is amplified to millions of copies in an oil emulsion PCR (emPCR). The beads are then distributed on a solid-phase sequencing substrate (a PicoTiterPlate™) with 1.6 million wells that can each hold a bead and additional reagents, including polymerase, luciferase, and ATP sulfurylase. Microfluidics cycle each of the four nucleotide triphosphates over the PicoTiterPlate™, and incorporation of a nucleotide releases pyrophosphate, the substrate for a luminescence reaction, which is recorded with a cooled CCD camera. The record of intensity of each flow of a nucleotide is a flowgram, analogous to a chromatogram that reports the order of A, C, G and T residues from a DNA sequencing template. Flowgram values correspond to the homopolymer length for that base. The average number of wells with detectable sequencing templates is about 450,000, which produces about 200,000 usable reads. This new methodology brings with it different sources of error to traditional dideoxy capillary sequencing. Since the nucleotide triphosphates are flowed one at a time, substitutions are less likely than with traditional methods. However, it is sometimes difficult to resolve the intensity of luminescence produced when a homopolymer is encountered. The result can be ambiguity of homopolymer length, particularly for longer homopolymers. In addition, insufficient flushing between flows can cause single base insertions (carry forward events) usually near but not adjacent to homopolymers. Insufficient nucleotides within a flow can cause incomplete extension within homopolymers. Generally, an excess of intermediate flowgram values indicates a poor quality read [2]. The GS20 software makes corrections for carry forward and incomplete extensions (CAFIE); it shortens reads from the 3' end until fewer than 3% of the remaining flowgram values are of intermediate value, and it removes reads if the trimming falls below a threshold length. The software identifies as ambiguous flow cycles in which no flowgram value was greater than 0.5. If 5% or more of the flow cycles for a read are ambiguous, the read is removed. The assembly of many overlapping pyrosequencing reads can produce highly accurate consensus sequences [3,4]. Wicker et al. [5] compared assemblies of the barley genome produced by reads from GS20 pyrosequencing and from ABI dideoxy sequencing. Both methods produced consensus sequences with error rates of approximately 0.07% at each consensus position. Gharizadeh et al. [6] compared pyrosequences with Sanger dideoxy methods for 4,747 templates. Comparisons of the traditional capillary sequences with the 25-30 nucleotide pyrosequence reads demonstrated similar levels of read accuracy. Assemblies of massively parallel pyrosequencing reads of plastid genomes from Nandina and Platanus exhibited overall error rates of 0.043% and 0.031%, respectively, in the consensus sequence [4]. The generation of consensus sequences to improve accuracy, however, is generally not appropriate for studies that seek information about natural variation from every read. For example, in metagenomic [7] or PCR amplicon [8] libraries from environmental DNA samples, each sequence read can theoretically represent DNA from a distinct gene from a complex mixture of microbial genes. A viable but imperfect alternative to building consensus sequences for metagenomic and diversity investigations is to identify and remove pyrosequencing reads that are likely to be incorrect. For example, Gilbert et al. [9], in a study of ancient woolly mammoth mitochondrial DNA, removed pyrosequencing reads that were not 98% identical to previously sequenced mammoth mitochondrial DNA sequences, assuming that they must be poor quality. Dostie et al. [10] sequenced an amplicon library and discarded reads in which the PCR primer was not recognized by BLAST. These studies removed 15% and 7% of their reads, respectively, but it is not clear that these statistics improved the quality of the remaining data. To explore error modalities, we used the GS20 system to generate more than 340,000 reads from a PCR amplicon library that was prepared from a collection of 43 reference templates of known sequence. Each reference template contains a distinct ribosomal RNA gene (rDNA), including the V6 hypervariable region from a collection of 43 divergent bacteria [11]. Differences between pyrosequences and their cognate reference sequences identified signatures of low quality data. Results Read accuracy We obtained 340,150 reads that passed the GS20 quality filters, that is, flowgrams for each read: contained the correct base key at the start (a portion of the 454 primer used to differentiate reads from internal quality control sequences); included at least 84 flows; had fewer than 5% of flow cycles resulting in an ambiguous base call (N); and had fewer than 3% of flowgram values between 0.5 and 0.7 [12]. We aligned each read to its reference sequence using an optimized Needleman-Wunsch algorithm. Our data included 159,981 total errors over 32,801,429 bases. The error rate, defined as the number of errors (miscalled bases plus inserted and deleted bases) divided by the total number of expected bases, was 0.49%. As shown in Table 1, 39% of these errors correspond to homopolymer effects, including extension (insertions), incomplete extensions (deletions) and carry forward errors (insertions and substitutions). Carry forward occurs when an incomplete flush of base flow results in a premature incorporation of a base. The presence of homopolymers tends to increase the likelihood of both carry forward and incomplete extension with the GS20 sequencer [12]. Insertions were the most common type of error (36% of errors) followed by deletions (27%), ambiguous bases, Ns (21%), and substitutions (16%). It should be noted that the V6 region does not contain long or frequent homopolymers. The errors did not correlate significantly with distance along the sequence (R2 60%, also contain Ns. Conclusion Our analysis of the GS20 sequencing error rate of the V6 region of bacterial rRNA genes shows a marked improvement over the original error rates published by Margulies et al [2]. The largest source of errors may be due to multi-templated beads, and enhancements to both the chemistry protocol for the GS20 and the built-in bioinformatics software may account for the change in error rates. Our results highlight that a small proportion of low quality reads, presumably from multi-templated beads, are responsible for the majority of sequencing errors. The ability to identify and remove these reads is the best way to improve the accuracy of the entire dataset. It is not a replacement for assigning quality scores to detect the position of miscalled bases. The interpretation of chromatograms by programs such as PHRED [13] employs quality scores that reflect the probability of any type of base call error. Although it uses the same scale, the GS20 software generates quality values based on the probability of homopolymer extension rather than probability of a correct base call. Regardless of the ultimate cause of poor reads, the presence of even a single ambiguous base (N) was an effective indicator of low-quality sequence. The removal of all reads containing one or more Ns can drastically improve the overall quality of the remaining dataset, reducing the error rate from about 0.5% to about 0.25% (Table 2). For our data, this strategy eliminated only 6% of the total reads. By excluding approximately 1% of all reads whose lengths lie outside of the main distribution, as well as those with inexact matches to the primer, the error rate for the V6-tag data dropped to less than 0.2%. The pyrosequencing technology provides such a large number of reads that the elimination of even 10% or more of the reads in a data set should be more than offset by the increase in quality of the remaining reads. Table 2 Identifying low-quality reads and their contribution to the error rate Data selection Percent of reads Error rate All reads 100.0% 0.49% Reads with no Ns 94.4% 0.24% Reads with one or more Ns 5.6% 4.7% Reads with length ≥81 and ≤108 98.8% 0.33% Reads with length 108 1.2% 18.9% Reads with no Ns and length ≥81 and ≤108 93.3% 0.20% Reads with no proximal errors 97.0% 0.45% Reads with fewer than three proximal errors >99.99% 0.48% Reads with more than three proximal errors <0.01% 12.2% Reads with no Ns and length ≥81 and ≤108 and no proximal errors 90.6% 0.16% Removing reads with Ns is the most effective means we found of removing low-quality data and improving the error rates. Read lengths that are either longer or shorter than expected, and are outside the peak of common reads, also correlate strongly with incorrect reads. Our strategy for detecting low quality reads circumvents the need to generate consensus sequences for improving data quality in massively parallel pyrosequencing experiments of environmental DNA. Our criteria for detecting reads with errors allows for the acquisition of pyrosequencing data in the context of molecular ecology that can surpass the accuracy of traditional capillary methods. Materials and methods Generation of 1 kb clone library and selection of clones for pyrosequencing DNA was extracted according to Huber et al. [14] from diffuse flow hydrothermal vent samples as described in Sogin et al. [8]. PCR primers were designed using ARB software [15] to target the bacterial 16S rDNA. The primers used were 337 F (5' CAN CCT ACG GGN GGC NGC) and 1391R (5' GAC GGG CGG TGW GTN CA). The amplification mix contained 5 units Pfu Turbo polymerase (Stratagene, La Jolla, CA, USA), 1× Pfu reaction buffer, 200 μM dNTPs (Pierce Nucleic Acid Technologies, Milwaukee, WI, USA), and 0.2 μM each primer in a volume of 100 μl. Environmental DNA (3-10 ng) was added to 3 separate 30 μl amplification mixes. A positive control (Marinobacter aquaeolei genomic DNA) and two negative controls (no DNA and genomic DNA from the archaeon Methanococcus jannaschii) were also run. An initial denaturation step of 3 min at 94°C was followed by 30 cycles of 94°C for 30 s, 55°C for 45 s, and 72°C for 2 minutes. The final extension step was 72°C for 2 minutes. Following PCR, three reactions for each sample were combined, purified, and concentrated using the MinElute PCR Purification Kit (Qiagen, Valencia, CA, USA) according to the manufacturer's instructions. PCR product quality was assessed on a 0.8% agarose gel stained with ethidium bromide and ligated with pCR4-TOPO vector for 20 minutes at room temperature and transformed with TOP10 electrocompetent cells according to the manufacturer's instructions (Invitrogen, Carlsbad, CA, USA). Colonies for each library were randomly selected and grown in SuperBroth with 50 mg/ml kanamycin in 96 deep-well blocks overnight. Alkaline lysis template preparation was carried out on cell pellets using the RevPrep Orbit II (Genomic Solutions, Ann Arbor, MI, USA) or the Biotech RoboPrep2500 (MWG Biotech, Ebersberg, Germany). The 1,000 base-pair amplicons were sequenced bidirectionly using primers T3 (5'-ATT AAC CCT CAC TAA AGG GA) and T7 (5'-TAA TAC GAC TCA CTA TAG GG), and on an ABI 3730 × l genetic analyzer. Sequences were aligned with MUSCLE (with parameters -diags and -maxiters 10) [16] and manually manipulated in the BioEdit 7.0.1 program [17]. Distance matrixes were calculated using quickdist [8], and taxonomic identities determined using RDP-II Classifier [18]. Sequences were trimmed to include only the V6 region of the gene, the distance matrix re-calculated, and from this analysis, 43 divergent sequences were chosen for further experimentation. The average length of the V6 region for these clones was 101 bases, ranging from 95 to 109, with one longer reference of 183 bases. The 16S rDNA sequences are deposited at GenBank under accession numbers DQ909092, DQ909128 DQ909132, DQ909133, DQ909142, DQ909144, DQ909158, DQ909184, DQ909202, DQ909204, DQ909218, DQ909223, DQ909224, DQ909248, DQ909251, DQ909253, DQ909266, DQ909274, DQ909337, DQ909368. DQ909392, DQ909396, DQ909400, DQ909414, DQ909423, DQ909438, DQ909440, DQ909465, DQ909474, DQ909498, DQ909513, DQ909519, DQ909538, DQ909603, DQ909618, DQ909631, DQ909662, DQ909688, DQ909702, DQ909706, DQ909719. DQ909727, DQ909753. Generation of known V6 amplicon library We treated each plasmid with plasmid-safe DNAase (Epicentre, Madison, WI, USA) to remove Escherichia coli genomic DNA and confirmed that each plasmid produced an amplification product of the expected size with primers targeting the V6 region of the bacterial rDNA according to Sogin et al. [8]. We then pooled the individual plasmids and amplified with the primers that flank the V6 region of rRNA genes according to Sogin et al. [8]. We assessed the product quality using a BioAnalyzer Agilent DNA 1000 LabChip following the manufacturer's instructions. Three reactions were combined, purified, and concentrated using the MinElute PCR Purification Kit (Qiagen). The final amplicon library was sequenced independently by 454 Life Sciences and our own lab. Both labs used the Roche Genome Sequencer 20 (GS20) according to the manufacturer's specifications [2]. The original GS20 output files as text are available in Additional data files 3-5. Error rate calculations We combined the data from both sequencing runs for a total of 340,150 reads (226,150 and 114,000), with an average read length of 94.5 nucleotides and a total of 32,816,656 bases. These sequences are available in fasta format in Additional data file 2. To determine the reference sequence source of each pyrosequencing read, we ran a separate multiple sequence alignment of each individual read against the 43 reference sequences using MUSCLE [16] (default options plus maxiters 2, diags). We calculated the number of sequence differences between each read and the reference sequences to determine the reference sequence to which each read mapped most closely. All subsequent error calculations are based on comparing reads to their assigned reference sequence. The overall error rate is the number of errors in a read divided by the length of sequence. Specifically, we calculated errors in several ways. In all methods, each base mismatch or N in the test sequence counts as an error, and a terminal gap caused by a GS20 read terminating before the end of the reference does not count as an error. In the first and second methods each base of an insertion or deletion counts as one error. In the third method, insertions or deletions are counted by homopolymer runs. If a TAAA is inserted, it is counted as two insertions, one single-T and one multi-A insertion. The denominator for the first method was the read length. The denominator for the second and third methods was the length of reference sequence minus any discounted terminal gaps. All error rate calculations produced essentially the same results. We report error rates using all base errors (not by homopolymer run) divided by the expected length (reference sequence length minus terminal gaps). The error rates were calculated for each sequence in a pairwise comparison of the pyrosequencing read and the reference sequence to which it was assigned. We used the Needleman-Wunsch algorithm [19] for these pair-wise alignments because it selects for the best possible alignment given its run parameters. Using a set of 100 sequences and a matrix of Needleman-Wunsch run parameter combinations, we found that a gap opening penalty of 5.75 and a gap extension penalty of 2.75 minimized the calculated error rate. Error rates, reference sequences and read sequences were imported into a MySQL database for storage and analysis. Additional data files The following additional data are available with the online version of this paper. Additional data file 1 is a fasta file of the 43 known sequences used. Additional data file 2 is a gzip-compressed fasta file of the sequences output by the GS20. These sequences correspond to those included in Additional data files 3, 4, 5 but include only the final sequence information. Additional data files 3, 4, 5 are three compressed text files representing the text translations of the original GS20 binary output (sff) files for all of the sequencing used in the analysis, including sequence, flowgram and other run information. GS20 data are reported by region of the PicoTiterPlate™; we sequenced three plate regions. Supplementary Material Additional data file 1 The 43 known sequences used Click here for file Additional data file 2 These sequences correspond to those included in Additional data files 3-5 but include only the final sequence information in fasta format. Click here for file Additional data file 3 Text translation of the original GS20 binary output (sff) file for the first of three PicoTiterPlate™ regions used in the analysis. Click here for file Additional data file 4 Text translation of the original GS20 binary output (sff) file for the second of three PicoTiterPlate™ regions used in the analysis. Click here for file Additional data file 5 Text translation of the original GS20 binary output (sff) file for the third of three PicoTiterPlate™ regions used in the analysis. Click here for file
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries.

            High-density single-nucleotide polymorphism (SNP) arrays have revolutionized the ability of genome-wide association studies to detect genomic regions harboring sequence variants that affect complex traits. Extensive numbers of validated SNPs with known allele frequencies are essential to construct genotyping assays with broad utility. We describe an economical, efficient, single-step method for SNP discovery, validation and characterization that uses deep sequencing of reduced representation libraries (RRLs) from specified target populations. Using nearly 50 million sequences generated on an Illumina Genome Analyzer from DNA of 66 cattle representing three populations, we identified 62,042 putative SNPs and predicted their allele frequencies. Genotype data for these 66 individuals validated 92% of 23,357 selected genome-wide SNPs, with a genotypic and sequence allele frequency correlation of r = 0.67. This approach for simultaneous de novo discovery of high-quality SNPs and population characterization of allele frequencies may be applied to any species with at least a partially sequenced genome.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              SHRiMP: Accurate Mapping of Short Color-space Reads

              Introduction Next generation sequencing (NGS) technologies are revolutionizing the study of variation among individuals in a population. The ability of sequencing platforms such as AB SOLiD and Illumina (Solexa) to sequence one billion basepairs (gigabase) or more in a few days has enabled the cheap re-sequencing of human genomes, with the genomes of a Chinese individual [1], a Yoruban individual [2], and matching tumor and healthy samples from a female individual [3] sequenced in the last few months. These resequencing efforts have been enabled by the development of extremely efficient mapping tools, capable of aligning millions of short (25–70 bp) reads to the human genome [4]–[10]. In order to accelerate the computation, most of these methods allow for only a fixed number of mismatches (usually two or three) between the reference genome and the read, and usually do not allow for the matching of reads with insertion/deletion (indel) polymorphisms. These methods are extremely effective for mapping reads to the human genome, most of which has a low polymorphism rate, and so the likelihood that a single read spans multiple SNPs is small. While matching with up to a few differences (allowing for a SNP and 1–2 errors) is sufficient in these regions, these methods fail when the polymorphism level is high. NGS technologies are also opening the door to the study of population genomics of non-model individuals in other species. Various organisms have a wide range of polymorphism rates - from 0.1% in humans to 4.5% in the marine ascidian Ciona savignyi. The polymorphisms present in a particular species can be used to discern its evolutionary history and understand the selective pressures in various genomic loci. For example, the large amount of variation in C. savignyi (two individuals' genomes are as different as Human and Macaque) was found to be due to a large effective population size [11]. The re-sequencing of species like C. savignyi (and regions of the human genome with high variability) requires methods for short read mapping that allow for a combination of several SNPs, indels, and sequencing errors within a single (short) read. Furthermore, due to larger-scale “structural” variation, only a fraction of the read may match to the genome, necessitating the use of local, rather than global, alignment methods. Previous short read mapping tools typically allow for a fixed number of mismatches by separating a read into several sections and requiring some number of these to match perfectly, while others are allowed to vary [4],[6],[8]. An alternative approach generates a set of subsequences from the read (often represented as spaced seeds [7],[10],[12]), again in such a manner that if a read were to match at a particular location with some number of mismatches, at least one of the subsequences would match the genome [5],[9]. While these methods are extremely fast, they were developed for genomes with relatively low levels of polymorphism, and typically cannot handle a highly polymorphic, non-model genome. This becomes especially apparent when working with data from Applied Biosystem's SOLiD sequencing platform (AB SOLiD). AB SOLiD uses a di-base sequencing chemistry that generates one of four possible calls (colors) for each pair of nucleotides. While a sequencing error is a change of one color-call to another, a single SNP will change two adjacent color positions. Hence a read with two (non-adjacent) SNPs and a sequencing error will differ from the reference genome in five different positions. Simultaneously, the nature of the di-base sequencing code allows for the identification (and correction) of sequencing errors, so by carefully analyzing the exact sequence of matches and mismatches within a read, it is possible to determine that the read and the genome differ by two SNPs. While efficient mappers for color-space sequences have been developed [5],[13], they translate the genome to color-space, and directly compare to the color-space read. The complexity of the color-space representation makes the identification of complex variation such as adjacent SNPs and short indels challenging or impossible with these tools. In this paper we develop algorithms for the mapping of short reads to highly polymorphic genomes and methods for the analysis of the mappings. We demonstrate an algorithm for mapping short reads in the presence of a large amount of polymorphism. By employing a fast k-mer hashing step and a simple, very efficient implementation of the Smith-Waterman algorithm, our method conducts a full alignment of each read to all areas of the genome that are potentially homologous. Secondly, we introduce a novel, specialized algorithm for mapping di-base (color-space) reads, which allows for an accurate, non-heuristic alignment of AB SOLiD reads to a reference genome. Finally, we introduce methodology for evaluating the accuracy of discovered alignments. Because a read may match the genome in several locations with variable amounts of polymorphism, we develop a statistical method for scoring the hits, allowing for the selection of the most probable variants, and filtering of false positives. Our methods are implemented as part of SHRiMP: the SHort Read Mapping Package. To demonstrate the usefulness of SHRiMP we re-sequenced a Japanese Ciona savignyi genome on the SOLiD platform. Preliminary estimates obtained in the course of sequencing the reference genome indicate that the SNP heterozygosity is 4.5%, whereas indel heterozygosity is 16.6%. This species represents the most challenging known test case for the detection of polymorphisms with short read technologies. We aligned the SOLiD reads of the Japanese individual to the C. savignyi reference genome using both SHRiMP and AB's read mapper. SHRiMP is able to identify 5-fold more SNPs than AB's mapper, while also capturing 70,000 indel variants. Results/Discussion This section is organized as follows: we begin with three methodological sections, in which we first present an overview of the algorithms used in SHRiMP for mapping short reads, explain our specialized algorithm for alignment of di-base sequencing (AB SOLiD) data, and present our framework for computing p-values and other statistics for alignment quality. The data flow for these methods is illustrated in Figure 1. In the last two subsections we will first show the application of SHRiMP to the resequencing of Ciona savignyi using the AB SOLiD sequencing technology and present results on the accuracy of the SHRiMP tool on simulated data. 10.1371/journal.pcbi.1000386.g001 Figure 1 Data flow and processing within the SHRiMP. Candidate mapping locations are first discovered by the seed scanner and then validated by the vectorized Smith-Waterman algorithm, computing only a score. Top scoring hits are then fully aligned by a platform-specific algorithm (i.e. letter-space for Solexa data and color-space for SOLiD data). Statistical confidence for the final mappings are then computed using the PROBCALC utility. Read Mapping Algorithm The SHRiMP algorithm draws upon three recent developments in the field of sequence alignment: q-gram filter approaches, introduced by Rasmussen et al [14]; spaced seeds, introduced by Califano and Rigoutsos [15] and popularized by the PatterHunter family of tools [7],[10]; and specialized vector computing hardware to speed up the Smith-Waterman Algorithm [16]–[18] to rapidly find the likely locations for the reads on the genome. Once these locations are identified, we conduct a thorough, Smith-Waterman-based algorithm to rigorously evaluate the alignments. In this section we will provide a brief exposition of the methods used to align short reads in SHRiMP (a more thorough description of each of these steps is in Methods). Spaced seeds Most heuristic methods for local alignment rely on the identification of seeds – short exact matches between the two sequences. The advantage of using exact matches is that they are easy to find using hash tables, suffix arrays, or related techniques. While classically seeds have been contiguous matches, more recently “spaced” seeds, where predetermined positions in the read are allowed not to match, have been shown to be more sensitive. Spaced seeds are often represented as a string of 1 s and 0 s, where 1 s indicate positions that must match, while 0 s indicate positions that may mismatch. We refer to the length or span of the seed as the total length of the string, and the weight of the seed as the number of 1 s in the string. For example, the seed “11100111” requires matches at positions 1–3 and 6–8, and has length 8 and weight 6. Because seeds with such small weight match extremely often, we require multiple seeds to match within a region before it is further considered, using a technique called Q-gram filtering. Q-gram filters While most older local alignment tools, such as BLAST, use a single matching seed to start a thorough comparison of the strings around the seed, more recently Rassmussen et al [14] introduced the use of q-gram filters, where multiple seeds are used to determine if a good match exists. This idea is also used in SHRiMP where we require a pre-determined number of seeds from a read to match within a window of the genome before we conduct a thorough comparison. Vectorized Smith-Waterman If a particular read has the required number of seeds matching to a window of the genome we conduct a rapid alignment of the two regions to verify the similarity. This alignment is done using the classical Smith-Waterman algorithm [19], implemented using specialized “vector” instructions that are part of all modern CPUs. In order to speed up this stage we compute just the score of the optimal alignment, and not the alignment itself. For every read we store the locations of top hits, sorted by their score. The number of top hits to store is a parameter. Final alignment After we finish aligning all of the reads to all of the potential locations, we conduct a final, full alignment of each read to all of the top hits. This final alignment stage differs depending on the specifics of the sequencing technology. Within SHRiMP we have implemented separate final alignment modules for Illumina/Solexa data (this is done with the regular Smith-Waterman algorithm) and for color-space (di-base) data produced by the AB SOLiD instrument (described in the next section). Additionally we have an experimental module for alignment of two-pass sequencing data, where two reads are generated from every genomic location, which is described elsewhere [20]. Algorithm for Color-space Alignment The AB SOLiD sequencing technology introduced a novel dibase sequencing technique, which reads overlapping pairs of letters and generates one of four colors (typically labelled 0–3) at every stage. Each base is interrogated twice: first as the right nucleotide of a pair, and then as the left one. The exact combinations of letters and the colors they generate are shown in Figure 2A. The sequencing code can be thought of as a finite state automaton (FSA), in which each previous letter is a state and each color code is a transition to the next letter state. This automaton is demonstrated in Figure 2B. It is notable that the sequence of colors is insufficient to reconstruct the DNA sequence, as reconstruction requires knowledge of the first letter of the sequence (or the last letter of the primer, which is fixed for a single run of the technology). 10.1371/journal.pcbi.1000386.g002 Figure 2 Two representations of the color-space (dibase) encoding used by the AB SOLiD sequencing system. A: The standard representation, with the first and second letter of the queried pair along the horizontal and vertical axes, respectively. B: The equivalent Finite State Automaton representation, with edges labelled with the readouts and nodes corresponding to the basepairs of the underlying genome. The AB SOLiD sequencing technology has the remarkable property of differentiating between sequencing errors and biological SNPs (under the assumption that the reference genome has no sequencing errors): a SNP changes two adjacent readouts of the color-space code, while a sequencing error is unlikely to happen at two adjacent positions by chance (the technology does not sequence adjacent letters at adjacent time points). At the same time, however, the color-space code introduces certain complexities. Let us consider a comparison done by first translating the color-space read code into the letter-space sequence. Notice that a single sequencing error would cause every position after the place of error to be mistranslated (Figure 3B). Consequently, most approaches have translated the letter-space genome into the corresponding color code. However, this is problematic: since the color-coding of every dibase pairing is not unique, a string of colors can represent one of several DNA strings, depending on the preceding base pair. For example, a string of zeroes could be translated as a poly-A, poly-C, poly-G or poly-T string. 10.1371/journal.pcbi.1000386.g003 Figure 3 Various mutation and error events, and their effects on the color-code readouts. The reference genome is labeled G and the read R. A: A perfect alignment; B: In case of a sequencing error (the 2 should have been read as a 0) the rest of the read no longer matches the genome in letter-space; C: In case of a SNP two adjacent colors do not match the genome, but all subsequent letters do match. However, D: only 3 of the 9 possible color changes represent valid SNPs; E: the rules for deciding which insertion and deletion events are valid are even more complex, as indels can also change adjacent color readouts. There is an additional drawback to translating the genome into color-space code: a sequence of matches and mismatches in color-space does not map uniquely into letter-space similarity. For example, a single SNP results in two sequential color-space mismatches. However, given two consecutive colors, there are 9 possible ways to generate two mismatches. Of these, only 3 correspond to a SNP, while the rest lead to DNA strings that completely differ from the reference. This is illustrated in Figure 3D. We propose an alternate approach. Our key observation is that while a color-space error causes the rest of the sequence to be mistranslated, the genome will match one of the other three possible translations. This is illustrated in Figure 4C. Consequently, we adapt the classical dynamic programming algorithm to simultaneously align the genome to all four possible translations of the read, allowing the algorithm to move from one translation to another by paying a “crossover”, or sequencing error penalty. If one wishes for a probabilistic interpretation of the algorithm, one can consider the FSA in Figure 2B to be a Hidden Markov Model, where the letter is the hidden state, and the color-space sequence is the output of the model. By taking the cross product of this HMM with the standard pair-HMM associated with the Smith-Waterman algorithm, we can allow all of the typical alignment parameters, including the error penalty, to be probabilistically motivated as the log of the probability of the event, and trained using the Expectation-Maximization algorithm. It is notable that our approach handles not only matches, mismatches, and sequencing errors, but also indels. Because the sequences are aligned in letter-space (to be precise, they are aligned and translated simultaneously), indels can be penalized using the standard affine gap penalty with no further modification of the algorithm. 10.1371/journal.pcbi.1000386.g004 Figure 4 Color-space (dibase) sequence alignment. A: The Dynamic Programming (DP) representation, B: recurrences, and C: alignment of a letter space sequence to a color-space read with a sequencing error. Within the DP matrix we simultaneously align all of the four possible translations (vertical) to the reference genome (horizontal); however the alignment can transition between translations by paying the crossover penalty. This is illustrated by the fourth recurrence, where the third index ( ) corresponds to the translation currently being used. In the alignment (C) after the sequencing error, the original translation of the read (starting from a T) no longer matches, but a different one (starting from a C) does. In the SHRiMP algorithm, we only apply the special color-space Smith-Waterman algorithm in the final stage. For the initial stages, we convert the genome from letter-space to color-space, and search for k-mer matches as well as perform vectorized Smith-Waterman strictly in color-space. In order to better incorporate SNPs in color-space data, we use a spaced seed that allows for two adjacent mismatching colors between the read and the reference genome. Computing Statistics for Reads and Mate-pairs Once all of the reads are mapped, for every read and mate-pair we compute mapping confidence statistics. Initially these are computed for each read; however, they are then combined to compute likelihoods of accidental matches for mate-pairs. Computing statistics for single reads While a very thorough statistical theory for local alignments has been established [21], this theory assumes the comparison of infinite length strings, and hence is inappropriate for evaluating alignments of very short reads to a reference genome. Instead, we have designed confidence statistics that explicitly model short reads, and allow for the computation of confidences in the presence of short insertions and deletions. We estimate the confidence in the possible mappings of each read by using the following statistics (calculated by the PROBCALC program): pchance – the probability that the hit occurred by chance – and pgenome – the probability that the hit was generated by the genome, given the observed rates of the various evolutionary and error events. For example, a good alignment would have a low pchance (close to 0) and a very high pgenome (close to 1). In this section we briefly expand on these two concepts, give them mathematical definitions, and merge them to formulate an overall alignment quality measurement. A detailed description is in Methods (Computing Statistics: pchance and pgenome). The pchance of a hit is the probability that the read will align with as good a score to a genome that has the same length, but random nucleotide composition with equal base frequencies (that is, the read will align as well by chance). In order to compute this, we count all of the possible k-mers with an equal number of changes as observed in the hit, and we call this number . For example, if we only have substitutions in our alignment (that is, no indels) and an alignment length of , then gives the number of unique strings to which the read can align with the specified number of substitutions. A more detailed discussion on the construction of , especially for the more complex count for indels, appears in Computing Statistics: pchance and pgenome. The term compares the number of unique strings with the given score (when aligned to the read) compared to all possible unique reads of length , and gives us the probability that a read matches by chance at any location. To compute the pchance statistic over the entire length of the genome, we assume independence of positions, and evaluate the likelihood that there is a match at any of the positions: (1) where is the alignment length, is the genome length (2 corresponds to the two strands), and is a correction factor for mappings that are shorter than the length of the read, detailed in Computing Statistics: pchance and pgenome. Our second computation, pgenome, defines the probability that a hit was generated by the genome via common evolutionary events characteristic of the genome - i.e. substitutions, indels and errors. First, we estimate the rate for each type of event via bootstrapping. Then, we compute the likelihood that the read will differ by as many events from the genome via a binomial probability that uses this estimation and our observations for the events in the current hit. For example, when considering the number of errors, we first estimate the average error rate over all hits, and then we can define the probability that the current read was created via this many errors by (2) where is the number of observed errors in the current hit, and is the alignment length. We can similarly define and for substituion and indel events, respectively. Finally, we can form as (3) More specifics about the mathematical formulations are available in Computing Statistics: pchance and pgenome. Finally, we define the quality measurement of this hit as the normalized odds, i.e. a probability odds ratio normalized over all of the hits of this read: (4) This value represents a relative credibility of this hit compared to the others for a given read: A single hit would have a normalized odds score of 1, two equally good hits will both have of 0.5 for both, while for an exact match and a more distant one, the former will have a close to 1, and the latter close to 0. Computing statistics for mate-pairs SHRiMP also assigns mate-pair confidence values (akin to the read confidence values predicted by probcalc) by combining the confidence values for individual reads with emprically observed distributions of insert sizes in the library. We compute the distribution of the mapped distances (distance between the mapped positions of the two reads) for all mate-pairs, and save the average distance (see Computing Mate Pairs with Statistics for more details). Then, for each mate-pair mapping, we assign a pchance, pgenome and normodds score, similar in meaning to those used in the previous section: pchance for mate-pairs: assume is the pchance of a read that takes g, the length of the genome, as a parameter. Now, the pchance of a mate-pair read_1, read_2 is defined as (5) where is the length of the genome used in probcalc, is the average mate-pair distance, and is the distance of the current mate-pair. That is, we ask the question: what is the probability that a read as good as the first read would align anywhere in the genome by chance, and that a second read will align by chance within the observed mate-pair distance? pgenome for mate-pairs: assume is the pgenome of a read. We can compute the pgenome of each mate-pair by (6) where is the tail probability of the mate-pair distance distribution we computed (both tails, starting at the cutoff). Therefore, for a mate-pair with the distance really close to , the pgenome will be close to , otherwise, it will be penalized. Thus, following the difinition of pgenome, we will get a lower probability that the mate-pair was generated from the genome if the mate-pair distance is too big or too small compared to the average. A discussion of the implementation steps are included in the SHRiMP README, and a more detailed discussion of the statistical values is included in Computing Mate Pairs with Statistics. Validation In our experiments, we used SHRiMP to compare 135 million 35 bp reads from a tunicate Ciona savignyi to the reference genome [22]. The fragments were sequenced from sheared genomic DNA with an AB SOLiD 1.0 instrument. In the following sections we first describe the running time of SHRiMP at different parameter settings, and then evaluate the quality of our alignments compared to the Applied Biosystem's read mapping program. Running time analysis One of the advantages of the SHRiMP algorithm is the seamless parallelism provided by the fact that we can simply subdivide the reads into separate computational jobs, without affecting the results. This allows us to take full advantage of compute clusters regardless of the amount of memory available at each machine. We took a random subset consisting of 500,000 35 bp C. Savignyi reads and mapped them to the genome. The full read dataset and reference genome are available at http://compbio.cs.toronto.edu/shrimp/misc/paper_ciona_reads_35mer.csfasta.tar.bz2 and http://mendel.stanford.edu/sidowlab/CionaData/CionaSavignyi_v.2.1.fa.zip, respectively. The running times at several parameter settings are summarized in Table 1. Note that from smallest to largest seed weight, we see a nearly two orders of magnitude difference in total run time, most of which is concentrated in the vectorized Smith-Waterman filter, and, to a lesser degree, in the spaced k-mer scan. The final, full color-space Smith-Waterman alignment took approximately constant time across all runs, as the average number of top scoring hits that reached the stage was nearly constant (24.49±0.5); however, proportional time increased as the filter stages became more efficient. While SHRiMP is somewhat slower than other short read mapping programs, it allows both for micro-indels in the alignments and a proper color-space alignment algorithm. SHRiMP is also very configurable in terms of sensitivity and running time trade-offs. 10.1371/journal.pcbi.1000386.t001 Table 1 Running time of SHRiMP for mapping 500,000 35 bp SOLiD C. savignyi reads to the 180 Mb reference genome on a single Core2 2.66 GHz processor. K-mer (7,8) (8,9) (9,10) (10,11) (11,12) (12,13) % K-mer Scan 10.1% 16.5% 18.9% 13.4% 9.8% 7.4% % Vectorized SW Filter 88.8% 75.4% 49.8% 30.2% 20.1% 14.9% % Full SW Alignment 1.1% 8.0% 30.7% 55.5% 68.8% 76.2% Time 1 d21 h34 m 6 h18 m 1 h36 m 50 m28 s 37 m52 s 32 m32 s In all cases, two k-mer hits were required within a 41 bp window to invoke the vectorized Smith-Waterman filter. Ciona savignyi polymorphism analysis The primary strength of SHRiMP and other mapping methods based on Smith-Waterman alignments is the ability to map reads containing complex patterns of sequence variation, including insertions, deletions and clusters of closely-space SNPs. Mappers that exclusively produce ungapped alignments can only find SNPs. Furthermore they are more likely to miss dense clusters of SNPs, since the overlapping reads contain many mismatches, and SNPs adjacent to an indel, since only a small fraction of the overlapping reads contain just the SNP. Finally, since SHRiMP produces local alignments, it can map a read even if either end overlaps a large indel or structural variant. To evaluate the effectiveness of SHRiMP for detecting sequence variation we used it to find polymorphisms in a resequenced Ciona savignyi individual. C. savignyi is a challenging test case because of its very high polymorphism rate: the SNP heterozygosity is 4.5% and the average per-base indel heterozygosity is 16.6% (indel rate of 0.0072 events per base) [11]. We therefore expect that even short reads will frequently span multiple variant sites. We used the AB SOLiD sequencing platform to generate 135 million reads of length 35 bp from a single C. savignyi individual. We then aligned those reads to the reference genome [22] with SHRiMP using lenient scoring thresholds so that reads with multiple variant sites could be mapped, and we selected the single highest-scoring alignment for each read (see Methods). We discarded alignments in repetitive sequence by removing reads with multiple similarly scoring alignments (“non-unique” matches). The mapping took 48 hours using 250 2.33 GHz cores. Table 2 summarizes the mapping results. 10.1371/journal.pcbi.1000386.t002 Table 2 Mapping results for 135 million 35 bp SOLiD reads from Ciona savignyi using SHRiMP and the SOLiD mapper provided by Applied Biosystems. SHRiMP SOLiD Mapper Uniquely-Mapped Reads 51,856,904 (38.5%) 15,268,771 (11.3%) Non-Uniquely-Mapped Reads 64,252,692 (47.7%) 12,602,387 (9.4%) Unmapped Reads 18,657,736 (13.8%) 106,896,174 (79.3%) Average Coverage (Uniquely-Mapped Reads) 10.3 3.0 Median Coverage (Uniquely-Mapped Reads) 8 1 SNPs 2,119,720 383,099 Deletions (1–5 bp) 51,592 0 Insertions (1–5 bp) 19,970 0 Non-uniquely-mapped reads have at least two alignments, none of which is significantly better than the others (see Methods). SNPs and indels have at least four supporting reads. The alignment data contains noise due to two types of errors: sequencing errors and chance alignments. Chance alignments are a significant problem for short reads, particularly with the low alignment score thresholds necessary for mapping reads containing significant variation. Reads containing both sequence variation and sequencing errors are even more likely to map to the wrong position in the reference sequence. To combat the high false-positive rate, for the remaining analysis we focused on a high-quality subset of the data consisting of sequence variants supported by at least four independent reads. Across the genome SHRiMP detected 2,119,720 SNPs supported by at least four reads. For comparison, we used the SOLiD aligner provided by Applied Biosystems to map the reads to the genome with up to three mismatches, where each mismatch can be either a single color-space mismatch or a pair of adjacent mismatches consistent with the presence of a SNP. Compared to the SOLiD mapper, SHRiMP mapped 4.2 times as many reads and found 5.5 times as many SNPs. The AB mapper, however, was a lot faster, requiring 255 CPU hours to complete the alignments, or roughly 50× faster than SHRiMP. While it is possible to run the mapper with greater sensitivity, allowing for more errors and SNPs, and thus more mapped reads, doing so would surrender much of the runtime advantage and still not overcome its fundamental inability to detect insertion and deletion polymorphisms. SHRiMP, on the other hand, is capable of handling indels, and detected tens of thousands of them. SHRiMP detected 51,592 deletions and 19,970 insertions of size 1–5 bp. The observed ratio of 2.5× between insertions and deletions for the C. savignyi data is biased by the construction of the reference genome – whenever the two haplomes differed, the reference agreed with the longer one. While there is a smaller inherent bias against detecting insertions (reads containing nucleotides not present in the reference) compared to deletions because a read spanning a deletion only incurs a gap penalty whereas an insertion both incurs a gap penalty and has fewer bases that match the reference. For simulated data (see next section) this bias was only ∼5% for single basepair indels (data not shown). The size distribution of the detected indels (Figure 5A) drops more rapidly with length than expected [11], but this detection bias against longer indels is not surprising since longer indels have lower alignments scores. 10.1371/journal.pcbi.1000386.g005 Figure 5 Size distribution of indels. (A) and distance between adjacent SNPs (B) detected by SHRiMP. The distance between adjacent SNPs shows a clear 3-periodicity, due to the fact that a significant fraction of the non-repetitive C. savignyi genome is coding. Mapping C. savignyi sequence is challenging primarily because the population contains so much variation. Figure 5B shows the high frequency of closely spaced SNPs detected by SHRiMP. Mappers that can only detect nearly exact matches fail to map the reads overlapping these dense SNP clusters. Note that even though the reads are generated from the whole genome, a significant fraction of the non-repetitive C. savignyi genome is coding, making it is possible to see the typical three-periodicity of SNPs in coding regions. Furthermore SHRiMP recovers microindels, which are completely invisible to ungapped aligners and yet account for a significant fraction of sequence variation in C. savignyi. Analysis of simulated data In order to further validate the accuracy of the SHRiMP alignments we have designed simulated experiments, where we sampled random locations from the C. savignyi genome, introduced polymorphisms (SNPs and indels) at the rates previously observed in the C. savignyi genome [22], added sequencing errors at rates observed in our C. savignyi dataset (2–7%, depending on the position in the read), and mapped the reads back to the original genome. Each sampled read could have multiple SNPs and indels, though due to the low indel rate only a small fraction of the reads had multiple indels. We mapped the reads with SHRiMP and postprocessed with PROBCALC (pchance 46% of all reads with either 4 SNPs or 5 bp indels, despite the large number of sequencing errors in our dataset (up to 7% towards the end of the read). Methods Details of the SHRiMP Algorithm The algorithm starts with a rapid k-mer hashing step to localize potential areas of similarity between the reads and the genome. All of the spaced k-mers present in the reads are indexed. Then for each k-mer in the genome, all of the matches of that particular k-mer among the reads are found. If a particular read has as many or more than a specified number of k-mer matches within a given window of the genome, we execute a vectorized Smith-Waterman step, described in the next section, to score and validate the similarity. The top highest-scoring regions are retained, filtered through a full backtracking Smith-Waterman algorithm, and output at the end of the program if their final scores meet a specified threshold. The SHRiMP algorithm is summarized in Figure 6. 10.1371/journal.pcbi.1000386.g006 Figure 6 SHRiMP Hashing technique & Vectorized Alignment algorithm. A: Overview of the k-mer filtering stage within SHRiMP: A window is moved along the genome. If a particular read has a preset number of k-mers within the window the vectorized Smith-Waterman stage is run to align the read to the genome. B: Schematic of the vectorized-implementation of the Needleman-Wunsch algorithm. The red cells are the vector being computed, on the basis of the vectors computed in the last step (yellow) and the next-to-last (blue). The match/mismatch vector for the diagonal is determined by comparing one sequence with the other one reversed (indicated by the red arrow below). To obtain the set of match/mismatch positions for the next diagonal, the lower sequence needs to be shifted to the right. Spaced seed filter We build an index of all spaced k-mers in the reads, and query this index with the genome. Our approach was taken primarily for simplicity: our algorithm can rapidly isolate which reads have several k-mer matches within a small window by maintaining a simple circular buffer of recent positions in the genome that matched the read. Since our targeted compute platform is a cluster of batch processing machines, indexing the reads means that we can easily control memory usage and parallelism by varying the read input size and splitting the read set accordingly. Data is only loaded at program invocation; we do not stream in new reads from disk as the algorithm runs. Vectorized Smith-Waterman implementation The SHRiMP approach relies on a rather liberal initial filtering step, followed by a rigorous, but very fast Smith-Waterman alignment process. By maximizing the speed of the Smith-Waterman comparison, we are permitted to let the algorithm test a larger number of potential regions. Most contemporary mobile, desktop and server-class processors have special vector execution units, which perform multiple simultaneous data operations in a single instruction. For example, it is possible to add the eight individual, 16-bit elements of two 128-bit vectors in one machine instruction. Over the past decade, several methods have been devised to significantly enhance the execution speed of Smith-Waterman-type algorithms by parallelizing the computation of several cells of the dynamic programming matrix. The simplest such implementation computes the dynamic programming matrix using diagonals. Since each cell of the matrix can be computed once the cell immediately above, immediately to the left, and at the upper-left corner have been computed, one can compute each successive diagonal once the two prior diagonals have been completed. In this way, the problem can be parallelized across the length of supported diagonals (see Figure 6B). In most cases, this is a factor of 4 to 16. The only portion of such a ‘Wozniak’ approach that cannot be parallelized is the identification of match/mismatch scores for every cell of the matrix, which has to be done sequentially. These operations are expensive, necessitating 24 independent data loads for 8-cell vectors, and become increasingly problematic as vector sizes increase. Because memory loads cannot be ‘vectorized’, when the parallelism grows, so does the number of lookups. For example, with 16-cell vectors, the number of data loads doubles to 48. We propose an alternate method, where the running time of the fully vectorized algorithm is independent of the number of matches and mismatches in the matrix, though it only supports fixed match/mismatch scores (rather than full scoring matrices). Our key observation is that it is possible to completely parallelize the score computation for every diagonal. Figure 6B demonstrates the essence of our algorithm: by storing one of the sequences backwards, we can align them in such a way that a small number of logical instructions obtain the positions of matches and mismatches for a given diagonal. We then construct a vector of match and mismatch scores for every cell of the diagonal without having to use expensive and un-vectorizable load instructions or a pre-compute a ‘query profile’. In our tests, using a diagonal approach with our scoring scheme surpasses the performance of Wozniak's original algorithm and performs on par with Farrar's method [17]. Table 4 summarizes these results. The advantage of our method over Farrar's is that it is independent of the scores used for matches/mismatches/gaps, and it will scale better with larger vector sizes. A disadvantage is that we cannot support full scoring matrices and are restricted to match/mismatch scores, though this is less important for DNA alignment. Additionally, Farrar's method is much faster for large databases where most of the sequence is dissimilar to the query. However, this is never the case for SHRiMP as the seed scan phase targets only small, similar regions for dynamic programming. In these cases our algorithms perform similarly. 10.1371/journal.pcbi.1000386.t004 Table 4 Performance (in millions of cells per second) of the various Smith-Waterman implementations, including a regular implementation (not vectorized), Wozniak's diagonal implementation with memory lookups, Farrar's method and our diagonal approach without score lookups. Processor type Unvectorized Wozniak Farrar SHRiMP Xeon 97 261 335 338 Core 2 105 285 533 537 We inserted each into SHRiMP, and used SHRiMP to align 50 thousand reads to a reference genome with default parameters. The improvements of the Core 2 architecture for vectored instructions lead to a significant speedup for our approach and Farrar's, while Wozniak's algorithm slight improvement is due to the slow match/mismatch lookups. Final pass The vectorized Smith-Waterman approach described above is used to rapidly determine if the read has a strong match to the local genomic sequence. The locations of the top hits for each read are stored in a heap data structure, which is updated after every invocation of the vectorized Smith-Waterman algorithm if the heap is not full, or if the attained score is greater than or equal to the lowest scoring top hit. Once the whole genome is processed, highest scoring matches are re-aligned using the appropriate full color- or letter-space Smith-Waterman algorithm. This is necessary, as the vectorized Smith-Waterman algorithm described above only computes the maximum score of an alignment, not the traceback, as this would require a much more complicated and costly implementation. Instead, at most only the top alignments for each read are re-aligned in the final step. Computing Statistics: pchance and pgenome In Computing Statistics for Single Reads, we briefly introduced the concepts of the pchance, pgenome and normalized odds of a hit. In this section we expand on the details regarding the construction of pchance and pgenome. In these formulas we make use the following definitions: is the genome length is the alignment length (note this may be different from the read length, which is constant) is the number of substitutions (mismatches) in our alignment is the number of nucleotide insertions in our alignment, where the genome is the “original” sequence. For example, if the genome is AC-G and a read is ACTG, there is an insertion of a T. is the number of nucleotide deletions in our alignment. For example, if the genome is ACTG and a read is A-TG, there is a deletion of a C. is the number of insertion events (for example, for a single insertion of length 3 we have and .) is similar. : following the previous definition, will describe the number of permutations of insertion events. To determine the number of distinguishable permutations, we need to first look at the frequency of insertion events of a certain size, . For example, is we have 3 insertions of size 2, we need to divide the permutations by . Therefore, the distinguishable permutations of insertion events can be written as: Below, we refer to this denominator term as . We similarly define . describes the number of ways to assign indistinguishable objects in indistinguishable bins, which is recursively defined by with and . pchance We begin with the mathematical formulation of pchance (defined above): (7) where, as described before, is the number of possible unique sequences with the given edit distance as a fraction of all possible unique reads of length . Thus, gives us the probability that the current read has aligned by chance to a random genome of the size of a read. To this term, we add a correction factor of which accounts for all the possible places the alignment of size might match. For example, if the is 25 and we have a match of size 22, we should count for every position where this match could be found, that is 25−22+1 = 4. Finally, to get the probability that the current read has aligned by chance to a random genome of size (instead of size ), we get formula (7). The factor that lies at the core of this calculation is , the number of possible unique sequences that would align to the read with the given edit distance. We have shown the definition of , which computes when there are no indels in the alignment: (8) However, the calculation of the number of references to which a read will map with a particular indel count, , depends on the sequence of that read and is significantly more complicated. We define a lower and upper bound on in this case: a lower bound (least number of unique sequences) occurs when the current read is one repeated nucleotide, for example [AAAAAA], and the higher bound occurs with the most change in nearby nucleotides, say [ACGTAC]. In the former case, we need to look at the deletion events from the genome to this read, consider all the combinations of that number of deletion events and deleted nucleotides, as well as all the places where these combinations may occur. This gives the formula (9) Looking for the upper bound, we note that the places and combinations of insertions also matters in generating unique sequences, therefore giving us two extra terms involving (10) (11) In order to estimate the correct value for , we estimated the average complexity of the reads in our dataset (i.e., between the simplest [AAAAA…] and the most complex [ACGTACGT…]). And have found that the mean observed could be accurately estimated by (12) Finally, we can approximate the total as (13) pgenome In Computing Statistics for Single Reads, we defined our pgenome factor as , where (14) with the rate of event (estimated via bootstrapping) and the number of observed events of type in the current alignment. We wrote as an approximation because there are small corrections to this formula for each probability that is part of pgenome. First, for the error term , the number of sites that can support errors is in fact one minus the read size, giving us (15) When considering substitutions, we can have changes at any of the inner nucleotides, excluding erroneous sites: (16) As before, when we look at alignments that involve indels, the formula becomes more complex. In the case of pgenome, we do not have to consider the various placements of insertion or deletion events, but we do have to consider, for fixed placements of events, the various combinations of the total number of insertions and deletions into a set number of events. (17) Computing mate pairs with statistics In this section we provide several details for the implementation, usage and statistics of the matepair post-processing step introduced in Computing Statistics for Mate-pairs. We define a good matepair mapping as a mapping whose distance (between the two reads) are smaller than some chosen limit , and for which the read mappings are in a consistent orientation and strand(i.e. R+F+ or F−R−). First, probcalc_mp will compute a matepair distance and standard deviation by looking at all the connected forward and reverse reads - all matepairs - and adding the distance of any matepair with exactly one good mapping to a histogram. Optionally, one can choose to use only unique good mappings, or only use a certain number of mappings (say, the first 100,000) to speed up the program. Next, we call a matepair concordant if it has at least one good mapping, and otherwise we call it discordant. Depending on the task, probcalc_mp can output all concordant matepairs, or all discordant matepairs. For each matepair mapping, probcalc_mp will compute the pgenome and pchance, as introduced in Computing Mate Pairs with Statistics. Parameters For the C. savignyi polymorphism analysis we ran SHRiMP with the following parameters. We used the spaced seed “11110111” and required two hits per 40-base window to invoke the Smith-Waterman algorithm. The Smith-Waterman scoring parameters were set to +100 for a matching base, −90 for a mismatch, −250 and −100 to open and extend a gap respectively, and −300 for a crossover (sequencing error). The minimum Smith-Waterman score thresholds were 1000 for the vectorized first pass and 1275 for the final alignment pass. We discarded alignments with pchance less than 0.05, and to remove reads from known repetitive sequence we required normodds to be at least 0.8.
                Bookmark

                Author and article information

                Journal
                BMC Genomics
                BMC Genomics
                BioMed Central
                1471-2164
                2011
                25 January 2011
                : 12
                : 59
                Affiliations
                [1 ]Department of Plant Sciences, University of California, Davis, CA, 95616, USA
                [2 ]Genomics and Gene Discovery Research Unit, USDA-ARS, Western Regional Research Center, Albany, CA, 94710, USA
                Article
                1471-2164-12-59
                10.1186/1471-2164-12-59
                3041743
                21266061
                a1cf57e5-0c4b-46d6-bce3-d363088a64dc
                Copyright ©2011 You et al; licensee BioMed Central Ltd.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 24 August 2010
                : 25 January 2011
                Categories
                Methodology Article

                Genetics
                Genetics

                Comments

                Comment on this article