Introduction The last decade has seen dramatic advances in sequencing technology that have relied on highly-parallel optical sensing of polymerisation reactions. These advances have substantially reduced sequencing costs, however further reduction in cost is limited by the dependence of these platforms on photo-receptive sensors and their associated reagents. In 2011, Life technologies began distribution of the Ion Torrent Personal Genome Machine (PGM). The PGM leverages advances in semi-conductor technology and ion-sensitive transistors to sequence DNA using only DNA polymerase and natural nucleotides, with each polymerisation event recognised by pH changes alone . The PGM requires similar library preparation steps to Roche 454 shotgun libraries, where an adapter and key () is ligated to the DNA templates, and under optimal conditions, a single DNA template is affixed to a bead and clonally amplified using emulsion PCR. The beads are then loaded onto the chip, where, on average, each well contains less than a single bead. Deoxynucleotide triphosphates (dNTPs) are flowed over the surface of the bead in a predetermined sequence, with zero or more dNTPs ligating during each flow. While the first generation of the PGM cycled through the four nucleotides in a step-wise fashion (as does the Roche 454 pyrosequencer), this cycle was modified to have a period of 32, with a pattern that repeats some nucleotides in a period shorter than four. This more complex flow cycle, referred to as the Samba, was implemented to improve synchronicity of clonal templates on the bead at the cost of a flow-sequence not optimised for read length. A single proton is released for every nucleotide incorporated during the flow, resulting in a net decrease of the pH in the surrounding solution. This pH change is measured by an ionic sensor and then converted to a flow value using a physical model of the cell and its contents. The PGM base-caller takes these flow-values and corrects for phase and signal loss, and also normalizes the raw flow-values to the key sequence (the key is a known four-base DNA molecule appended to the 5′ region of every read sequence). Finally, quality assurance procedures are applied to the data, filtering polyclonal and noisy reads, and clipping adapters and low quality regions from the 3′ end of the remaining reads. An assessment of the PGM sequencing technology was provided with the publication of the sequencing platform . However, this analysis was conducted using quality-trimmed data (rather than raw reads) and reported accuracy based on a consensus rather than individual read accuracy. In addition, substantial changes have been made to the platform since the initial publication, including the change of flow order (Samba), quality control and base calling, which are likely to influence the error rates and biases in PGM data. Two studies comparing benchtop sequencers, including the PGM, were recently published , , both focusing on the assembly prospects of benchtop platforms on single isolate genomes. Consistent with the scope of these studies, they used the quality-trimming recommended by each distributor and did not generate sufficient replicate data to understand variability in accuracy. No study of PGM data to date has attempted to robustly evaluate error rates or explored the possible causal factors of errors at the flow-level. Here, we characterise the nature of errors produced by the PGM, including sequencing kit, chip type, machine variability, template G+C%, and chip, read and flow positional effects. Our findings contribute to a more comprehensive understanding of the errors and biases inherent in PGM data, and will enable the bioinformatics community to develop appropriate algorithms for this new platform. Results Datasets In this study, we considered the influence of both technological (template preparation kit, chip, machine) and DNA template factors on sequencing output using a full factor statistical design (see Methods ). We evaluated three template preparation kits, listed in chronological release order, Ion OneTouch Template Kit (100 bp OneTouch kit), Ion Xpress Template 200 Kit (200 bp manual kit), Ion OneTouch 200 Template Kit (200 bp OneTouch kit), two chip densities, 314 (1.2M sensors, density 200,000 reads) and 316 (6.1M sensors, density 1 million reads) and two PGM machines (named here a and b). Two species, Sulfolobus tokodaii (33 G+C%) and Bacillus amyloliquefaciens (46% G+C%), both with relatively small, sequenced genomes and no extra-chromosomal DNA, were selected to represent different G+C% templates. A third organism, Deinococcus maricopensis (69% G+C), was selected for its high G+C content, however library preparation consistently failed using the 200 bp kits, and read throughput was extremely low for the 100 bp OneTouch kit, consequently it was excluded from subsequent analyses. Fifteen datasets were generated ( Table 1 ) and used to evaluate the effects and interactions between template preparation kit, chip and machine on the measured error rate (see Methods ). 10.1371/journal.pcbi.1003031.t001 Table 1 Sequencing runs generated for this study. Treatment # Reads % Wells with ISPs % With adaptors #Reads Used % Reads mapping Mean length Mean Length (AT) Mean quality 314-B-a-100 275,058 54% 96% 275,058 96% 146.8 116.1 32.9 314-S-a-100 188,925 63% 94% 188,925 94% 153.6 125.2 32.5 316-S-a-100 2,135,728 55% 97% 300,000 97% 149.9 121.4 33.5 316-B-a-100 2,364,054 67% 98% 300,000 98% 145.2 114.2 33.1 314-B-a-200M 453,539 66% 100% 300,000 95% 267.7 248 33.6 314-S-a-200M 286,106 45% 100% 286,106 94% 269.40 254.5 33.9 316-S-a-200M 1,321,709 51% 100% 300,000 91% 268.40 254.2 30.9 316-S-b-100 1,442,952 51% 98% 300,000 98% 151.3 121.4 32.7 314-B-b-100 441,037 68% 92% 300,000 91% 152.9 123.3 33.7 316-B-a-200M 1,124,128 95% 100% 300,000 92% 267.4 248.2 29.5 314-S-b-200M 197,811 36% 100% 197,811 93% 271.7 256.7 33.4 316-B-b-200 2,345,739 95% 5.7% 300,000 94% 298.9 250.5 33.2 314-S-b-200 426,411 94% 15% 300,000 85% 318.2 278.4 34.0 316-S-a-200 2,586,746 82% 6.7% 300,000 93% 305.0 258.5 32.1 314-B-a-200 373,256 50% 12% 300,000 87% 307.9 263.9 34 The name for each run is comprised of the chip (314, 316), species (B – Bacillus amyloliquefaciens, S – Sulfolobus tokodaii), machine (a, b), and kit (100 - Ion OneTouch Template Kit, 200M - Ion Xpress Template 200 kit, 200 - Ion OneTouch 200 Template kit). Runs are listed in chronological order. ‘% Wells with ISPs’ describes the percentage of wells on the chip which contained a bead. Mean Length AT denotes length after 3′ adapter trimming. For the 100 bp OneTouch kit and 200 bp manual kit, the great majority of reads (94–100%) had adapter sequences detected by the PGM software. The newest kit considered here, the 200 bp OneTouch Kit, had a very low percentage of reads with detected adapters (7–12%), which may be due to library construction differences (longer inserts used). We applied the 3′ clipping of adapter sequences as specified in the Standard Flowgram File (SFF), however we did not apply the recommended 3′ quality clipping – allowing more realistic calculation of raw error rates. Datasets larger than 300,000 reads were randomly subsampled down to 300,000 reads as this provided sufficient information for downstream analyses ( Table 1 ). On average, 93±3.5% of all analysed reads mapped to their respective reference genomes, with the poorest performance obtained with the 200 bp OneTouch kits on 314 chips. Mean read quality across all the datasets was 32.88±1.20 s.d. The distribution of read lengths ( Figure S1 ) shows that both the 100 bp and 200 bp One Touch kits were bi-modal with a smaller secondary peak ∼100 bp greater than the expected length, which was more prominent in runs using the 314 chip. These longer reads did not exhibit any deviation in mean G+C content or homopolymer composition from the dominant read length peak for their respective runs. However the majority of these longer reads did not map to the reference genome, but those that could be mapped had an error rate double that of the mean base-error rate across all datasets. Considering these reads have a substantially higher error rate, we recommend the removal of unexpectedly long reads prior to analysis. High frequency polymorphisms between sequenced reads and reference genomes Prior to assessing PGM error profiles, we determined if there were any genuine polymorphisms between the PGM determined and reference genomes, which may be the result of accumulated mutations in the genome  or sequencing errors in the original genome project. We tested each base difference between the reads and their respective reference genome to identify whether the number of observed differences was significantly higher than the expected error rate (see Methods ). Across all datasets, there were a large number of significant differences, predominantly high-frequency insertion and deletion (indel) polymorphisms ( Table 2 ). While the number of polymorphisms appear to be lower for 100 bp OneTouch kits, this is likely due to lower coverage reducing the sensitivity of our ‘polymorphism’ detection. It has been previously reported that the majority of indel polymorphisms detected in PGM reads are false-positives, even when the ‘putative’ indel was present across a large number of reads , . We would expect that if the indels in our datasets were bona fide polymorphisms they would be observed across all datasets for the same species. Analysis of the 200 bp kits revealed that 87% of high-frequency indels across the B. amyloliquefaciens datasets, and 82% across S. tokodaii were unique to a single run ( Figure S2a and S2b ). As the data were derived from the same DNA template, this strongly indicates that these indels are due to PGM-based error as opposed to genuine polymorphisms. While few high frequency indel (HFI) sites were shared amongst all 200 bp runs for the same species, the size of the intersection between pairs of runs suggests that the HFI (or a subset of them) are not random (i.e. they may be more prevalent in or around a particular sequence motif) ( Figure S2a and S2b ). HFIs have been observed previously in PGM data, with some evidence to suggest the HFI were asymmetrically distributed across reads in the forward versus reverse orientation (or vice versa) . We investigated whether any of the HFIs in our data were asymmetrically distributed across the forward and reverse oriented reads aligning across that site (see Methods ). This test was only performed on the 200 bp kits as the 100 bp kits had insufficient coverage for evaluation. Significant indel-strand asymmetry was detected in 7.4% (897/12,107) of the testable putative indel sites across the 200 bp datasets ( Table 2 ), with some datasets having a substantially higher percentage of asymmetric indels than others (i.e. 314-S-b-200, 314-S-a-200M). As with the HFI superset, these strand-asymmetric HFIs have more sites in common between runs than expected for a random event ( Figure S2c and S2d ). 10.1371/journal.pcbi.1003031.t002 Table 2 Number of genomic locations where a significant proportion of reads disagreed with the reference. Treatment Substitutions Deletion Insertion Asymmetric Deletion (# significant/# tested) Asymmetric Insertion (#significant/#tested) 314-B-a-100 24 107 40 N/A N/A 314-S-a-100 18 45 38 N/A N/A 316-S-a-100 21 162 98 N/A N/A 316-B-a-100 9 110 79 N/A N/A 314-B-a-200M 553 761 1303 12/618 5/644 314-S-a-200M 483 1105 2446 165/365 33/529 316-S-a-200M 568 337 911 35/247 30/561 316-S-b-100 13 131 161 N/A N/A 314-B-b-100 11 92 83 N/A N/A 316-B-a-200M 534 246 546 1/159 0/294 314-S-b-200M 309 409 1014 24/365 13/529 316-B-b-200 164 979 717 32/905 0/315 314-S-b-200 292 1579 2189 398/1533 21/1041 316-S-a-200 198 398 1100 64/350 7/623 314-B-a-200 170 1127 1214 53/1049 4/554 The last two columns show the number of significant strand-specific error instances out of the total testable instances (testable defined here as having reads aligning in both orientations over the site). The majority of HFI errors were single-base insertions or deletions and only 29% of these occurred in homopolymeric regions, in contrast, 82% of strand-asymmetric HFIs occurred in a homopolymer of length two to three. In general, HFI errors manifested most commonly as insertions of A/T or deletions of C/G ( Figure S3 a). Strand-asymmetric HFIs were dominated by deletions of C and G ( Figure S3b ). There was no relationship between the base or flow positions of these errors across the reads with the same HFI. It will require further experimentation to identify whether HFIs are introduced by the library preparation, template preparation or the sequencer itself. The HFI error rate relative to the reference genome was approximately 1 in 1000 to 1 in 2000 bp, and HFIs accounted for 0.03%–0.09% of all bases in each dataset. Given that it is difficult to distinguish HFIs from bona fide polymorphisms without sequencing the same template over multiple chips, PGM sequencing may be compromised in polymorphism detection and amplicon sequencing projects. Here, we mask the HFIs from downstream analyses, as they are unlikely to be the consequence of individual flow-call inaccuracy, and will introduce bias into modeling of base and flow-call accuracy. Coverage and G+C bias Initial studies characterising the PGM reported that there was little correlation between genomic coverage and G+C% content. More recently it was claimed that, based on visual comparisons of the theoretical versus empirical genomic coverage, there was substantially less read coverage in A+T rich regions . Here, we attempt to quantify the magnitude and significance of the relationship between coverage and G+C content. Inspection of the data suggested that the relationship between G+C% and coverage differs for the two species used in this study ( Figure 1a ). While B. amyloliquefaciens has a higher mean G+C (46.1%) compared to S. tokodaii (32.8%), both have a large number of 100 bp windows within the range of 20–50% G+C content, allowing direct comparison ( Figure 1b ). The positive correlation between G+C and coverage in S. tokodaii versus the negative correlation with B. amyloliquefaciens is clear even when restricting to this 20–50% G+C range. This suggests that the relationship between G+C% and coverage is influenced by the DNA template from which the sequences are derived. We fitted a linear model to the normalised coverage to evaluate the significance and magnitude of the relationship between G+C% and coverage, as well as the influence of species (DNA template) on this relationship (see Methods ). All terms in the regression were significant (p T, G A) are the most prevalent form of substitution reported in bacteria , transitions from C to T and G to A were found to be equally likely with transversions from C to A and G to C for both species ( Figure 8 ). As these rates have not been corrected for the G+C% of the parent organisms (both with <50% G+C), the proportion of substitutions which transition/transverse in higher G+C% organisms is likely larger than observed here. The novelty of this observation, in light of accepted substitution mechanisms, suggests that these substitutions are a consequence of the PGM sequencer, which is further reinforced by the prevalence of low confidence flow-calls for these substitutions. emPCR is an unlikely source of these errors, as transitions are the dominant substitution type for 454 GS FLX Titanium sequencing . Considering that the PGM introduces substitution errors at a rate of between 0.04%–0.1%, rare variants (single nucleotide polymorphisms) which occur at a frequency greater than 0.1% may be detected using the PGM platform. 10.1371/journal.pcbi.1003031.g008 Figure 8 Breakdown of substitution type as a proportion of all substitutions for each sequencing kit. A closer look at PGM quality scores Each quality score, q, generated by the PGM base-caller is Phred-based, where q = −10×log10(p error). A quality score is assigned to each base using a pre-computed quality lookup-table distributed with each version of the PGM software. The lookup table uses six predictors of local quality, described elsewhere (Life Sciences Technical Note Version 2.0.1–2.20). For each Ion Torrent quality score, we evaluated the empirical rate of error for bases assigned to that quality score. Consistent with previous reports , , we observe that the PGM quality scores underestimate the base accuracy, but observe that they have become more accurate with sequential sequencing kit releases ( Figure 9 ). The relationship between the empirical quality and the estimated quality is not strictly linear, potentially a consequence of the quality score look-up table method. As with Roche 454 quality scores, the qualities can be only used to detect inserted and substituted bases . We evaluated whether there was any relationship between position within the homopolymer and the assigned confidence. We found that even in correctly called bases within a homopolymer, there is a decrease in assigned quality along the homopolymer length ( Figure S10 ), as was found in Roche 454 pyrosequencing , although, overall the quality scores decrease more rapidly for inserted (error) bases. Counter intuitively, it is possible for the first base in a homopolymer run to have lower quality score than later bases, irrespective of whether the homopolymer is the correct length or not ( Figure S10 ). This could originate from penalties for local or environmental noise (P1 and P6 in the documentation), which allow individual base qualities to be adjusted if immediately upstream or downstream calls are noisy. 10.1371/journal.pcbi.1003031.g009 Figure 9 Ion Torrent quality scores versus empirically estimated quality score for base. The grey cloud surrounding the LOESS smoother function indicates the 95% confidence interval for the conditional mean. Individual observations for each quality are plotted as black points. How effective is Ion Torrent quality assurance? The base-calling software in the Torrent Suite (version 2.0.1) performs two quality assurance steps prior to outputting sequences. The first step evaluates the residual between observed flow values and predicted flow values, based on a model of the flow cell. Reads with residuals that produce a median absolute value greater than a given threshold are filtered from both the SFF and the FASTQ, as they are assumed polyclonal. The second step scans non-polyclonal reads to identify undesirable 3′ regions of the read, which are subsequently trimmed. Undesirable regions are defined as regions containing the adapter sequence (and beyond) as well as low-quality regions. Should both adapter and quality trims be specified, the more stringent trimming is used. The newly released Torrent Suite (2.20) introduced a third trimming approach which clips the read based on high-residual ionogram (HRI) 1-mer and 2-mer flow values, which may indicate ‘noisy’ flows. A high-residual 1-mer is in the range [0.50, 0.59] or [1.40, 1.49], a high residual 2-mer in the range [1.50, 1.59] or [2.41, 2.49] (Torrent User Documentation Version 2.2.0). We consider the influence of PGM quality and HRI trimming methods on error rates and other metrics for each of the sequencing kits ( Methods , Table 4 ). For consistency, we initially apply our trimming, where we only consider the first 100 bp for the 100 bp OneTouch kit, and the first 200 bp for the 200 bp Manual and 200 bp OneTouch kits. We used the quality clip as specified in the SFF file, and calculated the clip points for HRI trimming. Very little improvement was gained by applying the quality clip after our analysis trimming. This suggests that only unmappable reads are removed by PGM quality clipping, and/or quality trimming is occurring after base 100 for the 100 bp OneTouch kit, and 200 for the 200 bp kits. This is to be expected, considering that only a long, consistently noisy region of the read could violate the quality threshold (mean of Q9 or less across 30 bp, i.e. 87.5% mean base accuracy). The addition of HRI trimming results in a substantial improvement over our ‘analysis’ treatment, with an absolute decrease in insertion and deletion rates by 0.23% to 1%, with generally greater improvements in the insertion rate over the deletion rate ( Table 4 ). While there is asymmetry in the insertion versus deletion rate, we did not expect a disproportionate decrease in the insertion rate compared to the deletion rate after trimming. In an attempt to improve the deletion rate, we considered whether the addition of HRI 3-mers to this metric would account for our previous observation that deletions are more common in longer homopolymers, however this did not yield substantial improvements ( Table 4 ). 10.1371/journal.pcbi.1003031.t004 Table 4 Effect of quality and flow trimming on dataset metrics, aggregated by kit used. Treatment AT AT+QT AT+QT+HRI AT+QT+HRI3 Kit 100 200M 200 100 200M 200 100 200M 200 100 200 200M Metric Insertion Rate (%) 0.83 2.68 1.76 0.80 2.64 1.74 0.42 0.97 0.79 0.39 0.89 0.73 Deletion Rate 0.80 2.00 1.08 0.77 1.97 1.07 0.45 0.87 0.59 0.43 0.81 0.56 Comparison homopolymer rate 1.5%  1.78  Read length: 25th percentile 99 199 199 99 199 199 90.0 105.0 167.0 85.0 99.0 160.0 Read length: 50th percentile 990 199 199 99 199 199 98.0 170.0 198.0 98.0 164.0 198.0 Read length: 75th percentile 100 200.0 200 100.0 200 200 99 199 199 99 198.0 199.00 % of data retained 100.0 100.0 100.0 99.8 100.0 100.0 99.4 99.3 99.6 99.3 99.1 99.5 % error-free reads 52.5 8.3 22.2 53.3 8.4 22.2 63.4 24.1 29.6 65.0 27.0 31.0 Errors per read: 50th percentile 0 6 2 0 6 2 0 2 1 0 2 1 Errors per read: 75th percentile 2 13 7 2 13 7 1 4 3 1 3 3 Errors per read: 99th percentile 13 35 36 12 35 35 6 12 13 5 11 12 AT = Analysis trim, QT = Quality trim, HRI = High-residual ionogram trim (1-mers and 2-mers), HRI3 = High-residual ionogram trim (1-mer, 2-mer, 3-mers). The ‘comparison homopolymer rates’ are taken from other literature using the same kit and level of quality assurance (both cases used Torrent Server version 1.5.0). After default HRI clipping, the vast majority of reads in the 200 bp kits contained at least one error ( Table 4 ), with the top 99th percentile as high as 6 for the 100 bp OneTouch kit, and 12–13 for both 200 bp kits (an error-rate of ∼6%). No obvious global metrics could be identified to filter these error-prone reads. Discussion As with any new sequencing technology, exploring the nuances of the data lays an essential foundation for developing platform-specific bioinformatics methods. The goal of this study was to provide a comprehensive evaluation of the types of errors and biases introduced during PGM library preparation and sequencing. Previous studies which used the manufacturers aligner, ‘tmap’, on default settings implicitly removed reads with two or more indels prior to data analysis , . As these filtered reads can only be distinguished with the benefit of a reference genome, such studies are more relevant for re-mapping applications. By using an indel-tolerant mapping approach, we were able to map almost 100% of the reads, which largely explains why the global base error-rates calculated in this study are higher than previously reported. Even after applying the quality trimming used for the initial Ion Torrent studies, we found that our estimate of quality-clipped accuracy for the 100 bp OneTouch kit was substantially lower than first reported. The first Ion Torrent study was conducted prior to the changes in the flow-cycle pattern, a change implemented to improve synchronicity between reads on the same bead. By analysing the flow-values, we are able to show that this change in the flow-pattern results in specific positions within the flow cycle being more susceptible to over-call/under-call error than others. Error-prone cycles tend to be ‘A’ or ‘T’ flows, and unlike Roche 454, over-calls of zero length homopolymers and under-calling homopolymers of length one are not improbable. We find that insertions are more common than deletions overall, however, under-calling (deletions) rapidly becomes the dominant error type with increasing homopolymer length. The variance of flow-values with respect to homopolymer length, cycle number, and position-within a cycle (PIC) will present challenges to studies, such as indel variant detection, which often assume a global base error rate. Consequently, higher read coverage and use of PGM flow-specific error rates will be required for applications to confidently distinguish PGM indel errors from genuine variants. However, the substitution error rate, which occurs at an order of magnitude lower than the indel error rate (0.04%–0.1%), suggests that Ion Torrent could be used to detect SNP variants that occur at frequencies greater than 0.1%. Through the use of replicates and modelling of flow-values, we are able to identify high frequency indel (HFI) errors that could easily be mistaken for polymorphisms in the absence of replicates. Flow-level analyses suggest that these errors do not correlate with factors implicated in over-call/under-call errors. The frequency of HFIs with respect to the reference genome will yield undesirable results for a number of applications. For example, in polymorphism detection, HFI regions will yield false positives . In assembly, these regions yield unresolvable differences  or frame-shifts. Amplicon sequencing will suffer both from run-specific HFI errors and synchronised over-call/under-call errors; it is expected that amplicons from closely related species will be synchronised in their called flows, thus exposing them to much higher error rates as a result of noisy PICs. Given the difficulty of detecting HFIs in isolation, we recommend generating two or more datasets from the same starting DNA to resolve the majority of HFI errors (i.e. run-specific HFIs). Consistent with previous results, we find the PGM introduces coverage bias against low and high G+C% sequences, with evidence suggesting that attempts to sequence high G+C% organisms (greater than 65% G+C%) may even fail during the PGM library preparation. Although the mechanisms behind these failed libraries are not evaluated here, we anticipate they will bias representation of high G+C% organisms in metagenomes and metatranscriptomes. As expected for a new technology, there have been marked improvements in the PGM since its limited release in January, 2011. The newest kit considered in this study, the 200 bp OneTouch, has substantially reduced the error rates observed in the 200 bp Manual kit. Furthermore, the accuracy of PGM quality scores has improved markedly with each kit release. The addition of HRI-based clipping (as of Torrent Server 2.2.0) to complement the relatively lax quality trim has proven extremely effective at removing the error-prone 3′ end of the reads, albeit at a cost of 20–30 bp of read length. For datasets processed with older versions of Torrent Server, we highly recommend re-running the PGM analysis to improve run quality. However, trimming cannot remove all errors. Thus, we recommend that researchers who intend to use or develop methods for analysing PGM data take into account that the PGM has a higher error-rate than both Illumina and 454, and that PGM global base error rates are poor substitutes for flow error rates. Armed with the models developed in this study, bioinformaticians can develop platform-specific approaches for the PGM that adequately account for the majority of errors introduced by this platform. Methods DNA DNA for Bacillus amyloliquefaciens subsp. amyloliquefaciens DSM7, Deinococcus maricopensis DSM 21211 and Sulfolobus tokodaii DSM 16993, was acquired from the Leibniz-Institut DSMZ – German Collection of Microorganisms and Cell Cultures. These organisms were selected as they had small genomes consisting of a single chromosome, no plasmids, and complete reference genomes. They were also selected to span a range of mean genomic G+C% (B. amyloliquefaciens 46.1%, D. maricopensis 69.8%, S. tokodaii 32.8%). The SFF files for all B. amyloliquefaciens and S. tokodaii datasets are available from http://ecogenomic.org/acacia. Library preparation and sequencing For the 100 bp libraries, 100 ng of genomic DNA was sheared by adaptive focused acoustics using a Covaris S2 (Covaris Inc.) and Covaris Micro-tubes and using the method and shearing conditions described in the Ion Fragment Library Kit (publication 4467320 Rev. B) protocol. For 200 bp libraries, 200 ng of genomic DNA was also sheared using a Covaris S2, with modification of the shearing conditions (intensity, 4; time, 27 sec). The remaining steps in the library preparation were performed using the Ion Plus Fragment Library Kit, using the corresponding User Guide (Publication 4471989 Rev. B), with modification. Due to the shearing volume required for the Covaris S2 micro-tubes, the sheared DNA samples were treated as if they were 1 µg samples, in the end repair steps, as this catered for a larger starting volume. From that point forward the library was treated as if the input was 100 ng. Following adapter ligation and nick repair, the library was size selected using a Pippin Prep (Sage Science) instrument, followed by 6 cycles of amplification. Using the Agilent 2100 Bioanalyzer (Agilent Technologies) with the High Sensitivity DNA Kit (Agilent Technologies), the quality, size and concentration of the libraries was determined and the library was diluted, prior to template preparation, so as to keep polyclonal values in the sequencing results to a minimum. For 100 bp libraries, the Ion OneTouch system (composed of the Ion OneTouch Instrument and Ion OneTouch ES) was used to prepare the template, using the methods outlined in the Ion OneTouch Template Kit and associated User Guide (Publication 4468007 Rev. E). 200 bp libraries were prepared either manually, using the Ion Xpress Template 200 Kit and associated User Guide (Publication 4471974 Rev. C), or using the Ion OneTouch system, using the Ion OneTouch 200 Template Kit and associated User Guide (Publication 4472430 Rev C.). 100 bp templates were sequenced using the Ion Sequencing Kit and user guide (Publication 4469714 Rev. C). Manually prepared 200 bp templates were sequenced using the Ion Sequencing 200 Kit, using methods outlined in the corresponding User Guide (Publication 4471999 Rev. B), while 200 bp templates prepared using the OneTouch system were sequenced using the Ion PGM 200 Sequencing Kit and the associated User Guide (Publication 4474246 Rev. B). The Ion Torrent Suite 2.0.1 was used for all analyses and the SFF was subsequently downloaded for analysis. Test runs for D. maricopensis either failed to produce viable libraries, or produced very few reads of low quality, so sequencing this species was no longer pursued. A factorial design was used for estimating variability due to Chip, Species and Kit, consisting initially of 8 sequencing runs ( Table S2 ). A second experiment using a randomised Plackatt-Burman design consisting of 4 runs was conducted to help estimate inter-machine variability ( Table S3 ). However, the Ion Xpress Template 200 kit was phased out during the experiment, preventing the completion of the final experiment. The release of a new kit part-way through the data-generation necessitated the addition of four new datasets, also conducted using a Plackatt-Burman design ( Table S4 ). In total, 15 datasets were generated. Read preparation and alignment The unclipped reads, flows and qualities were extracted from the SFF. The location of the predicted adapter sites for each read was also extracted. Only reads with an adapter site were clipped to the recommended length. All reads retained their flow key (the first four called bases in every read, TACG) to maintain ease of moving between a base and flow-value coordinate system. Flow and cycle coordinate systems reported start from zero, and base positions start from one. The reference sequence for each genome was downloaded from NCBI, and converted to run-length encoded  (RLE) form (S. tokodaii - NC_003106.2, B. amyloliquefaciens NC_014551.1, and D. maricopensis NC_014958.1). Each read was also collapsed to its RLE form (i.e. homopolymers collapsed into a single base). This reduces the influence of numerous homopolymer errors on the alignment thresholds, as well as making it easier to transition between flow and base coordinates. Each read was aligned to its respective reference genome using Segemehl version 0.1.2 , an InDel tolerant short-read aligner. The alignment output was parsed and the aligned regions extracted from both the read and reference RLEs. This was necessary as both substitutions and identical bases are reported as matches (‘M’) in the SAM format generated by this version of Segemehl. Using the SeqOp string and the aligned sequence segments, each position in the alignment was recorded as a match, substitution, insertion or deletion. Using a set of in-house Perl scripts and SQL databases, the alignment positions were mapped to their relevant flow-value, base position/s and quality scores. Read level attributes such as average quality, average G+C% in 100 bp windows and read length were also calculated. To avoid over-inflation of the error rate due to (1) 5′ misalignments and (2) homopolymers which overlapped the last base of the key and the first base/s of the read, we only consider base positions 10 and greater for analysis. Furthermore, analysis was restricted to the first 100 bp for the Ion OneTouch Template Kit (100 bp reads), and 200 bp for the Ion Xpress Template 200 and Ion OneTouch 200 Template kits, as the number of reads longer than this decreased rapidly leading to exaggerated error rates at these base positions. Statistical filtering and analysis Repeats MUMmer 3.22  (program ‘repeat-match’ with parameter –n 30) was used to identify regions of perfect match repeats within the reference genomes. These regions were masked from all downstream analysis. Polymorphisms While the sequenced type strains for each species were used, there is still potential for polymorphic differences between the cultured strains and the reference genome. Classification and consequent masking of these genomic locations in the reference allows more accurate modeling of error rates in downstream analyses. Beginning with substitution differences between the read and reference, we simply modeled the observed number of differences, X, at each reference position as a binomial, with the probability of substitution estimated from the data (mean substitution rate). Due to the large number of comparisons, the p-values were corrected using Holm's method . The significance threshold was taken to be 0.05. A similar approach was adopted for detecting insertion and deletion polymorphisms, however, in the case of single-base insertions (over-calls of zero) the coverage of the site was taken to be the maximum coverage of the bases immediately adjacent to the insert. The probability of an insertion or deletion was also estimated from the data, but parameterised on the homopolymer length, as previous reports have shown that the error rate increases dramatically with homopolymer length . Strand-asymmetry of high frequency indels was only evaluated on the indel polymorphisms found in the 200 bp kits (both Manual and OneTouch), and only for sites that had reads mapping in both orientations. For each site, the data form a 2×2 table, namely strand by presence/absence of indel. We therefore used Fisher's exact test to generate a p-value for every site, testing the null hypothesis that indel frequency at that site was not orientation specific. The p-values were then adjusted to control the family-wise error rate using Holm's method . Only sites with corrected p-values smaller than 5% were considered to demonstrate significant evidence of orientation specific indel frequencies at that site. Using this approach, the orientation specificity may differ by site. (The analysis was repeated using a chi-squared test in place of Fisher's exact test with similar results). Replicate bias To prevent reference end-effects and repeat-effects, we used repeat-masked data and ignored the first and last 120 bp of the reference genome. We visually identified whether the distribution of read starting position (5′ aligned position) on the reference genome was uniformly distributed for each individual run. G+C% bias in read coverage To evaluate whether there was a relationship between G+C content and read depth, we calculated the average coverage of bases within disjoint 100 bp windows across the genome, as well as G+C% also calculated for these windows. Areas expected to have high or low coverage for processing reasons were masked from the analysis, these included the first and last 120b of the reference genome, as well as genomic 100 bp bins that contained repetitive sequences. The coverage was normalised for each run by dividing the coverage in each window by the mean coverage across all windows for that run. A square-root transformation was applied to the run-normalised coverage. After initial inspection, we identified that a number of very large coverage values were the result of an un-masked LSU rRNA in the B. amyloliquefaciens genome. This small region was masked prior to G+C modeling. The relationship between the square-root normalised coverage and G+C% content was evaluated by fitting linear models using the lm function in the R statistical package. Error rates Base-error rates were calculated as the number of errors in the alignment, divided by the length of the alignment. This was to ensure that deletion errors, which are quite common on this platform, would be reflected by the error rate. Flow-error rates were calculated as the number of incorrect flow-calls divided by the total number of flow-calls. For specific break-downs (such as error rate for homopolymer of length X), it was the number of miscalled X-mer flows, divided by the total number of flows which were, or should have been, an X-mer, according to the reference genome. Modeling flow values Given RAM restrictions, a random subset of 18 million observations (flows) were sampled from all datasets as input to model fitting. Note that true zero calls and over-calls of a zero were not included in the model, as zero-flows were unlikely to be well-approximated by a Gaussian. The flow-values were then modeled as normally distributed, using a variety of read attributes (including chip, kit, machine, flow position, well x-coordinate, well y-coordinate, nucleotide, position in cycle (PIC), nucleotide, pyrimidine versus purine). As the flow-values for each homopolymer length did not share a constant variance, these needed to be modeled using a double generalised linear model (DGLM), which simultaneously models the mean and dispersion. In the DGLM used here, the mean was a Gaussian linear model and the dispersion was linear on a log-scale. Only terms with an effect size greater than 0.001 were retained in the model. While the PIC showed the strongest relationship with the flow error-rate, we considered the replacement of PIC with simpler terms, such as the nucleotide flowed or pyrimidine versus purine, however this was detrimental to the model. Based on these choices, a simpler model was created and assessed against the full-model for significance using ANOVA. Some terms removed were statistically significant (x-pos, y-pos, machine), however were practically unimportant contributing only a very small amount to the modeled flow value. We emphasise that the purpose of this statistical model is not to test for significant factors, but to produce usable predictions. Quality score analysis Each run was analysed individually to identify the accuracy of empirical quality scores. For each Ion Torrent quality score, we examined the error rate of bases assigned that score, and calculated the associated Phred score based on the error rate. To analyse the relationship between quality score and position within homopolymer, we sampled 20,000 reads from each run, and calculated the relative change in quality score from the first base to later bases within each homopolymer. We also recorded whether each of these consecutive bases was a correct call or an over-call. Quality trimming The PGM quality clip was extracted from the SFF file produced from Torrent Server 2.0.1. To emulate the PGM HRI trim approach, we calculated the percentage of HRI 1-mers and 2-mers out of the total 1-mer and 2-mer calls in the read, and continued clipping from the 3′ end until this percentage reached 3% or less (Torrent User Documentation 2.20). Our third HRI trim approach included HRI 3-mers (defined as flow-values in [2.50, 2.59] or [3.40, 3.49] in this calculation. As with the Torrent Server software, reads shorter than 4 bp after clipping were ignored. Graphs Graphics used throughout this manuscript were produced either using the R base package or ggplot2 package . Supporting Information Figure S1 Read length density functions for kit and species. Note that both One Touch kits produce a small peak around 100 bp longer than the mode length. (EPS) Click here for additional data file. Figure S2 Intersection of HFI sites across 200 bp runs for each species. Panel a) shows the overlap between HFI sites detected in B. amyloliquefaciens 200 bp runs, and panel b) shows the HFI site overlaps for S. tokodaii 200 bp runs, panel, c) shows the overlap between strand-asymmetric HFI sites detected in B. amyloliquefaciens 200 bp runs, and d) shows the overlap between strand-asymmetric HFI sites detected in S. tokodaii 200 bp runs. Red font highlights the HFI instances unique to a given run. (EPS) Click here for additional data file. Figure S3 Type and frequency of HFI by indel type. Panel a) shows the HFI instances across all 200 bp runs (both species), broken down by error type and nucleotide, and b) shows the strand-asymmetric HFI instances broken down by error type and nucleotide. (EPS) Click here for additional data file. Figure S4 Linear model fit diagnostics plots for G+C% versus coverage. Using a subset of data-points, these plots show the standard linear model diagnostics for the G+C versus coverage linear model. The data is not strictly normal as the response variable (coverage) is based on count data. The small number of zero coverage regions are the outliers in the ‘Residual versus Fitted’ plot, and the deviation from the normal quantiles in the ‘Normal Q-Q’ plot. Unmasked repetitive regions are the likely cause for outliers with high leverage ‘Residuals versus Leverage’ plot. (TIFF) Click here for additional data file. Figure S5 Smoothed normalised coverage versus proportion G+C content in 100 bp windows for three species. (EPS) Click here for additional data file. Figure S6 Number of errors found within reads. These plots exclude reads with more than 30 errors. (EPS) Click here for additional data file. Figure S7 Error-rate for insertions versus deletions by homopolymer length. Rows correspond to restriction on base position considered for error-rate (maximum base 100 versus maximum base 200) Columns correspond to the kit analysed. Observations are the error-rate by flow position for the respective homopolymer length. (EPS) Click here for additional data file. Figure S8 Relationship between flow-call error-rate and flow position. It is clear from these figures that the main effect is increasing with the number of flow cycles. The 316 chip has a higher error rate than the 314, and S. tokodaii is more error-prone than B. amyloliquefaciens. Periodicity is observed in the error rate with a peak and trough occurring regularly in each plot. (EPS) Click here for additional data file. Figure S9 Over-call rates for zero length homopolymer by position-in-cycle (PIC). Consistent with our DGLM based on calls for 1-mers or longer, we find that over-calls of zero occur in the PICs with a positive coefficient in the model. Most notable is position-in-cycle 12. (EPS) Click here for additional data file. Figure S10 Relationship between quality scores and position in homopolymer. The x-axis shows the base number in the homopolymer, the y-axis shows the relative change (qual(base 1) – qual(base x)/qual(base 1)) from the quality of the first base. (EPS) Click here for additional data file. Table S1 Coefficients for main and dispersion for position-in-cycle (PIC) effects in generalised linear model for flow-values. (DOCX) Click here for additional data file. Table S2 Sequencing runs in the full-factor design. (DOCX) Click here for additional data file. Table S3 Randomised Plackatt-Burman design for sequencing runs using OneTouch 100 kit and Ion Express Template kit. (DOCX) Click here for additional data file. Table S4 Randomised Plackatt-Burman design for sequencing runs using OneTouch 200 bp kit. (DOCX) Click here for additional data file.