DNA sequence information underpins genetic research, enabling discoveries of important
biological or medical benefit. Sequencing projects have traditionally employed long
(400-800 bp) reads, but the existence of reference sequences for the human and many
other genomes makes it possible to develop new, fast approaches to re-sequencing,
whereby shorter reads are compared to a reference to identify intraspecies genetic
variation. We report an approach that generates several billion bases of accurate
nucleotide sequence per experiment at low cost. Single molecules of DNA are attached
to a flat surface, amplified in situ and used as templates for synthetic sequencing
with fluorescent reversible terminator deoxyribonucleotides. Images of the surface
are analysed to generate high quality sequence. We demonstrate application of this
approach to human genome sequencing on flow-sorted X chromosomes and then scale the
approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We
build an accurate consensus sequence from >30x average depth of paired 35-base reads.
We characterise four million SNPs and four hundred thousand structural variants, many
of which are previously unknown. Our approach is effective for accurate, rapid and
economical whole genome re-sequencing and many other biomedical applications.
DNA sequencing yields an unrivalled resource of genetic information. We can characterise
individual genomes, transcriptional states and genetic variation in populations and
disease. Until recently, the scope of sequencing projects was limited by the cost
and throughput of Sanger sequencing. The raw data for the 3 billion base (3 gigabase,
Gb) human genome sequence, completed in 20041, was generated over several years for
∼$300 million using several hundred capillary sequencers. More recently an individual
human genome sequence has been determined for ∼$10 million by capillary sequencing2.
Several new approaches at varying stages of development aim to increase sequencing
throughput and reduce cost3-6. They increase parallelisation dramatically by imaging
many DNA molecules simultaneously. One instrument run produces typically thousands
or millions of sequences that are shorter than capillary reads. Another human genome
sequence was recently determined using one of these approaches7. However, much bigger
improvements are necessary to enable routine whole human genome sequencing in genetic
research.
We describe a massively parallel synthetic sequencing approach that transforms our
ability to use DNA and RNA sequence information in biological systems. We demonstrate
utility by re-sequencing an individual human genome to high accuracy. Our approach
delivers data at very high throughput and low cost, and enables extraction of genetic
information of high biological value, including single nucleotide polymorphisms (SNPs)
and structural variants.
DNA sequencing using reversible terminators and clonal single molecule arrays
We generated high density single molecule arrays of genomic DNA fragments attached
to the surface of the reaction chamber (the flowcell) and used isothermal ‘bridging’
amplification to form DNA ‘clusters’ from each fragment. We made the DNA in each cluster
single-stranded and added a universal primer for sequencing. For paired read sequencing,
we then converted the templates to double-stranded DNA and removed the original strands,
leaving the complementary strand as template for the second sequencing reaction (fig
1a-c). To obtain paired reads separated by larger distances, we circularised DNA fragments
of the required length (e.g. 2kb +/− 0.2kb) and obtained short junction fragments
for paired end sequencing (fig 1d).
We sequenced DNA templates by repeated cycles of polymerase-directed single base extension.
To ensure base-by-base nucleotide incorporation in a stepwise manner, we used a set
of four reversible terminators, 3′-O-azidomethyl 2′-deoxynucleoside triphosphates
(A, C, G and T) each labelled with a different removable fluorophore (fig S1a)8. The
use of 3′-modified nucleotides allowed the incorporation to be driven essentially
to completion without risk of over-incorporation. It also enabled addition of all
four nucleotides simultaneously rather than sequentially, minimising risk of mis-incorporation.
We engineered the active site of 9°N DNA polymerase to improve the efficiency of incorporation
of these unnatural nucleotides9. After each cycle of incorporation, we determined
the identity of the inserted base by laser-induced excitation of the fluorophores
and imaging. We added tris(2-carboxyethyl)phosphine (TCEP) to remove the fluorescent
dye and side-arm from a linker attached to the base and simultaneously to regenerate
a 3′ hydroxyl group ready for the next cycle of nucleotide addition (fig S1b). The
Genome Analyzer (GA1) was designed to perform multiple cycles of sequencing chemistry
and imaging to collect the sequence data automatically from each cluster on the surface
of each lane of an 8-lane flowcell (fig S2).
To determine the sequence from each cluster, we quantified the fluorescent signal
from each cycle and applied a base-calling algorithm. We defined a quality (Q) value
for each base call (scaled as by the phred algorithm10) that represents the likelihood
of each call being correct (fig S3). We used the Q-values in subsequent analyses to
weight the contribution of each base to sequence alignment and detection of sequence
variants (e.g. SNP calling). We discarded all reads from mixed clusters and used the
remaining ‘purity filtered’ (PF) reads for analysis. Typically we generated 1-2 billion
bases (gigabases, Gb) of high quality PF sequence per flow cell from ∼60 million single
35-base reads, or 2-4 Gb in a paired read experiment (table S1).
To demonstrate accurate sequencing of human DNA, we sequenced a human bacterial artificial
chromosome (BAC) clone (bCX98J21) that contained 162,752 bp of the major histocompatibility
complex on human chromosome 6 (accession AL662825.4, previously determined using capillary
sequencing by the Wellcome Trust Sanger Institute). We developed a fast global alignment
algorithm ELAND that aligns a read to the reference only if the read can be assigned
a unique position with 0, 1 or 2 differences. We collected 0.17 Gb of aligned data
for the BAC from one lane of a flowcell. Approximately 90% of the 35-base reads matched
perfectly to the reference, demonstrating high raw read accuracy (fig S4). To examine
consensus coverage and accuracy, we used 5 Mb of 35-base PF reads (30-fold average
input depth of the BAC) and obtained 99.96% coverage of the reference. There was one
consensus miscall, at a position of very low coverage (just above our cut-off threshold),
yielding an overall consensus accuracy of >99.999%.
Detecting genetic variation of the human X chromosome
For an initial study of genetic variation, we sequenced flow-sorted X chromosomes
of a Caucasian female (CEPH NA07340). We generated 278 million paired 30-35 bp PF
reads and aligned them to the human genome reference sequence. We carried out separate
analyses of the data using two alignment algorithms, ELAND (see above) or MAQ11. Both
algorithms place each read pair where it best matches the reference and assign a confidence
score to the alignment. In cases where a read has two or more equally likely positions
(i.e. in an exact repeat), MAQ randomly assigns the read pair to one position and
assigns a zero alignment quality score (these reads are excluded from SNP analysis).
ELAND rejects all non-unique alignments, which are mostly in recently inserted retroposons
(see fig S5). MAQ therefore provides an opportunity to assess the properties of a
dataset aligned to the entire reference, whereas ELAND effectively excludes ambiguities
from the short read alignment before further analysis.
We obtained comprehensive coverage of the X chromosome from both analyses. With MAQ,
204 million reads aligned to 99.94% of the X chromosome at an average depth of 43x.
With ELAND, 192 million reads covered 91% of the reference sequence, showing what
can be covered by unique best alignments. These results were obtained after excluding
reads aligning to non-X sequence (impurities of flow sorting) and apparently duplicated
read pairs (table S2). We reasoned that these duplicates (∼10% of the total) arose
during initial sample amplification.
The sampling of sequence fragments from the X chromosome is close to random. This
is evident from the distribution of mapped read depth in the MAQ alignment in regions
where the reference is unique (fig 2a): the variance of this distribution is only
2.26 times that of a Poisson distribution (the theoretical minimum). Half of this
excess variance can be accounted for by a dependence on GC content. However, the average
mapped read depth only falls below 10x in regions with GC content less than 4% or
greater than 76%, comprising in total just 1% of unique chromosome sequence and 3%
of coding sequence (fig 2b).
We identified 92,485 candidate SNPs in the X chromosome using ELAND (fig S6). Most
calls (85%) match previous entries in the public database dbSNP. Heterozygosity (π)
in this dataset is 4.3×10−4 (i.e. 1 substitution per 2.3 kb), close to a previously
published X chromosome estimate (4.7×10−4)12. Using MAQ we obtained 104,567 SNPs,
most of which were common to the results of the ELAND analysis. The differences between
the two sets of SNP calls are largely the consequence of different properties of the
alignments as described earlier. For example, most of the SNPs found only by the MAQ-based
analysis were at positions of low or zero sequence depth in the ELAND alignment (fig
S6c).
We assessed accuracy and completeness of SNP calling by comparison to genotypes obtained
for this individual using the Illumina HumanHap550 BeadChip (HM550). The sequence
data covered >99.8% of the 13,604 genotyped positions and we found excellent agreement
between sequence based SNP calls and genotyping data (99.52% or 99.99% using ELAND
or MAQ, respectively)(table S3). There was complete concordance of all homozygous
calls and a low level of ‘undercalling’ (denoted as ‘GT>Seq’ in table 1) at a small
number of the heterozygous sites, caused by inadequate sampling of one of the two
alleles. The depth of input sequence influences the coverage and accuracy of SNP calling.
We found that reducing the read depth to 15x still gives 97% coverage of genotype
positions and only 1.27% of the heterozygous sites are undercalled. We observed no
other types of disagreement at any input depth (fig S7).
We detected structural variants (defined as any variant other than a single base substitution)
as follows. We found 9,747 short insertions/deletions (‘short indels’, defined here
as less than the length of the read) by performing a gapped alignment of individual
reads (fig S8). We identified larger indels based on read depth and/or anomalous read
pair spacing, similar to previous approaches13-15. We detected 115 indels in total,
77 of which were visible from anomalous read pair spacing (see tables S4 and S5).
We developed Resembl, an extension to the Ensembl browser16, to view all variants
(fig S9; see also fig 4). Inversions can be detected when the orientation of one read
in a pair is reversed (e.g. fig S10). In general, inversions occur as the result of
non-allelic homologous recombination, and are therefore flanked by repetitive sequence
that can compromise alignments. We found partial evidence for other inversion events,
but characterisation of inversions from short read data is complex because of the
repeats and requires further development.
Sequencing and analysis of a whole human genome
Our X chromosome study enabled us to develop an integrated set of methods for rapid
sequencing and analysis of whole human genomes. We sequenced the genome of a male
Yoruba from Ibadan, Nigeria (YRI; sample NA18507). This sample was originally collected
for the HapMap project17,18 through a process of community engagement and informed
consent19 and has also been studied in other projects20,21. We were therefore able
to compare our results with publicly available data from the same sample. We constructed
two libraries: one of short inserts (∼200 bp) with similar properties to the previous
X chromosome library and one with long inserts (∼2 kb) to provide longer-range read
pair information (see fig S11 for size distributions). We generated 135 Gb of sequence
(∼4 billion paired 35-base reads; see table S6) over a period of 8 weeks (Dec'07–Jan'08)
on six GA1 instruments averaging 3.3Gb per production run (see table S1 for example).
The approximate consumables cost (based on full list price of reagents) was $250,000.
We aligned 97% of the reads using MAQ and found 99.9% of the human reference (NCBI
build 36.1) was covered with one or more reads at an average of 40.6-fold depth. Using
ELAND, we aligned 91% of the reads over 93% of the reference sequence at sufficient
depth to call a strong consensus (>three Q30 bases). The distribution of mapped read
depth was close to random, with slight overdispersion as seen for the X chromosome
data. We observed comprehensive representation across a wide range of GC content,
dropping only at the very extreme ends, but with a different pattern of distribution
compared to the X (see fig S12).
We identified ∼4 million SNPs, with 74% matching previous entries in dbSNP (fig 3).
We found excellent agreement of our SNP calls with genotyping results: sequence-based
SNP calls covered almost all of the 552,710 loci of HM550, with >99.5% concordance
of sequencing vs. genotyping calls (tables 1 and S7a). The few disagreements were
mostly undercalls of heterozygous positions (GT>Seq) in areas of low sequence depth,
providing us with a false negative rate of <0.35% from the ELAND analysis - see table
1). The other disagreements (0.09% of all genotypes) included errors in genotyping
plus apparent tri-allelic SNPs (table S7). The main cause of genotype error (0.05%
of all genotypes) is the existence of a second ‘hidden’ SNP close to the assayed locus
that disrupts the genotyping assay, leading to loss of one allele and an erroneous
homozygous genotype (figs S13, S14).
To examine the accuracy of SNP calling in more detail, we compared our sequence-based
SNP calls with 3.7M genotypes (HM-All) generated for this sample during the HapMap
project (table 1 and S7b)20 and found excellent concordance. Disagreements included
sequence-based undercalls of heterozygous positions in regions of low read depth.
The slightly higher level of other disagreements (0.76%) seen in this analysis compared
to that of the HM550 data (0.09%) is in line with the higher level of underlying genotype
error rate of 0.7% for the HapMap data20. To refine this analysis further, we generated
a set of 530,750 very high confidence reference genotypes comprising concordant calls
in both the HM550 and HM-All genotype datasets. Comparing the results of the MAQ analysis
to this high confidence set (see table 1), we found 130 heterozygote undercalls (i.e.
a false negative rate of 0.025%). There were also 130 heterozygote overcalls, but
most of these are likely genotype errors as 82 have a nearby ‘hidden’ SNP and 3 have
a nearby indel. A further 41 are tri-allelic loci, leaving at most 4 potential wrong
calls by sequencing (i.e. false positive rate of 4/529,589 positions). Finally we
selected a subset of novel SNP calls from the sequence data and tested them by genotyping.
We found 96.1% agreement between sequence and genotype calls (table S8). However,
the 47 disagreements included 10 correct sequencing calls (genotyping undercalls due
to hidden SNPs) and 7 sequencing undercalls. On this basis, therefore, the false positive
discovery rate for the 1M novel SNPs is 2.4% (30/1206). For the entire dataset of
4M SNPs detected in this analysis, the false positive and negative rates both average
<1%.
This Yoruba genome contains significantly more polymorphism than a genome of European
descent. The autosomal heterozygosity (π) of NA18507 is 9.94 × 10−4 (1 SNP per 1006
bp), higher than previous values for Caucasians (7.6 × 10−4, ref
12
). Heterozygosity in the pseudoautosomal region 1 (PAR1) was substantially higher
(1.92 × 10−3) than the autosomal value. PAR1 (2.7Mb) at the tip of the short arm of
X and Y undergoes obligatory recombination in male meiosis, which is equivalent to
20x the autosome average. This illustrates clearly the correlation between recombination
and nucleotide diversity. By contrast, the 0.33 Mb PAR2 region has a much lower recombination
rate than PAR1; we observed that heterozygosity in PAR2 is identical to that of the
autosomes in NA18507. Heterozygosity in coding regions is lower (0.54 × 10−3) than
the total autosome average, consistent with the model that some coding changes are
deleterious and are lost as the result of natural selection22. Nevertheless, the 26,140
coding SNPs (fig S15) include 5,361 non-conservative amino acid substitutions plus
153 premature termination codons (table S9), many of which are expected to affect
protein function.
We performed a genome-wide survey of structural variation in this individual and found
excellent correlation with variants that had been reported in previous studies, as
well as detecting many new variants. We found 0.4 million short indels (1-16 bp) (fig
S16), most of which are length polymorphisms in homopolymeric tracts of A or T. Half
of these events are corroborated by entries in dbSNP, and 95 of 100 examined were
present in amplicons sequenced from this individual in ENCODE regions, confirming
the high specificity of this method of short indel detection. For larger structural
variants (detected by anomalously spaced paired ends) we found that some were detected
by both long and short insert datasets (fig S17a), but the majority were unique to
one or other dataset. We observed two reasons for this: first, small events (<400
bp) are within the normal size variance of the long insert data; second, nearby repetitive
structures can prevent unique alignment of read pairs (see fig S17b, c). In some cases,
the high resolution of the short insert data permits detection of additional complexity
in a structural rearrangement that is not revealed by the long insert data. For example
where the long insert data indicates a 1.3kb deletion in NA18507 relative to the reference,
the short insert data reveal an inversion accompanied by deletions at both breakpoints
(fig 4). We carried out de novo assembly of reads in this region and constructed a
single contig that defines the exact structure of the rearrangement (data not shown).
We discovered 5,704 structural variants ranging from 50 bp to >35 kb where there is
sequence absent from NA18507 compared to the reference. We observed a steadily decreasing
number of events of this type with increasing size, except for two peaks (fig S18).
Most of the events represented by the large peak at 300-350 bp contain a sequence
of the AluY family. This is consistent with insertion of SINEs that are present in
the reference but missing from the genome of NA18507. Similarly, the second, smaller
peak at 6-7kb is the consequence of insertion of L1Hs elements in many cases. We found
good correspondence between our results and the data of Kidd et al.23, who reported
148 deletions of <100 kb in this individual on the basis of abnormal fosmid paired
end spacing. We found supporting evidence for 111 of these events. We detected a further
2,345 indels in the range 60-160 bp which are sequences present in NA18507 and absent
from the reference (fig S19). One example is shown in fig S20. The ‘singleton’ reads
on either side of the event, which have partners that do not align to the reference,
form part of a de novo assembly that precisely defines the novel sequence and breakpoint
(fig S21).
Effect of sequence depth on coverage and accuracy
We investigated the impact of varying input read depth (and hence cost) on SNP calling
using chromosome 2 as a model. SNP discovery increases with increasing depth: essentially
all homozygous positions are detected at 15x, whereas heterozygous positions accumulate
more gradually to 33x (fig 5a). This effect is influenced by the stringency of the
SNP caller. To call each allele in this analysis we required the equivalent of two
high quality Q30 bases (as opposed to three used in full depth analyses). Homozygotes
could be detected at read depth of 2x or higher, whereas heterozygote detection required
at least double this depth for sampling of both alleles. Missing calls (not covered
by sequence) and discordances between sequence based SNP calls and genotype loci (mostly
undercalls of heterozygotes due to low depth) progressively reduced with increasing
depth (fig 5b). We observed very few other types of discordance at any depth; and
many of these are genotyping errors as described above.
Concluding remarks
Reversible terminator chemistry is a defining feature of this sequencing approach,
enabling each cycle to be driven to completion while minimising mis-incorporation.
The result is a system that generates accurate data at very high throughput and low
cost. We determined an accurate whole human genome sequence in eight weeks to an average
depth of ∼40x. We built a consensus sequence, optimised methods for analysis, assessed
accuracy and characterised the genetic variation of this individual in detail.
We assessed accuracy relative to genotype data over the entire fraction of the human
sequence where SNP calling was possible (>90%). We established very low false positive
and negative rates for the ∼4M SNPs detected (<1% overcalls and undercalls). This
compares favourably with previous individual genome analyses which reported a 24%
undercalling of heterozygous positions2,7.
Paired reads were very powerful in all areas of the analysis. They provided very accurate
read alignment and thus improved the accuracy and coverage of consensus sequence and
SNP calling. They were essential for developing our short indel caller, and for detecting
structural variants. Our short insert paired read dataset introduced a new level of
resolution in structural variation detection, revealing thousands of variants in a
size range not characterised previously. In some cases we determined the exact sequence
of structural variants by de novo assembly from the same paired read dataset. Interpreting
events that are embedded in repetitive sequence tracts will require further work.
Massively parallel sequencing technology makes it feasible to consider whole human
genome sequencing as a clinical tool in the near future. Characterising multiple individual
genomes will enable us to unravel the complexities of human variation in cancer and
other diseases and will pave the way for the use of personal genome sequences in medicine
and healthcare. Accuracy of personal genetic information from sequence will be critical
for life-changing decisions.
In addition to the large scale genomic projects exemplified by the present study and
others15,24-26, the system described here is being used to explore biological phenomena
in unprecedented detail, including transcriptional activity, mechanisms of gene regulation
and epigenetic modification of DNA and chromatin27-32. In the future, DNA sequencing
will be the central tool for unravelling how genetic information is used in living
processes.
Methods Summary
DNA and sequencing
DNA samples (NA07340 and NA18507) and cell line (GM07340) were obtained from Coriell
Repositories, Camden NJ. DNA samples were genotyped on the HM550 array and the results
compared to publicly available data to confirm their identity before use. Methods
for DNA manipulation, including sample preparation, formation of single molecule arrays,
cluster growth and sequencing were all developed during this study and formed the
basis for the standard protocols now available from Illumina, Inc. All sequencing
was performed on Illumina GA1s equipped with a one-megapixel camera. All PF read data
are available for download from the Short Read Archive at NCBI.
Analysis software
Image analysis software and the ELAND aligner are provided as part of the Genome Analyzer
analysis software. SNP and structural variant detectors will be available as future
upgrades of the analysis pipeline. The Resembl extension to Ensembl is available on
request. The MAQ (Mapping and Assembly with Qualities) aligner is freely available
for download from http://maq.sourceforge.net
Data access
Sequence data are freely available from the short read archive, accession SRA000271:
ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/SRA000271 Links to Resembl displays for
X and human data, plus information on other available data are provided at http://www.illumina.com/iGenome
A detailed Methods section can be found as part of the Supplementary Information.
Supplementary Material
Experimental Methods
Figures S1-20
Figure S21
Tables S1-S9