4
views
0
recommends
+1 Recommend
0 collections
0
shares
• Record: found
• Abstract: found
• Article: found
Is Open Access

# Divide and conquer: A perspective on biochips for single-cell and rare-molecule analysis by next-generation sequencing

1 , 2 , 1,3 , 1,2,4,5 , b)

APL Bioengineering

AIP Publishing LLC

Bookmark
There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

### Abstract

Recent advances in biochip technologies that connect next-generation sequencing (NGS) to real-world problems have facilitated breakthroughs in science and medicine. Because biochip technologies are themselves used in sequencing technologies, the main strengths of biochips lie in their scalability and throughput. Through the advantages of biochips, NGS has facilitated groundbreaking scientific discoveries and technical breakthroughs in medicine. However, all current NGS platforms require nucleic acids to be prepared in a certain range of concentrations, making it difficult to analyze biological systems of interest. In particular, many of the most interesting questions in biology and medicine, including single-cell and rare-molecule analysis, require strategic preparation of biological samples in order to be answered. Answering these questions is important because each cell is different and exists in a complex biological system. Therefore, biochip platforms for single-cell or rare-molecule analyses by NGS, which allow convenient preparation of nucleic acids from biological systems, have been developed. Utilizing the advantages of miniaturizing reaction volumes of biological samples, biochip technologies have been applied to diverse fields, from single-cell analysis to liquid biopsy. From this perspective, here, we first review current state-of-the-art biochip technologies, divided into two broad categories: microfluidic- and micromanipulation-based methods. Then, we provide insights into how future biochip systems will aid some of the most important biological and medical applications that require NGS. Based on current and future biochip technologies, we envision that NGS will come ever closer to solving more real-world scientific and medical problems.

### Most cited references69

• Record: found
• Abstract: found

### Sequencing technologies - the next generation.

(2010)
Demand has never been greater for revolutionary technologies that deliver fast, inexpensive and accurate genome information. This challenge has catalysed the development of next-generation sequencing (NGS) technologies. The inexpensive production of large volumes of sequence data is the primary advantage over conventional methods. Here, I present a technical review of template preparation, sequencing and imaging, genome alignment and assembly approaches, and recent advances in current and near-term commercially available NGS instruments. I also outline the broad range of applications for NGS technologies, in addition to providing guidelines for platform selection to address biological questions of interest.
Bookmark
• Record: found
• Abstract: found

### Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets.

(2015)
Cells, the basic units of biological structure and function, vary broadly in type and state. Single-cell genomics can characterize cell identity and function, but limitations of ease and scale have prevented its broad application. Here we describe Drop-seq, a strategy for quickly profiling thousands of individual cells by separating them into nanoliter-sized aqueous droplets, associating a different barcode with each cell's RNAs, and sequencing them all together. Drop-seq analyzes mRNA transcripts from thousands of individual cells simultaneously while remembering transcripts' cell of origin. We analyzed transcriptomes from 44,808 mouse retinal cells and identified 39 transcriptionally distinct cell populations, creating a molecular atlas of gene expression for known retinal cell classes and novel candidate cell subtypes. Drop-seq will accelerate biological discovery by enabling routine transcriptional profiling at single-cell resolution. VIDEO ABSTRACT.
Bookmark
• Record: found
• Abstract: found

### Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry

(2008)
DNA sequence information underpins genetic research, enabling discoveries of important biological or medical benefit. Sequencing projects have traditionally employed long (400-800 bp) reads, but the existence of reference sequences for the human and many other genomes makes it possible to develop new, fast approaches to re-sequencing, whereby shorter reads are compared to a reference to identify intraspecies genetic variation. We report an approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost. Single molecules of DNA are attached to a flat surface, amplified in situ and used as templates for synthetic sequencing with fluorescent reversible terminator deoxyribonucleotides. Images of the surface are analysed to generate high quality sequence. We demonstrate application of this approach to human genome sequencing on flow-sorted X chromosomes and then scale the approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We build an accurate consensus sequence from >30x average depth of paired 35-base reads. We characterise four million SNPs and four hundred thousand structural variants, many of which are previously unknown. Our approach is effective for accurate, rapid and economical whole genome re-sequencing and many other biomedical applications. DNA sequencing yields an unrivalled resource of genetic information. We can characterise individual genomes, transcriptional states and genetic variation in populations and disease. Until recently, the scope of sequencing projects was limited by the cost and throughput of Sanger sequencing. The raw data for the 3 billion base (3 gigabase, Gb) human genome sequence, completed in 20041, was generated over several years for ∼$300 million using several hundred capillary sequencers. More recently an individual human genome sequence has been determined for ∼$10 million by capillary sequencing2. Several new approaches at varying stages of development aim to increase sequencing throughput and reduce cost3-6. They increase parallelisation dramatically by imaging many DNA molecules simultaneously. One instrument run produces typically thousands or millions of sequences that are shorter than capillary reads. Another human genome sequence was recently determined using one of these approaches7. However, much bigger improvements are necessary to enable routine whole human genome sequencing in genetic research. We describe a massively parallel synthetic sequencing approach that transforms our ability to use DNA and RNA sequence information in biological systems. We demonstrate utility by re-sequencing an individual human genome to high accuracy. Our approach delivers data at very high throughput and low cost, and enables extraction of genetic information of high biological value, including single nucleotide polymorphisms (SNPs) and structural variants. DNA sequencing using reversible terminators and clonal single molecule arrays We generated high density single molecule arrays of genomic DNA fragments attached to the surface of the reaction chamber (the flowcell) and used isothermal ‘bridging’ amplification to form DNA ‘clusters’ from each fragment. We made the DNA in each cluster single-stranded and added a universal primer for sequencing. For paired read sequencing, we then converted the templates to double-stranded DNA and removed the original strands, leaving the complementary strand as template for the second sequencing reaction (fig 1a-c). To obtain paired reads separated by larger distances, we circularised DNA fragments of the required length (e.g. 2kb +/− 0.2kb) and obtained short junction fragments for paired end sequencing (fig 1d). We sequenced DNA templates by repeated cycles of polymerase-directed single base extension. To ensure base-by-base nucleotide incorporation in a stepwise manner, we used a set of four reversible terminators, 3′-O-azidomethyl 2′-deoxynucleoside triphosphates (A, C, G and T) each labelled with a different removable fluorophore (fig S1a)8. The use of 3′-modified nucleotides allowed the incorporation to be driven essentially to completion without risk of over-incorporation. It also enabled addition of all four nucleotides simultaneously rather than sequentially, minimising risk of mis-incorporation. We engineered the active site of 9°N DNA polymerase to improve the efficiency of incorporation of these unnatural nucleotides9. After each cycle of incorporation, we determined the identity of the inserted base by laser-induced excitation of the fluorophores and imaging. We added tris(2-carboxyethyl)phosphine (TCEP) to remove the fluorescent dye and side-arm from a linker attached to the base and simultaneously to regenerate a 3′ hydroxyl group ready for the next cycle of nucleotide addition (fig S1b). The Genome Analyzer (GA1) was designed to perform multiple cycles of sequencing chemistry and imaging to collect the sequence data automatically from each cluster on the surface of each lane of an 8-lane flowcell (fig S2). To determine the sequence from each cluster, we quantified the fluorescent signal from each cycle and applied a base-calling algorithm. We defined a quality (Q) value for each base call (scaled as by the phred algorithm10) that represents the likelihood of each call being correct (fig S3). We used the Q-values in subsequent analyses to weight the contribution of each base to sequence alignment and detection of sequence variants (e.g. SNP calling). We discarded all reads from mixed clusters and used the remaining ‘purity filtered’ (PF) reads for analysis. Typically we generated 1-2 billion bases (gigabases, Gb) of high quality PF sequence per flow cell from ∼60 million single 35-base reads, or 2-4 Gb in a paired read experiment (table S1). To demonstrate accurate sequencing of human DNA, we sequenced a human bacterial artificial chromosome (BAC) clone (bCX98J21) that contained 162,752 bp of the major histocompatibility complex on human chromosome 6 (accession AL662825.4, previously determined using capillary sequencing by the Wellcome Trust Sanger Institute). We developed a fast global alignment algorithm ELAND that aligns a read to the reference only if the read can be assigned a unique position with 0, 1 or 2 differences. We collected 0.17 Gb of aligned data for the BAC from one lane of a flowcell. Approximately 90% of the 35-base reads matched perfectly to the reference, demonstrating high raw read accuracy (fig S4). To examine consensus coverage and accuracy, we used 5 Mb of 35-base PF reads (30-fold average input depth of the BAC) and obtained 99.96% coverage of the reference. There was one consensus miscall, at a position of very low coverage (just above our cut-off threshold), yielding an overall consensus accuracy of >99.999%. Detecting genetic variation of the human X chromosome For an initial study of genetic variation, we sequenced flow-sorted X chromosomes of a Caucasian female (CEPH NA07340). We generated 278 million paired 30-35 bp PF reads and aligned them to the human genome reference sequence. We carried out separate analyses of the data using two alignment algorithms, ELAND (see above) or MAQ11. Both algorithms place each read pair where it best matches the reference and assign a confidence score to the alignment. In cases where a read has two or more equally likely positions (i.e. in an exact repeat), MAQ randomly assigns the read pair to one position and assigns a zero alignment quality score (these reads are excluded from SNP analysis). ELAND rejects all non-unique alignments, which are mostly in recently inserted retroposons (see fig S5). MAQ therefore provides an opportunity to assess the properties of a dataset aligned to the entire reference, whereas ELAND effectively excludes ambiguities from the short read alignment before further analysis. We obtained comprehensive coverage of the X chromosome from both analyses. With MAQ, 204 million reads aligned to 99.94% of the X chromosome at an average depth of 43x. With ELAND, 192 million reads covered 91% of the reference sequence, showing what can be covered by unique best alignments. These results were obtained after excluding reads aligning to non-X sequence (impurities of flow sorting) and apparently duplicated read pairs (table S2). We reasoned that these duplicates (∼10% of the total) arose during initial sample amplification. The sampling of sequence fragments from the X chromosome is close to random. This is evident from the distribution of mapped read depth in the MAQ alignment in regions where the reference is unique (fig 2a): the variance of this distribution is only 2.26 times that of a Poisson distribution (the theoretical minimum). Half of this excess variance can be accounted for by a dependence on GC content. However, the average mapped read depth only falls below 10x in regions with GC content less than 4% or greater than 76%, comprising in total just 1% of unique chromosome sequence and 3% of coding sequence (fig 2b). We identified 92,485 candidate SNPs in the X chromosome using ELAND (fig S6). Most calls (85%) match previous entries in the public database dbSNP. Heterozygosity (π) in this dataset is 4.3×10−4 (i.e. 1 substitution per 2.3 kb), close to a previously published X chromosome estimate (4.7×10−4)12. Using MAQ we obtained 104,567 SNPs, most of which were common to the results of the ELAND analysis. The differences between the two sets of SNP calls are largely the consequence of different properties of the alignments as described earlier. For example, most of the SNPs found only by the MAQ-based analysis were at positions of low or zero sequence depth in the ELAND alignment (fig S6c). We assessed accuracy and completeness of SNP calling by comparison to genotypes obtained for this individual using the Illumina HumanHap550 BeadChip (HM550). The sequence data covered >99.8% of the 13,604 genotyped positions and we found excellent agreement between sequence based SNP calls and genotyping data (99.52% or 99.99% using ELAND or MAQ, respectively)(table S3). There was complete concordance of all homozygous calls and a low level of ‘undercalling’ (denoted as ‘GT>Seq’ in table 1) at a small number of the heterozygous sites, caused by inadequate sampling of one of the two alleles. The depth of input sequence influences the coverage and accuracy of SNP calling. We found that reducing the read depth to 15x still gives 97% coverage of genotype positions and only 1.27% of the heterozygous sites are undercalled. We observed no other types of disagreement at any input depth (fig S7). We detected structural variants (defined as any variant other than a single base substitution) as follows. We found 9,747 short insertions/deletions (‘short indels’, defined here as less than the length of the read) by performing a gapped alignment of individual reads (fig S8). We identified larger indels based on read depth and/or anomalous read pair spacing, similar to previous approaches13-15. We detected 115 indels in total, 77 of which were visible from anomalous read pair spacing (see tables S4 and S5). We developed Resembl, an extension to the Ensembl browser16, to view all variants (fig S9; see also fig 4). Inversions can be detected when the orientation of one read in a pair is reversed (e.g. fig S10). In general, inversions occur as the result of non-allelic homologous recombination, and are therefore flanked by repetitive sequence that can compromise alignments. We found partial evidence for other inversion events, but characterisation of inversions from short read data is complex because of the repeats and requires further development. Sequencing and analysis of a whole human genome Our X chromosome study enabled us to develop an integrated set of methods for rapid sequencing and analysis of whole human genomes. We sequenced the genome of a male Yoruba from Ibadan, Nigeria (YRI; sample NA18507). This sample was originally collected for the HapMap project17,18 through a process of community engagement and informed consent19 and has also been studied in other projects20,21. We were therefore able to compare our results with publicly available data from the same sample. We constructed two libraries: one of short inserts (∼200 bp) with similar properties to the previous X chromosome library and one with long inserts (∼2 kb) to provide longer-range read pair information (see fig S11 for size distributions). We generated 135 Gb of sequence (∼4 billion paired 35-base reads; see table S6) over a period of 8 weeks (Dec'07–Jan'08) on six GA1 instruments averaging 3.3Gb per production run (see table S1 for example). The approximate consumables cost (based on full list price of reagents) was \$250,000. We aligned 97% of the reads using MAQ and found 99.9% of the human reference (NCBI build 36.1) was covered with one or more reads at an average of 40.6-fold depth. Using ELAND, we aligned 91% of the reads over 93% of the reference sequence at sufficient depth to call a strong consensus (>three Q30 bases). The distribution of mapped read depth was close to random, with slight overdispersion as seen for the X chromosome data. We observed comprehensive representation across a wide range of GC content, dropping only at the very extreme ends, but with a different pattern of distribution compared to the X (see fig S12). We identified ∼4 million SNPs, with 74% matching previous entries in dbSNP (fig 3). We found excellent agreement of our SNP calls with genotyping results: sequence-based SNP calls covered almost all of the 552,710 loci of HM550, with >99.5% concordance of sequencing vs. genotyping calls (tables 1 and S7a). The few disagreements were mostly undercalls of heterozygous positions (GT>Seq) in areas of low sequence depth, providing us with a false negative rate of 35 kb where there is sequence absent from NA18507 compared to the reference. We observed a steadily decreasing number of events of this type with increasing size, except for two peaks (fig S18). Most of the events represented by the large peak at 300-350 bp contain a sequence of the AluY family. This is consistent with insertion of SINEs that are present in the reference but missing from the genome of NA18507. Similarly, the second, smaller peak at 6-7kb is the consequence of insertion of L1Hs elements in many cases. We found good correspondence between our results and the data of Kidd et al.23, who reported 148 deletions of 90%). We established very low false positive and negative rates for the ∼4M SNPs detected (<1% overcalls and undercalls). This compares favourably with previous individual genome analyses which reported a 24% undercalling of heterozygous positions2,7. Paired reads were very powerful in all areas of the analysis. They provided very accurate read alignment and thus improved the accuracy and coverage of consensus sequence and SNP calling. They were essential for developing our short indel caller, and for detecting structural variants. Our short insert paired read dataset introduced a new level of resolution in structural variation detection, revealing thousands of variants in a size range not characterised previously. In some cases we determined the exact sequence of structural variants by de novo assembly from the same paired read dataset. Interpreting events that are embedded in repetitive sequence tracts will require further work. Massively parallel sequencing technology makes it feasible to consider whole human genome sequencing as a clinical tool in the near future. Characterising multiple individual genomes will enable us to unravel the complexities of human variation in cancer and other diseases and will pave the way for the use of personal genome sequences in medicine and healthcare. Accuracy of personal genetic information from sequence will be critical for life-changing decisions. In addition to the large scale genomic projects exemplified by the present study and others15,24-26, the system described here is being used to explore biological phenomena in unprecedented detail, including transcriptional activity, mechanisms of gene regulation and epigenetic modification of DNA and chromatin27-32. In the future, DNA sequencing will be the central tool for unravelling how genetic information is used in living processes. Methods Summary DNA and sequencing DNA samples (NA07340 and NA18507) and cell line (GM07340) were obtained from Coriell Repositories, Camden NJ. DNA samples were genotyped on the HM550 array and the results compared to publicly available data to confirm their identity before use. Methods for DNA manipulation, including sample preparation, formation of single molecule arrays, cluster growth and sequencing were all developed during this study and formed the basis for the standard protocols now available from Illumina, Inc. All sequencing was performed on Illumina GA1s equipped with a one-megapixel camera. All PF read data are available for download from the Short Read Archive at NCBI. Analysis software Image analysis software and the ELAND aligner are provided as part of the Genome Analyzer analysis software. SNP and structural variant detectors will be available as future upgrades of the analysis pipeline. The Resembl extension to Ensembl is available on request. The MAQ (Mapping and Assembly with Qualities) aligner is freely available for download from http://maq.sourceforge.net Data access Sequence data are freely available from the short read archive, accession SRA000271: ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/SRA000271 Links to Resembl displays for X and human data, plus information on other available data are provided at http://www.illumina.com/iGenome A detailed Methods section can be found as part of the Supplementary Information. Supplementary Material Experimental Methods Figures S1-20 Figure S21 Tables S1-S9
Bookmark

### Author and article information

###### Journal
APL Bioeng
APL Bioeng
ABPID9
APL Bioengineering
AIP Publishing LLC
2473-2877
June 2019
25 June 2019
25 June 2019
: 3
: 2
###### Affiliations
[1 ]Interdisciplinary Program in Bioengineering, Seoul National University , Seoul 08826, South Korea
[2 ]Department of Electrical and Computer Engineering, Seoul National University , Seoul 08826, South Korea
[3 ]BK21+ Creative Research Engineer Development for IT , Seoul National University, Seoul 08826, South Korea
[4 ]Institutes of Entrepreneurial BioConvergence, Seoul National University , Seoul 08826, South Korea
[5 ]Seoul National University Hospital Biomedical Research Institute, Seoul National University Hospital , Seoul 03080, South Korea
###### Author notes
[a)]

Contributions: A. C. Lee and Y. Lee contributed equally to this work.

[b) ] Author to whom correspondence should be addressed: skwon@ 123456snu.ac.kr
###### Article
1.5095962 005902APB APB19-PS-00032R1
10.1063/1.5095962
6697027

2473-2877/2019/3(2)/020901/8

###### Page count
Pages: 8
###### Funding
Funded by: Brain Korea 21 Plus Project
Funded by: National Research Foundation of Korea http://doi.org/10.13039/501100003725
Award ID: 2015K1A4A3047345
###### Categories
Perspectives