Haplotyping germline and cancer genomes using high-throughput linked-read sequencing

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Haplotyping of human chromosomes is a prerequisite for cataloguing the full repertoire of genetic variation. We present a microfluidics-based linked-read sequencing technology that can phase and haplotype germline and cancer genomes using nanograms of input DNA. This high-throughput platform prepares barcoded libraries for short read sequencing and computationally reconstructs long range haplotype and structural variant information. We generate haplotype blocks in a nuclear trio that are concordant with expected inheritance patterns and phased a set of structural variants. We also resolve the structure of the EML4/ALK fusion in the NCI-H2228 cancer cell line using phased exome sequencing. Finally, we assign genetic aberrations to specific megabase-scale haplotypes generated from whole genome sequencing of a primary colorectal adenocarcinoma. This approach resolves haplotype information using up to 100 times less genomic DNA than some existing methods and enables the accurate detection of structural variants. The human genome is diploid, with each cell containing a copy of both the maternal and paternal chromosomes. A comprehensive understanding of human genetic variation requires identifying the order, structure, and origin of these sets of alleles and their variants across the genome 1 . Haplotypes, the contiguous phased blocks of genomic variants specific to one homologue or another, are essential to such an analysis. Genome-scale haplotype analysis has many advantages for improving genetic studies. Phasing of germline variants can be used to identify causative mutations in pedigrees, determine the structure of genomic rearrangement events and unravel cis-versus trans-relationships of ostensibly linked variants. In the case of cancer genomes, phasing studies provide insight into the haplotype context of somatic mutations, genomic rearrangements critical for oncogenesis and aneuploidy-state of chromosomes 2 . Traditionally, haplotype analysis has relied on statistical inference using genomic sequences of related individuals 3 . More recently, experimental sequencing approaches have been used 1, 2, 4–9 . Next generation sequencing (NGS) methods for determining haplotype either require sample dilution into a limited number of wells in 96 or 384 format microtiter plates for sequencing library preparation 6–9 or subcloning into fosmids 1, 2 . In dilution haplotyping, the low molarity of molecules per given partition reduces the likelihood that one DNA molecule has overlapping sequence with another and thus haplotypes can be derived. However, existing dilution methods require complicated experimental protocols, high amounts of starting DNA, extensive manipulation of DNA and microwell plates with restricted partitioning capacity 6–9 . Recently, experimentally phased genome assemblies have also been demonstrated with long-read technology 10 , but high input DNA requirements, lower sequence read accuracy, and cost burdens create substantial barriers for its widespread adoption. Here we develop a microfluidic technology that circumvents these complex experimental protocols and generates haplotype-resolved genome sequences using small amounts of input DNA. Approximately 300 genomic equivalents, or 1 ng of high molecular weight (HMW) genomic DNA, is distributed across over100,000 droplet partitions, where it is barcoded and subjected to random priming and polymerase amplification. Barcode-tagged DNA molecules are released from each droplet, after which a modified library preparation process is performed. The resulting libraries then undergo standard Illumina short-read sequencing. A computational algorithm uses the barcodes to link sequencing reads to the originating HMW DNA molecule, enabling the construction of contiguous segments of phased variants. Loading nanogram amounts of DNA into large number of gel beads minimizes the chance of coincidental barcode overlap and improves overall phasing performance. Furthermore, the increased number of barcodes per amount of genomic DNA improves detection of structural variants. Using exome enrichment, we determine gene-level phasing in a nuclear family trio and define the complex structure of an oncogenic rearrangement in a cancer cell line. Finally, we generate a phased genomic analysis of a primary colorectal cancer originating from clinical tissue sample. RESULTS Microfluidics and gel bead partitioning This high-throughput, droplet-based reagent delivery system uses hydrogel beads (gel beads) to deliver barcoded oligonucleotides. Within the confines of a microfluidic chip, monodispersed gel beads are deformable and can be delivered with a single gel bead fill rate of over 85% per droplet 11 . An individual gel bead is functionalized with millions of copies of the same barcoded oligonucleotide. Droplet partitions number more than 100,000 and this feature exceeds what can be achieved with microwell systems or virtual partitioning 9 by orders of magnitude. Reagent delivery and sample partitioning are performed within a plastic microfluidic consumable cartridge, which processes eight samples simultaneously. Cartridge reservoirs are loaded with gel beads, sample and reagent mixture and oil-surfactant solution. Reagents are delivered from the reservoirs via a network of microfluidic channels to a microfluidic “double-cross” junction (Fig. 1a). The first junction combines a close-packed aqueous slurry of gel beads with the sample and reagent mixture, and the second junction delivers the oil-surfactant solution. Droplet generation occurs at a rate of ~1 kHz at the second junction, resulting in more than 100,000 droplets loaded with greater than 85% single occupancy per droplet partition. The droplets flow to a collection reservoir where they are subsequently transferred to a conventional 96-well plate. A chemical reducing agent in the reaction mix dissolves the gel beads, triggering the release of the barcoded oligonucleotides from the gel matrix. As a result, independent solution-phase reactions are conducted without further addition of reagents. In our study, 300 genomic equivalents (1 nanogram) are dispersed into >100,000 droplet partitions with different barcodes; thus, only a small number of genomic equivalents are loaded per partition. As we report later, it is highly unlikely that two distinct high-molecular weight molecules that cover the same loci but of opposing haplotype will share the same barcode (p<0.002). The use of fewer barcodes would require even smaller amounts of input DNA – equivalent to only a few genome equivalents – to maintain low rates of intermolecular genomic overlap within the same barcoded partition. Such low amounts of input DNA would lead to issues with significant allelic dropout and coverage gaps throughout the genome. Barcode sequencing To create barcoded DNA molecules for sequencing, we perform an optimized droplet-based assay that introduces a barcode-containing sequencing adapter into new fragments (Methods). High molecular weight DNA templates, ranging from ten to several hundred kb in size, are randomly distributed in picoliter reaction volumes across greater than 100,000 droplets. Within an individual droplet, gel bead dissolution releases the amplification primer into the partitioned solution. The primer contains the following components: i) an Illumina P5 sequence, ii) a designed 14bp barcode, iii) an Illumina R1 sequence and iv) a 10bp random primer sequence (Supplementary Fig. 1). Amplification is non-processive, and the length of molecules produced range from several hundred to several kilobases. After thermocycling the droplets, the emulsion is broken, the pooled aqueous fractions are recovered and the library preparation is completed with the addition of other adapter components. Additional details of library preparation and metrics of DNA loading are noted in the online Methods. Whole genome sequencing with linked-reads To assess the performance of linked-read sequencing for haplotyping, we relied on three HapMap samples that form a nuclear trio: NA12878 (mother), NA12877 (father) and NA12882 (child). These samples have been experimentally phased and have haplotypes statistically derived from inheritance patterns 9, 12 . We generated barcode sequencing libraries from this trio. Approximately 1 ng of HMW DNA from each sample was processed on a 10× GemCode instrument and sequencing libraries were prepared. The barcode libraries underwent WGS to approximately 30× mean coverage. We used an Illumina Hiseq 2500 with 2×98 paired-end reads (Supplementary Table 1) for all sequencing analysis in this report. Reads were trimmed to remove the first 10 bp of each paired read to remove primer extension artifacts, aligned to the human reference genome (hg19) with BWA 13 , and PCR duplicates were marked utilizing the barcode information and alignment position (Methods, Supplementary Fig. 1). Overall, library quality was exceptionally high; greater than 95% of reads were mapped, more than 90% of the genome was covered at least once, with most gaps shorter than 50 kb (Table 1, Supplementary Table 1, Supplementary Fig. 2). PCR duplication-rates were minimal at less than 1.4% (Table 1). Library insert size averaged around 200 bp, and the number of binding events was estimated to be around 45,000 within each partition (Supplementary Table 2, Methods). Barcodes were uniformly distributed over 100,000 partitions based on an effective number of barcodes adjusted for unevenness of barcode counts (Methods, Table 1, Supplementary Fig. 2). The GC distribution is provided in Supplementary Figure 2d and shows the predictable effect of the non-processive amplification step. Some droplets may coincidentally receive identical barcodes; performance for haplotyping is maintained by loading only a small amount of genomic DNA into each droplet partition. Conversely, significantly increasing DNA input increases the coincidental loading of overlapping DNA molecules to the same droplet and thus, barcode; the overlap among DNA templates obscures haplotype analysis. As a metric to gauge coincidental loading, we calculated the relative genome loading per a droplet partition. From nuclear trio WGS data, the relative genomic loading was less than 0.002 genome equivalents per partition (Table 1). Thus, these loading issues are minimized given the low molarity distribution of DNA per a droplet. A feature of our analysis involves the concept of linked-reads; this term refers to sequences with the same barcode and determined to be in physical proximity based on alignment (Fig. 1b, Methods). This type of data enables the inference of input DNA length, phasing of haplotype blocks, and identification of DNA molecules with a new structural change (for example, genomic fusion) compared to the reference. The mean linked-reads per DNA molecule was about 15 for the trio libraries. With linked-reads, we determined that these libraries had an inferred length-weighted mean DNA molecule length of at least 40 kb, with maximum length reaching up to 200 kb (Fig. 2a). This is consistent with the distribution of input DNA detected by gel electrophoresis (Supplementary Fig. 2b). Using low genomic fractions per barcode achieves a high number of linked reads per an individual DNA molecule. We load approximately 3 Mb of genomic DNA per partition. In comparison, the CPT-Seq phasing method averages about 21 to 62 Mb genomic DNA per partition 9 . Phasing variants becomes substantially more robust; the high number of barcode partitions in combination with the low number of haploid genome equivalents reduces the probability of a partition having two high-molecular weight molecules overlapping the same genomic loci but of opposing haplotype. As a result, we achieve excellent phasing performance, require low amounts of genomic DNA and rely on a standard WGS coverage to achieve larger haplotypes as described later. Further decreasing the amount of genomic DNA per partition will accordingly increase the linked reads per molecule. However, the variance in the amount of genomic DNA loaded per partition substantially increases, resulting in allelic dropout and uneven coverage. We have empirically observed that our approach is robust within two-fold of the optimal DNA loading amount (data not shown). Without using standard WGS and after SNV calling 14 , our method can achieve a 0.93 SNV calling sensitivity (which is a measure of true positive rate) and 0.99 PPV (positive predictive value, which is a measure of precision) at 30× sequencing of NA12878. In contrast, a TruSeq library at 0.99 PPV has a SNV calling sensitivity of 0.98. We compared the coverage distribution between barcode libraries and Illumina TruSeq libraries (Supplementary Fig. 3). Overall, 90% of all genomic bases are covered by both samples. We note that the genome coverage of GemCode libraries is broader and biased against GC-rich regions (Supplementary Figure 2d). However, the metrics observed with standard TruSeq libraries require substantially higher input DNA; Illumina TruSeq libraries made from 1ng of input generally suffers from high proportions of PCR duplicates. Experimentally derived haplotypes from control genomes We utilized linked-reads to phase single nucleotide variants (SNVs) that were previously annotated (Table 1, Supplementary Table 4) across all three samples in the nuclear trio. We used a maximum likelihood approach to find near-optimal local haplotype assignments based on read and barcode support (Methods, Supplementary Note 1). A phasing score (PQ) is assigned to each SNV by comparing the likelihood of the observed data in both phasing states of the variant (Methods). As an indicator of haplotyping accuracy, we determined the N50 value from our linked-read analysis. This phasing metric reflects the haplotype block interval where larger blocks represent 50% of the overall phased genome sequence 1, 9 . We identified N50 phase block lengths ranging from 0.9 Mb to 2.8 Mb, with over 97% of SNVs being phased (Table 1, Fig. 2b). Long switch errors occur when a variant position is incorrectly phased in the context of the adjacent flanking variants 9, 12 . We used this well-established metric as another indicator of the genome-wide accuracy of our haplotyping analysis (Methods). The overall long switch error rate was less than 0.03% (Table 1) for the nuclear trio when compared to phasing information generated via trio sequencing 15 . As an additional indicator of phasing accuracy, we examined the distance between all pairwise SNVs per a given haplotype, and the probability that these pairs were accurately assigned. Using this metric, phasing accuracy remained above 95% out to an interval distance greater than 0.5 Mb (Fig. 2c). We investigated the impact of lower sequencing coverage on phasing performance. We determined the haplotypes of NA12878 using down-sampled proportions of the linked-read data. At approximately 30 Gb of sequence (~10× average coverage), 93% of SNVs were phased, the N50 phase block length was 1.1 Mb, and long switch error rates were less than 0.03% (Supplementary Table 3). Phasing performance decreases in regions with small number of heterozygous variants. As shown in Figure 2c, we demonstrate that the probability of correctly phasing SNVs scales with the pairwise distance between variants. We examined the phasing scores at low-density (5 SNVs per 100kb) heterozygous SNV positions derived from pre-called gold-standard variants of NA12878 15 . We observed only 1,407 variants that fit this criterion, corresponding to less than 1% of all the heterozygous SNVs. From this set of sparsely distributed SNVs, 94% were confidently phased past the threshold score of PQ>23. In addition, over 60% of the low-density variants were phased with a maximal phasing score of 255 – this represents exceptionally high phasing quality even for genomic regions with a sparse distribution of SNVs. As a reference, we have also examined all heterozygous SNVs in NA12878; 99% of them were phased accurately and exceeded the threshold score of PQ>23 and 93% were phased with a maximal, highest quality phasing score of 255. We can potentially increase the phasing performance in low-density heterozygous regions if we increase input molecule length, increase sequencing depth, or decrease the amount of genomes loaded per partition. As another comparison, we analyzed the genome of the Hapmap subject NA20847, an individual of Gujarati Indian descent. This genome had been sequenced and haplotyped using a fosmid approach 1 . Library quality and phasing performance were similar to results obtained on the nuclear trio (Table 1). More than 98% of SNVs were phased, with N50 phase block size of ~2.5 Mb. We calculated a long switch error rate of less than 0.09% when compared to fosmid-based haplotyping study. Discovery of phased germline structural variants (SVs) With traditional short read sequencing approaches, the discovery of SVs is computationally difficult, particularly in highly repetitive regions of the genome. We used linked-read data to call breakpoints in large-scale SVs and assign them to specific haplotypes. We developed an algorithm that efficiently searches all pairs of genomic loci for regions of large numbers of overlapping barcodes. We identify non-adjacent candidate positions of structural variation (Fig. 3a, b, Supplementary Fig. 4). Each breakpoint is assigned a Phred-like quality score, which describes the likelihood of the breakpoint being called by chance (Methods, Supplementary Note 2). Linked-read data provides sequencing information spanning the region around a breakpoint up to 10s of kb or higher. Thus, linked-read analysis can differentiate between a putative breakpoint created by a true SV and a false positive confounded by repetitive sequences (Fig. 3b). As an initial test we examined eight large genomic deletion candidates with sizes greater than 70 kb as previously identified and validated in NA12878 16–18 . Five were highly ranked by our prediction algorithm, while three had lower scores (Fig. 3c). Barcode count analysis showed that the five high-scoring breakpoints were also consistent with a loss-of-heterozygosity (LOH) as would be expected for a deletion (Supplementary Fig. 5). We phased these deletions by overlapping the barcodes supporting a breakpoint with barcodes from neighboring haplotype blocks (Fig. 3b). Counting the haplotype-specific barcodes provides another type of score for additional vetting of the putative deletion. Five of these candidate deletions had flanking haplotype blocks that cover both sides of the deletion (Fig. 3c). In contrast, the breakpoints of two putative candidates with a low score could not be phased. As additional evidence of the accuracy of our haplotype deletion calls, we examined the consistency of haplotype blocks with Mendelian inheritance for NA12878, the mother and NA12882, the child in the nuclear trio. Among the five deletions with the highest ranked score, three were inherited in the child (Fig. 3c, Supplementary Table 5a,b). For the highest ranking cases, when the analysis of the child showed the deletion, it also inherited the related haplotype; when the child did not inherit the deletion, the other maternal haplotype was inherited (Fig. 3c, Supplementary Table 6). We validated these breakpoints with targeted sequencing 19, 20 . For this study, we used targeting probes that tile across the genomic intervals where the two breakpoints are created by a deletion (Methods, Supplementary Table 7, Supplementary Fig. 6). Targeted reads were examined for evidence of a breakpoint junction and deletion. In general, the chimeric junctions of the high scoring deletion breakpoints were validated and showed the appropriate inheritance pattern (Supplementary Fig. 6, Supplementary Table 6). In contrast, two of the lower scoring deletion candidates could not be confirmed, one of which is in a VDJ recombination region (Supplementary Table 6). One of the lower scoring candidates did have targeted sequencing results pointing to a deletion that was seen among all three individuals. Linked-read data reveals other types of structural rearrangements besides deletions. Overall, our SV algorithm called 20 structural variants in NA12878, of which 11 were identified as deletions in a recent de novo assembly 10 , two were reported as inversions 16 , and one was identified as a retro-transposon insertion 21 (Supplementary Fig. 4, Supplementary Table 8, Methods). To assess the sensitivity of our method, we compared our calls to a set of deletions identified from both conventional short-read sequencing 18 and long-read-assisted assembly 10 . There were only 3 such deletions, all of which were detected and validated (Fig. 3c, Methods). Exome-based phasing Generating barcode exome libraries is straightforward and enables the haplotyping of genes with linked-reads (Methods, Fig. 1b). Using approximately 1 ng of DNA from NA12878, NA12877 and NA12882, barcode libraries underwent exome enrichment with Agilent SureSelect kits and were sequenced to a depth of greater than 185× (Table 1). After data pre-processing and alignment, we observed that over 99% of the linked-reads aligned to the human genome reference, greater than 57% of the bases were on-target and the data had a PCR duplication rate up to 18% at 450× (Table 1). Our algorithm phased more than 95% of genes under 100 kb and we observed N50 phase block lengths greater than 103 kb for the nuclear trio (Table 1, Supplementary Fig. 2e, f). In addition, haplotype blocks were consistent with Mendelian inheritance across the trio (Fig. 2d). Detection of an EML4/ALK rearrangement via exome phasing SVs such as cancer rearrangements frequently occur in intronic sequences rather than exons and can lead to chimeric gene products. Exome sequencing does not detect gene fusions for which the breakpoint is more than a few hundred base pairs from an exon without custom targeting assays and extremely high sequencing coverage 22, 23 . To overcome these issues, we used exome linked-reads to detect a clinically actionable cancer rearrangement. The lung cancer cell line NCI-H2228 contains an EML4/ALK fusion 24, 25 in which exons 1–6 of EML4 are fused to exons 20–29 of ALK. This rearrangement leads to constitutive activation of ALK 26 , an oncogenic kinase driver. We prepared a barcode sequencing library from approximately 1ng of genomic DNA, conducted exome enrichment, then sequenced to an average sequencing coverage of 204× after duplication filtering (Table 1). We correctly identified an EML4/ALK fusion (Fig. 4a–d, Supplementary Fig. 7a, b, Supplementary Table 9); our exome linked-read data showed that the rearrangement occurs between exons 20–26 of ALK and exons 2–6 of EML4 (Fig. 4a), consistent with previous reports and our own validation (Supplementary Fig. 7). A simple inversion would predict corresponding overlap between exon 19 of ALK with exon 7 of EML4 (Fig. 4e). Our results showed overlap of exon 1 of ALK and exon 7 of EML4 (Fig. 4b), suggesting a deletion of exons 2–19 of ALK and a more complex structure than a simple inversion. In addition, we identified an additional insertion of ALK exons 10–11 in the gene PTPN3 on chromosome 9 (Fig. 4c, Supplementary Fig. 7c, d, Supplementary Table 9) as has been previously reported 27 . Based on these results for this cell line, we inferred a refined structure of the overall structural rearrangement (Fig. 4e) covering the ALK deletion, EML4/ALK inversion, and insertion of exons 10–11 of ALK into PTPN3. Exons 20–29 of ALK are contained within a 220 kb phase block; only one haplotype overlaps with the EML4/ALK fusion. Similarly, exons 3–4 of PTPN3 are contained with a 40 kb phase block and there is a distinct segregation of the ALK insertion into only one haplotype of the PTPN3 gene (Fig. 4f). The rearrangement structure was separately verified with linked-reads whole genome sequencing (Supplementary Table 1, Supplementary Fig. 7c, d). Analysis of the barcode counts in the WGS data (Fig. 4d, f) revealed a coverage reduction consistent with a deletion in the region covering exons 2–19 of ALK. Haplotype analysis of a primary colon adenocarcinoma Whole genome haplotype analysis has been reported for the HeLa tumor cell line 2 , but the phasing analysis of primary tumor genome has remained difficult. Unlike cancer cell lines that are more homogeneous in their genetic composition, primary tumors derived from clinical samples pose a number of challenges. The amount of DNA from clinical samples is often limited and the cellular composition frequently includes normal stromal tissue. As a first time demonstration of a phasing analysis for a primary cancer genome, we analyzed a primary colon adenocarcinoma and identified three classes of genetic aberrations in the context of Mb-scale haplotypes: i) mutations; ii) copy number variants (CNVs); iii) rearrangements. First, we performed short read WGS on both the tumor and matched normal pair to an average sequencing coverage of 50× followed by variant calling (Supplementary Table 1). For both samples, we used ~2 million heterozygous SNVs determined from the short reads for subsequent phasing (Supplementary Table 4, Methods). For the phased analysis of a primary cancer genome, we used ~1ng genomic DNA from the tumor and normal sample as direct input, without other preparative steps, for generating a barcode sequencing library. This tumor had 70% purity. We sequenced the barcode libraries to an average coverage of ~30×. After linked-read alignment, we determined that the genomic DNA had an inferred DNA length distribution with mean >50 kb (Fig. 5a). Over 94% of SNVs were phased with N50 phase block lengths of ~1.5 Mb for the normal sample and ~0.9 Mb for the tumor (Table 1, Fig. 5b). We generated haplotype blocks for both samples and examined the haplotype intersection intervals. From this intersection, 90% of the bases were shared between the two samples. Seventeen deleterious cancer mutations were identified per CADD scores 28 and assigned to specific haplotype blocks (Supplementary Table 10). A number of the mutations occurred in known colorectal cancer drivers such as TP53 and NRAS 29 . Using the linked-read analysis, we identified five rearrangements. Two are inter-chromosomal translocations (Supplementary Table 11). The short read WGS data provided validation of these events; the breakpoints of the five structural variants were confirmed by BreakDancer 30 predictions and supported by reads covering the breakpoints (Supplementary Table 11). Analysis of the short read WGS for copy number alteration algorithm 31 , identified 26 copy number variant (CNV) intervals of which 24 were validated by counting the average number of barcodes and linked-reads (Supplementary Table 12). There were two CNV intervals (short-read analysis) not validated by barcode counting and occurred in centromeric and telomeric regions that had lower linked-read coverage. We used linked-read analysis and haplotype blocks to explore the context and structural alterations affecting a critical driver mutation in TP53, namely a candidate deleterious non-synonymous R213Q mutation (Supplementary Table 10). Phasing analysis demonstrated that the R213Q mutation is within a 46 kb phase block on the haplotype 2 allele (Fig. 5c). Traditional short read WGS analysis indicated an LOH event represented by a hemizygous deletion in the p-arm of chromosome 17 (Fig. 5d). Barcode count analysis with linked-reads confirmed this observation and once considering the mutation allelic fraction, the corrected copy number at that region is 1. The LOH results from an extensive genomic deletion overlapping the TP53 mutation (Fig. 5e). The phased SNV frequencies in the haplotype 1 allele are reduced in the tumor compared to the normal, indicating that LOH in the tumor sample is associated with the loss of the haplotype 1 allele (Fig. 5f). Thus, the TP53 R213Q mutation is in trans with the deleted allele haplotype. As a result, the tumor contains only a single, inactivated copy of TP53. Taken together, this result shows the unambiguous biallelic inactivation of TP53 32, 33 . DISCUSSION We demonstrate the use of a high-throughput microfluidics technology to construct phased sequencing libraries from nanogram inputs of high molecular weight DNA. By using gel beads as the barcode delivery reagent, we demonstrate the robust loading of a single barcode to a microdroplet partition. This technology addresses issues that affect current experimental phasing approaches such as problems of Poisson-distributed barcode loading and limited partitioning with microwells. This is the first study that demonstrates a droplet-based system for whole genome phasing and structural variant analysis. In addition to phasing and structural variant calling, linked-reads can potentially also be applied to de novo genome assembly, remapping of difficult regions of the genome, detection of rare alleles, and elucidating complex structural rearrangements. Several studies have recently demonstrated high-throughput barcoding of droplet partitions 34–36 for single-cell RNA-Seq and analysis of short bacterial 16S sequences. These other approaches use individual barcodes ranging up into the millions that are introduced into a specific partition. However, none of these droplet applications generate megabase-scale haplotypes from whole genome sequencing. As noted previously, there are a number of other genome sequencing approaches used for phasing 1, 2, 5–9, 37, 38 and an overview is listed in Supplementary Table 13. Only one of these approaches uses droplets, and this method does not involve sequencing but rather relies on digital PCR counting methods to assess a single-plex candidate locus 38 . To assess performance, we conducted a phased genome analysis on several well-defined genomes. With this technology, we phased over 95% of SNVs in all samples with N50 phase block sizes ranging from 0.8 Mb to 2.8 Mb, at a low switch error rate of less than 0.001. This phasing performance was achieved using existing variant datasets. We show that linked-read data can be used to phase de novo variants, although more coverage will be required to achieve parity with standard library preparation methods due to coverage biases against GC-rich regions (Supplementary Fig. 2d). Statistical inference of haplotypes from genomic intervals dominated by similar heterozygous variants among family members is an issue that experimental phasing overcomes. For example, in the NA12878 nuclear trio, about 10% of the total number of SNVs in the child are inherited from such regions with common genotypes 39 . Our technology is compatible with standard downstream NGS assays, such as exome enrichment, as barcode information is introduced as the first step in the library preparation process. With the nuclear trio samples, over 95% of genes less than 100 kb were phased using this phased exome sequencing approach, which enables the economical use of phased analysis on large numbers of samples. We used phasing and read barcode counts to identify structural variation such as large genomic deletions and rearrangements that were independently validated by multiple methods. Using exome linked-reads, we delineated the complex rearrangements such as the EML4/ALK inversion in the case of the NCI-H2228 cell line. In addition, we showed that linked-read phasing of structural variants distinguish true SVs from false predictions. We also used this approach to phase a cancer genome derived from a primary tumor. The combination of somatic mutations, haplotype blocks and barcode counting identified the trans-relationship between a mutation in TP53 and a chromosome 17 p-arm loss in colon adenocarcinoma. We additionally generated haplotypes incorporating other critical genetic aberrations such as copy number alterations and rearrangements. We anticipate that phased cancer genomes will provide new insight into the underlying genomic structural alterations underlying tumor development and maintenance. The identification of potentially pathogenic mutations and structural variants remains a challenge and linked-read sequencing provides a unique opportunity to improve our understanding of diseases such as cancer. METHODS Genomic DNA samples The Institutional Review Board (IRB) at Stanford University School of Medicine approved the study. Informed consent was obtained and the samples were made available from the Stanford Cancer Center Tissue Bank. This study used a primary colorectal adenocarcinoma and matched normal tissue that were collected at time of surgical resection and flash frozen. Both samples had genomic DNA extracted with the E.Z.N.A. SQ DNA/RNA Protein Kit (Omega Bio-Tek). The genomic DNA did not require further size selection or processing. We quantified the DNA with Life Technologies Qubit. For the commercially acquired genomic DNA, we size selected DNA molecules 20 kb or higher using the BluePippin (Sage Science) (NA12877 and NA12882 from Coriell, and NCI-H2228 from ATCC). In addition, we harvested immortalized human lymphocyte cells (GM12878 and GM20847 from Coriell) and genomic DNA was extracted using the Gentra Puregene Cell kit (Qiagen). Sequencing library construction using the GemCode platform A GemCode Instrument (10× Genomics) was used for sample preparation. The high-throughput nature of the platform allows construction of 8 sequencing libraries by a single person in a day. Sample indexing and partition barcoded libraries were prepared using a beta version of the GemCode Gel Bead and Library Kit (10× Genomics, Pleasanton, CA). One nanogram of sample DNA was used for GEM reactions where DNA molecules were partitioned into droplets to amplify the DNA and introduce 14-bp partition barcodes. With 1ng genomic DNA of 50 kb molecule length, there are ~100 molecules per droplet. GEM reactions were thermal cycled (95°C for 5 min; cycled 18×: 4°C for 30 sec, 45°C for 1 sec, 70°C for 20 sec, and 98°C for 30 sec; held at 4°C). After amplification, the droplets were fractured and the library intermediate DNA purified using the 10× Genomics protocol. The DNA was subsequently sheared to either 250bp or 500bp using a Covaris M220 system (Supplementary Table 2) to construct sample-indexed libraries using 10× Genomics adaptors. The barcode sequencing libraries were quantified by qPCR (KAPA Biosystems Library Quantification Kit for Illumina platforms). Sequencing was conducted with an Illumina Hiseq2500 with 2×98 paired-end reads based on the manufacturer’s protocols. To compare barcode libraries against standard short read libraries, we prepared a TruSeq library (Illumina) following manufacturer’s protocols, using 100ng of DNA. Both barcode and TruSeq libraries used GM12878 genomic DNA. Each library was sequenced to ~30× coverage. At 30× coverage, the coverage of molecule in each droplet is 0.1×, and the number of linked-reads per molecule is around 15. Five micrograms of each barcode library was used for exome capture (Agilent SureSelect Human All Exon V5+UTRs) with the Agilent SureSelect Target Enrichment System (Agilent Technologies, Santa Clara, CA) supplemented with modified blocking oligonucleotides for Illumina Dual Indexing (TS HT i5 and TS HT i7) from IDT. Captured libraries were quantified by qPCR (KAPA Biosystems). Again, sequencing was conducted with an Illumina Hiseq2500 with 2×98 paired-end reads based on the manufacturer’s protocols. Alignment, barcode assignment and calculation of sequencing metrics The GemCode analysis software was used for processing the sequenced data from barcode libraries. Fastq files from Illumina sequencing reads were trimmed (removing the first 10nt of all reads) and aligned to the human genome (hg19) using bwa (mem algorithm, version 0.7.10-r789). Barcodes were incorporated into the read information in the bam file and only reads associated with valid barcodes were considered for alignment and downstream analysis. For visualization and some analysis, the barcode counts were calculated using non-overlapping window size of 100 kb, over all positions. Only uniquely mapped, non-duplicated reads with mapping quality (MAPQ) of 60 are considered. Reads were sorted by position using samtools (Version 0.1.19-96b5f2294a). PCR duplicates were marked if two sets of read-pairs shared both identical aligned genomic position and an identical associated barcode sequence. Linked-reads were inferred by clustering reads from the same barcode on the genome, and their boundaries were set by two nearest reads more than 50 kb apart. The phrase, “barcodes correctly assigned” is the fraction of barcodes matching a known barcode. “Relative genomic loading per partition” was calculated as the fraction of the amount of DNA in a partition relative to the size of the human genome. The number of binding events is estimated as the product of binding density and genome loaded per partition. For a uniform distribution of barcode frequencies, the probability of drawing two identical barcodes is p = Σ i frequency(BCi ) * frequency(BCi ) = N/N 2 = 1/N where N is the number of unique barcodes. Thus, effective barcode diversity, which accounts for a non-uniform distribution of barcode frequencies, can be calculated as: Effective barcode diversity = 1 ∑ i frequency ( B C i ) ∗ frequency ( B C i ) . where BCi = i-th barcode. To perform the variant calling analysis, we used Freebayes to call variants on 10× and Truseq libraries, down-sampling each library to 10×, 20× and 30× coverage. Then sensitivity and PPV of SNVs were evaluated against ground-truth variants published by Cleary et. al 15 . Phasing linked-reads See Supplementary Note 1. Structural variant calling from linked-read data See Supplementary Note 2. Phasing of structural variants Phasing of large-scale variants used the final probabilistic assignment of barcodes to haplotype blocks calculated as part of the phasing code. For each haplotype block within a 30 kb window of each of the two breakpoints defining a structural variant, barcodes supporting the structural variant call were assigned to one of the two haplotypes for that haplotype block. For each haplotype block, the counts of barcodes assigned to each of the two haplotypes were used to calculate a p-value under the two-tailed binomial test. Phase calls were made on a structural variant when the p-value was < 0.01. Validation of genomic deletions with targeted sequencing We validated a series of genomic deletions using targeted sequencing 20 . The methods are fully described by Hopmans et al. 19 . For this validation study, we relied on targeting assays that uses target-specific primer probes that hybridize to the target DNA molecule 20 . Afterwards, a polymerase extension captures the specific genomic target sequence. Previously, we demonstrated the utility of this method for confirming SVs, even in the context of genomic mixtures where a candidate rearrangement is present in only a fraction of the sample 19 . As a result of random fragmentation of genomic DNA in the library preparation, breakpoints of structural variants will be randomly distributed within a subset of the sequencing reads. For this assay, we designed multiple primer probe sequences flanking each putative breakpoint associated with a structural variant candidate. This targeting method is generally successful at selecting sequences up to 1 kb if not further away from the primer probe. The primer probe sequences chosen were on both the forward and reverse strands surrounding both sides of a target putative breakpoint within a distance of 0.75 kb (Supplementary Fig. 6). Reads captured by primer probes upstream from the breakpoints should cross on the reverse strand; reads captured by primer probes downstream from the breakpoints should cross on the forward strand. For the eight candidate deletions that were validated, we designed and synthesized 163 primer-probe oligonucleotides (Supplementary Table 7). Generally, all of these oligonucleotides were unique in terms of their representation in the genome. The only exception was for 15 probes intended to validate a deletion in chromosome 5 (position 99,400,335 – 99,713,992). This deletion occurs in an area of the genome that is highly repetitive, so only two of these 15 primer probes contain a 20mer that aligns uniquely to the human genome with no single-mismatch alignments. Single-end alignment using bwa (mem algorithm, version 0.7.10) was performed on the individual reads from the mate-pairs. The targeting primer sequence is included in read 2 and used as an index for a given target segment. The captured sequence is in read 1 and were indexed based on the read 2 targeting primer. The read 1 sequences that completely aligned to the human genome were excluded. The remaining read 1 sequences were evaluated for evidence of breakpoint and counted. The reads that had breakpoints were concatenated to create a breakpoint sequence. Reads crossing breakpoints were generated by finding reads that included a soft-clipped section such that the aligning portion preceded or followed the breakpoint; soft-clipped reads that also included soft-clipping on the non-breakpoint side were excluded. Using this read set we counted 20mers that contained a chimeric junction containing sequence on both sides of the breakpoint candidate. Evaluation of structural variant calls in NA12878 To assess the false discovery rate of our SV calling algorithm, we compared our structural variant calls in NA12878 against a recent de-novo assembly using genomic DNA from this individual 10 . We obtained a list of assembly-based deletion and insertion calls in NA12878 from the Genome in a Bottle website (ftp://ftp-trace.ncbi.nih.gov/giab/ftp/technical/NA12878_PacBio_MtSinai). We then constructed two deletion datasets: (a) a “confident” set containing deletions that were marked as “passing” by the study. These were deletions that were called by 3 or more out of the 7 methods used in that paper; (b) a “relaxed” set containing all deletions detected by at least one computational method in the de-novo assembly data. We focused on deletion calls in the following comparison because 1) deletion calls are much easier to compare across datasets; 2) we omitted insertions from the above two sets because our algorithm is not designed to detect gaps in the reference genome. Out of the 20 calls that were made in NA12878 via linked-reads, 40% and 55% respectively matched those from the confident and relaxed Pendleton datasets to within 20 kb. Besides deletion calls, our SV algorithm can detect other types of structural rearrangement. Indeed, two of our calls matched inversions reported in the literature 16 . One additional call is a retro-transposon insertion that has been found in Caucasian individuals 21 . Although the three calls were not explicitly called by Pendleton et. al., they were supported by long sequence reads (i.e. Pacific Biosciences sequencer) obtained from the same assembly work 10 . Altogether, this increases the percentage of the validated calls against the de-novo assembly to 70%. We have included a comparison of the calls in Supplementary Table 8. RT-PCR validation of EML4-ALK fusion RT-PCR was used to confirm the EML4-ALK and ALK-PTPN3 fusions in NCI-H2228 cancer cell line. We used the Cells-to-CT 1-Step Power SYBR Green Kit (Life Technologies) according to the manufacturer’s recommendations. The gene ACTB was assayed using SYBR Green Kit Control Kit (Life Technologies). As a negative control, NA12878 cells were assayed in parallel. Briefly, ~7500 cells were lysed and treated with DNase I in a total of 55 ul. Two ul of lysate was used for a 20 ul PCR reaction. The PCR products were visualized using the BioAnalyzer High Sensitivity DNA Kit (Agilent), with respectively the amplicons ACTB diluted 1:50, EML4/ALK 1:20, and ALK/PTPN3 diluted 1:3. The primers for the EML4/ALK amplicon are (F) 5′-GCATAAAGATGTCATCATCAACCAAG; (R) 5′-CGGAGCTTGCTCAGCTTGTA. The PCR primers for ALK/PTPN3 are: (F) 5′-TGGCTGCAGATGGTCGCATGG; (R) 5′-AGTCCACGGAGTCGTCATCAT. Cancer whole genome sequencing with short reads and data processing Whole genome libraries were made per the manufacturer’s protocol (Illumina). Sequencing libraries underwent cluster-generation on an Illumina cBot using paired end flowcells and Illumina TruSeq chemistry and sequenced at Illumina with the HiSeq 2500 for 2×100 cycle reads with indexing. Sequence reads were aligned to the human genome version hg19 using bwa 13 . The Genome Analysis Toolkit (GATK) 14 was used to determine overall sequencing coverage and variant calls. Cancer genome somatic mutation calling for coding mutations The whole genome sequence data was aligned using bwa 0.7.5 13 aln and sampe with default parameters against NCBI human genome build 37. Data was sorted and duplicate marked using Picard’s AddOrReplaceReadGroups and MarkDuplicates functions respectively. Picard version 1.63 was used in all steps. The files were merged in the GATK 14 RealignerTargetCreator step. This step and the IndelRealigner step were used to realign locally; IndelRealigner referred to dbSNP version 135. The BaseRecalibrator function used CycleCovariate and ContextCovariate as covariates and referred to dbSNP 135. At this point the realigned bam file of Patient 1532’s data was split up to allow for easier processing. GATK PrintReads was run on realigned bam files with the appropriate recalibration data table to produce recalibrated bam files. The GATK UnifiedGenotyper was then run with the parameters --dbsnp dbsnp_135.b37.vcf --max_alternate_alleles 11. These raw calls were then recombined. The GATK VariantRecalibrator was run on the raw VCF data, using the hapmap, omni, and dbsnp resources with standard priors and using HaplotypeScore, MQRankSum, ReadPosRankSum, FS, MQ and DP as filter elements. Finally, the ApplyRecalibration step was used to determine whether calls received a PASS value or not. Variants were called using GATK version 2.6–4. After variants were called, all SNV positions where the tumor and normal calls differed were submitted to CADD annotation 28 . SNVs were then filtered to require a somatic variant (positions where the normal tissue shows no variant and the tumor does or the normal tissue is heterozygous and the tumor has a homozygous variant) in a coding region with coverage depth >= 10 in both samples and a CADD phred score greater than or equal to 25 (Supplementary Table 10). Sequencing coverage was assessed with the GATK DepthOfCoverage tool at depths of 10, 20 and 30 (Supplementary Table 1). SNVs were then extracted from the phased VCF files and their phasing status was assessed. The tumor haplotype is based on the haplotype of the first SNV in the local normal phase block: that haplotype is always arbitrarily assumed to be 1. The normal and tumor haplotypes are then set to be congruent to one another by comparing positions heterozygote in both samples. In case the normal region is not phased, the tumor haplotype is assumed to be 2. If the tumor SNV is a homozygote while the normal is a heterozygote, the haplotype is assigned to the wild-type haplotype of the normal. Cancer genome allelic imbalance analysis For assessment of loss-of-heterozygosity (LOH) events, our analysis relied on minor allelic frequency (MAF) data. The MAF is a ratio comparison of allelic read depths from heterozygous SNVs identified from the normal genome compared to the same position from the tumor. The input file is a VCF containing the normal and tumor reads. The calls are filtered to require a genotype quality (GQ) of 30 or greater in both normal and tumor at that position, an overall read depth of 10 or greater, and a minor allele depth of at least 3 in the normal genome. The allele depth ratio is calculated as the minor allele count divided by the major allele count. The MAF value is determined as follows: we divide the tumor allele depth ratio by the normal depth ratio and taking the log2 of the quotient. For graphic display, we used a smoothed MAF value based on a window average of 100 contiguous SNVs from each genome. Cancer genome copy number and structural variant analysis To determine somatic copy number alterations and the affected genomic intervals from whole genome sequencing data, we used the SeqCBS method 31 . The software implementation is available as an open-source R package named SeqCBS (http://cran.r-project.org). The CNV analysis used an R script that reads a configuration file listing the sequence data sets to be compared, namely the case (tumor) versus the control (normal). The algorithm then performs the segmentation on these two files, compares them, and produces both local and whole-chromosome CNV plots. For any such region, there is a general test statistic and a relative gain or loss copy number value. Generally, we required a test statistic > 1,000 as a basic cutoff and a copy number value of greater than 2.5 or less than 1.6 as our thresholds for marking an event as a significant amplification or loss. We validated these calls with linked reads by counting the average number of barcodes-annotated reads over 50 kb window spanning across the length of each candidate. To validate SV calls made by the GemCode software analysis of linked-reads we examined sequence data from the short read WGS dataset. We used BreakDancer 30 with default setting to generate a set of SV candidates and then identified putative locations as predicted by the phased SV call set and associated quality score. In addition, we identified soft-clipped reads in the vicinity of the breakpoints, which are indicative of a structural variant breakpoint. Afterwards, we tabulated the number of reads directly supporting the breakpoint. Soft-clipped reads were manually curated in IGV to verify base quality, and were individually aligned in BLAT to verify the breakpoint locations. Supplementary Material 1 2 3

Related collections

Most cited references 23

Record: found
Abstract: not found
Article: not found

A genetic model for colorectal tumorigenesis.

E Fearon, B Vogelstein (1990)

0 comments Cited 1016 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Mapping and sequencing of structural variation from eight human genomes.

Jeffrey M. Kidd, Gregory M. Cooper, William F Donahue … (2008)

Genetic variation among individual humans occurs on many different scales, ranging from gross alterations in the human karyotype to single nucleotide changes. Here we explore variation on an intermediate scale--particularly insertions, deletions and inversions affecting from a few thousand to a few million base pairs. We employed a clone-based method to interrogate this intermediate structural variation in eight individuals of diverse geographic ancestry. Our analysis provides a comprehensive overview of the normal pattern of structural variation present in these genomes, refining the location of 1,695 structural variants. We find that 50% were seen in more than one individual and that nearly half lay outside regions of the genome previously described as structurally variant. We discover 525 new insertion sequences that are not present in the human reference genome and show that many of these are variable in copy number between individuals. Complete sequencing of 261 structural variants reveals considerable locus complexity and provides insights into the different mutational processes that have shaped the human genome. These data provide the first high-resolution sequence map of human structural variation--a standard for genotyping platforms and a prelude to future individual genome sequencing projects.

0 comments Cited 365 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

EML4-ALK fusion gene and efficacy of an ALK kinase inhibitor in lung cancer.

Jussi P. Koivunen, Craig H. Mermel, Kreshnik Zejnullahu … (2008)

The EML4-ALK fusion gene has been detected in approximately 7% of Japanese non-small cell lung cancers (NSCLC). We determined the frequency of EML4-ALK in Caucasian NSCLC and in NSCLC cell lines. We also determined whether TAE684, a specific ALK kinase inhibitor, would inhibit the growth of EML4-ALK-containing cell lines in vitro and in vivo. We screened 305 primary NSCLC [both U.S. (n = 138) and Korean (n = 167) patients] and 83 NSCLC cell lines using reverse transcription-PCR and by exon array analyses. We evaluated the efficacy of TAE684 against NSCLC cell lines in vitro and in vivo. We detected four different variants, including two novel variants, of EML4-ALK using reverse transcription-PCR in 8 of 305 tumors (3%) and 3 of 83 (3.6%) NSCLC cell lines. All EML4-ALK-containing tumors and cell lines were adenocarcinomas. EML4-ALK was detected more frequently in NSCLC patients who were never or light (<10 pack-years) cigarette smokers compared with current/former smokers (6% versus 1%; P = 0.049). TAE684 inhibited the growth of one of three (H3122) EML4-ALK-containing cell lines in vitro and in vivo, inhibited Akt phosphorylation, and caused apoptosis. In another EML4-ALK cell line, DFCI032, TAE684 was ineffective due to coactivation of epidermal growth factor receptor and ERBB2. The combination of TAE684 and CL-387,785 (epidermal growth factor receptor/ERBB2 kinase inhibitor) inhibited growth and Akt phosphorylation and led to apoptosis in the DFCI032 cell line. EML4-ALK is found in the minority of NSCLC. ALK kinase inhibitors alone or in combination may nevertheless be clinically effective treatments for NSCLC patients whose tumors contain EML4-ALK.

0 comments Cited 273 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-journal-id): 9604648

Journal ID (pubmed-jr-id): 20305

Journal ID (nlm-ta): Nat Biotechnol

Journal ID (iso-abbrev): Nat. Biotechnol.

Title: Nature biotechnology

ISSN (Print): 1087-0156

ISSN (Electronic): 1546-1696

Publication date Nihms-submitted: 18 November 2015

Publication date (Electronic): 01 February 2016

Publication date (Print): March 2016

Publication date PMC-release: 01 August 2016

Volume: 34

Issue: 3

Pages: 303-311

Affiliations

[1 ]10× Genomics, Pleasanton CA, United States

[2 ]Stanford Genome Technology Center, Stanford University, Palo Alto, CA, United States

[3 ]Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, United States

Author notes

Corresponding Authors: Hanlee P. Ji, genomics_ji@ 123456stanford.edu . Benjamin J. Hindson, ben@ 12345610xgenomics.com

[†]

These authors contributed equally to this work.

Article

Manuscript ID: NIHMS738063

DOI: 10.1038/nbt.3432

PMC ID: 4786454

PubMed ID: 26829319

SO-VID: 99d66f9c-0809-4adc-9cc8-5961949b5393

License:

Users may view, print, copy, and download text and data-mine the content in such documents, for the purposes of academic research, subject always to the full Conditions of use: http://www.nature.com/authors/editorial_policies/license.html#terms

Haplotyping germline and cancer genomes using high-throughput linked-read sequencing

Read this article at

Abstract

Related collections

Genome Engineering using CRISPR

Most cited references 23

A genetic model for colorectal tumorigenesis.

Mapping and sequencing of structural variation from eight human genomes.

EML4-ALK fusion gene and efficacy of an ALK kinase inhibitor in lung cancer.

Author and article information

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 39

Cited by 311

Most referenced authors 1,883