39
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: not found
      • Article: not found

      Resolving the complexity of the human genome using single-molecule sequencing

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The human genome is arguably the most complete mammalian reference assembly, yet more than 160 euchromatic gaps remain and aspects of its structural variation remain poorly understood ten years after its completion. To identify missing sequence and genetic variation, here we sequence and analyse a haploid human genome (CHM1) using single-molecule, real-time DNA sequencing. We close or extend 55% of the remaining interstitial gaps in the human GRCh37 reference genome--78% of which carried long runs of degenerate short tandem repeats, often several kilobases in length, embedded within (G+C)-rich genomic regions. We resolve the complete sequence of 26,079 euchromatic structural variants at the base-pair level, including inversions, complex insertions and long tracts of tandem repeats. Most have not been previously reported, with the greatest increases in sensitivity occurring for events less than 5 kilobases in size. Compared to the human reference, we find a significant insertional bias (3:1) in regions corresponding to complex insertions and long short tandem repeats. Our results suggest a greater complexity of the human genome in the form of variation of longer and more complex repetitive DNA that can now be largely resolved with the application of this longer-read sequencing technology.

          Related collections

          Most cited references 11

          • Record: found
          • Abstract: found
          • Article: not found

          A high-resolution recombination map of the human genome.

          Determination of recombination rates across the human genome has been constrained by the limited resolution and accuracy of existing genetic maps and the draft genome sequence. We have genotyped 5,136 microsatellite markers for 146 families, with a total of 1,257 meiotic events, to build a high-resolution genetic map meant to: (i) improve the genetic order of polymorphic markers; (ii) improve the precision of estimates of genetic distances; (iii) correct portions of the sequence assembly and SNP map of the human genome; and (iv) build a map of recombination rates. Recombination rates are significantly correlated with both cytogenetic structures (staining intensity of G bands) and sequence (GC content, CpG motifs and poly(A)/poly(T) stretches). Maternal and paternal chromosomes show many differences in locations of recombination maxima. We detected systematic differences in recombination rates between mothers and between gametes from the same mother, suggesting that there is some underlying component determined by both genetic and environmental factors that affects maternal recombination rates.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition

            We characterize and extend a highly efficient method for constructing shotgun fragment libraries in which transposase catalyzes in vitro DNA fragmentation and adaptor incorporation simultaneously. We apply this method to sequencing a human genome and find that coverage biases are comparable to those of conventional protocols. We also extend its capabilities by developing protocols for sub-nanogram library construction, exome capture from 50 ng of input DNA, PCR-free and colony PCR library construction, and 96-plex sample indexing.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              A Comprehensive Map of Mobile Element Insertion Polymorphisms in Humans

              Introduction Mobile elements: significance and current catalogs Retrotransposons are endogenous genomic sequences that copy and paste into locations throughout host genomes [1]–[3]. Most mobile elements annotated in the human reference genome are remnants of ancient retrotransposition events and are no longer capable of active retrotransposition. However, a fraction of mobile elements remain active and contribute to variation between individuals in the human population. These active elements belong almost exclusively to the Alu, L1, and SVA families of non-LTR retrotransposons [4]. The Alu family is the most common mobile element in primate genomes, with more than 1.1 million copies in Homo sapiens [5]–[7]. The sequence of a full-length Alu element is 300 bp long. Alu elements are classified into a range of sub-families which have different propensities for retrotransposition, and are identified according to sequence alterations. Several AluY sub-families are currently active and are responsible for the bulk of mobile element insertion variation in Homo sapiens. The human reference genome contains over 140,000 annotated AluY elements. After Alus, L1 insertions are the next most prevalent family of mobile elements. There are over 500,000 L1 elements annotated in Homo sapiens. A full-length L1 is a sequence of roughly 6 kb in length and the most active L1 sub-family in the human lineage is L1HS [8], [9]. There are a little more than 1,500 L1HS annotated elements in the human reference. A third family of mobile element are SVA retrotransposons [10]. SVAs are hybrid elements of SINE, VNTR and Alu components that range in size up to several Kb, with more than 3,600 annotated SVA elements in the human reference genome. SVA elements are thought to be the youngest family of retrotransposons in primates [11]. Other less common classes of mobile elements, such as DNA transposons, and endogenous retroviruses are not the focus in this study. Mobile element insertions (MEI) are known to generate significant structural variation within Homo sapiens [12], [13] and have diverse functional impacts [14]–[16]. In vitro experiments identified key features of Alu [17] and L1 [18] elements responsible for retrotransposon activity. The identification of MEI variant loci in humans initially began with disease-causing insertion events (e.g. hemophilia [19], breast cancer [20]). Experimental approaches were based upon library screening and small-scale PCR based display assays [21]. These approaches have been augmented by comparisons of the NCBI and the HuRef genomes [22], [23], large scale fosmid-end sequences [24], and targeted sequencing of element-specific PCR products [25]–[28]. The dbRIP database of MEI polymorphisms [29] currently contains 2,691 polymorphic loci, enabling early estimates for the total number of segregating events [25] and per-generation mutation rates [23]. MEI polymorphisms can be detected either as insertions or as deletions in samples relative to the reference genome. Mechanistically, however, both types of observations are due to retrotransposon insertion; precise excisions of mobile elements are essentially non-existent [1]. Therefore MEI detected as deletions are, in fact, retrotransposon insertions in the reference DNA and can be verified as such by comparison with ancestral genomes. Detection and genotyping properties of MEI detected as insertions (“non-reference MEI”) and as deletions (“reference MEI”) are substantially different. We present their respective properties separately before combining the two detection modes into a unified MEI analysis. The deletion detection methods and properties of the full set of 1000GP deletions have been extensively described in the 1000GP CNV companion paper [30]. This allows us to focus on specific properties of the reference MEI subset of those deletions. Effective computational algorithms using second-generation sequencing data exist for identifying deletions [27], [31], [32], and have been used to find MEI in particular [33]. Detecting non-reference MEI directly as insertions from whole genome shotgun sequence data poses a more challenging problem, owing to the inherent difficulties associated with accurate mapping of sequenced reads derived from highly repetitive regions of the genome. Only recently have methods been developed for the purpose of non-reference MEI detection from second-generation whole genome shotgun data including published studies of L1 element insertions [34] and of Alu insertions [35]. These studies adopted similar computational approaches to one of our insertion detection methods (the read pair method, see Materials and Methods) and have different detection properties (Text S2 Comparisons, Figures S8, S9, S10). Relative to previous studies, we present a broad analysis of MEI variation in the human population; with more variant loci detected, from the three major mobile element families, using multiple detection methods, each with comprehensive experimental validation (Table 1). The present study represents the combined efforts of the MEI sub-group of the 1000 Genomes Project and has been prepared as a companion to previous 1000GP pilot publications [30], [36]. The MEI analyzed in this study were included the 1000GP variant call release of July 2010 (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07), also provided as Table S1). The specific purpose here is to provide a more detailed description of the methods, validation experiments, and properties of the 1000GP catalog of MEI events, and to extend the analysis by adding genotype information, population allele frequencies, and population specific mutation rates. 10.1371/journal.pgen.1002236.t001 Table 1 1000 Genomes Project pilot data used for mobile element insertion discovery. Non-reference MEI Reference MEI Detection method Illumina RP 454 SR RP+SR Combined deletion detection algorithms Dataset Low Cov Trio Low Cov Trio Total Low Cov Trio Total Number of samples 138 6 22 2 156 169 6 175 Coverage per sample 2.2x 16.4x 2.0x 7.6x 3.0x 3x 25x 3.9x Alu insertions 2882 1786 2420 1284 4500 1689 1420 1730 L1 insertions 345 192 396 172 792 193 170 206 SVA insertions 49 35 17 7 79 70 65 74 Loci PCR tested 193 186 182 185 746 - - - Loci validated 183 182 173 174 712 1873 1615 1927 FDR (%) 5.2±1.6 2.2±1.1 4.4±1.6 5.5±1.1 4.5±0.8 - - - Number of samples, average read coverage, detected loci, and validation results are shown. Non-reference MEI false detection rates (FDR) were based on validation results at randomly selected loci. In addition to PCR validation, reference MEI were also tested for validation as deletions by local assembly. The FDR for reference MEI, including the additional MEI selection criteria, is estimated to be and averaged over the samples in the group. Values of μMEI for each population and each element family are compared to μMEI derived from θ fitting (Figure 7c) and are listed in Table 4 with 95% statistical confidence intervals. Confidence intervals from the allele frequency fits (error bars in Figure 7c) are larger than statistical errors from the averaged heterozygosities over samples (error bars within the circles on Figure 7c) because each sample provided independent observations for the average heterozygosity, whereas in the allele frequency spectra fits all samples were combined. Both estimates are subject to systematic errors that may arise from the detection, genotyping, and correction procedures. To test for systematic biases in μMEI we re-processed both allele frequency spectra and heterozygosity estimates over a range of genotype selection thresholds (Text S2: Stability, Figure S18) and found consistent trends in μMEI among the population groups and element families, although the overall scales of the mutation rates are uncertain to 20%. Values of the element specific mutation rates in Table 4 and Figure 7c are consistent with previous reports [23], [25], [68]. In summary, careful error analysis led us to believe that the differences in the mutation rate observed between the different population sample groups are likely to result from biological processes, rather than measurement or analytical artifacts. 10.1371/journal.pgen.1002236.g007 Figure 7 MEI and SNP heterozygosity in low coverage samples. a) MEI vs. SNP heterozygosity scatterplot (πMEI vs. πSNP): The dashed line is a linear model constrained to pass through the centroid of the YRI (red) samples and the origin. The gray region represents an extrapolation from human-chimpanzee (H-C) MEI and SNP differences between the respective genome assemblies. b) Averaged population μMEI vs. coalescent time scaled to thousands of years, assuming that SNP mutation rate is a steady clock (μSNP∼1.8×10−8 mutations per site per generation). c) MEI mutation rates based on heterozygosity (solid circles) and based on allele frequency fits (vertical error bars) for population groups (CEU: blue, YRI: red, CHBJPT: green, all three: black) and estimated separately for element families (all families combined: MEI, Alu, L1, and SVA). Error bars are statistical only. 10.1371/journal.pgen.1002236.t004 Table 4 MEI population properties. population element Θ [95% CI] μ(θ) [95% CI] χ2 d. f. Π [95% CI] μ(π) [95% CI] all MEI 1860 [1540–2170] 0.0464 [0.0384–0.0543] 75.4 78 2160 [2130–2200] 0.0499 [0.0490–0.0507] CEU MEI 1700 [1360–2040] 0.0425 [0.0339–0.0510] 52.3 39 2040 [2020–2070] 0.0493 [0.0487–0.0499] YRI MEI 2240 [1690–2790] 0.0559 [0.0421–0.0697] 39.9 39 2480 [2430–2530] 0.0488 [0.0478–0.0499] CHBJPT MEI 1550 [1220–1870] 0.0387 [0.0306–0.0468] 70.5 39 2030 [2000–2060] 0.0533 [0.0525–0.0541] all ALU 1570 [1310–1830] 0.0392 [0.0326–0.0457] 83.9 78 1880 [1840–1910] 0.0432 [0.0424–0.0439] CEU ALU 1440 [1150–1720] 0.0359 [0.0289–0.0430] 55.4 39 1770 [1750–1800] 0.0428 [0.0422–0.0434] YRI ALU 1830 [1390–2270] 0.0458 [0.0348–0.0569] 43.4 39 2150 [2100–2200] 0.0423 [0.0414–0.0433] CHBJPT ALU 1300 [1020–1570] 0.0324 [0.0256–0.0391] 86.5 39 1750 [1720–1780] 0.046 [0.0453–0.0468] all L1 224 [120–329] 0.0056 [0.0030–0.0082] 51.9 71 264 [257–270] 0.0061 [0.0059–0.0062] CEU L1 223 [100–346] 0.0056 [0.0025–0.0086] 49.6 38 243 [234–252] 0.0059 [0.0057–0.0061] YRI L1 326 [118–535] 0.0082 [0.0029–0.0134] 59.6 39 303 [292–314] 0.006 [0.0057–0.0062] CHBJPT L1 166 [70–262] 0.0041 [0.0018–0.0066] 49.7 39 251 [243–258] 0.0066 [0.0064–0.0068] all SVA 80 [48–113] 0.002 [0.0012–0.0028] 15.4 39 55 [53]–[58] 0.0013 [0.0012–0.0014] CEU SVA 38 [18]–[58] 0.001 [0.0004–0.0014] 10.4 27 51 [48]–[54] 0.0012 [0.0011–0.0013] YRI SVA 64 [26–101] 0.0016 [0.0006–0.0025] 11.2 24 61 [56]–[65] 0.0012 [0.0011–0.0013] CHBJPT SVA 46 [21]–[72] 0.0012 [0.0005–0.0018] 12.5 27 55 [51]–[59] 0.0014 [0.0013–0.0015] MEI diversity parameter θ was fit from the allele frequency spectra for the listed populations and element families. “all” is the full dataset of all three population groups. Insertion rates μ(θ) were derived from the θ values, assuming an effective population size of 10,000. The MEI population heterozygosities π were averaged over samples in the given population group. MEI insertion rates μ(π) were derived from Eq (2) relative to the SNP mutation rate. All insertion rates are listed in units of insertions per genome per generation. Discussion Common MEI polymorphisms in the human population MEI alleles propagate within population groups much like other predominantly neutral polymorphisms. MEI allele frequency spectra from the low coverage samples are in general agreement with expectations from the standard neutral model for allele drift in a population. The major differences in allele frequency spectra between non-reference and reference MEI (Figure 5a–5c) are explained by the ascertainment condition that the derived MEI allele occurs in a given sample (the reference) and are in agreement with expectations based on a coalescent simulation of MEI population drift (Figure S17). MEI allele frequency spectra among the three population groups exhibits a similar trend to SNPs (Figure 6b), although the MEI spectrum in the Asian samples is a poor fit to the θ/i form (χ2/d.f.∼2 from Table 4) with an excess of high frequency alleles and a deficit at low frequency (Figure 5e). MEI allele frequencies were based on MEI detected and genotyped across three element families (Alu, L1, and SVA), from both non-reference and reference MEI, and multiple detection methods (RP and SR), each with characteristic detection sensitivities and false detection rates. Corrections for these effects, as well as genotyping efficiencies, were included in the allele frequency spectra. Measurements of MEI heterozygosity offer a more direct method to estimate MEI insertion rates. Like the allele frequency spectrum, heterozygosity is dependent on accurate genotyping and includes corrections for efficiency losses, but in this case the corrections were made on a per sample basis, which is more specific since sample coverage is the dominant limitation for detection and genotyping power (Figure S6). The heterozygosity measurement also has an advantage in that each sample is an independent estimate of the population average and . The heterozygosity measurements revealed evidence for differential MEI mutation rates among the three population groups. The probability that the Asian population samples have the same MEI mutation rate as the other two population groups is very low (paired t-test p-value 10% is 4500, and 9000 for frequency>1%, with 20% uncertainty arising from parameter estimates. Counting only those sites with a sufficient number of genotypes to measure allele frequency, our dataset contains more than half of the segregating human MEI sites with frequency>10%. Significance This study of the 1000GP pilot datasets is a sizable step toward a complete population-based catalog of common human MEI polymorphisms, made possible by targeting both non-reference and reference MEI events in the human genome. We identified 7,380 polymorphic mobile element insertions from the Alu, L1, and SVA families. Based on experimental validation of random subsets of loci we estimate that the false discovery rate in this study is less than 5%. Detection power for common alleles (allele frequency>10%) varies between non-reference MEI (70%–80%) and reference MEI (>90%). We were also able to assemble the inserted sequence for more than 1,000 non-reference Alu MEI and found consistent proportions of Alu sub-families in comparison to MEI identified in HuRef. This comprehensive variant discovery and genotyping effort allowed us to directly compare the segregation properties of different variant types from the same dataset. Our analysis revealed that, to a first approximation, the evolution of MEI variants is similar to SNPs and consistent with neutral models [52], [53], except in exonic regions where they are subject to negative selection on the scale that acts against SNP variants resulting in stop codon loss. An intriguing finding from our data, however, is the detection of signals suggesting a recent increase in MEI rates in humans. Materials and Methods Non-reference MEI detection Both the SR and the RP methods were based on identification of non-reference MEI as clusters of mapped DNA fragments in which one end mapped to the consensus sequence of a mobile element while the other end was uniquely mapped to the reference genome in a location inconsistent with a known mobile element location in the reference (Figure 1a–1b). The RP method required at least two MEI supporting fragments across both the 5′ and 3′ insertion breakpoints for each candidate MEI from the pooled datasets (the low coverage and trio pilot data were pooled separately). The SR method required only one MEI supporting fragment across either the 5′ or 3′ breakpoints for candidate events. We used 52 consensus element sequences from Repbase [69] (www.girinst.org, version 14.03, Table S11) to identify reads mapping to mobile elements. The RP method used Mosaik [70] (bioinformatics.bc.edu/marthlab/Mosaik, version 0.9.1176) for read pair mapping of Illumina paired-end data to the NCBI36 human reference (build 36.3) and the Spanner [40] program to identify non-reference MEI by clustering supporting fragments [40], [71], [72]. The SR method also used Mosaik to align 454 data, for full read mapping as well as for split-read mapping. We used extensive simulation experiments [73] to optimize detection methods, algorithm parameters, and post-process MEI candidate event selection filters (further details are provided in Text S2). Reference MEI selection The 2,010 reference MEI events are a subset of the 1000GP pilot release of 22,025 deletions [30]. 95% of the MEI sites detected as deletions were found by more than one algorithm but the dominant mapping algorithms were Mosaik, and Maq [74], with detection algorithms Spanner, Pindel [41], BreakDancer [38], and GenomeSTRiP [39]. Two selection criteria ensure that a given deletion corresponds to a true variant MEI inser a given deletion corresponds to a true variant MEI inserted in the reference genome: The deletion coordinates match to an annotated Alu, L1, or SVA element [6] in the hg18 reference, defined as >50% reciprocal overlap and the start and end coordinates both match within a window of 20 bp for Alus, or 200 bp for L1s and SVAs. At least 75% of the deleted region corresponds to a gap in the chimpanzee genome assembly [37]. MEI event matching between algorithms and studies Non-reference MEI detected by the SR and RP methods were merged according to a 100 bp matching window around the leftmost insertion coordinates. To assess call set intersections between this study and other published lists of non-reference MEI, we used a matching window of 200 bp around each insertion position. We adopted the ‘leftmost’ coordinate convention (Figure S1), in keeping with 1000GP call sets, whereas other studies used rightmost or unclear coordinate conventions. The respective scales of the matching windows were dictated by the characteristic position resolutions of the algorithms (Figure S7, Figure S10), which varied considerably from study to study. Redundant loci from recent publications were not counted multiple times in Figure 2e. To identify matching reference MEI to other studies we required at least 50% reciprocal overlap between the starting and ending NCBI36 deletion coordinates. Calculations of sample sequence coverage For SR detection the relevant coverage statistic is 454 base coverage, counts of aligned reads covering a given base, averaged across the accessible genome. For RP detection the driving coverage statistic is Illumina read-pair spanning coverage, counts of fragments in which the non-sequenced segment of the fragment between the reads cover a given base, averaged across the genome (Table S2). Validation methods The four non-reference MEI event lists (Table 1) were submitted to the 1000 Genomes Structural Variation subgroup for validation experiments to assess false detection rates. 200 loci from each list were randomly selected for primer design and subsequent PCR validation. Primers were designed as described previously [32], [36] to span across the insertion breakpoint and to guarantee unique mapping to build 36.3. In addition to the estimation of the false detection rates, validation genotypes were derived from gel-band size comparison for each sample and site tested by PCR. We also used the validation data to estimate detection sensitivity based on the overlap of events called between the two independent sequence data platforms and algorithms. For loci with ambiguous PCR results, no amplification, or amplification of only the empty insertions site, a second primer pair was designed. For the primer design, 600 bp of flanking sequence on either side of the insertion site was retrieved from genome.ucsc.edu using Galaxy. Alu elements within the flanking sequence were masked to “N” using RepeatMasker (repeatmasker.org). Primers were designed with BatchPrimer3 v2.0 in the flanking sequence, leaving at least 100 bp before and after the predicted insertion site. Next, all primers were tested with BLAT to determine the number of matches in the human genome. If one primer of a primer pair matched several times and the other primer was unique, a virtual PCR was performed. Primer combinations with one predicted PCR product were tested on our panel. Otherwise primers were designed manually (if possible) after repeat-masking the flanking sequence with the complete repeat library. In addition, for L1 and SVA loci without unambiguous PCR amplification, primers were designed, placing one primer within the 3′ end of the mobile element sequence [75]. The primers were designed to match the consensus sequences of the youngest L1 and SVA sub-families. All PCR primers were ordered from Sigma Aldrich, Inc. (St. Louis, MO). All LSU-designed PCR primer sequences used in this project can be found at http://batzerlab.lsu.edu. DNA samples for PCR verification A subset of 25 DNA samples from the low coverage pilot samples and all six trio samples were used in PCR validations (Table S4). Each DNA panel also included a population out-group sample, an individual of South American origin (NA17310, Coriell) for low coverage pilot, and an individual of Asian origin (NA17081, Coriell) for the trio pilot. Additional control DNA samples on both panels included human cell line DNA, (HeLa; ATCC CCL-2) as well as “Pop80”, a locally pooled DNA sample from different individuals of diverse geographic origins (Asia, Africa, South American, and European). This sample serves as a diagnostic tool because amplification of an empty site alone in all samples (including Pop80) strongly indicates that the putative insertion is absent (false positive). In contrast, the presence of an MEI in a single study subject, while absent in Pop80, points toward a potential de novo retrotransposon insertion, or at least an insertion with a low allele frequency (AF). Chimpanzee DNA (NS06006, Coriell) was also included on each panel, representing the presumptive pre-insertion site for each event (empty site) as another PCR control. In addition to the subset of 25 individuals used for the low coverage pilot validations, four more DNA samples from the low coverage pilot dataset were obtained for subsequent experiments. DNA samples NA12872, NA12814, NA12815 and NA12044 (CEPH/Utah USA) were purchased from the Coriell Institute for Medical Research. All 35 samples (25+6+4) were used for PCR validations associated with MEI events detected specifically in exons. PCR details (LSU) PCR amplifications were performed in 25 µl reactions in a 96-well format using either a Perkin Elmer GeneAmp 9700 or a BioRad i-cycler thermo-cycler. Each reaction contained 15–50 ng of template DNA; 200 nM of each oligonucleotide primer; 1.5 mM MgCl2, 1× PCR buffer (50 mM KCl; 10 mM TrisHCl, pH 8.3); 0.2 mM dNTPs; and 1–2 U Taq DNA polymerase. Full-length L1 and SVA elements typically exceed the limitations of standard DNA Taq polymerase in PCR. For L1 insertions, LA-Taq DNA polymerase (Takara Bio USA, Clontech Laboratories, Inc. Mountain View, CA) was used in the PCR reactions according to the manufacturer's instructions to enhance the yield of long PCR templates (2–10 kb). SVA elements are particularly GC-rich and difficult to amplify in PCR if full-length even with special long-template polymerases. In order to evaluate presence/absence of these insertions using PCR, we performed a PCR using one primer residing within the SVA insertion in conjunction with an external primer (forward or reverse, depending on the orientation of the predicted insertion). To determine the genotype and presence of the insertion, two separate PCR reactions were required in these instances. A PCR using primers flanking the MEI amplified a PCR product if the MEI was “absent.” A separate PCR with internal primers detected the MEI “present” site. In addition, this approach was also used for some L1 loci to confirm the presence/absence of the insertion and to minimize the chance of false non-detection. PCR experiments were carried out in three different laboratories yielding similar success rates. At EMBL, PCRs were preformed using 10 ng of NA12878 genomic DNA (Coriell) in 20 µl volumes in a C1000 thermocycler (BioRad). Two different enzymes, iProof High Fidelity DNA Polymerase (Biorad) and Hotstart Taq (Qiagen) were used, with comparable results. PCR conditions for iProof were: 98°C for 1 min, followed by 5 cycles of 98°C for 10 s, 68°C for 20 s and 72°C for 4 min and 30 cycles of 98°C for 10 s, 66°C for 20 s and 72°C for 4.5 min, followed by a final cycle of 72°C for 5 min. PCR conditions for HotStart Taq were: 94°C for 15 min, followed by 5 cycles of 94°C for 30 s, 60°C for 30 s and 72°C for 3 min and 30 cycles of 94°C for 30 s, 56°C for 30 s and 72°C for 3.5 min, followed by a final cycle of 72°C for 5 min. PCR products were analyzed on a 1% agarose gel stained with Sybr Safe Dye (Invitrogen) and a 100 bp ladder and 1 kb ladder (NEB). PCR reactions at Louisiana State University were performed under the following conditions: initial denaturation at 94°C for 90 sec, followed by 32 cycles of denaturation at 94°C for 20 sec, annealing at 61°C for primers designed by pipeline or 57°C for other primer design for 20 sec, and extension at 72°C for 30 to 90 sec depending on the predicted PCR amplicon size. PCRs were terminated with a final extension at 72°C for 3 min. When LA-Taq DNA polymerase was used to amplify L1 insertions, the extension step of each cycle was carried out at 68° for 8 min, 30 sec, followed by a final extension step at 68° for 10 minutes at the end of the run. 20 µl of each PCR product were size-fractionated in a horizontal gel chamber on a 2% (Alu and SVA) or 1% (L1) agarose gel containing 0.1 µg/ml ethidium bromide for 60 minutes at 175 V or 1 hour/45 min at 150 V, respectively. UV-fluorescence was used to visualize the DNA fragments and images were saved using a BioRad ChemiDoc XRS imaging system (Hercules, CA). An outcome from the validation experiments on the 86 gene-interupting MEI was a high false detection rate for candidate Alu insertions in close proximity to 7SLRNA annotations. Subsequently we reclassified all 22 Alu insertion candidates within 200 bp of a 7SLRNA as invalidated (Table S1). Detection sensitivity The two non-reference MEI detection methods use independent DNA libraries. So the overlap between the RP and SR are governed by the respective detection sensitivities, statistically akin to the Lincoln-Peterson method [76] used in ecological studies to estimate the size of a population based on two random capture and recapture samplings. This estimate assumes that the two algorithms are sensitive to the same type of events and that the difference between the event lists is a sampling issue. The expression for the detection respective detection sensitivities (εRP and εSR ) depends on the false detection rates (fRP and fSR ) provided by the validation experiments, the counts of loci detected by each method (nRP and nSR ), and the count of loci detected by both methods (nRP.SR ): (3) Given detection sensitivities εRP and εSR from independent datasets and methods, the combined detection sensitivity (RP+SR) becomes: (4) for samples in which both types of data were available (e.g. trio samples NA12878 and NA19240). Genotyping methods For reference MEI we used available genotypes calculated by GenomeSTRiP [39] for the 1000GP deletion call set. GenomeSTRiP results were not readily available for non-reference MEI so we developed a simple Bayesian framework to estimate the posterior probability for each possible genotype. The posterior genotype probability is: (5) where NALT and NREF are the counts of fragments supporting the alternate and reference alleles respectively; g is the genotype (i.e. homozygous reference allele, heterozygous, homozygous insertion allele); P(g) is the prior probability for the genotype g (a flat prior was used, P(g) = 1/3); pg is the expected fraction of insertion fragments given a genotype g (i.e. pg = 0.5 for heterozygous insertions, pg∼0 for homozygous reference, and pg∼1 for homozygous insertions); Pbin(NALT,NALT+NREF,pg) is the binomial probability that NALT+NREF fragments will fluctuate to NALT , given an expected fraction pg . The called genotype for a given site is the genotype with the maximum posterior probability. The Bayesian framework also provides genotype likelihoods, which are used to construct genotyping quality metric (GQ) for each site and sample. The GQ value adopts the “phred” quality convention: (6) Where P(g|NALT,NREF) is the posterior probability for the called genotype from Eq. (5). GQ is highly dependent on the total number of supporting fragments (reference plus insertion). A selection of sites at GQ = 7 should correspond to roughly to 80% genotyping accuracy and corresponds to sites with 2 or more supporting fragments. Allele frequency spectra MEI loci with at least 25 genotyped samples per population (50 samples for the combined population spectra) were included in allele frequency spectra. Sites of GQ≥7 non-reference MEI and of GQ≥10 reference MEI were included. For loci with more than 25 genotyped samples, a random subset of 25 was used for the allele count spectra (Figure 5). For the allele frequency spectra (Figure 6a–6b) we projected down to 25 samples according to the hypergeometric distribution [56], [57] which smooths the spectrum while retaining all available information from loci with more than 25 genotyped samples. Hypergeometric projection was not used to build the allele count spectra used for fitting purposes because it introduces correlation among allele count bins. We constructed the allele count spectra for MEI events detected as insertions and those detected as deletions separately to account for the distinct ascertainment conditions before combining them into the aggregate spectrum. The combined spectrum includes corrections for respective detection and genotyping efficiencies: (7) where nREF(i) and nNREF(i) are the counts of genotyped loci for reference (e.g. Figure 5b) and non-reference MEI (Figure 5a) at allele count i, KREF and KNREF are scaling factors for each detection mode (not dependent on i), and nMEI(i) is the net count of MEI variant loci at a given allele count i (Figure 5c–d). The correction factors depend on the detection sensitivity (εDET) and genotyping efficiency (εGEN) as K = (εDET·εGEN)−1. Genotyping efficiency is simply the fraction of detected sites with 25 more genotyped samples (Table S9). Detection efficiency is described above (Detection specificity and sensitivity, Figure 3d). SNP allele frequency spectra (Figure 6b) were based on the 1000GP release VCF files (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/) with no corrections. SNP allele frequency spectra were projected down to 50 samples using the hypergeometric distribution [56], [57]. Functional calculation of suppression factor Only non-reference MEI with insertion position confidence intervals entirely within annotated regions (Gene, UTR, CDS) were counted. No MEI that were subsequently invalidated were counted. Relative to random placement across the genome the MEI suppression or boost factor is defined as: (8) where Ntot is the total number of MEI loci, Ltot  = 2.85×109 bp is the length of the accessible genome, Lobs is the size of the region (1 MB or the sum of coding regions) where the number of observed MEI is Nobs . The null model for MEI placement results in a binomially distributed Nobs , which is generally not far from what we observe, except in the case of functional regions (suppressed) and HLA (hotspot). For the calculation of MEI inserted in CDS regions, only non-reference MEI were considered, since an embedded reference MEI precludes annotation as a coding sequence. Heterozygosity MEI and SNP heterozygosity for each sample were calculated from the counts of genotyped heterozygous sites. For MEI, the total numbers of genomic heterozygous sites were estimated with corrections for genotyping efficiency and detection sensitivity. The genotyping efficiency for a given sample is the fraction of detected loci with high quality (GQ≥7 non-reference, of GQ≥10 reference MEI) genotypes. There is also a sample specific correction for genotyping bias against heterozygotes at sites with limited fragment coverage: (9) where the sum is over genotyped loci passing the GQ threshold for the given sample, Nloci is the count of such sites, NF is the count of supporting fragments (both reference and insertion allele) at the site and binopdf is the binomial probability density function that a heterozgygous site will randomly produce only reference supporting fragments. The KHET correction was applied only to the non-reference MEI component because, for reference MEI detected as deletions, GenomeSTRiP used not just supporting fragment information for genotype likelihoods, but also used Beagle to impute missing data from linkage with local SNP haplotypes to identify heterozygous deletions. For each sample (s) the number of heterozygous MEI in the genome is estimated as: (10) where πMEI(s) is the heterozygosity for sample s, πREF(s) and πNREF(s) are the raw counts of heterozygous sites for reference and non-reference MEI, ε DET is the detection sensitivity, and ε GEN(s) is the fraction of detected sites genotyped in the given sample (Figure S13, Table S9). SNP heterozygosity is derived from the raw counts of heterozygous sites. All values of heterozygosity are in units of heterozygous sites per genome, and the length of the genome is considered to be the accessible genome (2.85 Gb) [36]. The SNP heterozygosity values are transformed to rough estimates of the corresponding coalescent time (Figure 7b) [77]: (11) where, μSNP = 1.8×10−8 mutations per site per generation, and TGEN∼25 y is the average time between generations. Supporting Information Figure S1 Insertion coordinate convention. (EPS) Click here for additional data file. Figure S2 Number of deletion call sets supporting reference MEI locus. The average number of deletions call sets supporting MEI events is about eight (blue) while for all deletions in the 1000GP release (gray dashed line) the average number of calls was about three. The peak at the call sets for Alu MEI deletions corresponds to the eight Illumina RP based call sets (BC, Wash U, WTSI, for both pilots, Broad for pilot 1 and U.Wash for pilot 2) and two SR call sets (Pindel for both pilots). (EPS) Click here for additional data file. Figure S3 UCSC browser display of reference MEI. (top) The deletion (red track with 1000GP deletion id's P1_M_061510_12_213 for low coverage pilot and P2_M_061510_12_22 from the trio pilot) matches to the annotated AluYg6 element at chr12:8516855–8517156, present in the NCBI36 reference sequence but missing in the sequenced sample. The black RepeatMasker track shows that the AluYg6 element matches the deletion start and end coordinates. The green tracks indicate the extent of the chimpanzee assembly, which does not include the AluYg6 element. The blue DGV tracks show that this particular deletion has been previously identified by several experiments with various degrees of position resolution. (bottom) Example of questionable reference MEI. The blue track at the top marks a detected deletion (id P2_M_061510_3_301) at chromosome 3, 60,660,331 bp that overlaps >50% with a short annotated L1HS element, but the start and end coordinates do not match precisely. The chimpanzee genome (in yellow) has a gap in the region, but the edges do not align precisely. This deletion was included in the count of 2,010 reference MEI, but adds to the level of uncertainty. (TIFF) Click here for additional data file. Figure S4 1000 Genome Project pilot sample breakdown. a) Venn diagram of pilot samples by sequencing platform (Illumina and 454 only). The bulk of the samples were sequenced by Illumina. The circle areas are only roughly proportional to the number of samples contained. b) Venn diagram of samples used for MEI detection (left) and genotyping (right). MEI detected as insertions (red) and deletions (blue) have different signatures and algorithms resulting in the difference between the samples used. (TIFF) Click here for additional data file. Figure S5 Illumina paired end fragment length distributions. Left) Low coverage pilot fragment length distributions for a random selection of 20 lanes of Illumina read pair data. Most libraries have a median fragment length from 100 to 300 bp with a wide variety of shapes. Right) Trio pilot fragment length distributions for 130 lanes of Illumina read pair data for NA12878. Five libraries are shown in different colors with different characteristic shapes. The small peak visible in orange at 550 bp is shifted by 300 bp from the main peak. This small peak arises from reference Alu insertions of length 300 bp. This small Alu peak occurs for all libraries in both pilots. (EPS) Click here for additional data file. Figure S6 MEI insertion sensitivity vs. coverage for the two methods. Coverage for the RP method is quantified as “span” coverage on the blue scale. Span coverage is calculated based on the fragment gap between the reads at the end of the fragment where RP detection is sensitive to large structural variations. The SR algorithm sensitivity depends on read coverage (red scale at the top) because the insertion can be detected anywhere within a given read (except within 20 bp of the ends). The detection sensitivity at maximum coverage is determined by the trio overlap calculations from Table S6. Sensitivity at reduced coverage values is calculated by down sampling the number of supporting reads and counting the fraction of insertions that survive the selection criteria. (EPS) Click here for additional data file. Figure S7 Non-reference MEI insertion breakpoint resolution. (top) the position residual between matched RP to SR insertions. (bottom) 1000GP loci vs. dbRIP. The dbRIP hg18 coordinates were shifted by TSD such that both lists adopt the ‘leftmost’ coordinate convention. (EPS) Click here for additional data file. Figure S8 Venn diagrams of MEI insertion overlap with recent studies. (top) L1 overlap with Ewing and Kazazian [34]. (bottom) Alu overlap with Hormozdiari et. al. [35]. (EPS) Click here for additional data file. Figure S9 Genomic distance to nearest element of the same family. (top) Non-reference MEI. 1000GP and HuRef distributions are plotted as well as L1 distances for Ewing and Kazazian [34] and Alu distance for Hormozdiari et. al. [35]. Distances <1 indicate insertions within annotated elements. (EPS) Click here for additional data file. Figure S10 Insertion position resolution comparison. Non-reference MEI were matched to dbRIP using a 200 bp window. (EPS) Click here for additional data file. Figure S11 Number of MEI per 1 MB binned regions across genome. (top) Dotted gray line is a simple Poisson model for MEI distributed uniformly across the accessible genome (2.85 Gb). The red arrow points to a significant hotspot in chromosome 6, position 33 Mb in the HLA region where 19 MEI were detected in a 1 MB region. (bottom) MEI density profile across chromosome 6 showing spike in region of HLA at 33 Mb. (EPS) Click here for additional data file. Figure S12 MEI insertion length. a) Comparison of insertion lengths with 617 dbRIP assembled MEI insertions that match 1000 Genomes MEI using a 200 bp window around insertion position. b) MEI insertion length residual distribution. c) The insertion length from MEI deletions (red) is the number of reference nucleotides in the deleted region (the annotated mobile element plus one copy of the TSD and any carry-over sequence). Sharp peaks at 300 bp and 6000 bp are the Alu and L1 insertions respectively. The insertion length for MEI detected as insertions (blue) is estimated from the span of the mapping coordinates within the mobile element. This estimate does not take into account any inserted sequence that is not part of the mobile element such as the TSD, poly-A tail, or carry-over sequence. (EPS) Click here for additional data file. Figure S13 Genotyping efficiency. top) Fraction of MEI sites surviving genotype quality thresholds in low coverage data for non-reference MEI (blue steps, GQ≥7) and for reference MEI (red, GQ≥10). Also shown is genotype accuracy based on validation experiments for non-reference MEI (dashed with grey 95% confidence interval). bottom) Sample-by-sample fraction of MEI sites surviving genotype quality threshold for vs. coverage in low coverage samples. Non-reference MEI (crosses) show a genotyping efficiency approaching 60% at 4 fragments/base spanning coverage, while reference MEI (circles) genotyping efficiency is nearly flat at 80%. Samples from the three population groups show the same trends. Coverage here is calculated as spanning coverage, most relevant for RP detection. (EPS) Click here for additional data file. Figure S14 Hardy-Weinberg Equilibrium test. Proportions of each genotype as a function of allele frequency for each population group (blue: CEU, red YRI, and green CHBJPT). Also plotted in gray dashed lines for comparison is the proportion expected from HWE. (EPS) Click here for additional data file. Figure S15 Genotype Matrix of low coverage samples. Each element in the matrix corresponds to a sample and a locus at which the genotype is color coded. Sample populations are labeled across the top, separated by green lines. The chromosome order for the MEI loci is labeled on the right side, with non-reference MEI (“insertions”) and reference MEI (“deletions”) grouped separately. This matrix was input to Principal Component Analysis for plotted in the main text Figure 6c (Figure S16d). (EPS) Click here for additional data file. Figure S16 Principal Component Analysis population clustering for PCR genotypes, MEI ins, MEI del, combined. A matrix of genotypes for each site and sample was input to a PCA and the resulting first two components are plotted against each other. The sum of insertion alleles is the value in the matrix elements. For elements corresponding to sites and samples without genotypes, the global average genotype value was used. a) Genotypes from PCR validation for the low coverage pilot. b) Genotypes from low coverage non-reference MEI only. c) Genotypes from reference MEI only. d) Genotypes from samples with both non-reference and reference MEI. Population clusters become tighter as more MEI insertion information is added to PCA. (EPS) Click here for additional data file. Figure S17 Coalescent simulation allele frequency spectra for the combined CEU, YRI, CHB and JPT population groups. AF is binned in units of 0.1. The lowest bin (0–0.1) is not plotted to allow the spectra at higher AF to be compared. The normalizations for MEI detected as insertions (red) and deletions (green) are set to that the two components sum to the total unbiased MEI AFS (blue). (EPS) Click here for additional data file. Figure S18 MEI insertion rate vs. coalescent time for increasing MEI site selection thresholds. The estimated MEI insertion rates (main text Eq.2) for each sample is plotted vs. the coalescent time derived from SNP heterozygosity. Panel a) is the same as Fig. 7b from the main text and corresponds to genotyped sites with GQ≥7, which also corresponds to sites with at least two supporting fragments. As more supporting fragments are required b) NF≥3, c) NF≥5, d) NF≥7, the numbers of genotyped sites decrease, but the trend between populations in the MEI insertion rates remains. (EPS) Click here for additional data file. Table S1 Combined MEI event list (external Excel file). Genomic coordinates with confidence intervals are listed for each of the 7380 MEI loci. Each event is characterized by an element type (ELEMENT = Alu, L1, or SVA), element STRAND (+ or −), detection (DET = DEL or INS for non-reference and reference MEI respectively), event ID, estimated insertion length (LEN), detection algorithm (ALG), validation status (VAL), validation method (VALMETH = PCR, ASM for assembly, 7SLRNA should be discarded due to proximity to annotated 7SLRNA element), population (POP = CEU, YRI, CHB, or JPT), allele frequency in three major groups (AF), number of genotyped samples in the three groups, number of insertion alleles in the three groups, previous study ID's (DBVARID, DBRIPID, PUBID), TSD length, number of insertion-supporting fragments from the 5′ side (NALT5), from the 3′ side (NALT3), the 1000 Genomes CALL SET name, quality value (Q), gene/exon/UTR/CDS interrupted (GENE), sub-family, and inserted sequence when available, and a list of all samples in which the alternate allele was detected (ALTSAMPLES). Note: 71 events identified by the VAL field as invalidated or in close proximity to a 7SLRNA loci are marked in yellow and were not included in the counts of interrupted genes, exons, UTRs, or CDS regions. (XLSX) Click here for additional data file. Table S2 Samples with corresponding sequence coverage (external Excel file) Sequence coverage for each of the 185 samples calculated in terms of Illumina span-coverage for RP detection, 454 base coverage for SR detection and Illumina base-coverage (including single-end read data) for deletion detection. (XLSX) Click here for additional data file. Table S3 Reference MEI detection method breakdown. (external Excel file) Thirteen different algorithms contributed to the detection of MEI present in the reference but not in a sample. a) Breakdown by pilot. b) Breakdown by algorithm. The bulk of MEI deletions were found by Illumina RP and SR methods. (XLSX) Click here for additional data file. Table S4 Validation genotypes for non-reference MEI datasets (external Excel file). Complete genotyping information for all samples tested at the 746 sites used for false detection rate estimates and for genotyping assessment. a) Additional validation results for non-reference MEI loci (external Excel file) Genome coordinates for 267 additional validation PCR experiments carried out at Yale, EMBL, and LSU. These experiments were done as preliminary tests (EMBL, Yale, LSU-PRELIM) and for testing specific loci (SVA, de novo, exon interrupting). (XLSX) Click here for additional data file. Table S5 MEI sensitivity based on comparison to gold standard events. (external Excel file) The fraction of HuRef MEI [23] found by this study is a lower limit to the detection sensitivity to common MEI alleles. a) MEI insertion detection sensitivity. b) MEI deletion sensitivity. b) MEI deletion sensitivity based on loci detected in the same samples from Mills et al. [47]. (XLSX) Click here for additional data file. Table S6 Trios (external Excel file). a) Overlap between RP and SR in the same trio samples (NA12878 and NA19240) can be used to estimate detection sensitivity. Columns RP and SR are the counts of all loci for the two samples broken down by element type. RP-only and SR-only count loci where only one method found the insertion. RP+SR is the count of loci deleted by both methods. The detection sensitivity estimates (εRP, εSR, and ε) with corresponding statistical 1-sigma errors are derived from the overlaps. The combined detected efficiency is based on the union of the two independent methods. b) Counts of MEI site differences between two individuals. The trio samples were used for this because of the relatively high coverage and corresponding sensitivity to low frequency alleles. Corrections to the counts compensate for less-than-perfect detection sensitivity and false detections. The trio children from two populations (CEU and YRI) have the most differences (2034±120) while the CEU parents have the fewest (663±120). The YRI parents' count of sites is between the other pairs. These differences are plotted vs. the corresponding coalescent time in Figure 6d (main text). c) De novo insertion hunt. Any MEI appearing in the children of the family trios but not in the parent would be a de novo MEI insertion. Six candidates from NA12878 (a) and 15 from NA19240 (b). All but one de novo candidate occurred at a site not found in any of the other samples. This site was PCR tested and identified in NA12892 (mother). (XLSX) Click here for additional data file. Table S7 Sub-family breakdown (external Excel file). Fragments from 1,105 of the Alu insertions were assembled into contigs spanning the Alu element to allow subfamily identification. The subfamilies are compared with those from the reference MEI detected as deletions and to the Venter MEI. (XLSX) Click here for additional data file. Table S8 Non-reference MEI genotyping validation (external Excel file). Genotype contingency table for non-reference MEI vs. genotypes from PCR validation experiments. “0/0” are homozygous reference, “0/1” are heterozygous insertions, and “1/1” are homozygous insertions (VCF file genotype label convention). Counts in each box are the numbers of sites and samples with the corresponding combination of genotype from sequencing and PCR. The overall genotyping accuracy is the fraction of counts on the diagonal while the genotyping efficiency is the fraction of all genotyped sites & samples divided by sites×samples for the given pilot dataset. Only genotypes with Q≥7 are included. The low coverage (a) accuracy is 87% and the efficiency is 57%. The trio pilot (b) accuracy is 95.7% and the genotyping efficiency is 89.9%. The improved genotyping performance for the trio pilot is a consequence of higher coverage. (XLSX) Click here for additional data file. Table S9 MEI genotyping corrections. (external Excel file). a) Detection sensitivity. b) Genotyping efficiency with correction factors used in constructing the allele frequency spectra for each population and element type. c) Heterozygosity counts and correction factors for each sample and element family. (XLSX) Click here for additional data file. Table S10 Loss of Function variants (external Excel file). Counts of insertions occurring within genes, UTR, and CDS regions annotated from Gencode version 3b. This table is partially shown as Table 1 in the main text. Only insertions with breakpoint confidence intervals entirely within the annotation region are counted. Any insertion candidate subsequently invalidated is not counted. A random placement model is used to estimate the number of expected insertions in the absence of selection. a) MEI counts. b) The corresponding counts of SNPs from the low coverage pilots are also listed along with the expected numbers of SNPs based on random placement. The suppression factor for MEI (∼46×) is similar to that of a SNP changing a stop codon (∼42×). (XLSX) Click here for additional data file. Table S11 Mobile element consensus sequences (external Excel file). Repbase element names and sequences for each of the element added to the reference genome for MEI insertion detection. (XLSX) Click here for additional data file. Text S1 The 1000 Genomes Project Consortium. (DOC) Click here for additional data file. Text S2 Supporting Methods. (DOCX) Click here for additional data file.
                Bookmark

                Author and article information

                Journal
                Nature
                Nature
                Springer Science and Business Media LLC
                0028-0836
                1476-4687
                January 2015
                November 10 2014
                January 2015
                : 517
                : 7536
                : 608-611
                Article
                10.1038/nature13907
                4317254
                25383537
                39d3305e-703c-4be8-9fa4-707324a8a97b
                © 2015

                Comments

                Comment on this article