Analysis of protein-coding genetic variation in 60,706 humans.

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC). This catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of widespread mutational recurrence. We have used this catalogue to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; identifying 3,230 genes with near-complete depletion of predicted protein-truncating variants, with 72% of these genes having no currently established human disease phenotype. Finally, we demonstrate that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human 'knockout' variants in protein-coding genes.

Related collections

Most cited references 13

Record: found
Abstract: found
Article: not found

Evolution on the X chromosome: unusual patterns and processes.

Beatriz Vicoso, Brian Charlesworth (2006)

Although the X chromosome is usually similar to the autosomes in size and cytogenetic appearance, theoretical models predict that its hemizygosity in males may cause unusual patterns of evolution. The sequencing of several genomes has indeed revealed differences between the X chromosome and the autosomes in the rates of gene divergence, patterns of gene expression and rates of gene movement between chromosomes. A better understanding of these patterns should provide valuable information on the evolution of genes located on the X chromosome. It could also suggest solutions to more general problems in molecular evolution, such as detecting selection and estimating mutational effects on fitness.

0 comments Cited 200 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Distribution and Medical Impact of Loss-of-Function Variants in the Finnish Founder Population

Elaine T. Lim, Peter Würtz, Aki Havulinna … (2014)

Introduction After widespread success with genome-wide association studies (GWAS) of common variants, several studies have recently begun to identify rare (with 36,000 Finns (Hardy Weinberg Equilibrium (HWE) P = 0.0077). This suggests that complete loss of TSFM might result in embryonic lethality, severe childhood diseases in humans, or that the individuals might not have been ascertained by the studies employed, i.e. if the individuals are too sick to be included in the studies. A lookup of this variant in another 25,237 Finnish samples in exome chip genotyping data from the GoT2D studies confirmed that the variant is present at 1.2% in Finns, but again with no homozygotes observed (combined HWE P = 1.6×10−4). Recessive missense variants in TSFM have been reported to result in mitochondrial translation deficiency [14], [15] and Finnish mitochondrial disease patients from two families have been identified with compound heterozygosity of this nonsense variant (each with a different second hit in TSFM) (personal communication) - lending strong evidence to the hypothesis that complete loss of this gene is not tolerated in humans. Neither did we observe strong associations for the TSFM Q246X heterozygotes across major diseases (Table S5). Several other LoF variants occur in genes where recessive mutations have been noted to cause severe Mendelian diseases from the Online Mendelian Inheritance in Man database (OMIM) [16]. For instance, the Fanconi anemia complementation group M gene (FANCM) was initially discovered in one family with Fanconi anemia [17], but we did not observe any deficit of homozygous LoFs in FANCM from our dataset (expected = 5, observed = 7), which we would typically observe for a disease causing recessive variant. Furthermore, examination of the hospital discharge records did not provide any evidence for blood diseases, increased cancer events or any other chronic diseases in these individuals with homozygous LoFs in FANCM. We also had blood counts for two homozygote individuals. Both of them had normal hemoglobin, erythrocyte size and counts as well as leukocyte and thrombocyte counts. Singh et al. reported that the initial case that led to the association of FANCM with Fanconi anemia also harbor biallelic, functional mutations in FANCA, a well-established Fanconi anemia gene [18]. Our findings in this study, combined with the findings by Singh et al. do not support the hypothesis that FANCM is a Fanconi anemia gene but rather suggest that the initial FANCM association was not causative. In addition to FANCM, we further evaluated evidence for two other genes COL9A2 and DPYD that were previously implicated in other Mendelian diseases (Supplementary Methods). The FINRISK cohort had collected 60 biochemical and physiological quantitative measurements of cardiovascular or immunologic relevance (Table S6), some of which are highly correlated. We tested the 80 variants across the 60 traits and report from this initial screen all associations with p G and c.4289+1G>A) in the Lipoprotein(a) gene (LPA) with lipoprotein(a) measurements in plasma (Pdiscovery = 2.17×10−81, Pdiscovery+replication = 1.53×10−117, combined = −0.64 or −8.77 mg/dL per allele, Table S7), the W154X variant in Fucosyltransferase 2 (FUT2) with increased Vitamin B12 levels [19] ( = 0.2, P = 3.7×10−26 or 43 pg/mL per allele, Table S8) and the R225X variant in the Citrate Lyase Beta Like gene (CLYBL) with decreased Vitamin B12 levels [20] ( = −0.2, P = 1.8×10−5 or −43 pg/mL per allele, Table S9) [21]. The boxplots for these associations are shown in Fig. S5. In addition to a strong correlation between circulating lipoprotein(a) levels and cardiovascular disease, it has been previously reported that genetic variants that elevate circulating lipoprotein(a) levels are cardiovascular risk factors [22], [23]. The converse, critical for evaluation of the therapeutic hypothesis of inhibition, that lowering lipoprotein(a) levels can confer cardiovascular protection has not yet been evaluated. With access to National Health Records, we utilized the strong lipoprotein(a) lowering variants discovered here to evaluate the impact of lipoprotein(a) lowering via Mendelian randomization. Using a Cox proportional hazards model for incident cardiovascular disease in these cohorts (adjusted for age, gender and therapies), the composite LPA variant was found to protect against coronary heart disease (Hazard Ratio HR = 0.79, P = 6.7×10−3), demonstrating that lowering lipoprotein(a) levels are likely to confer protection for cardiovascular diseases. We adjusted the association for the composite LPA variant with a previously published risk variant (rs3798220) [22], but observed a similarly protective effect (N = 18,270, HR = 0.79, P = 0.014), suggesting that the splice variants are independent from the previously reported risk variants in LPA. We confirmed this finding using three independent non-Finnish datasets: an early onset myocardial infarction dataset of 18,000 individuals and two studies from the Estonian Biobank (4,600 and 7,953 individuals respectively), which collectively replicated the observation that the LPA variants confer cardioprotective effect (OR = 0.87, P = 0.016). After meta-analyzing all the datasets, the final odds ratio was found to be 0.84 (P = 3×10−4, Fig. 3). We found 227 individuals who are homozygous or compound heterozygous for the two LPA splice variants with no evidence for increased morbidity or mortality based on National Health Records. This suggests that reduction of lipoprotein(a) is well-tolerated and might constitute a potential drug target for cardiovascular diseases. A survey across other diseases showed potential association between the LPA variants with acute coronary disease and myocardial infarction but not Type 2 Diabetes (Table S10). In addition, we surveyed the LPA variants across other cardiovascular risk factors and observed that the LPA variants were associated with mildly increased glucose levels but not high-density lipoproteins (HDL), low-density lipoproteins (LDL) or triglycerides (Table S11). 10.1371/journal.pgen.1004494.g003 Figure 3 Forest plot for the LPA splice variants with cardiovascular diseases. The cardiovascular diseases were defined as coronary heart disease (CHD), ischemic heart disease (IHD), heart failure (HF) or myocardial infarction (MI) from the various cohorts. In addition, we observed novel associations for the FGL1, MS4A2 and ATP2C2 variants. The 1-bp c.545_546insA frameshift in the Fibrinogen-like 1 gene (FGL1) was associated with increased D-dimer levels ( = 0.21, P = 6.1×10−6 or 52.23 ng/mL per allele, Table S12). D-dimers are products of fibrin degradation and their concentration in the blood flow is clinically used to monitor thrombotic activity. The role of FGL1 in clot formation remains unclear: although FGL1 is homologous with fibrinogen, it lacks the essential structures for fibrin formation, with one study suggesting its presence in fibrin clots [24]. In addition, given prior links between variants associated with D-dimer levels and stroke, we utilized the same Mendelian randomization approach as for LPA above and found a nominally significant association between FGL1 c.545_546insA and increased risk of ischemic stroke (OR = 1.32, P = 0.024). If replicated, this would be consistent with modest risk increase for stroke that other variants associated to circulating D-dimer levels, such as reported for variants in coagulation Factor V, Factor III and FGA [25]. We found suggestive associations for the c.637-1G>A splice variant in the membrane-spanning 4-domains, subfamily A, member 2 gene (MS4A2) with triglycerides (Pdiscovery = 7.80×10−5, Pdiscovery+replication = 1.31×10−6, = 0.14 or 0.14 mmol/L per allele, Table S13). This observation is consistent with our previously published study of 631 individuals in the DILGOM subset of FINRISK showing that whole blood expression of MS4A2 was strongly negatively associated with total triglycerides ( = −1.62, P = 2.1×10−27, Fig. S6) [26] and a wide range of systemic metabolic traits [27]. A similar but insignificant trend was observed in 15,696 individuals from the D2D2007, DPS, FUSION, METSIM and DRSEXTRA cohorts ( = 0.04, P = 0.32). The MS4A2 gene encodes the β-subunit of the high affinity IgE receptor, a key mediator of the acute phase inflammatory response. The c.2482-2A>C splice variant in the ATPase Ca++ Transporting Type 2C Member 2 gene (ATP2C2) was associated with increased systolic blood pressure (Pdiscovery = 1.25×10−5, Pdiscovery+replication = 1.3×10−6, = 0.12 or 2.13 mmHg per allele (an association that is undisturbed by correction for lipid lowering medication ( = 0.12, P = 1.75×10−5) or blood pressure lowering medication ( = 0.13, P = 1.3×10−5), Table S14). Based on its structure, ATP2C2 is predicted to catalyze the hydrolysis of ATP coupled with calcium transport. Interestingly, the ATP2C2 c.2482-2A>C variant is also significantly associated to several highly correlated immune markers, such as granulocyte colony-stimulating factor ( = 0.26, P = 6.98×10−7), interleukin-4 ( = 0.27, P = 2.48×10−6), interferon-γ ( = 0.26, P = 3.24×10−6) and interleukin-6 ( = 0.25, P = 4.58×10−6). Discussion The empirical data of this study sheds light on an active debate in population genetics theory whether or not bottlenecked populations have an excess burden of deleterious alleles. Lohmueller et al. first observed that there were proportionally more deleterious variants in European American individuals compared to African American individuals [28]. They performed a series of forward simulations to demonstrate that such an observation is consistent with an Out-of-Africa bottleneck experienced by the European populations from which the European-American individuals descend, and illustrated that bottlenecked populations are likely to accumulate a higher proportion of deleterious alleles. A recent study by Simons et al. showed conflicting results suggesting that there are similar burdens of deleterious alleles in Europeans and West Africans and that demography is unlikely to contribute to the proportions of deleterious alleles in human populations [29]. The comparison of Finns, with a well-documented bottleneck, with non-Finnish Europeans here provides strong empirical data on these questions. While the distribution of common alleles, both synonymous and non-synonymous, is as expected unchanged by the bottleneck, when exploring the rare and low-frequency allelic spectrum where the Finns and NFEs demonstrate distinct distributions, we indeed observe a significant excess of deleterious variants in the Finns – despite the considerable deficit in variable sites in the population overall. This suggests that negative selection has had insufficient time to suppress the frequency of deleterious alleles dramatically elevated in frequency through the founding bottleneck, an observation that generalizes the intuitive understanding of the existence of characteristic and unusually common Mendelian recessive disorders in Finland. However, we note that while we observe a strong influence of the founding bottleneck, the observed results, particularly the proportional enrichment of rare deleterious variants, are also influenced by other elements in the unique history of the Finnish population and will not necessarily apply to all populations influenced by a bottleneck. This excess of presumably deleterious variants motivated the subsequent association study and indeed, the absence of homozygotes at TSFM (contemporaneously identified as an early-onset mitochondrial disease gene) suggests that low-frequency variants in Finns, beyond those already identified in Mendelian disease, do include more unusually strong acting alleles than in non-founder populations. In this study, both replicated results and novel associations demonstrate the association of low-frequency LoF variants with various complex traits and diseases. In addition, we discovered a novel cardiovascular protective effect from splice variants in the LPA gene, suggesting that knocking down levels of circulating lipoprotein(a), or Lp(a), can confer a protection from cardiovascular diseases. Given that we detected numerous individuals in these adult population cohorts, healthy and in the expected Hardy-Weinberg proportions, carrying a complete knockout of LPA (homozygous or compound heterozygous for the 2 splice variants), this suggests that knocking out the gene in humans does not result in severe medical consequences. As such, this study provides data suggesting that LPA may be an effective target for therapeutic purposes. As more Finnish samples are being sequenced, these enriched variants can also be imputed with high precision to the large number of existing samples with array-based GWAS genotypes. This advantage is likely to be more pronounced for the much larger pool of missense variation – while one can presume all LoF variants in a gene might have a comparable effect on phenotype (and thereby burden tests of LoF variants in an out-bred sample is not at a great disadvantage compared to isolated populations), it is evident that many rare missense variants within the same gene will not all have the same impact on gene function. Thus the ability to assess single low-frequency variants conclusively, especially since they will include an excess of damaging variants enriched through a bottleneck, rather than perform burden tests on heterogeneous sets of extremely rare variants, will offer substantial ongoing advantage to isolated population studies as indicated by these and other recent findings. Materials and Methods All research involving human participants have been approved by the Hospital District of Helsinki and Uusimaa Coordinating Ethical Committee, and all clinical investigation was conducted according to the principles expressed in the Declaration of Helsinki. Exome sequencing quality control, annotation and filtering Raw Binary Sequence Alignment/Map (BAM) files from the various projects were jointly processed at the Broad Institute and joint variant calling was performed on all exomes to minimize batch differences. Functional annotation was performed using the Variant Effect Predictor (VEP v2.5) tool from Ensembl (http://useast.ensembl.org/info/docs/tools/vep/). We modified it to produce custom annotation tags and additional loss-of-function annotations. The additional annotations were applied to variants that were annotated as STOP_GAINED, SPLICE_DONOR_VARIANT, SPLICE_ACCEPTOR_VARIANT, and FRAME_SHIFT and the variants were flagged if any filters failed. A loss-of-function variant was predicted as high confidence if there is one transcript that passes all filters, otherwise it is predicted as low confidence. In our genotyping study, we had used loss-of-function variants that were predicted to be high confidence. For quality control, we required all variants to pass the basic GATK filters and required all genotypes to have a quality score of ≥30, read depth of ≥10 and allele balance of between 0.3 and 0.7 for heterozygous calls and 1×10−5). The high concordance between the allele frequencies in two independent NFE datasets ensures that the variants are unlikely to arise from alignment or sequencing artifacts and that these variants are unlikely to reside in a region of the exome that is difficult to sequence or genotype, which can result in highly variable allele frequencies from different experiments. Sequenom genotyping Genotyping was performed using the iPLEX Gold Assay (Sequenom Inc.). Assays for all SNPs were designed using the eXTEND suite and MassARRAY Assay Design software version 3.1 (Sequenom Inc.). Amplification was performed in a total volume of 5 µL containing ∼10 ng genomic DNA, 100 nM of each PCR primer, 500 µM of each dNTP, 1.25× PCR buffer (Qiagen), 1.625 mM MgCl2 and 1 U HotStar Taq (Qiagen). Reactions were heated to 94°C for 15 min followed by 45 cycles at 94°C for 20 s, 56°C for 30 s and 72°C for 1 min, then a final extension at 72°C for 3 min. Unincorporated dNTPs were SAP digested prior to iPLEX Gold allele specific extension with mass-modified ddNTPs using an iPLEX Gold reagent kit (Sequenom Inc.). SAP digestion and extension were performed according to the manufacturer's instructions with reaction extension primer concentrations adjusted to between 0.7–1.8 µM, dependent upon primer mass. Extension products were desalted and dispensed onto a SpectroCHIP using a MassARRAY Nanodispenser prior to MALDI-TOF analysis with a MassARRAY Analyzer Compact mass spectrometer. Genotypes were automatically assigned and manually confirmed using MassARRAY TyperAnalyzer software version 4.0 (Sequenom Inc.). The genotyped variants were then checked for concordance in allele frequencies with the exome sequencing data. Phenotyping Data on disease status from National Health registers (Hospital Discharged Registers maintained by THL (Institute for Health and Welfare, Finland), Cause of Death Register, Statistics Finland and Prescription Medication Register, THL) for FINRISK, Health2000 and the Young Finns Study participants of this study were collected and curated. A description of each cohort is provided in the Supplement. Analyses of RNA sequencing data To analyze the effects of the LoF variants on gene expression, we used RNA sequencing data from two major studies: the GEUVADIS project [30] with RNA sequencing data from lymphoblastoid cell lines of 462 individuals participants from the 1000 Genomes Project [31]), and the GTEx project with RNA-sequencing data from a total of 175 individuals with 1–30 tissues each (http://www.broadinstitute.org/gtex/) [32]. The processing of the GEUVADIS data and the methods for allele-specific expression analysis are described in Lappalainen et al. [30] and the GTEx data were analyzed using similar methods. Allele-specific expression analysis was used primarily to capture nonsense-mediated decay. Additionally, to assess whether LoF variants lead to decreased exon expression levels overall or for individual exons, we calculated an empirical p-value for each exon of all the LoF genes with respect to all other exons genome-wide, denoting the proportion of all exons where carriers of the LoF variants are more extreme than in the each studied exon in LoF variant genes. The analyses were performed separately in each studied tissue: lymphoblastoid cell lines from the GEUVADIS data and nine tissues from the GTEx data. The significance threshold after correcting for the total number of tested exons across all tissues is 0.05/1070 = 4.67×10−5. Statistical analyses and methods Inverse rank-based normalization was performed on the quantitative measurements in males and females separately, with linear regression residuals using age and age2 as covariates. Linear regression was then performed on the normalized Z-scores using R to obtain the statistics for the associations. We tested the correlations between the quantitative measurements and disease outcomes using two one-tailed t-tests to assess the significance of observing higher levels of the quantitative measurements in cases (individuals with the disease outcomes) versus controls (individuals without the disease outcomes), as well as lower levels of the quantitative measurements in cases versus controls. To test the association of the variants with the prevalent disease outcomes, we performed a logistic regression in R to obtain the reported statistics. In addition, a Fisher's Exact Test on the homozygous counts in cases and controls were performed to test for association with the homozygotes. The results for the LPA with cardiovascular disease association from MIGen ExA and the Estonian Biobank were meta-analyzed using METAL [33] and the combined results with FINRISK were obtained using the Fisher's Combined P method with 4 degrees of freedom. Associations between MS4A2 c.637-1G>A, gene expression and triglycerides We fit a linear model in which the log2-normalised gene probe expression of individual i was regressed on the LoF genotype, which was encoded as Xi = 0, 1 or 2 for the LoF genotypes −/−, +/− or +/+ respectively and association analysis of MS4A2 gene expression and triglycerides was performed as previously reported [26]. Briefly, we used a multivariate linear regression adjusted for age, gender, and use of cholesterol or blood pressure lowering medication. We further tested for association between MS4A2 c.637-1G>A and triglycerides using a 2-sided t-test. Supporting Information Figure S1 Ratio of the number of missense variants predicted by PolyPhen2 found in Finns versus NFEs. (A) The ratios for probably damaging missense variants highlighted in red text and the ratios for benign missense variants in black. The p-values represent the binomial probabilities of the variants being enriched in Finns and similarly, the p-values in red represent the probabilities for the probably damaging missense variants and the p-values in black represent the probabilities for the benign missense variants. (B) Percentage of variants that are missense variants across the allele frequency spectrum. (DOCX) Click here for additional data file. Figure S2 Allele frequency distribution in 3,000 Finns compared to 3,000 Swedes. The ratios for LoF variants highlighted in red text and the ratios for synonymous variants in black. (DOCX) Click here for additional data file. Figure S3 Distribution of LoF variants per individual. (A) Number of LoF variants in an average Finn vs NFE individual. (B) Number of homozygous LoF variants in Finns vs NFEs per individual. (DOCX) Click here for additional data file. Figure S4 Simulations for a set of variants (ranging from 1% to 5% allele frequencies) with complete recessive lethality. The red line indicates the expected allele frequencies in present-day Finns (where the Finnish bottleneck occurred ∼100 generations ago) and the blue line indicates the expected allele frequencies in Finns 1,000 generations after the Finnish bottleneck, similar to the out-of-Africa bottleneck which occurred >1,000 generations ago. (DOCX) Click here for additional data file. Figure S5 Boxplots for the known and novel associations. (DOCX) Click here for additional data file. Figure S6 Correlation between triglycerides and MS4A2 gene expression. (DOCX) Click here for additional data file. Table S1 Exomes collected from ongoing studies. All the Finnish and NFE exome sequences were captured using the Agilent SureSelect v2 kit. The replication data for the LPA variants from the different studies was performed on the exome chip genotyping platform. (XLSX) Click here for additional data file. Table S2 The number of variants in each category in Finns and NFEs. (XLSX) Click here for additional data file. Table S3 Allele frequencies of variants discovered from the FinDis database. (XLSX) Click here for additional data file. Table S4 Final list of variants from Sequenom genotyping in 36,262 Finns. The cohorts used in this study are from FINRISK 1992, FINRISK 1997, FINRISK 2002, FINRISK 2007, Health 2000 and Young Finns studies (83 variants + 3 composite variants). (XLSX) Click here for additional data file. Table S5 Associations between TSFM Q246X heterozygotes and various disease states, as well as various neurological and muscular diseases from the medical record system (ICD 9/10) with >30 cases. (XLSX) Click here for additional data file. Table S6 List of 60 blood pressure measures and biochemical assays from plasma/serum of fasting subjects. (XLSX) Click here for additional data file. Table S7 Correlations between the combined LPA variant and various disease states. The rows with significant correlation between the levels of the biomarker and disease status (P A and various disease states. The rows with significant correlation between the levels of the biomarker and disease status (P C and various disease states. The rows with significant correlation between the levels of the biomarker and disease status (P<1×10−3) are shaded in blue and the rows with significant association (P≤0.05) between the variant and disease status (allelic or homozygous tests) are highlighted in red text. (XLSX) Click here for additional data file.

0 comments Cited 170 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Carrier testing for severe childhood recessive diseases by next-generation sequencing.

Callum Bell, Darrell Dinwiddie, Neil A Miller … (2011)

Of 7028 disorders with suspected Mendelian inheritance, 1139 are recessive and have an established molecular basis. Although individually uncommon, Mendelian diseases collectively account for ~20% of infant mortality and ~10% of pediatric hospitalizations. Preconception screening, together with genetic counseling of carriers, has resulted in remarkable declines in the incidence of several severe recessive diseases including Tay-Sachs disease and cystic fibrosis. However, extension of preconception screening to most severe disease genes has hitherto been impractical. Here, we report a preconception carrier screen for 448 severe recessive childhood diseases. Rather than costly, complete sequencing of the human genome, 7717 regions from 437 target genes were enriched by hybrid capture or microdroplet polymerase chain reaction, sequenced by next-generation sequencing (NGS) to a depth of up to 2.7 gigabases, and assessed with stringent bioinformatic filters. At a resultant 160x average target coverage, 93% of nucleotides had at least 20x coverage, and mutation detection/genotyping had ~95% sensitivity and ~100% specificity for substitution, insertion/deletion, splicing, and gross deletion mutations and single-nucleotide polymorphisms. In 104 unrelated DNA samples, the average genomic carrier burden for severe pediatric recessive mutations was 2.8 and ranged from 0 to 7. The distribution of mutations among sequenced samples appeared random. Twenty-seven percent of mutations cited in the literature were found to be common polymorphisms or misannotated, underscoring the need for better mutation databases as part of a comprehensive carrier testing strategy. Given the magnitude of carrier burden and the lower cost of testing compared to treating these conditions, carrier screening by NGS made available to the general population may be an economical way to reduce the incidence of and ameliorate suffering associated with severe recessive childhood disorders.