Annotating functional effects of non-coding variants in neuropsychiatric cell types by deep transfer learning

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Genomewide association studies (GWAS) have identified a large number of loci associated with neuropsychiatric traits, however, understanding the molecular mechanisms underlying these loci remains difficult. To help prioritize causal variants and interpret their functions, computational methods have been developed to predict regulatory effects of non-coding variants. An emerging approach to variant annotation is deep learning models that predict regulatory functions from DNA sequences alone. While such models have been trained on large publicly available dataset such as ENCODE, neuropsychiatric trait-related cell types are under-represented in these datasets, thus there is an urgent need of better tools and resources to annotate variant functions in such cellular contexts. To fill this gap, we collected a large collection of neurodevelopment-related cell/tissue types, and trained deep Convolutional Neural Networks (ResNet) using such data. Furthermore, our model, called MetaChrom, borrows information from public epigenomic consortium to improve the accuracy via transfer learning. We show that MetaChrom is substantially better in predicting experimentally determined chromatin accessibility variants than popular variant annotation tools such as CADD and delta-SVM. By combining GWAS data with MetaChrom predictions, we prioritized 31 SNPs for Schizophrenia, suggesting potential risk genes and the biological contexts where they act. In summary, MetaChrom provides functional annotations of any DNA variants in the neuro-development context and the general method of MetaChrom can also be extended to other disease-related cell or tissue types.

Author summary

A large number of genetic variants have been statistically associated with the risks of common diseases. However, whether such variants are actual risk variants and when and where they function are often unknown. To address this challenge, machine learning methods have been developed to predict functional variants in specific cellular contexts. These methods correlate DNA sequences with their biological functions, e.g. enhancer activities, and can predict effects of single base mutations. Nevertheless, the training data used by existing methods often lack neurodevelopment-related cell types, thus annotating variant effects in neuropsychiatric genetics remains difficult. In this work, we fill this gap by collecting a large set of regulatory genomic datasets from fetal and adult brain, from iPSC-based cellular models and brain organoids. We trained deep learning models on this data, and further improved its performance by borrowing information from large external datasets, a strategy known as transfer learning. Our tool, MetaChrom, is substantially better at predicting experimentally determined regulatory variants than current methods, and helps us identify candidate risk variants of Schizophrenia. We believe MetaChrom provides a valuable tool for the neuropsychiatric genetic community, and the software can be of interest to researchers in other fields as well.

Related collections

Most cited references 79

Record: found
Abstract: found
Article: found

Is Open Access

A global reference for human genetic variation

Lachlan Coin, Robert Garry, Oleksyk Taras (2017)

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

0 comments Cited 4058 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

The mutational constraint spectrum quantified from variation in 141,456 humans

Konrad J. Karczewski, Laurent C. Francioli, Grace Tiao … (2021)

Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes that are crucial for the function of an organism will be depleted of such variants in natural populations, whereas non-essential genes will tolerate their accumulation. However, predicted loss-of-function variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes 1 . Here we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence predicted loss-of-function variants in this cohort after filtering for artefacts caused by sequencing and annotation errors. Using an improved model of human mutation rates, we classify human protein-coding genes along a spectrum that represents tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve the power of gene discovery for both common and rare diseases.

0 comments Cited 3252 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

An Integrated Encyclopedia of DNA Elements in the Human Genome

Iakes Ezkurdia (2016)

Summary The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure, and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall the project provides new insights into the organization and regulation of our genes and genome, and an expansive resource of functional annotations for biomedical research.

0 comments Cited 3152 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Boqiao Lai:

ORCID: https://orcid.org/0000-0002-4201-6786

Role: ConceptualizationRole: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: SoftwareRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing

Sheng Qian:

ORCID: https://orcid.org/0000-0003-4337-8532

Role: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing

Hanwei Zhang:

ORCID: https://orcid.org/0000-0001-8490-918X

Role: Data curation

Siwei Zhang:

ORCID: https://orcid.org/0000-0002-1646-063X

Role: Data curation

Alena Kozlova:

ORCID: https://orcid.org/0000-0002-7298-7460

Role: Data curation

Jubao Duan:

ORCID: https://orcid.org/0000-0002-7215-3220

Role: Data curation

Jinbo Xu:

ORCID: https://orcid.org/0000-0001-7111-4839

Role: ConceptualizationRole: Funding acquisitionRole: MethodologyRole: Project administrationRole: ResourcesRole: SupervisionRole: Writing – original draft

Xin He: Role: ConceptualizationRole: Formal analysisRole: Funding acquisitionRole: InvestigationRole: MethodologyRole: Project administrationRole: ResourcesRole: SupervisionRole: ValidationRole: Writing – original draftRole: Writing – review & editing

Tony Capra: Role: Editor

Journal

Journal ID (nlm-ta): PLoS Comput Biol

Journal ID (iso-abbrev): PLoS Comput Biol

Journal ID (publisher-id): plos

Title: PLoS Computational Biology

Publisher: Public Library of Science (San Francisco, CA USA )

ISSN (Print): 1553-734X

ISSN (Electronic): 1553-7358

Publication date Collection: May 2022

Publication date (Electronic): 16 May 2022

Volume: 18

Issue: 5

Electronic Location Identifier: e1010011

Affiliations

[1 ] Toyota Technological Institute at Chicago, Chicago, Illinois, United States of America

[2 ] Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America

[3 ] Center for Psychiatric Genetics, NorthShore University HealthSystem, Evanston, Illinois, United States of America

[4 ] Department of Psychiatry and Behavioral Neuroscience, University of Chicago, Chicago, Illinois, United States of America

University of California San Francisco, UNITED STATES

Author notes

The authors have declared that no competing interests exist.

* E-mail: xinhe@ 123456uchicago.edu (XH); j3xu@ 123456ttic.edu (JX)

Author information

Boqiao Lai https://orcid.org/0000-0002-4201-6786

Sheng Qian https://orcid.org/0000-0003-4337-8532

Hanwei Zhang https://orcid.org/0000-0001-8490-918X

Siwei Zhang https://orcid.org/0000-0002-1646-063X

Alena Kozlova https://orcid.org/0000-0002-7298-7460

Jubao Duan https://orcid.org/0000-0002-7215-3220

Jinbo Xu https://orcid.org/0000-0001-7111-4839

Article

Publisher ID: PCOMPBIOL-D-21-01311

DOI: 10.1371/journal.pcbi.1010011

PMC ID: 9135341

PubMed ID: 35576194

SO-VID: 3bdc9b2c-e169-4914-a0ee-b2b35bce438c

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 16 July 2021

Date accepted : 11 March 2022

Page count

Figures: 7, Tables: 0, Pages: 22

Funding

Funded by: funder-id http://dx.doi.org/10.13039/100000025, national institute of mental health;

Award ID: R01MH116281

Award Recipient : Xin He

Funded by: funder-id http://dx.doi.org/10.13039/100000025, national institute of mental health;

Award ID: R01MH110531

Award Recipient : Xin He

Funded by: funder-id http://dx.doi.org/10.13039/100000057, national institute of general medical sciences;

Award ID: R01GM089753

Award Recipient :

ORCID: https://orcid.org/0000-0001-7111-4839

Jinbo Xu

Funded by: University of Chicago Biological Sciences Division

Award ID: BSD 2021-22

Award Recipient :

ORCID: https://orcid.org/0000-0003-4337-8532

Sheng Qian

This research was supported by the National Institutes of Health( https://www.nih.gov/) (R01MH116281, R01MH110531 to X.H. and R01GM089753 to J.X.), and the university of Chicago Biological Sciences Division( https://biologicalsciences.uchicago.edu/) (BSD 2021-22 to S.Q.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Custom metadata

PLOS Publication Stage vor-update-to-uncorrected-proof

Publication Update 2022-05-26

Data Availability The ATAC-seq data of the iPSC derived neurons are publicly available in the Gene Expression Omnibus(GSE129017). The reference epigenomic dataset is available at ( http://deepsea.princeton.edu). Detailed data source for the neurodevelopmental model used in our experiment can be found in the supporting files (Table A in S1 Table).

Annotating functional effects of non-coding variants in neuropsychiatric cell types by deep transfer learning

Read this article at

Abstract

Author summary

Related collections

Journal of Systems Thinking Preprints

Most cited references 79

A global reference for human genetic variation

The mutational constraint spectrum quantified from variation in 141,456 humans

An Integrated Encyclopedia of DNA Elements in the Human Genome

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 247

Cited by 3

Most referenced authors 3,480