Evaluating genotype imputation pipeline for ultra-low coverage ancient genomes

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Although ancient DNA data have become increasingly more important in studies about past populations, it is often not feasible or practical to obtain high coverage genomes from poorly preserved samples. While methods of accurate genotype imputation from > 1 × coverage data have recently become a routine, a large proportion of ancient samples remain unusable for downstream analyses due to their low coverage. Here, we evaluate a two-step pipeline for the imputation of common variants in ancient genomes at 0.05–1 × coverage. We use the genotype likelihood input mode in Beagle and filter for confident genotypes as the input to impute missing genotypes. This procedure, when tested on ancient genomes, outperforms a single-step imputation from genotype likelihoods, suggesting that current genotype callers do not fully account for errors in ancient sequences and additional quality controls can be beneficial. We compared the effect of various genotype likelihood calling methods, post-calling, pre-imputation and post-imputation filters, different reference panels, as well as different imputation tools. In a Neolithic Hungarian genome, we obtain ~ 90% imputation accuracy for heterozygous common variants at coverage 0.05 × and > 97% accuracy at coverage 0.5 ×. We show that imputation can mitigate, though not eliminate reference bias in ultra-low coverage ancient genomes.

Related collections

Most cited references 29

Record: found
Abstract: found
Article: not found

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Aaron McKenna, Matthew Hanna, Eric R. Banks … (2010)

Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.

0 comments Cited 5495 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

A global reference for human genetic variation

Lachlan Coin, Robert Garry, Oleksyk Taras (2017)

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

0 comments Cited 4076 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

The UK Biobank resource with deep phenotyping and genomic data

Clare Bycroft, Colin Freeman, Desislava Petkova … (2018)

The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment. The open resource is unique in its size and scope. A rich variety of phenotypic and health-related information is available on each participant, including biological measurements, lifestyle indicators, biomarkers in blood and urine, and imaging of the body and brain. Follow-up information is provided by linking health and medical records. Genome-wide genotype data have been collected on all participants, providing many opportunities for the discovery of new genetic associations and the genetic bases of complex traits. Here we describe the centralized analysis of the genetic data, including genotype quality, properties of population structure and relatedness of the genetic data, and efficient phasing and genotype imputation that increases the number of testable variants to around 96 million. Classical allelic variation at 11 human leukocyte antigen genes was imputed, resulting in the recovery of signals with known associations between human leukocyte antigen alleles and many diseases.

0 comments Cited 2430 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Toomas Kivisild: toomas.kivisild@kuleuven.be

Journal

Journal ID (nlm-ta): Sci Rep

Journal ID (iso-abbrev): Sci Rep

Title: Scientific Reports

Publisher: Nature Publishing Group UK (London )

ISSN (Electronic): 2045-2322

Publication date (Electronic): 29 October 2020

Publication date PMC-release: 29 October 2020

Publication date Collection: 2020

Volume: 10

Electronic Location Identifier: 18542

Affiliations

[1 ]GRID grid.5335.0, ISNI 0000000121885934, McDonald Institute for Archaeological Research, , University of Cambridge, ; Cambridge, UK

[2 ]GRID grid.5596.f, ISNI 0000 0001 0668 7884, Department of Human Genetics, , Katholieke Universiteit Leuven, ; Herestraat 49 - box 602, 3000 Leuven, Belgium

[3 ]GRID grid.5326.2, ISNI 0000 0001 1940 4177, Istituto di Biologia e Patologia Molecolari, , Consiglio Nazionale delle Ricerche, ; Rome, Italy

[4 ]GRID grid.8217.c, ISNI 0000 0004 1936 9705, Smurfit Institute of Genetics, , Trinity College Dublin, ; Dublin, Ireland

[5 ]GRID grid.10939.32, ISNI 0000 0001 0943 7661, Estonian Biocentre, Institute of Genomics, , University of Tartu, ; Tartu, Estonia

[6 ]GRID grid.5335.0, ISNI 0000000121885934, St John’s College, ; St John’s Street, Cambridge, CB2 1TP UK

Article

Publisher ID: 75387

DOI: 10.1038/s41598-020-75387-w

PMC ID: 7596702

PubMed ID: 33122697

SO-VID: 30597677-76aa-4ef8-bc3d-fb3e4e648012

License:

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

History

Date received : 13 May 2020

Date accepted : 12 October 2020

Funding

Funded by: Wellcome Trust

Award ID: 2000368/Z/15/Z

Funded by: Sapienza Università di Roma

Custom metadata

ScienceOpen disciplines: Uncategorized

Keywords: anthropology,archaeology,evolutionary genetics,population genetics

Data availability:

ScienceOpen disciplines: Uncategorized

Keywords: anthropology, archaeology, evolutionary genetics, population genetics

Comments

Comment on this article

scite_

Cited by 39

See all cited by

Most referenced authors 647

See all reference authors

- Version 1

Evaluating genotype imputation pipeline for ultra-low coverage ancient genomes

Read this article at

Abstract

Related collections

Digital Archaeology

Most cited references 29

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

A global reference for human genetic variation

The UK Biobank resource with deep phenotyping and genomic data

Author and article information

Contributors

Journal

Affiliations

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 41

Cited by 39

Most referenced authors 647