14
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Evaluating genotype imputation pipeline for ultra-low coverage ancient genomes

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Although ancient DNA data have become increasingly more important in studies about past populations, it is often not feasible or practical to obtain high coverage genomes from poorly preserved samples. While methods of accurate genotype imputation from > 1 × coverage data have recently become a routine, a large proportion of ancient samples remain unusable for downstream analyses due to their low coverage. Here, we evaluate a two-step pipeline for the imputation of common variants in ancient genomes at 0.05–1 × coverage. We use the genotype likelihood input mode in Beagle and filter for confident genotypes as the input to impute missing genotypes. This procedure, when tested on ancient genomes, outperforms a single-step imputation from genotype likelihoods, suggesting that current genotype callers do not fully account for errors in ancient sequences and additional quality controls can be beneficial. We compared the effect of various genotype likelihood calling methods, post-calling, pre-imputation and post-imputation filters, different reference panels, as well as different imputation tools. In a Neolithic Hungarian genome, we obtain ~ 90% imputation accuracy for heterozygous common variants at coverage 0.05 × and > 97% accuracy at coverage 0.5 ×. We show that imputation can mitigate, though not eliminate reference bias in ultra-low coverage ancient genomes.

          Related collections

          Most cited references29

          • Record: found
          • Abstract: found
          • Article: not found

          The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

          Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            A global reference for human genetic variation

            The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              The UK Biobank resource with deep phenotyping and genomic data

              The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment. The open resource is unique in its size and scope. A rich variety of phenotypic and health-related information is available on each participant, including biological measurements, lifestyle indicators, biomarkers in blood and urine, and imaging of the body and brain. Follow-up information is provided by linking health and medical records. Genome-wide genotype data have been collected on all participants, providing many opportunities for the discovery of new genetic associations and the genetic bases of complex traits. Here we describe the centralized analysis of the genetic data, including genotype quality, properties of population structure and relatedness of the genetic data, and efficient phasing and genotype imputation that increases the number of testable variants to around 96 million. Classical allelic variation at 11 human leukocyte antigen genes was imputed, resulting in the recovery of signals with known associations between human leukocyte antigen alleles and many diseases.
                Bookmark

                Author and article information

                Contributors
                toomas.kivisild@kuleuven.be
                Journal
                Sci Rep
                Sci Rep
                Scientific Reports
                Nature Publishing Group UK (London )
                2045-2322
                29 October 2020
                29 October 2020
                2020
                : 10
                : 18542
                Affiliations
                [1 ]GRID grid.5335.0, ISNI 0000000121885934, McDonald Institute for Archaeological Research, , University of Cambridge, ; Cambridge, UK
                [2 ]GRID grid.5596.f, ISNI 0000 0001 0668 7884, Department of Human Genetics, , Katholieke Universiteit Leuven, ; Herestraat 49 - box 602, 3000 Leuven, Belgium
                [3 ]GRID grid.5326.2, ISNI 0000 0001 1940 4177, Istituto di Biologia e Patologia Molecolari, , Consiglio Nazionale delle Ricerche, ; Rome, Italy
                [4 ]GRID grid.8217.c, ISNI 0000 0004 1936 9705, Smurfit Institute of Genetics, , Trinity College Dublin, ; Dublin, Ireland
                [5 ]GRID grid.10939.32, ISNI 0000 0001 0943 7661, Estonian Biocentre, Institute of Genomics, , University of Tartu, ; Tartu, Estonia
                [6 ]GRID grid.5335.0, ISNI 0000000121885934, St John’s College, ; St John’s Street, Cambridge, CB2 1TP UK
                Article
                75387
                10.1038/s41598-020-75387-w
                7596702
                33122697
                30597677-76aa-4ef8-bc3d-fb3e4e648012
                © The Author(s) 2020

                Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

                History
                : 13 May 2020
                : 12 October 2020
                Funding
                Funded by: Wellcome Trust
                Award ID: 2000368/Z/15/Z
                Funded by: Sapienza Università di Roma
                Categories
                Article
                Custom metadata
                © The Author(s) 2020

                Uncategorized
                anthropology,archaeology,evolutionary genetics,population genetics
                Uncategorized
                anthropology, archaeology, evolutionary genetics, population genetics

                Comments

                Comment on this article