18
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Comparison of multiple imputation algorithms and verification using whole-genome sequencing in the CMUH genetic biobank

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          A genome-wide association study (GWAS) can be conducted to systematically analyze the contributions of genetic factors to a wide variety of complex diseases. Nevertheless, existing GWASs have provided highly ethnic specific data. Accordingly, to provide data specific to Taiwan, we established a large-scale genetic database in a single medical institution at the China Medical University Hospital. With current technological limitations, microarray analysis can detect only a limited number of single-nucleotide polymorphisms (SNPs) with a minor allele frequency of >1%. Nevertheless, imputation represents a useful alternative means of expanding data. In this study, we compared four imputation algorithms in terms of various metrics. We observed that among the compared algorithms, Beagle5.2 achieved the fastest calculation speed, smallest storage space, highest specificity, and highest number of high-quality variants. We obtained 15,277,414 high-quality variants in 175,871 people by using Beagle5.2. In our internal verification process, Beagle5.2 exhibited an accuracy rate of up to 98.75%. We also conducted external verification. Our imputed variants had a 79.91% mapping rate and 90.41% accuracy. These results will be combined with clinical data in future research. We have made the results available for researchers to use in formulating imputation algorithms, in addition to establishing a complete SNP database for GWAS and PRS researchers. We believe that these data can help improve overall medical capabilities, particularly precision medicine, in Taiwan.

          Related collections

          Most cited references40

          • Record: found
          • Abstract: found
          • Article: not found

          The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

          Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            PLINK: a tool set for whole-genome association and population-based linkage analyses.

            Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Fast and accurate long-read alignment with Burrows–Wheeler transform

              Motivation: Many programs for aligning short sequencing reads to a reference genome have been developed in the last 2 years. Most of them are very efficient for short reads but inefficient or not applicable for reads >200 bp because the algorithms are heavily and specifically tuned for short queries with low sequencing error rate. However, some sequencing platforms already produce longer reads and others are expected to become available soon. For longer reads, hashing-based software such as BLAT and SSAHA2 remain the only choices. Nonetheless, these methods are substantially slower than short-read aligners in terms of aligned bases per unit time. Results: We designed and implemented a new algorithm, Burrows-Wheeler Aligner's Smith-Waterman Alignment (BWA-SW), to align long sequences up to 1 Mb against a large sequence database (e.g. the human genome) with a few gigabytes of memory. The algorithm is as accurate as SSAHA2, more accurate than BLAT, and is several to tens of times faster than both. Availability: http://bio-bwa.sourceforge.net Contact: rd@sanger.ac.uk
                Bookmark

                Author and article information

                Journal
                Biomedicine (Taipei)
                Biomedicine (Taipei)
                BioMedicine
                China Medical University
                2211-8020
                2211-8039
                2021
                01 December 2021
                : 11
                : 4
                : 57-65
                Affiliations
                [a ]Center for Precision Medicine, China Medical University Hospital, Taichung, 40447, Taiwan
                [b ]Artificial Intelligence Center for Medical Diagnosis, China Medical University Hospital, Taichung, 40447, Taiwan
                [c ]Department of Medicine, China Medical University, Taichung, Taiwan
                [d ]Department of Neurology, China Medical University Hospital, Taichung, Taiwan
                [e ]Epigenome Research Center, China Medical University Hospital, Taichung, 40447, Taiwan
                [f ]Million-person Precision Medicine Initiative, China Medical University Hospital, Taichung, 40447, Taiwan
                [g ]Department of Medical Research, China Medical University Hospital, Taichung, 40402, Taiwan
                [h ]School of Chinese Medicine, China Medical University, Taichung, 40402, Taiwan
                [i ]Division of Pediatric Genetics, Children’s Hospital of China Medical University, Taichung, 40447, Taiwan
                [j ]Department of Biotechnology and Bioinformatics, Asia University, Taichung, 41354, Taiwan
                Author notes
                [* ]Corresponding author at: Department of Medical Research, No. 2, Yude Road, North District, Taichung City, 40447, Taiwan, ROC.
                [** ]Corresponding author at: Artificial Intelligence Center for Medical Diagnosis, No. 2, Yude Road, North District, Taichung City, 40447, Taiwan. E-mail addresses: D35842@ 123456mail.cmuh.org.tw (K.-C. Hsu), d0704@ 123456mail.cmuh.org.tw (F.-J. Tsai).
                Article
                bmed-11-04-057
                10.37796/2211-8039.1302
                8823485
                35223420
                7cc051d6-8962-495f-bf7b-61f991e9e8e5
                © the Author(s)

                This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/).

                History
                : 06 March 2021
                : 09 March 2021
                : 08 April 2021
                Categories
                Original Article

                imputation,snp array,whole genome sequencing,cmuh genetic biobank

                Comments

                Comment on this article