1
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      MitoScape: A big-data, machine-learning platform for obtaining mitochondrial DNA from next-generation sequencing data

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The growing number of next-generation sequencing (NGS) data presents a unique opportunity to study the combined impact of mitochondrial and nuclear-encoded genetic variation in complex disease. Mitochondrial DNA variants and in particular, heteroplasmic variants, are critical for determining human disease severity. While there are approaches for obtaining mitochondrial DNA variants from NGS data, these software do not account for the unique characteristics of mitochondrial genetics and can be inaccurate even for homoplasmic variants. We introduce MitoScape, a novel, big-data, software for extracting mitochondrial DNA sequences from NGS. MitoScape adopts a novel departure from other algorithms by using machine learning to model the unique characteristics of mitochondrial genetics. We also employ a novel approach of using rho-zero (mitochondrial DNA-depleted) data to model nuclear-encoded mitochondrial sequences. We showed that MitoScape produces accurate heteroplasmy estimates using gold-standard mitochondrial DNA data. We provide a comprehensive comparison of the most common tools for obtaining mtDNA variants from NGS and showed that MitoScape had superior performance to compared tools in every statistically category we compared, including false positives and false negatives. By applying MitoScape to common disease examples, we illustrate how MitoScape facilitates important heteroplasmy-disease association discoveries by expanding upon a reported association between hypertrophic cardiomyopathy and mitochondrial haplogroup T in men (adjusted p-value = 0.003). The improved accuracy of mitochondrial DNA variants produced by MitoScape will be instrumental in diagnosing disease in the context of personalized medicine and clinical diagnostics.

          Author summary

          Recent studies have highlighted the importance of mitochondrial DNA variation in both primary mitochondrial disease and complex, human pathology including COVID-19, and space-flight stress. The vast amount of existing, next-generation sequencing (NGS) data can be leveraged to interrogate both nuclear and mitochondrial DNA (mtDNA) sequence simultaneously, allowing for analysis of the interplay between mitochondrial and nuclear encoded genes in mitochondrial function. Identifying mtDNA sequence accurately is complicated by the presence of nuclear encoded mitochondrial sequences (NUMTs), which are homologous to mtDNA. Current software for analyzing mtDNA from NGS do not accurately model the unique characteristics of mitochondrial genetics. We introduce MitoScape, a novel, big-data, software which models mitochondrial genetics through machine learning to accurately identify mtDNA sequence from NGS data. MitoScape takes advantage of rho-zero cell data to model the characteristics of NUMTs. We show that MitoScape produces more accurate heteroplasmy estimates compared to published software. We provide an example of applying MitoScape in replicating an association between hypertrophic cardiomyopathy and mitochondrial haplogroup T in men. MitoScape is an important contribution to mitochondrial genomics allowing for accurate mtDNA variants, and the ability to tailor mtDNA analysis in different population and disease contexts, which is not available in other software.

          Related collections

          Most cited references37

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Fast and SNP-tolerant detection of complex variants and splicing in short reads

          Motivation: Next-generation sequencing captures sequence differences in reads relative to a reference genome or transcriptome, including splicing events and complex variants involving multiple mismatches and long indels. We present computational methods for fast detection of complex variants and splicing in short reads, based on a successively constrained search process of merging and filtering position lists from a genomic index. Our methods are implemented in GSNAP (Genomic Short-read Nucleotide Alignment Program), which can align both single- and paired-end reads as short as 14 nt and of arbitrarily long length. It can detect short- and long-distance splicing, including interchromosomal splicing, in individual reads, using probabilistic models or a database of known splice sites. Our program also permits SNP-tolerant alignment to a reference space of all possible combinations of major and minor alleles, and can align reads from bisulfite-treated DNA for the study of methylation state. Results: In comparison testing, GSNAP has speeds comparable to existing programs, especially in reads of ≥70 nt and is fastest in detecting complex variants with four or more mismatches or insertions of 1–9 nt and deletions of 1–30 nt. Although SNP tolerance does not increase alignment yield substantially, it affects alignment results in 7–8% of transcriptional reads, typically by revealing alternate genomic mappings for a read. Simulations of bisulfite-converted DNA show a decrease in identifying genomic positions uniquely in 6% of 36 nt reads and 3% of 70 nt reads. Availability: Source code in C and utility programs in Perl are freely available for download as part of the GMAP package at http://share.gene.com/gmap. Contact: twu@gene.com
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA.

              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Molecular Poltergeists: Mitochondrial DNA Copies (numts) in Sequenced Nuclear Genomes

              The natural transfer of DNA from mitochondria to the nucleus generates nuclear copies of mitochondrial DNA (numts) and is an ongoing evolutionary process, as genome sequences attest. In humans, five different numts cause genetic disease and a dozen human loci are polymorphic for the presence of numts, underscoring the rapid rate at which mitochondrial sequences reach the nucleus over evolutionary time. In the laboratory and in nature, numts enter the nuclear DNA via non-homolgous end joining (NHEJ) at double-strand breaks (DSBs). The frequency of numt insertions among 85 sequenced eukaryotic genomes reveal that numt content is strongly correlated with genome size, suggesting that the numt insertion rate might be limited by DSB frequency. Polymorphic numts in humans link maternally inherited mitochondrial genotypes to nuclear DNA haplotypes during the past, offering new opportunities to associate nuclear markers with mitochondrial markers back in time.
                Bookmark

                Author and article information

                Contributors
                Role: ConceptualizationRole: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: Project administrationRole: SoftwareRole: SupervisionRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: Software
                Role: Investigation
                Role: Formal analysis
                Role: Investigation
                Role: Investigation
                Role: Investigation
                Role: Resources
                Role: InvestigationRole: Supervision
                Role: Software
                Role: Supervision
                Role: Resources
                Role: Resources
                Role: ResourcesRole: Supervision
                Role: Formal analysisRole: Supervision
                Role: Resources
                Role: Funding acquisitionRole: ResourcesRole: Supervision
                Role: Funding acquisitionRole: Resources
                Role: Editor
                Journal
                PLoS Comput Biol
                PLoS Comput Biol
                plos
                PLoS Computational Biology
                Public Library of Science (San Francisco, CA USA )
                1553-734X
                1553-7358
                11 November 2021
                November 2021
                : 17
                : 11
                : e1009594
                Affiliations
                [1 ] Center for Mitochondrial and Epigenomic Medicine, Division of Human Genetics, The Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
                [2 ] Center for Data-Driven Discovery in Biomedicine (D3b), The Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
                [3 ] Center for Eye Research Australia, Ophthalmology, Department of Surgery, University of Melbourne, Melbourne, Australia
                [4 ] Department of Surgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
                [5 ] Department of Psychiatry, The Children’s Hospital of Philadelphia and the University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
                [6 ] 22q and You Center, Division of Human Genetics, The Children’s Hospital of Philadelphia and the University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
                [7 ] Regeneron Genetics Center, Tarrytown, New York, United States of America
                [8 ] Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
                [9 ] Cardiovascular Institute, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
                bioinformatics, GERMANY
                Author notes

                The authors have declared that no competing interests exist.

                Author information
                https://orcid.org/0000-0002-2478-5864
                https://orcid.org/0000-0002-2653-5009
                https://orcid.org/0000-0002-7861-0197
                https://orcid.org/0000-0002-5743-0795
                https://orcid.org/0000-0002-1528-5964
                https://orcid.org/0000-0002-2455-9525
                https://orcid.org/0000-0002-9207-6955
                https://orcid.org/0000-0002-9245-9876
                https://orcid.org/0000-0003-1368-2453
                https://orcid.org/0000-0001-8009-1632
                https://orcid.org/0000-0003-0436-4189
                https://orcid.org/0000-0002-7480-8278
                Article
                PCOMPBIOL-D-21-00835
                10.1371/journal.pcbi.1009594
                8610268
                34762648
                166743cd-4dfe-42c1-8913-a6d5348b2bed
                © 2021 Singh et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 5 May 2021
                : 27 October 2021
                Page count
                Figures: 5, Tables: 2, Pages: 20
                Funding
                Funded by: funder-id http://dx.doi.org/10.13039/100000025, National Institute of Mental Health;
                Award ID: MH110185
                Award Recipient :
                Funded by: funder-id http://dx.doi.org/10.13039/100000065, National Institute of Neurological Disorders and Stroke;
                Award ID: NS021328
                Award Recipient :
                Funded by: funder-id http://dx.doi.org/10.13039/100000025, National Institute of Mental Health;
                Award ID: MH108592
                Award Recipient :
                Funded by: funder-id http://dx.doi.org/10.13039/100000052, NIH Office of the Director;
                Award ID: OD010944
                Award Recipient :
                This work was supported by grants awarded to SA (National Institutes of Mental Health - MH110185) and DCW (National Institute of Neurological Disorders and Stroke: NS021328, National Institutes of Mental Health: MH108592, and Office of the Director: OD010944). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology and life sciences
                Genetics
                DNA
                Forms of DNA
                Mitochondrial DNA
                Biology and life sciences
                Biochemistry
                Nucleic acids
                DNA
                Forms of DNA
                Mitochondrial DNA
                Biology and Life Sciences
                Genetics
                Heredity
                Heteroplasmy
                Research and Analysis Methods
                Database and Informatics Methods
                Bioinformatics
                Sequence Analysis
                Sequence Alignment
                Biology and Life Sciences
                Biochemistry
                Bioenergetics
                Energy-Producing Organelles
                Mitochondria
                Biology and Life Sciences
                Cell Biology
                Cellular Structures and Organelles
                Energy-Producing Organelles
                Mitochondria
                Biology and life sciences
                Molecular biology
                Molecular biology techniques
                Sequencing techniques
                DNA sequencing
                Next-Generation Sequencing
                Research and analysis methods
                Molecular biology techniques
                Sequencing techniques
                DNA sequencing
                Next-Generation Sequencing
                Biology and Life Sciences
                Computational Biology
                Genome Analysis
                Transcriptome Analysis
                Next-Generation Sequencing
                Biology and Life Sciences
                Genetics
                Genomics
                Genome Analysis
                Transcriptome Analysis
                Next-Generation Sequencing
                Computer and Information Sciences
                Software Engineering
                Computer Software
                Engineering and Technology
                Software Engineering
                Computer Software
                Computer and Information Sciences
                Artificial Intelligence
                Machine Learning
                Biology and Life Sciences
                Evolutionary Biology
                Population Genetics
                Haplogroups
                Biology and Life Sciences
                Genetics
                Population Genetics
                Haplogroups
                Biology and Life Sciences
                Population Biology
                Population Genetics
                Haplogroups
                Custom metadata
                vor-update-to-uncorrected-proof
                2021-11-23
                Data specific to HCM analysis are available from the Penn Medicine Biobank ( https://pmbb.med.upenn.edu). All other data, including Benchmark data, are available via authorized access from https://cavatica.sbgenomics.com/u/cavatica/22q11-deletion-syndrome-project/.

                Quantitative & Systems biology
                Quantitative & Systems biology

                Comments

                Comment on this article