3
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Interpretable genotype-to-phenotype classifiers with performance guarantees

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Understanding the relationship between the genome of a cell and its phenotype is a central problem in precision medicine. Nonetheless, genotype-to-phenotype prediction comes with great challenges for machine learning algorithms that limit their use in this setting. The high dimensionality of the data tends to hinder generalization and challenges the scalability of most learning algorithms. Additionally, most algorithms produce models that are complex and difficult to interpret. We alleviate these limitations by proposing strong performance guarantees, based on sample compression theory, for rule-based learning algorithms that produce highly interpretable models. We show that these guarantees can be leveraged to accelerate learning and improve model interpretability. Our approach is validated through an application to the genomic prediction of antimicrobial resistance, an important public health concern. Highly accurate models were obtained for 12 species and 56 antibiotics, and their interpretation revealed known resistance mechanisms, as well as some potentially new ones. An open-source disk-based implementation that is both memory and computationally efficient is provided with this work. The implementation is turnkey, requires no prior knowledge of machine learning, and is complemented by comprehensive tutorials.

          Related collections

          Most cited references30

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Chapter 11: Genome-Wide Association Studies

          Genome-wide association studies (GWAS) have evolved over the last ten years into a powerful tool for investigating the genetic architecture of human disease. In this work, we review the key concepts underlying GWAS, including the architecture of common diseases, the structure of common human genetic variation, technologies for capturing genetic information, study designs, and the statistical methods used for data analysis. We also look forward to the future beyond GWAS.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Methods of integrating data to uncover genotype-phenotype interactions.

            Recent technological advances have expanded the breadth of available omic data, from whole-genome sequencing data, to extensive transcriptomic, methylomic and metabolomic data. A key goal of analyses of these data is the identification of effective models that predict phenotypic traits and outcomes, elucidating important biomarkers and generating important insights into the genetic underpinnings of the heritability of complex traits. There is still a need for powerful and advanced analysis strategies to fully harness the utility of these comprehensive high-throughput data, identifying true associations and reducing the number of false associations. In this Review, we explore the emerging approaches for data integration - including meta-dimensional and multi-staged analyses - which aim to deepen our understanding of the role of genetics and genomics in complex outcomes. With the use and further development of these approaches, an improved understanding of the relationship between genomic variation and human phenotypes may be revealed.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Alignment-free sequence comparison-a review.

              Genetic recombination and, in particular, genetic shuffling are at odds with sequence comparison by alignment, which assumes conservation of contiguity between homologous segments. A variety of theoretical foundations are being used to derive alignment-free methods that overcome this limitation. The formulation of alternative metrics for dissimilarity between sequences and their algorithmic implementations are reviewed. The overwhelming majority of work on alignment-free sequence has taken place in the past two decades, with most reports published in the past 5 years. Two main categories of methods have been proposed-methods based on word (oligomer) frequency, and methods that do not require resolving the sequence with fixed word length segments. The first category is based on the statistics of word frequency, on the distances defined in a Cartesian space defined by the frequency vectors, and on the information content of frequency distribution. The second category includes the use of Kolmogorov complexity and Chaos Theory. Despite their low visibility, alignment-free metrics are in fact already widely used as pre-selection filters for alignment-based querying of large applications. Recent work is furthering their usage as a scale-independent methodology that is capable of recognizing homology when loss of contiguity is beyond the possibility of alignment. Most of the alignment-free algorithms reviewed were implemented in MATLAB code and are available at http://bioinformatics.musc.edu/resources.html
                Bookmark

                Author and article information

                Contributors
                alexandre.drouin.8@ulaval.ca
                Journal
                Sci Rep
                Sci Rep
                Scientific Reports
                Nature Publishing Group UK (London )
                2045-2322
                11 March 2019
                11 March 2019
                2019
                : 9
                : 4071
                Affiliations
                [1 ]ISNI 0000 0004 1936 8390, GRID grid.23856.3a, Department of Computer Science and Software Engineering, , Université Laval, ; Quebec, Canada
                [2 ]ISNI 0000 0004 1936 8390, GRID grid.23856.3a, Big Data Research Centre, , Université Laval, ; Quebec, Canada
                [3 ]ISNI 0000 0004 1936 8390, GRID grid.23856.3a, School of Nutrition, , Université Laval, ; Quebec, Canada
                [4 ]ISNI 0000 0004 1936 8390, GRID grid.23856.3a, Institute of Nutrition and Functional Foods, , Université Laval, ; Quebec, Canada
                [5 ]ISNI 0000 0004 1936 8390, GRID grid.23856.3a, Infectious Disease Research Centre, , Université Laval, ; Quebec, Canada
                Author information
                http://orcid.org/0000-0001-7718-0319
                http://orcid.org/0000-0002-3606-4060
                http://orcid.org/0000-0002-9973-2740
                Article
                40561
                10.1038/s41598-019-40561-2
                6411721
                30858411
                e92e5f2d-24fa-4887-a191-5531e8b7e187
                © The Author(s) 2019

                Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

                History
                : 5 October 2018
                : 19 February 2019
                Funding
                Funded by: FundRef https://doi.org/10.13039/501100000038, Gouvernement du Canada | Natural Sciences and Engineering Research Council of Canada (Conseil de Recherches en Sciences Naturelles et en Génie du Canada);
                Award ID: 556037
                Award ID: RGPIN-2016-05942
                Award ID: 262067
                Award Recipient :
                Funded by: FundRef https://doi.org/10.13039/501100001804, Canada Research Chairs (Chaires de recherche du Canada);
                Award ID: Canada Research Excellence Chair in the Microbiome-Endocannabinoidome Axis in Metabolic Health
                Award ID: Canada Research Chair in Medical Genomics - Tier 1
                Award Recipient :
                Categories
                Article
                Custom metadata
                © The Author(s) 2019

                Uncategorized
                Uncategorized

                Comments

                Comment on this article