17
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis

      research-article

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Objectives

          1) To use data-driven method to examine clinical codes (risk factors) of a medical condition in primary care electronic health records (EHRs) that can accurately predict a diagnosis of the condition in secondary care EHRs. 2) To develop and validate a disease phenotyping algorithm for rheumatoid arthritis using primary care EHRs.

          Methods

          This study linked routine primary and secondary care EHRs in Wales, UK. A machine learning based scheme was used to identify patients with rheumatoid arthritis from primary care EHRs via the following steps: i) selection of variables by comparing relative frequencies of Read codes in the primary care dataset associated with disease case compared to non-disease control (disease/non-disease based on the secondary care diagnosis); ii) reduction of predictors/associated variables using a Random Forest method, iii) induction of decision rules from decision tree model. The proposed method was then extensively validated on an independent dataset, and compared for performance with two existing deterministic algorithms for RA which had been developed using expert clinical knowledge.

          Results

          Primary care EHRs were available for 2,238,360 patients over the age of 16 and of these 20,667 were also linked in the secondary care rheumatology clinical system. In the linked dataset, 900 predictors (out of a total of 43,100 variables) in the primary care record were discovered more frequently in those with versus those without RA. These variables were reduced to 37 groups of related clinical codes, which were used to develop a decision tree model. The final algorithm identified 8 predictors related to diagnostic codes for RA, medication codes, such as those for disease modifying anti-rheumatic drugs, and absence of alternative diagnoses such as psoriatic arthritis. The proposed data-driven method performed as well as the expert clinical knowledge based methods.

          Conclusion

          Data-driven scheme, such as ensemble machine learning methods, has the potential of identifying the most informative predictors in a cost-effective and rapid way to accurately and reliably classify rheumatoid arthritis or other complex medical conditions in primary care EHRs.

          Related collections

          Most cited references38

          • Record: found
          • Abstract: not found
          • Article: not found

          Random Forests

            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age

            Cathie Sudlow and colleagues describe the UK Biobank, a large population-based prospective study, established to allow investigation of the genetic and non-genetic determinants of the diseases of middle and old age.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Mining electronic health records: towards better research applications and clinical care.

              Clinical data describing the phenotypes and treatment of patients represents an underused data source that has much greater research potential than is currently realized. Mining of electronic health records (EHRs) has the potential for establishing new patient-stratification principles and for revealing unknown disease correlations. Integrating EHR data with genetic data will also give a finer understanding of genotype-phenotype relationships. However, a broad range of ethical, legal and technical reasons currently hinder the systematic deposition of these data in EHRs and their mining. Here, we consider the potential for furthering medical research and clinical care using EHR data and the challenges that must be overcome before this is a reality.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS One
                PLoS ONE
                plos
                plosone
                PLoS ONE
                Public Library of Science (San Francisco, CA USA )
                1932-6203
                2 May 2016
                2016
                : 11
                : 5
                : e0154515
                Affiliations
                [1 ]Institute of Life Science, College of Medicine, Swansea University, Swansea, United Kingdom
                [2 ]UCL Institute of Health Informatics and Farr Institute of Health Informatics Research, London, United Kingdom
                [3 ]Institute of Infection, Immunity and Inflammation, University of Glasgow, Glasgow, United Kingdom
                [4 ]Arthritis Research UK Centre for Epidemiology, Institute of Inflammation and Repair, Faculty of Medical and Human Sciences, Manchester Academic Health Science Centre, University of Manchester, Manchester, United Kingdom
                [5 ]Arthritis Research UK CREATE Centre and Welsh Arthritis Research Network, School of Medicine, Cardiff University, Cardiff, United Kingdom
                [6 ]Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, United Kingdom
                [7 ]The UK Biobank, Stockport, United Kingdom
                University of Catania, ITALY
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                Conceived and designed the experiments: SMZ FF JK SB. Performed the experiments: FF JK SMZ. Analyzed the data: SB SMZ FF JK SD SS WGD TWO EC CS. Contributed reagents/materials/analysis tools: FF JK SMZ SB. Wrote the paper: SB SMZ FF JK SD SS WGD TWO EC CS. Conceived this study: SB, SMZ. Conceived and developed algorithms: SMZ FF JK. Coordinated data acquisition and management: RC MA. Critically reviewed and revised the manuscript: SB SMZ FF JK RC MA SD SS WGD TWO EC CS. Providing advances and consultations related to this study: UK Biobank.

                ¶ The complete membership of the author group (UK Biobank Follow-up and Outcomes Group) can be found in the Acknowledgments

                Article
                PONE-D-16-00457
                10.1371/journal.pone.0154515
                4852928
                27135409
                a321af45-dd9b-4d4d-a312-04988f166cc8
                © 2016 Zhou et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 5 January 2016
                : 14 April 2016
                Page count
                Figures: 2, Tables: 4, Pages: 14
                Funding
                Funded by: Health and Care Research Wales
                Award ID: CA02
                Funded by: funder-id http://dx.doi.org/10.13039/501100000265, Medical Research Council;
                Award ID: MR/K006525/1
                Funded by: funder-id http://dx.doi.org/10.13039/501100000265, Medical Research Council;
                Award ID: G0902272
                Award Recipient :
                The work was supported by the UK Biobank, and undertaken with the support of the National Centre for Population Health and Wellbeing Research (NCPHWR) and the Farr Institute of Health Informatics Research. The NCPHWR is funded by Health and Care Research Wales (grant ref. : CA02). The Farr Institute is funded by a consortium of ten UK research organisations (grant ref. : MR/K006525/1): Arthritis Research UK, the British Heart Foundation, Cancer Research UK, the Economic and Social Research Council, the Engineering and Physical Sciences Research Council, the Medical Research Council, the National Institute of Health Research, the National Institute for Social Care and Health Research (Welsh Government) and the Chief Scientist Office (Scottish Government Health Directorates). WGD was supported by an MRC Clinician Scientist Fellowship (G0902272). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Medicine and Health Sciences
                Rheumatology
                Arthritis
                Rheumatoid Arthritis
                Medicine and Health Sciences
                Clinical Medicine
                Clinical Immunology
                Autoimmune Diseases
                Rheumatoid Arthritis
                Biology and Life Sciences
                Immunology
                Clinical Immunology
                Autoimmune Diseases
                Rheumatoid Arthritis
                Medicine and Health Sciences
                Immunology
                Clinical Immunology
                Autoimmune Diseases
                Rheumatoid Arthritis
                Medicine and Health Sciences
                Health Care
                Primary Care
                Physical Sciences
                Mathematics
                Applied Mathematics
                Algorithms
                Research and Analysis Methods
                Simulation and Modeling
                Algorithms
                Physical Sciences
                Mathematics
                Applied Mathematics
                Algorithms
                Machine Learning Algorithms
                Research and Analysis Methods
                Simulation and Modeling
                Algorithms
                Machine Learning Algorithms
                Biology and Life Sciences
                Neuroscience
                Cognitive Science
                Artificial Intelligence
                Machine Learning
                Machine Learning Algorithms
                Computer and Information Sciences
                Artificial Intelligence
                Machine Learning
                Machine Learning Algorithms
                Medicine and Health Sciences
                Rheumatology
                Medicine and Health Sciences
                Diagnostic Medicine
                Engineering and Technology
                Management Engineering
                Decision Analysis
                Decision Trees
                Research and Analysis Methods
                Decision Analysis
                Decision Trees
                Research and Analysis Methods
                Database and Informatics Methods
                Health Informatics
                Electronic Medical Records
                Custom metadata
                To maintain patient confidentiality, patient data are available on request from the SAIL databank management team via SAILDatabank@ 123456Swansea.ac.uk . All other relevant data are within the paper and its Supporting Information files.

                Uncategorized
                Uncategorized

                Comments

                Comment on this article