A review of approaches to identifying patient phenotype cohorts using electronic health records

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Objective

To summarize literature describing approaches aimed at automatically identifying patients with a common phenotype.

Materials and methods

We performed a review of studies describing systems or reporting techniques developed for identifying cohorts of patients with specific phenotypes. Every full text article published in (1) Journal of American Medical Informatics Association, (2) Journal of Biomedical Informatics, (3) Proceedings of the Annual American Medical Informatics Association Symposium, and (4) Proceedings of Clinical Research Informatics Conference within the past 3 years was assessed for inclusion in the review. Only articles using automated techniques were included.

Results

Ninety-seven articles met our inclusion criteria. Forty-six used natural language processing (NLP)-based techniques, 24 described rule-based systems, 41 used statistical analyses, data mining, or machine learning techniques, while 22 described hybrid systems. Nine articles described the architecture of large-scale systems developed for determining cohort eligibility of patients.

Discussion

We observe that there is a rise in the number of studies associated with cohort identification using electronic medical records. Statistical analyses or machine learning, followed by NLP techniques, are gaining popularity over the years in comparison with rule-based systems.

Conclusions

There are a variety of approaches for classifying patients into a particular phenotype. Different techniques and data sources are used, and good performance is reported on datasets at respective institutions. However, no system makes comprehensive use of electronic medical records addressing all of their known weaknesses.

Related collections

Most cited references 92

Record: found
Abstract: found
Article: not found

A simple algorithm for identifying negated findings and diseases in discharge summaries.

Wendy W. Chapman, Will Bridewell, Paul Hanbury … (2001)

Narrative reports in medical records contain a wealth of information that may augment structured data for managing patient information and predicting trends in diseases. Pertinent negatives are evident in text but are not usually indexed in structured databases. The objective of the study reported here was to test a simple algorithm for determining whether a finding or disease mentioned within narrative medical reports is present or absent. We developed a simple regular expression algorithm called NegEx that implements several phrases indicating negation, filters out sentences containing phrases that falsely appear to be negation phrases, and limits the scope of the negation phrases. We compared NegEx against a baseline algorithm that has a limited set of negation phrases and a simpler notion of scope. In a test of 1235 findings and diseases in 1000 sentences taken from discharge summaries indexed by physicians, NegEx had a specificity of 94.5% (versus 85.3% for the baseline), a positive predictive value of 84.5% (versus 68.4% for the baseline) while maintaining a reasonable sensitivity of 77.8% (versus 88.3% for the baseline). We conclude that with little implementation effort a simple regular expression algorithm for determining whether a finding or disease is absent can identify a large portion of the pertinent negatives from discharge summaries.

0 comments Cited 242 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Accuracy of ICD-9-CM codes for identifying cardiovascular and stroke risk factors.

Yan Yan, Brian F Gage, Yan Yan … (2005)

We sought to determine which ICD-9-CM codes in Medicare Part A data identify cardiovascular and stroke risk factors. This was a cross-sectional study comparing ICD-9-CM data to structured medical record review from 23,657 Medicare beneficiaries aged 20 to 105 years who had atrial fibrillation. Quality improvement organizations used standardized abstraction instruments to determine the presence of 9 cardiovascular and stroke risk factors. Using the chart abstractions as the gold standard, we assessed the accuracy of ICD-9-CM codes to identify these risk factors. ICD-9-CM codes for all risk factors had high specificity (>0.95) and low sensitivity ( or =0.98) but moderate positive predictive values (range, 0.54-0.77) in this population. Using ICD-9-CM codes alone, heart failure, coronary artery disease, diabetes, hypertension, and stroke can be ruled in but not necessarily ruled out. Where feasible, review of additional data (eg, physician notes or imaging studies) should be used to confirm the diagnosis of valvular disease, arterial peripheral embolus, intracranial hemorrhage, and deep venous thrombosis.

0 comments Cited 174 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study.

Abel Kho, M. Geoffrey Hayes, Laura J Rasmussen-Torvik … (2015)

Genome-wide association studies (GWAS) require high specificity and large numbers of subjects to identify genotype-phenotype correlations accurately. The aim of this study was to identify type 2 diabetes (T2D) cases and controls for a GWAS, using data captured through routine clinical care across five institutions using different electronic medical record (EMR) systems. An algorithm was developed to identify T2D cases and controls based on a combination of diagnoses, medications, and laboratory results. The performance of the algorithm was validated at three of the five participating institutions compared against clinician review. A GWAS was subsequently performed using cases and controls identified by the algorithm, with samples pooled across all five institutions. The algorithm achieved 98% and 100% positive predictive values for the identification of diabetic cases and controls, respectively, as compared against clinician review. By standardizing and applying the algorithm across institutions, 3353 cases and 3352 controls were identified. Subsequent GWAS using data from five institutions replicated the TCF7L2 gene variant (rs7903146) previously associated with T2D. By applying stringent criteria to EMR data collected through routine clinical care, cases and controls for a GWAS were identified that subsequently replicated a known genetic variant. The use of standard terminologies to define data elements enabled pooling of subjects and data across five different institutions to achieve the robust numbers required for GWAS. An algorithm using commonly available data from five different EMR can accurately identify T2D cases and controls for genetic study across multiple institutions.

0 comments Cited 124 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): J Am Med Inform Assoc

Journal ID (iso-abbrev): J Am Med Inform Assoc

Journal ID (hwp): amiajnl

Journal ID (publisher-id): jamia

Title: Journal of the American Medical Informatics Association : JAMIA

Publisher: BMJ Publishing Group (BMA House, Tavistock Square, London, WC1H 9JR )

ISSN (Print): 1067-5027

ISSN (Electronic): 1527-974X

Publication date (Print): March 2014

Publication date (Electronic): 7 November 2013

Publication date PMC-release: 7 November 2013

Volume: 21

Issue: 2

Pages: 221-230

Affiliations

[1 ]Department of Computer Science and Engineering, The Ohio State University , Columbus, Ohio, USA

[2 ]Department of Biomedical Informatics, The Ohio State University , Columbus, Ohio, USA

[3 ]Department of Biomedical Informatics, Columbia University , New York, New York, USA

[4 ]Center for Healthcare Informatics and Policy, Weill Cornell Medical College , New York, New York, USA

Author notes

[Correspondence to ] Chaitanya Shivade, Department of Computer Science and Engineering, The Ohio State University, 395 Dreese Laboratories, 2015 Neil Avenue, Columbus, OH 43210, USA; shivade@ 123456cse.ohio-state.edu

Author information

Chaitanya Shivade http://orcid.org/0000-0001-6604-1129

Article

Publisher ID: amiajnl-2013-001935

DOI: 10.1136/amiajnl-2013-001935

PMC ID: 3932460

PubMed ID: 24201027

SO-VID: d1d7cbbb-ca0c-45fb-8c58-e7cd55a94030

License:

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 3.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/3.0/

History

Date received : 15 April 2013

Date revision received : 18 October 2013

Date accepted : 25 October 2013

Custom metadata

special-feature unlocked

ScienceOpen disciplines: Bioinformatics & Computational biology

Keywords: review,electronic health records,cohort identification,phenotyping

Data availability:

ScienceOpen disciplines: Bioinformatics & Computational biology

Keywords: review, electronic health records, cohort identification, phenotyping

Comments

Comment on this article

scite_

Cited by 137

See all cited by

Most referenced authors 828

See all reference authors

A review of approaches to identifying patient phenotype cohorts using electronic health records

Read this article at

Abstract

Objective

Materials and methods

Results

Discussion

Conclusions

Related collections

REPO4EU WP2 Systematic Reviews

Most cited references 92

A simple algorithm for identifying negated findings and diseases in discharge summaries.

Accuracy of ICD-9-CM codes for identifying cardiovascular and stroke risk factors.

Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study.

Author and article information

Journal

Affiliations

Author notes

Author information

Article

History

Categories

Custom metadata

Comments

Comment on this article

Similar content 149

Cited by 137

Most referenced authors 828