+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      A review of approaches to identifying patient phenotype cohorts using electronic health records

      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.



          To summarize literature describing approaches aimed at automatically identifying patients with a common phenotype.

          Materials and methods

          We performed a review of studies describing systems or reporting techniques developed for identifying cohorts of patients with specific phenotypes. Every full text article published in (1) Journal of American Medical Informatics Association, (2) Journal of Biomedical Informatics, (3) Proceedings of the Annual American Medical Informatics Association Symposium, and (4) Proceedings of Clinical Research Informatics Conference within the past 3 years was assessed for inclusion in the review. Only articles using automated techniques were included.


          Ninety-seven articles met our inclusion criteria. Forty-six used natural language processing (NLP)-based techniques, 24 described rule-based systems, 41 used statistical analyses, data mining, or machine learning techniques, while 22 described hybrid systems. Nine articles described the architecture of large-scale systems developed for determining cohort eligibility of patients.


          We observe that there is a rise in the number of studies associated with cohort identification using electronic medical records. Statistical analyses or machine learning, followed by NLP techniques, are gaining popularity over the years in comparison with rule-based systems.


          There are a variety of approaches for classifying patients into a particular phenotype. Different techniques and data sources are used, and good performance is reported on datasets at respective institutions. However, no system makes comprehensive use of electronic medical records addressing all of their known weaknesses.

          Related collections

          Most cited references 92

          • Record: found
          • Abstract: found
          • Article: not found

          A simple algorithm for identifying negated findings and diseases in discharge summaries.

          Narrative reports in medical records contain a wealth of information that may augment structured data for managing patient information and predicting trends in diseases. Pertinent negatives are evident in text but are not usually indexed in structured databases. The objective of the study reported here was to test a simple algorithm for determining whether a finding or disease mentioned within narrative medical reports is present or absent. We developed a simple regular expression algorithm called NegEx that implements several phrases indicating negation, filters out sentences containing phrases that falsely appear to be negation phrases, and limits the scope of the negation phrases. We compared NegEx against a baseline algorithm that has a limited set of negation phrases and a simpler notion of scope. In a test of 1235 findings and diseases in 1000 sentences taken from discharge summaries indexed by physicians, NegEx had a specificity of 94.5% (versus 85.3% for the baseline), a positive predictive value of 84.5% (versus 68.4% for the baseline) while maintaining a reasonable sensitivity of 77.8% (versus 88.3% for the baseline). We conclude that with little implementation effort a simple regular expression algorithm for determining whether a finding or disease is absent can identify a large portion of the pertinent negatives from discharge summaries.
            • Record: found
            • Abstract: found
            • Article: not found

            Accuracy of ICD-9-CM codes for identifying cardiovascular and stroke risk factors.

            We sought to determine which ICD-9-CM codes in Medicare Part A data identify cardiovascular and stroke risk factors. This was a cross-sectional study comparing ICD-9-CM data to structured medical record review from 23,657 Medicare beneficiaries aged 20 to 105 years who had atrial fibrillation. Quality improvement organizations used standardized abstraction instruments to determine the presence of 9 cardiovascular and stroke risk factors. Using the chart abstractions as the gold standard, we assessed the accuracy of ICD-9-CM codes to identify these risk factors. ICD-9-CM codes for all risk factors had high specificity (>0.95) and low sensitivity ( or =0.98) but moderate positive predictive values (range, 0.54-0.77) in this population. Using ICD-9-CM codes alone, heart failure, coronary artery disease, diabetes, hypertension, and stroke can be ruled in but not necessarily ruled out. Where feasible, review of additional data (eg, physician notes or imaging studies) should be used to confirm the diagnosis of valvular disease, arterial peripheral embolus, intracranial hemorrhage, and deep venous thrombosis.
              • Record: found
              • Abstract: found
              • Article: not found

              Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study.

              Genome-wide association studies (GWAS) require high specificity and large numbers of subjects to identify genotype-phenotype correlations accurately. The aim of this study was to identify type 2 diabetes (T2D) cases and controls for a GWAS, using data captured through routine clinical care across five institutions using different electronic medical record (EMR) systems. An algorithm was developed to identify T2D cases and controls based on a combination of diagnoses, medications, and laboratory results. The performance of the algorithm was validated at three of the five participating institutions compared against clinician review. A GWAS was subsequently performed using cases and controls identified by the algorithm, with samples pooled across all five institutions. The algorithm achieved 98% and 100% positive predictive values for the identification of diabetic cases and controls, respectively, as compared against clinician review. By standardizing and applying the algorithm across institutions, 3353 cases and 3352 controls were identified. Subsequent GWAS using data from five institutions replicated the TCF7L2 gene variant (rs7903146) previously associated with T2D. By applying stringent criteria to EMR data collected through routine clinical care, cases and controls for a GWAS were identified that subsequently replicated a known genetic variant. The use of standard terminologies to define data elements enabled pooling of subjects and data across five different institutions to achieve the robust numbers required for GWAS. An algorithm using commonly available data from five different EMR can accurately identify T2D cases and controls for genetic study across multiple institutions.

                Author and article information

                J Am Med Inform Assoc
                J Am Med Inform Assoc
                Journal of the American Medical Informatics Association : JAMIA
                BMJ Publishing Group (BMA House, Tavistock Square, London, WC1H 9JR )
                March 2014
                7 November 2013
                7 November 2013
                : 21
                : 2
                : 221-230
                [1 ]Department of Computer Science and Engineering, The Ohio State University , Columbus, Ohio, USA
                [2 ]Department of Biomedical Informatics, The Ohio State University , Columbus, Ohio, USA
                [3 ]Department of Biomedical Informatics, Columbia University , New York, New York, USA
                [4 ]Center for Healthcare Informatics and Policy, Weill Cornell Medical College , New York, New York, USA
                Author notes
                [Correspondence to ] Chaitanya Shivade, Department of Computer Science and Engineering, The Ohio State University, 395 Dreese Laboratories, 2015 Neil Avenue, Columbus, OH 43210, USA; shivade@
                Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to

                This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 3.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See:

                Custom metadata


                Comment on this article