+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Extracting a stroke phenotype risk factor from Veteran Health Administration clinical reports: an information content analysis


      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.



          In the United States, 795,000 people suffer strokes each year; 10–15 % of these strokes can be attributed to stenosis caused by plaque in the carotid artery, a major stroke phenotype risk factor. Studies comparing treatments for the management of asymptomatic carotid stenosis are challenging for at least two reasons: 1) administrative billing codes (i.e., Current Procedural Terminology (CPT) codes) that identify carotid images do not denote which neurovascular arteries are affected and 2) the majority of the image reports are negative for carotid stenosis. Studies that rely on manual chart abstraction can be labor-intensive, expensive, and time-consuming. Natural Language Processing (NLP) can expedite the process of manual chart abstraction by automatically filtering reports with no/insignificant carotid stenosis findings and flagging reports with significant carotid stenosis findings; thus, potentially reducing effort, costs, and time.


          In this pilot study, we conducted an information content analysis of carotid stenosis mentions in terms of their report location (Sections), report formats ( structures) and linguistic descriptions ( expressions) from Veteran Health Administration free-text reports. We assessed an NLP algorithm, pyConText’s, ability to discern reports with significant carotid stenosis findings from reports with no/insignificant carotid stenosis findings given these three document composition factors for two report types: radiology (RAD) and text integration utility (TIU) notes.


          We observed that most carotid mentions are recorded in prose using categorical expressions , within the Findings and Impression sections for RAD reports and within neither of these designated sections for TIU notes. For RAD reports, pyConText performed with high sensitivity (88 %), specificity (84 %), and negative predictive value (95 %) and reasonable positive predictive value (70 %). For TIU notes, pyConText performed with high specificity (87 %) and negative predictive value (92 %), reasonable sensitivity (73 %), and moderate positive predictive value (58 %). pyConText performed with the highest sensitivity processing the full report rather than the Findings or Impressions independently.


          We conclude that pyConText can reduce chart review efforts by filtering reports with no/insignificant carotid stenosis findings and flagging reports with significant carotid stenosis findings from the Veteran Health Administration electronic health record, and hence has utility for expediting a comparative effectiveness study of treatment strategies for stroke prevention.

          Related collections

          Most cited references22

          • Record: found
          • Abstract: found
          • Article: not found

          A simple algorithm for identifying negated findings and diseases in discharge summaries.

          Narrative reports in medical records contain a wealth of information that may augment structured data for managing patient information and predicting trends in diseases. Pertinent negatives are evident in text but are not usually indexed in structured databases. The objective of the study reported here was to test a simple algorithm for determining whether a finding or disease mentioned within narrative medical reports is present or absent. We developed a simple regular expression algorithm called NegEx that implements several phrases indicating negation, filters out sentences containing phrases that falsely appear to be negation phrases, and limits the scope of the negation phrases. We compared NegEx against a baseline algorithm that has a limited set of negation phrases and a simpler notion of scope. In a test of 1235 findings and diseases in 1000 sentences taken from discharge summaries indexed by physicians, NegEx had a specificity of 94.5% (versus 85.3% for the baseline), a positive predictive value of 84.5% (versus 68.4% for the baseline) while maintaining a reasonable sensitivity of 77.8% (versus 88.3% for the baseline). We conclude that with little implementation effort a simple regular expression algorithm for determining whether a finding or disease is absent can identify a large portion of the pertinent negatives from discharge summaries.
            • Record: found
            • Abstract: found
            • Article: not found

            Accuracy of ICD-9-CM codes for identifying cardiovascular and stroke risk factors.

            We sought to determine which ICD-9-CM codes in Medicare Part A data identify cardiovascular and stroke risk factors. This was a cross-sectional study comparing ICD-9-CM data to structured medical record review from 23,657 Medicare beneficiaries aged 20 to 105 years who had atrial fibrillation. Quality improvement organizations used standardized abstraction instruments to determine the presence of 9 cardiovascular and stroke risk factors. Using the chart abstractions as the gold standard, we assessed the accuracy of ICD-9-CM codes to identify these risk factors. ICD-9-CM codes for all risk factors had high specificity (>0.95) and low sensitivity ( or =0.98) but moderate positive predictive values (range, 0.54-0.77) in this population. Using ICD-9-CM codes alone, heart failure, coronary artery disease, diabetes, hypertension, and stroke can be ruled in but not necessarily ruled out. Where feasible, review of additional data (eg, physician notes or imaging studies) should be used to confirm the diagnosis of valvular disease, arterial peripheral embolus, intracranial hemorrhage, and deep venous thrombosis.
              • Record: found
              • Abstract: found
              • Article: not found

              Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study.

              Genome-wide association studies (GWAS) require high specificity and large numbers of subjects to identify genotype-phenotype correlations accurately. The aim of this study was to identify type 2 diabetes (T2D) cases and controls for a GWAS, using data captured through routine clinical care across five institutions using different electronic medical record (EMR) systems. An algorithm was developed to identify T2D cases and controls based on a combination of diagnoses, medications, and laboratory results. The performance of the algorithm was validated at three of the five participating institutions compared against clinician review. A GWAS was subsequently performed using cases and controls identified by the algorithm, with samples pooled across all five institutions. The algorithm achieved 98% and 100% positive predictive values for the identification of diabetic cases and controls, respectively, as compared against clinician review. By standardizing and applying the algorithm across institutions, 3353 cases and 3352 controls were identified. Subsequent GWAS using data from five institutions replicated the TCF7L2 gene variant (rs7903146) previously associated with T2D. By applying stringent criteria to EMR data collected through routine clinical care, cases and controls for a GWAS were identified that subsequently replicated a known genetic variant. The use of standard terminologies to define data elements enabled pooling of subjects and data across five different institutions to achieve the robust numbers required for GWAS. An algorithm using commonly available data from five different EMR can accurately identify T2D cases and controls for genetic study across multiple institutions.

                Author and article information

                J Biomed Semantics
                J Biomed Semantics
                Journal of Biomedical Semantics
                BioMed Central (London )
                10 May 2016
                10 May 2016
                : 7
                [ ]Department of Biomedical Informatics, University of Utah, Salt Lake City, UT USA
                [ ]IDEAS Center, Veteran Affair Health Care System, Salt Lake City, UT USA
                [ ]San Francisco Veteran Affair Health Care System, San Francisco, CA USA
                © Mowery et al. 2016

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                Funded by: FundRef http://dx.doi.org/10.13039/100000057, National Institute of General Medical Sciences;
                Award ID: NIGMS R01GM090187
                Award Recipient :
                Funded by: FundRef http://dx.doi.org/10.13039/100000092, U.S. National Library of Medicine;
                Award ID: R01 LM010964
                Award Recipient :
                Funded by: FundRef http://dx.doi.org/10.13039/100000050, National Heart, Lung, and Blood Institute;
                Award ID: 1R01HL114563-01A1
                Award Recipient :
                Funded by: FundRef http://dx.doi.org/10.13039/100007181, Quality Enhancement Research Initiative;
                Award ID: HSR&D Stroke QUERI RRP 12-185
                Award Recipient :
                Custom metadata
                © The Author(s) 2016

                Bioinformatics & Computational biology
                natural language processing,stroke,phenotype,information extraction


                Comment on this article