7
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Extracting social determinants of health from electronic health records using natural language processing: a systematic review

      review-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Objective

          Social determinants of health (SDoH) are nonclinical dispositions that impact patient health risks and clinical outcomes. Leveraging SDoH in clinical decision-making can potentially improve diagnosis, treatment planning, and patient outcomes. Despite increased interest in capturing SDoH in electronic health records (EHRs), such information is typically locked in unstructured clinical notes. Natural language processing (NLP) is the key technology to extract SDoH information from clinical text and expand its utility in patient care and research. This article presents a systematic review of the state-of-the-art NLP approaches and tools that focus on identifying and extracting SDoH data from unstructured clinical text in EHRs.

          Materials and Methods

          A broad literature search was conducted in February 2021 using 3 scholarly databases (ACL Anthology, PubMed, and Scopus) following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. A total of 6402 publications were initially identified, and after applying the study inclusion criteria, 82 publications were selected for the final review.

          Results

          Smoking status (n = 27), substance use (n = 21), homelessness (n = 20), and alcohol use (n = 15) are the most frequently studied SDoH categories. Homelessness (n = 7) and other less-studied SDoH (eg, education, financial problems, social isolation and support, family problems) are mostly identified using rule-based approaches. In contrast, machine learning approaches are popular for identifying smoking status (n = 13), substance use (n = 9), and alcohol use (n = 9).

          Conclusion

          NLP offers significant potential to extract SDoH data from narrative clinical notes, which in turn can aid in the development of screening tools, risk prediction models, and clinical decision support systems.

          Related collections

          Most cited references76

          • Record: found
          • Abstract: not found
          • Article: not found

          Social determinants of breast cancer risk, stage, and survival

            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            A clinical text classification paradigm using weak supervision and deep representation

            Background Automatic clinical text classification is a natural language processing (NLP) technology that unlocks information embedded in clinical narratives. Machine learning approaches have been shown to be effective for clinical text classification tasks. However, a successful machine learning model usually requires extensive human efforts to create labeled training data and conduct feature engineering. In this study, we propose a clinical text classification paradigm using weak supervision and deep representation to reduce these human efforts. Methods We develop a rule-based NLP algorithm to automatically generate labels for the training data, and then use the pre-trained word embeddings as deep representation features for training machine learning models. Since machine learning is trained on labels generated by the automatic NLP algorithm, this training process is called weak supervision. We evaluat the paradigm effectiveness on two institutional case studies at Mayo Clinic: smoking status classification and proximal femur (hip) fracture classification, and one case study using a public dataset: the i2b2 2006 smoking status classification shared task. We test four widely used machine learning models, namely, Support Vector Machine (SVM), Random Forest (RF), Multilayer Perceptron Neural Networks (MLPNN), and Convolutional Neural Networks (CNN), using this paradigm. Precision, recall, and F1 score are used as metrics to evaluate performance. Results CNN achieves the best performance in both institutional tasks (F1 score: 0.92 for Mayo Clinic smoking status classification and 0.97 for fracture classification). We show that word embeddings significantly outperform tf-idf and topic modeling features in the paradigm, and that CNN captures additional patterns from the weak supervision compared to the rule-based NLP algorithms. We also observe two drawbacks of the proposed paradigm that CNN is more sensitive to the size of training data, and that the proposed paradigm might not be effective for complex multiclass classification tasks. Conclusion The proposed clinical text classification paradigm could reduce human efforts of labeled training data creation and feature engineering for applying machine learning to clinical text classification by leveraging weak supervision and deep representation. The experimental experiments have validated the effectiveness of paradigm by two institutional and one shared clinical text classification tasks.
              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found
              Is Open Access

              [The PRISMA statement extension for systematic reviews incorporating network meta-analysis: PRISMA-NMA].

                Bookmark

                Author and article information

                Journal
                J Am Med Inform Assoc
                J Am Med Inform Assoc
                jamia
                Journal of the American Medical Informatics Association : JAMIA
                Oxford University Press
                1067-5027
                1527-974X
                December 2021
                06 October 2021
                06 October 2021
                : 28
                : 12
                : 2716-2727
                Affiliations
                [1 ] Department of Population Health Sciences, Weill Cornell Medicine , New York, New York, USA
                [2 ] Information Technologies and Services, Weill Cornell Medicine , New York, New York, USA
                [3 ] Department of Internal Medicine, Division of Epidemiology, University of Utah , Salt Lake City, Utah, USA
                [4 ] US Department of Veterans Affairs , Salt Lake City, Utah, USA
                [5 ] Icahn School of Medicine at Mount Sinai , New York, New York, USA
                [6 ] Department of Quantitative Health Sciences, Mayo Clinic , Rochester, Minnesota, USA
                [7 ] Northwestern University , Chicago, Illinois, USA
                [8 ] Department of Health Outcomes and Biomedical Informatics, University of Florida , Gainesville, Florida, USA
                [9 ] Division of Hematology & Oncology, Department of Medicine, College of Medicine, University of Florida, Gainesville, Florida, USA, and
                [10 ] Vagelos College of Physicians and Surgeons, Columbia University , New York, New York, USA
                Author notes
                Corresponding Author: Jyotishman Pathak, PhD, Department of Population Health Sciences, Weill Cornell Medicine, 425 E 61st St, Suite 301, New York, NY 10065, USA ( jyp2001@ 123456med.cornell.edu )
                Author information
                https://orcid.org/0000-0003-2997-5314
                https://orcid.org/0000-0003-0091-5510
                https://orcid.org/0000-0001-9801-2250
                https://orcid.org/0000-0002-8717-5975
                https://orcid.org/0000-0003-4515-8090
                https://orcid.org/0000-0002-6249-9180
                https://orcid.org/0000-0002-9881-1017
                https://orcid.org/0000-0002-2238-5429
                https://orcid.org/0000-0001-7624-769X
                https://orcid.org/0000-0001-5586-9940
                https://orcid.org/0000-0002-4856-410X
                Article
                ocab170
                10.1093/jamia/ocab170
                8633615
                34613399
                3d7d43f4-6d53-4acc-a3a5-c4746fa2ad50
                © The Author(s) 2021. Published by Oxford University Press on behalf of the American Medical Informatics Association.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License ( https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

                History
                : 29 April 2021
                : 09 July 2021
                : 04 August 2021
                : 25 November 2021
                Page count
                Pages: 12
                Funding
                Funded by: NIH, DOI 10.13039/100000002;
                Award ID: R01MH119177
                Award ID: R01MH121907
                Award ID: R01MH121922
                Categories
                Reviews
                AcademicSubjects/MED00580
                AcademicSubjects/SCI01060
                AcademicSubjects/SCI01530

                Bioinformatics & Computational biology
                social determinants of health,population health outcomes,electronic health records,natural language processing,information extraction,machine learning

                Comments

                Comment on this article