A clinical text classification paradigm using weak supervision and deep representation

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Automatic clinical text classification is a natural language processing (NLP) technology that unlocks information embedded in clinical narratives. Machine learning approaches have been shown to be effective for clinical text classification tasks. However, a successful machine learning model usually requires extensive human efforts to create labeled training data and conduct feature engineering. In this study, we propose a clinical text classification paradigm using weak supervision and deep representation to reduce these human efforts.

Methods

We develop a rule-based NLP algorithm to automatically generate labels for the training data, and then use the pre-trained word embeddings as deep representation features for training machine learning models. Since machine learning is trained on labels generated by the automatic NLP algorithm, this training process is called weak supervision. We evaluat the paradigm effectiveness on two institutional case studies at Mayo Clinic: smoking status classification and proximal femur (hip) fracture classification, and one case study using a public dataset: the i2b2 2006 smoking status classification shared task. We test four widely used machine learning models, namely, Support Vector Machine (SVM), Random Forest (RF), Multilayer Perceptron Neural Networks (MLPNN), and Convolutional Neural Networks (CNN), using this paradigm. Precision, recall, and F1 score are used as metrics to evaluate performance.

Results

CNN achieves the best performance in both institutional tasks (F1 score: 0.92 for Mayo Clinic smoking status classification and 0.97 for fracture classification). We show that word embeddings significantly outperform tf-idf and topic modeling features in the paradigm, and that CNN captures additional patterns from the weak supervision compared to the rule-based NLP algorithms. We also observe two drawbacks of the proposed paradigm that CNN is more sensitive to the size of training data, and that the proposed paradigm might not be effective for complex multiclass classification tasks.

Conclusion

The proposed clinical text classification paradigm could reduce human efforts of labeled training data creation and feature engineering for applying machine learning to clinical text classification by leveraging weak supervision and deep representation. The experimental experiments have validated the effectiveness of paradigm by two institutional and one shared clinical text classification tasks.

Related collections

Most cited references 41

Record: found
Abstract: found
Article: not found

Use of a medical records linkage system to enumerate a dynamic population over time: the Rochester epidemiology project.

Walter A Rocca, Barbara P. Yawn, L. Joseph Melton … (2011)

The Rochester Epidemiology Project (REP) is a unique research infrastructure in which the medical records of virtually all persons residing in Olmsted County, Minnesota, for over 40 years have been linked and archived. In the present article, the authors describe how the REP links medical records from multiple health care institutions to specific individuals and how residency is confirmed over time. Additionally, the authors provide evidence for the validity of the REP Census enumeration. Between 1966 and 2008, 1,145,856 medical records were linked to 486,564 individuals in the REP. The REP Census was found to be valid when compared with a list of residents obtained from random digit dialing, a list of residents of nursing homes and senior citizen complexes, a commercial list of residents, and a manual review of records. In addition, the REP Census counts were comparable to those of 4 decennial US censuses (e.g., it included 104.1% of 1970 and 102.7% of 2000 census counts). The duration for which each person was captured in the system varied greatly by age and calendar year; however, the duration was typically substantial. Comprehensive medical records linkage systems like the REP can be used to maintain a continuously updated census and to provide an optimal sampling framework for epidemiologic studies.

0 comments Cited 195 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Clinical information extraction applications: A literature review

Sunghwan Sohn, Hongfang Liu, Yanshan Wang … (2018)

With the rapid adoption of electronic health records (EHRs), it is desirable to harvest information and knowledge from EHRs to support automated systems at the point of care and to enable secondary use of EHRs for clinical and translational research. One critical component used to facilitate the secondary use of EHR data is the information extraction (IE) task, which automatically extracts and encodes clinical information from text.

0 comments Cited 191 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

What can natural language processing do for clinical decision support?

Dina Demner-Fushman, Wendy W Chapman, Clement J McDonald (2009)

Computerized clinical decision support (CDS) aims to aid decision making of health care providers and the public by providing easily accessible health-related information at the point and time it is needed. natural language processing (NLP) is instrumental in using free-text information to drive CDS, representing clinical knowledge and CDS interventions in standardized formats, and leveraging clinical narrative. The early innovative NLP research of clinical narrative was followed by a period of stable research conducted at the major clinical centers and a shift of mainstream interest to biomedical NLP. This review primarily focuses on the recently renewed interest in development of fundamental NLP methods and advances in the NLP systems for CDS. The current solutions to challenges posed by distinct sublanguages, intended user groups, and support goals are discussed.

0 comments Cited 127 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Yanshan Wang:

ORCID: http://orcid.org/0000-0003-4433-7839

wang.yanshan@mayo.edu

Sunghwan Sohn: sohn.sunghwan@mayo.edu

Sijia Liu: liu.sijia@mayo.edu

Feichen Shen: shen.feichen@mayo.edu

Liwei Wang: wang.liwei@mayo.edu

Elizabeth J. Atkinson: atkinson@mayo.edu

Shreyasee Amin: amin.shreyasee@mayo.edu

Hongfang Liu: liu.hongfang@mayo.edu

Journal

Journal ID (nlm-ta): BMC Med Inform Decis Mak

Journal ID (iso-abbrev): BMC Med Inform Decis Mak

Title: BMC Medical Informatics and Decision Making

Publisher: BioMed Central (London )

ISSN (Electronic): 1472-6947

Publication date (Electronic): 7 January 2019

Publication date PMC-release: 7 January 2019

Publication date Collection: 2019

Volume: 19

Electronic Location Identifier: 1

Affiliations

[1 ]ISNI 0000 0004 0459 167X, GRID grid.66875.3a, Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, , Mayo Clinic, ; 200 1st ST SW, Rochester, MN 55905 USA

[2 ]ISNI 0000 0004 0459 167X, GRID grid.66875.3a, Division of Rheumatology, Department of Medicine, , Mayo Clinic, ; 200 1st ST SW, Rochester, MN 55905 USA

[3 ]ISNI 0000 0004 0459 167X, GRID grid.66875.3a, Division of Epidemiology, Department of Health Sciences Research, , Mayo Clinic, ; 200 1st ST SW, Rochester, MN 55905 USA

Author information

Yanshan Wang http://orcid.org/0000-0003-4433-7839

Article

Publisher ID: 723

DOI: 10.1186/s12911-018-0723-6

PMC ID: 6322223

PubMed ID: 30616584

SO-VID: a4edfc04-84f1-417a-85e2-3d9c846f0218

License:

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

History

Date received : 23 May 2018

Date accepted : 10 December 2018

Funding

Funded by: FundRef http://dx.doi.org/10.13039/100000049, National Institute on Aging;

Award ID: P01AG04875

Funded by: FundRef http://dx.doi.org/10.13039/100000057, National Institute of General Medical Sciences;

Award ID: R01GM102282

Funded by: FundRef http://dx.doi.org/10.13039/100006108, National Center for Advancing Translational Sciences;

Award ID: U01TR002062

Funded by: FundRef http://dx.doi.org/10.13039/100000092, U.S. National Library of Medicine;

Award ID: R01LM11934

Custom metadata

ScienceOpen disciplines: Bioinformatics & Computational biology

Keywords: clinical text classification,natural language processing,electronic health records,machine learning,weak supervision

Data availability:

ScienceOpen disciplines: Bioinformatics & Computational biology

Keywords: clinical text classification, natural language processing, electronic health records, machine learning, weak supervision

A clinical text classification paradigm using weak supervision and deep representation

Read this article at

Abstract

Background

Methods

Results

Conclusion

Related collections

Radiology and Natural Language Processing

Most cited references 41

Use of a medical records linkage system to enumerate a dynamic population over time: the Rochester epidemiology project.

Clinical information extraction applications: A literature review

What can natural language processing do for clinical decision support?

Author and article information

Contributors

Journal

Affiliations

Author information

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 111

Cited by 119

Most referenced authors 643