Evaluating the effects of machine pre-annotation and an interactive annotation interface on manual de-identification of clinical text

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor method requires removal of 18 types of protected health information (PHI) from clinical documents to be considered “de-identified” prior to use for research purposes. Human review of PHI elements from a large corpus of clinical documents can be tedious and error-prone. Indeed, multiple annotators may be required to consistently redact information that represents each PHI class. Automated de-identification has the potential to improve annotation quality and reduce annotation time. For instance, using machine-assisted annotation by combining de-identification system outputs used as pre-annotations and an interactive annotation interface to provide annotators with PHI annotations for “curation” rather than manual annotation from “scratch” on raw clinical documents. In order to assess whether machine-assisted annotation improves the reliability and accuracy of the reference standard quality and reduces annotation effort, we conducted an annotation experiment. In this annotation study, we assessed the generalizability of the VA Consortium for Healthcare Informatics Research (CHIR) annotation schema and guidelines applied to a corpus of publicly available clinical documents called MTSamples. Specifically, our goals were to (1) characterize a heterogeneous corpus of clinical documents manually annotated for risk-ranked PHI and other annotation types (clinical eponyms and person relations), (2) evaluate how well annotators apply the CHIR schema to the heterogeneous corpus, (3) compare whether machine-assisted annotation (experiment) improves annotation quality and reduces annotation time compared to manual annotation (control), and (4) assess the change in quality of reference standard coverage with each added annotator’s annotations.

Related collections

Most cited references 34

Record: found
Abstract: found
Article: not found

ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text.

Burr Settles (2005)

ABNER (A Biomedical Named Entity Recognizer) is an open source software tool for molecular biology text mining. At its core is a machine learning system using conditional random fields with a variety of orthographic and contextual features. The latest version is 1.5, which has an intuitive graphical interface and includes two modules for tagging entities (e.g. protein and cell line) trained on standard corpora, for which performance is roughly state of the art. It also includes a Java application programming interface allowing users to incorporate ABNER into their own systems and train models on new corpora.

0 comments Cited 130 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Evaluating the state-of-the-art in automatic de-identification.

Ozlem Uzuner, Yuan Luo, Peter Szolovits (2024)

To facilitate and survey studies in automatic de-identification, as a part of the i2b2 (Informatics for Integrating Biology to the Bedside) project, authors organized a Natural Language Processing (NLP) challenge on automatically removing private health information (PHI) from medical discharge records. This manuscript provides an overview of this de-identification challenge, describes the data and the annotation process, explains the evaluation metrics, discusses the nature of the systems that addressed the challenge, analyzes the results of received system runs, and identifies directions for future research. The de-indentification challenge data consisted of discharge summaries drawn from the Partners Healthcare system. Authors prepared this data for the challenge by replacing authentic PHI with synthesized surrogates. To focus the challenge on non-dictionary-based de-identification methods, the data was enriched with out-of-vocabulary PHI surrogates, i.e., made up names. The data also included some PHI surrogates that were ambiguous with medical non-PHI terms. A total of seven teams participated in the challenge. Each team submitted up to three system runs, for a total of sixteen submissions. The authors used precision, recall, and F-measure to evaluate the submitted system runs based on their token-level and instance-level performance on the ground truth. The systems with the best performance scored above 98% in F-measure for all categories of PHI. Most out-of-vocabulary PHI could be identified accurately. However, identifying ambiguous PHI proved challenging. The performance of systems on the test data set is encouraging. Future evaluations of these systems will involve larger data sets from more heterogeneous sources.

0 comments Cited 94 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Automated de-identification of free-text medical records

Ishna Neamatullah, Margaret Douglass, Li-wei H Lehman … (2008)

Background Text-based patient medical records are a vital resource in medical research. In order to preserve patient confidentiality, however, the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that protected health information (PHI) be removed from medical records before they can be disseminated. Manual de-identification of large medical record databases is prohibitively expensive, time-consuming and prone to error, necessitating automatic methods for large-scale, automated de-identification. Methods We describe an automated Perl-based de-identification software package that is generally usable on most free-text medical records, e.g., nursing notes, discharge summaries, X-ray reports, etc. The software uses lexical look-up tables, regular expressions, and simple heuristics to locate both HIPAA PHI, and an extended PHI set that includes doctors' names and years of dates. To develop the de-identification approach, we assembled a gold standard corpus of re-identified nursing notes with real PHI replaced by realistic surrogate information. This corpus consists of 2,434 nursing notes containing 334,000 words and a total of 1,779 instances of PHI taken from 163 randomly selected patient records. This gold standard corpus was used to refine the algorithm and measure its sensitivity. To test the algorithm on data not used in its development, we constructed a second test corpus of 1,836 nursing notes containing 296,400 words. The algorithm's false negative rate was evaluated using this test corpus. Results Performance evaluation of the de-identification software on the development corpus yielded an overall recall of 0.967, precision value of 0.749, and fallout value of approximately 0.002. On the test corpus, a total of 90 instances of false negatives were found, or 27 per 100,000 word count, with an estimated recall of 0.943. Only one full date and one age over 89 were missed. No patient names were missed in either corpus. Conclusion We have developed a pattern-matching de-identification system based on dictionary look-ups, regular expressions, and heuristics. Evaluation based on two different sets of nursing notes collected from a U.S. hospital suggests that, in terms of recall, the software out-performs a single human de-identifier (0.81) and performs at least as well as a consensus of two human de-identifiers (0.94). The system is currently tuned to de-identify PHI in nursing notes and discharge summaries but is sufficiently generalized and can be customized to handle text files of any format. Although the accuracy of the algorithm is high, it is probably insufficient to be used to publicly disseminate medical data. The open-source de-identification software and the gold standard re-identified corpus of medical records have therefore been made available to researchers via the PhysioNet website to encourage improvements in the algorithm.

0 comments Cited 90 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-journal-id): 100970413

Journal ID (pubmed-jr-id): 22289

Journal ID (nlm-ta): J Biomed Inform

Journal ID (iso-abbrev): J Biomed Inform

Title: Journal of biomedical informatics

ISSN (Print): 1532-0464

ISSN (Electronic): 1532-0480

Publication date Nihms-submitted: 12 March 2016

Publication date (Electronic): 20 May 2014

Publication date (Print): August 2014

Publication date PMC-release: 04 October 2017

Volume: 50

Pages: 162-172

Affiliations

[a ]Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA

[b ]VA Salt Lake City Health Care System, Salt Lake City, UT, USA

[c ]Nuance Communications Inc., Burlington, MA, USA

[d ]Department of Internal Medicine, University of Utah, Salt Lake City, UT, USA

[e ]Department of Biomedical Informatics, University of Pittsburgh, PA, USA

[f ]VA Health Care System, San Diego, CA, USA

[g ]Division of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA

Author notes

[* ]Corresponding author at: University of Utah, Department of Biomedical Informatics, 421 Wakara Way, Suite 140, Salt Lake City, UT 84112, USA. brett.south@ 123456hsc.utah.edu (B.R. South)

Article

Manuscript ID: NIHMS656469

DOI: 10.1016/j.jbi.2014.05.002

PMC ID: 5627768

PubMed ID: 24859155

SO-VID: 13763f1c-0cde-45f3-811a-71c8ce93fdc9

License:

This is an open access article under the CC BY-NC-ND license ( http://creativecommons.org/licenses/by-nc-nd/3.0/).

Evaluating the effects of machine pre-annotation and an interactive annotation interface on manual de-identification of clinical text

Read this article at

Abstract

Related collections

Radiology and Natural Language Processing

Most cited references 34

ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text.

Evaluating the state-of-the-art in automatic de-identification.

Automated de-identification of free-text medical records

Author and article information

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 97

Cited by 16

Most referenced authors 203