32
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      A De-identification Method for Bilingual Clinical Texts of Various Note Types

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          De-identification of personal health information is essential in order not to require written patient informed consent. Previous de-identification methods were proposed using natural language processing technology in order to remove the identifiers in clinical narrative text, although these methods only focused on narrative text written in English. In this study, we propose a regular expression-based de-identification method used to address bilingual clinical records written in Korean and English. To develop and validate regular expression rules, we obtained training and validation datasets composed of 6,039 clinical notes of 20 types and 5,000 notes of 33 types, respectively. Fifteen regular expression rules were constructed using the development dataset and those rules achieved 99.87% precision and 96.25% recall for the validation dataset. Our de-identification method successfully removed the identifiers in diverse types of bilingual clinical narrative texts. This method will thus assist physicians to more easily perform retrospective research.

          Graphical Abstract

          Related collections

          Most cited references43

          • Record: found
          • Abstract: found
          • Article: not found

          Evaluating the state-of-the-art in automatic de-identification.

          To facilitate and survey studies in automatic de-identification, as a part of the i2b2 (Informatics for Integrating Biology to the Bedside) project, authors organized a Natural Language Processing (NLP) challenge on automatically removing private health information (PHI) from medical discharge records. This manuscript provides an overview of this de-identification challenge, describes the data and the annotation process, explains the evaluation metrics, discusses the nature of the systems that addressed the challenge, analyzes the results of received system runs, and identifies directions for future research. The de-indentification challenge data consisted of discharge summaries drawn from the Partners Healthcare system. Authors prepared this data for the challenge by replacing authentic PHI with synthesized surrogates. To focus the challenge on non-dictionary-based de-identification methods, the data was enriched with out-of-vocabulary PHI surrogates, i.e., made up names. The data also included some PHI surrogates that were ambiguous with medical non-PHI terms. A total of seven teams participated in the challenge. Each team submitted up to three system runs, for a total of sixteen submissions. The authors used precision, recall, and F-measure to evaluate the submitted system runs based on their token-level and instance-level performance on the ground truth. The systems with the best performance scored above 98% in F-measure for all categories of PHI. Most out-of-vocabulary PHI could be identified accurately. However, identifying ambiguous PHI proved challenging. The performance of systems on the test data set is encouraging. Future evaluations of these systems will involve larger data sets from more heterogeneous sources.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Automated de-identification of free-text medical records

            Background Text-based patient medical records are a vital resource in medical research. In order to preserve patient confidentiality, however, the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that protected health information (PHI) be removed from medical records before they can be disseminated. Manual de-identification of large medical record databases is prohibitively expensive, time-consuming and prone to error, necessitating automatic methods for large-scale, automated de-identification. Methods We describe an automated Perl-based de-identification software package that is generally usable on most free-text medical records, e.g., nursing notes, discharge summaries, X-ray reports, etc. The software uses lexical look-up tables, regular expressions, and simple heuristics to locate both HIPAA PHI, and an extended PHI set that includes doctors' names and years of dates. To develop the de-identification approach, we assembled a gold standard corpus of re-identified nursing notes with real PHI replaced by realistic surrogate information. This corpus consists of 2,434 nursing notes containing 334,000 words and a total of 1,779 instances of PHI taken from 163 randomly selected patient records. This gold standard corpus was used to refine the algorithm and measure its sensitivity. To test the algorithm on data not used in its development, we constructed a second test corpus of 1,836 nursing notes containing 296,400 words. The algorithm's false negative rate was evaluated using this test corpus. Results Performance evaluation of the de-identification software on the development corpus yielded an overall recall of 0.967, precision value of 0.749, and fallout value of approximately 0.002. On the test corpus, a total of 90 instances of false negatives were found, or 27 per 100,000 word count, with an estimated recall of 0.943. Only one full date and one age over 89 were missed. No patient names were missed in either corpus. Conclusion We have developed a pattern-matching de-identification system based on dictionary look-ups, regular expressions, and heuristics. Evaluation based on two different sets of nursing notes collected from a U.S. hospital suggests that, in terms of recall, the software out-performs a single human de-identifier (0.81) and performs at least as well as a consensus of two human de-identifiers (0.94). The system is currently tuned to de-identify PHI in nursing notes and discharge summaries but is sufficiently generalized and can be customized to handle text files of any format. Although the accuracy of the algorithm is high, it is probably insufficient to be used to publicly disseminate medical data. The open-source de-identification software and the gold standard re-identified corpus of medical records have therefore been made available to researchers via the PhysioNet website to encourage improvements in the algorithm.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Automatic de-identification of textual documents in the electronic health record: a review of recent research

              Background In the United States, the Health Insurance Portability and Accountability Act (HIPAA) protects the confidentiality of patient data and requires the informed consent of the patient and approval of the Internal Review Board to use data for research purposes, but these requirements can be waived if data is de-identified. For clinical data to be considered de-identified, the HIPAA "Safe Harbor" technique requires 18 data elements (called PHI: Protected Health Information) to be removed. The de-identification of narrative text documents is often realized manually, and requires significant resources. Well aware of these issues, several authors have investigated automated de-identification of narrative text documents from the electronic health record, and a review of recent research in this domain is presented here. Methods This review focuses on recently published research (after 1995), and includes relevant publications from bibliographic queries in PubMed, conference proceedings, the ACM Digital Library, and interesting publications referenced in already included papers. Results The literature search returned more than 200 publications. The majority focused only on structured data de-identification instead of narrative text, on image de-identification, or described manual de-identification, and were therefore excluded. Finally, 18 publications describing automated text de-identification were selected for detailed analysis of the architecture and methods used, the types of PHI detected and removed, the external resources used, and the types of clinical documents targeted. All text de-identification systems aimed to identify and remove person names, and many included other types of PHI. Most systems used only one or two specific clinical document types, and were mostly based on two different groups of methodologies: pattern matching and machine learning. Many systems combined both approaches for different types of PHI, but the majority relied only on pattern matching, rules, and dictionaries. Conclusions In general, methods based on dictionaries performed better with PHI that is rarely mentioned in clinical text, but are more difficult to generalize. Methods based on machine learning tend to perform better, especially with PHI that is not mentioned in the dictionaries used. Finally, the issues of anonymization, sufficient performance, and "over-scrubbing" are discussed in this publication.
                Bookmark

                Author and article information

                Journal
                J Korean Med Sci
                J. Korean Med. Sci
                JKMS
                Journal of Korean Medical Science
                The Korean Academy of Medical Sciences
                1011-8934
                1598-6357
                January 2015
                23 December 2014
                : 30
                : 1
                : 7-15
                Affiliations
                [1 ]Department of Biomedical Informatics, Asan Medical Center, Seoul, Korea.
                [2 ]Office of Clinical Research Information, Asan Medical Center, Seoul, Korea.
                [3 ]Department of Clinical Epidemiology and Biostatistics, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Korea.
                [4 ]Department of Pulmonary and Critical Care Medicine, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Korea.
                [5 ]Department of Oncology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Korea.
                [6 ]Department of Emergency Medicine, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Korea.
                [7 ]Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA, USA.
                Author notes
                Address for Correspondence: Jae Ho Lee, MD. Department of Biomedical Informatics, Asan Medical Center, 88 Olympic-ro 43-gil, Songpa-gu, Seoul 138-736, Korea. Tel: +82.2-3010-5875, Fax: +82.2-3010-8126, rufiji@ 123456gmail.com

                *Soo-Yong Shin and Yu Rang Park contributed equally to this work.

                Author information
                http://orcid.org/0000-0002-2410-6120
                http://orcid.org/0000-0002-4210-2094
                http://orcid.org/0000-0002-1144-9458
                http://orcid.org/0000-0003-4715-2901
                http://orcid.org/0000-0002-1872-8708
                http://orcid.org/0000-0002-9363-252X
                http://orcid.org/0000-0003-1085-9073
                http://orcid.org/0000-0002-2881-4669
                http://orcid.org/0000-0002-1254-1264
                http://orcid.org/0000-0003-2619-1231
                Article
                10.3346/jkms.2015.30.1.7
                4278030
                25552878
                12c78afe-f105-48ff-9dce-c43a57685aec
                © 2015 The Korean Academy of Medical Sciences.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 26 February 2014
                : 29 August 2014
                Funding
                Funded by: Asan Institute for Life Sciences
                Award ID: 2013-7205
                Categories
                Original Article
                Medical Informatics

                Medicine
                de-identification,anonymization,clinical text,bilingual text,patient privacy,medical informatics,text mining

                Comments

                Comment on this article