+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Privacy Preserving Probabilistic Record Linkage (P3RL): a novel method for linking existing health-related data and maintaining participant confidentiality

      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.



          Record linkage of existing individual health care data is an efficient way to answer important epidemiological research questions. Reuse of individual health-related data faces several problems: Either a unique personal identifier, like social security number, is not available or non-unique person identifiable information, like names, are privacy protected and cannot be accessed. A solution to protect privacy in probabilistic record linkages is to encrypt these sensitive information. Unfortunately, encrypted hash codes of two names differ completely if the plain names differ only by a single character. Therefore, standard encryption methods cannot be applied. To overcome these challenges, we developed the Privacy Preserving Probabilistic Record Linkage (P3RL) method.


          In this Privacy Preserving Probabilistic Record Linkage method we apply a three-party protocol, with two sites collecting individual data and an independent trusted linkage center as the third partner. Our method consists of three main steps: pre-processing, encryption and probabilistic record linkage. Data pre-processing and encryption are done at the sites by local personnel. To guarantee similar quality and format of variables and identical encryption procedure at each site, the linkage center generates semi-automated pre-processing and encryption templates. To retrieve information (i.e. data structure) for the creation of templates without ever accessing plain person identifiable information, we introduced a novel method of data masking. Sensitive string variables are encrypted using Bloom filters, which enables calculation of similarity coefficients. For date variables, we developed special encryption procedures to handle the most common date errors. The linkage center performs probabilistic record linkage with encrypted person identifiable information and plain non-sensitive variables.


          In this paper we describe step by step how to link existing health-related data using encryption methods to preserve privacy of persons in the study.


          Privacy Preserving Probabilistic Record linkage expands record linkage facilities in settings where a unique identifier is unavailable and/or regulations restrict access to the non-unique person identifiable information needed to link existing health-related data sets. Automated pre-processing and encryption fully protect sensitive information ensuring participant confidentiality. This method is suitable not just for epidemiological research but also for any setting with similar challenges.

          Related collections

          Most cited references 47

          • Record: found
          • Abstract: found
          • Article: not found

          Research use of linked health data--a best practice protocol.

          This article outlines a protocol for facilitating access to administrative data for the purpose of health services research, when these data are sourced from multiple organisations. This approach is designed to promote confidence in the community and among data custodians that there are benefits of linked health information being used and that individual privacy is being rigorously protected. Linked health administration data can provide an unparalleled resource for the monitoring and evaluation of health care services. However, for a number of reasons, these data have not been readily available to researchers. In Australia, an additional barrier to research is the result of health data sets being collected by different levels of government - thus all are not available to any one authority. To improve this situation, a practical blue-print for the conduct of data linkage is proposed. This should provide an approach suitable for most projects that draw large volumes of information from multiple sources, especially when this includes organisations in different jurisdictions. Health data, although widely and diligently collected, continue to be under-utilised for research and evaluation in most countries. This protocol aims to make these data more easily available to researchers by providing a controlled and secure mechanism that guarantees privacy protection.
            • Record: found
            • Abstract: found
            • Article: not found

            Probabilistic linkage of large public health data files.

             M Jaro (2015)
            Probabilistic linkage technology makes it feasible and efficient to link large public health databases in a statistically justifiable manner. The problem addressed by the methodology is that of matching two files of individual data under conditions of uncertainty. Each field is subject to error which is measured by the probability that the field agrees given a record pair matches (called the m probability) and probabilities of chance agreement of its value states (called the u probability). Fellegi and Sunter pioneered record linkage theory. Advances in methodology include use of an EM algorithm for parameter estimation, optimization of matches by means of a linear sum assignment program, and more recently, a probability model that addresses both m and u probabilities for all value states of a field. This provides a means for obtaining greater precision from non-uniformly distributed fields, without the theoretical complications arising from frequency-based matching alone. The model includes an iterative parameter estimation procedure that is more robust than pre-match estimation techniques. The methodology was originally developed and tested by the author at the U.S. Census Bureau for census undercount estimation. The more recent advances and a new generalized software system were tested and validated by linking highway crashes to Emergency Medical Service (EMS) reports and to hospital admission records for the National Highway Traffic Safety Administration (NHTSA).
              • Record: found
              • Abstract: not found
              • Book: not found

              Data Matching


                Author and article information

                BMC Med Res Methodol
                BMC Med Res Methodol
                BMC Medical Research Methodology
                BioMed Central (London )
                30 May 2015
                30 May 2015
                : 15
                [ ]Institute of Social and Preventive Medicine (ISPM), University of Bern, Finkenhubelweg 11, CH-3012 Bern, Switzerland
                [ ]Section of Geriatrics, Boston University Medical Center, 88 East Newton St., Boston, MA 02118 USA
                © Schmidlin et al. 2015

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

                Research Article
                Custom metadata
                © The Author(s) 2015


                bloom filters, record linkage, probabilistic record linkage, privacy, patient confidentiality


                Comment on this article