The effect of data cleaning on record linkage quality

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Within the field of record linkage, numerous data cleaning and standardisation techniques are employed to ensure the highest quality of links. While these facilities are common in record linkage software packages and are regularly deployed across record linkage units, little work has been published demonstrating the impact of data cleaning on linkage quality.

Methods

A range of cleaning techniques was applied to both a synthetically generated dataset and a large administrative dataset previously linked to a high standard. The effect of these changes on linkage quality was investigated using pairwise F-measure to determine quality.

Results

Data cleaning made little difference to the overall linkage quality, with heavy cleaning leading to a decrease in quality. Further examination showed that decreases in linkage quality were due to cleaning techniques typically reducing the variability – although correct records were now more likely to match, incorrect records were also more likely to match, and these incorrect matches outweighed the correct matches, reducing quality overall.

Conclusions

Data cleaning techniques have minimal effect on linkage quality. Care should be taken during the data cleaning process.

Related collections

Most cited references 11

Record: found
Abstract: found
Article: not found

Record linkage software in the public domain: a comparison of Link Plus, The Link King, and a 'basic' deterministic algorithm.

Dennis Deck, Antoinette Krupski, K. Campbell (2008)

The study objective was to compare the accuracy of a deterministic record linkage algorithm and two public domain software applications for record linkage (The Link King and Link Plus). The three algorithms were used to unduplicate an administrative database containing personal identifiers for over 500,000 clients. Subsequently, a random sample of linked records was submitted to four research staff for blinded clerical review. Using reviewers' decisions as the 'gold standard', sensitivity and positive predictive values (PPVs) were estimated. Optimally, sensitivity and PPVs in the mid 90s could be obtained from both The Link King and Link Plus. Sensitivity and PPVs using a basic deterministic algorithm were 79 and 98 per cent respectively. Thus the full feature set of The Link King makes it an attractive option for SAS users. Link Plus is a good choice for non-SAS users as long as necessary programming resources are available for processing record pairs identified by Link Plus.

0 comments Cited 36 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage.

Johannes Reitsma, Miranda Tromp, Gouke J. Bonsel … (2011)

To gain insight into the performance of deterministic record linkage (DRL) vs. probabilistic record linkage (PRL) strategies under different conditions by varying the frequency of registration errors and the amount of discriminating power. A simulation study in which data characteristics were varied to create a range of realistic linkage scenarios. For each scenario, we compared the number of misclassifications (number of false nonlinks and false links) made by the different linking strategies: deterministic full, deterministic N-1, and probabilistic. The full deterministic strategy produced the lowest number of false positive links but at the expense of missing considerable numbers of matches dependent on the error rate of the linking variables. The probabilistic strategy outperformed the deterministic strategy (full or N-1) across all scenarios. A deterministic strategy can match the performance of a probabilistic approach providing that the decision about which disagreements should be tolerated is made correctly. This requires a priori knowledge about the quality of all linking variables, whereas this information is inherently generated by a probabilistic strategy. PRL is more flexible and provides data about the quality of the linkage process that in turn can minimize the degree of linking errors, given the data provided. Copyright © 2011 Elsevier Inc. All rights reserved.

0 comments Cited 31 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Data linkage infrastructure for cross-jurisdictional health-related research in Australia

James Boyd, Anna M Ferrante, Christine O'Keefe … (2012)

Background The Centre for Data Linkage (CDL) has been established to enable national and cross-jurisdictional health-related research in Australia. It has been funded through the Population Health Research Network (PHRN), a national initiative established under the National Collaborative Research Infrastructure Strategy (NCRIS). This paper describes the development of the processes and methodology required to create cross-jurisdictional research infrastructure and enable aggregation of State and Territory linkages into a single linkage “map”. Methods The CDL has implemented a linkage model which incorporates best practice in data linkage and adheres to data integration principles set down by the Australian Government. Working closely with data custodians and State-based data linkage facilities, the CDL has designed and implemented a linkage system to enable research at national or cross-jurisdictional level. A secure operational environment has also been established with strong governance arrangements to maximise privacy and the confidentiality of data. Results The development and implementation of a cross-jurisdictional linkage model overcomes a number of challenges associated with the federated nature of health data collections in Australia. The infrastructure expands Australia’s data linkage capability and provides opportunities for population-level research. The CDL linkage model, infrastructure architecture and governance arrangements are presented. The quality and capability of the new infrastructure is demonstrated through the conduct of data linkage for the first PHRN Proof of Concept Collaboration project, where more than 25 million records were successfully linked to a very high quality. Conclusions This infrastructure provides researchers and policy-makers with the ability to undertake linkage-based research that extends across jurisdictional boundaries. It represents an advance in Australia’s national data linkage capabilities and sets the scene for stronger government-research collaboration.

0 comments Cited 30 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Sean M Randall

Anna M Ferrante

James H Boyd

James B Semmens

Journal

Journal ID (nlm-ta): BMC Med Inform Decis Mak

Journal ID (iso-abbrev): BMC Med Inform Decis Mak

Title: BMC Medical Informatics and Decision Making

Publisher: BioMed Central

ISSN (Electronic): 1472-6947

Publication date Collection: 2013

Publication date (Electronic): 5 June 2013

Volume: 13

Page: 64

Affiliations

[1 ]Centre for Data Linkage, Curtin Health Innovation Research Institute, Curtin University, Perth, WA GPO U1987, Australia

Article

Publisher ID: 1472-6947-13-64

DOI: 10.1186/1472-6947-13-64

PMC ID: 3688507

PubMed ID: 23739011

SO-VID: 3eb3ee6d-c3d8-48e6-8c87-4bf2cd47c9eb

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The effect of data cleaning on record linkage quality

Read this article at

Abstract

Background

Methods

Results

Conclusions

Related collections

REPO4EU WP2 Databases

Most cited references 11

Record linkage software in the public domain: a comparison of Link Plus, The Link King, and a 'basic' deterministic algorithm.

Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage.

Data linkage infrastructure for cross-jurisdictional health-related research in Australia

Author and article information

Contributors

Journal

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 373

Cited by 20

Most referenced authors 79