26
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      The effect of data cleaning on record linkage quality

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Within the field of record linkage, numerous data cleaning and standardisation techniques are employed to ensure the highest quality of links. While these facilities are common in record linkage software packages and are regularly deployed across record linkage units, little work has been published demonstrating the impact of data cleaning on linkage quality.

          Methods

          A range of cleaning techniques was applied to both a synthetically generated dataset and a large administrative dataset previously linked to a high standard. The effect of these changes on linkage quality was investigated using pairwise F-measure to determine quality.

          Results

          Data cleaning made little difference to the overall linkage quality, with heavy cleaning leading to a decrease in quality. Further examination showed that decreases in linkage quality were due to cleaning techniques typically reducing the variability – although correct records were now more likely to match, incorrect records were also more likely to match, and these incorrect matches outweighed the correct matches, reducing quality overall.

          Conclusions

          Data cleaning techniques have minimal effect on linkage quality. Care should be taken during the data cleaning process.

          Related collections

          Most cited references11

          • Record: found
          • Abstract: found
          • Article: not found

          Record linkage software in the public domain: a comparison of Link Plus, The Link King, and a 'basic' deterministic algorithm.

          The study objective was to compare the accuracy of a deterministic record linkage algorithm and two public domain software applications for record linkage (The Link King and Link Plus). The three algorithms were used to unduplicate an administrative database containing personal identifiers for over 500,000 clients. Subsequently, a random sample of linked records was submitted to four research staff for blinded clerical review. Using reviewers' decisions as the 'gold standard', sensitivity and positive predictive values (PPVs) were estimated. Optimally, sensitivity and PPVs in the mid 90s could be obtained from both The Link King and Link Plus. Sensitivity and PPVs using a basic deterministic algorithm were 79 and 98 per cent respectively. Thus the full feature set of The Link King makes it an attractive option for SAS users. Link Plus is a good choice for non-SAS users as long as necessary programming resources are available for processing record pairs identified by Link Plus.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage.

            To gain insight into the performance of deterministic record linkage (DRL) vs. probabilistic record linkage (PRL) strategies under different conditions by varying the frequency of registration errors and the amount of discriminating power. A simulation study in which data characteristics were varied to create a range of realistic linkage scenarios. For each scenario, we compared the number of misclassifications (number of false nonlinks and false links) made by the different linking strategies: deterministic full, deterministic N-1, and probabilistic. The full deterministic strategy produced the lowest number of false positive links but at the expense of missing considerable numbers of matches dependent on the error rate of the linking variables. The probabilistic strategy outperformed the deterministic strategy (full or N-1) across all scenarios. A deterministic strategy can match the performance of a probabilistic approach providing that the decision about which disagreements should be tolerated is made correctly. This requires a priori knowledge about the quality of all linking variables, whereas this information is inherently generated by a probabilistic strategy. PRL is more flexible and provides data about the quality of the linkage process that in turn can minimize the degree of linking errors, given the data provided. Copyright © 2011 Elsevier Inc. All rights reserved.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Data linkage infrastructure for cross-jurisdictional health-related research in Australia

              Background The Centre for Data Linkage (CDL) has been established to enable national and cross-jurisdictional health-related research in Australia. It has been funded through the Population Health Research Network (PHRN), a national initiative established under the National Collaborative Research Infrastructure Strategy (NCRIS). This paper describes the development of the processes and methodology required to create cross-jurisdictional research infrastructure and enable aggregation of State and Territory linkages into a single linkage “map”. Methods The CDL has implemented a linkage model which incorporates best practice in data linkage and adheres to data integration principles set down by the Australian Government. Working closely with data custodians and State-based data linkage facilities, the CDL has designed and implemented a linkage system to enable research at national or cross-jurisdictional level. A secure operational environment has also been established with strong governance arrangements to maximise privacy and the confidentiality of data. Results The development and implementation of a cross-jurisdictional linkage model overcomes a number of challenges associated with the federated nature of health data collections in Australia. The infrastructure expands Australia’s data linkage capability and provides opportunities for population-level research. The CDL linkage model, infrastructure architecture and governance arrangements are presented. The quality and capability of the new infrastructure is demonstrated through the conduct of data linkage for the first PHRN Proof of Concept Collaboration project, where more than 25 million records were successfully linked to a very high quality. Conclusions This infrastructure provides researchers and policy-makers with the ability to undertake linkage-based research that extends across jurisdictional boundaries. It represents an advance in Australia’s national data linkage capabilities and sets the scene for stronger government-research collaboration.
                Bookmark

                Author and article information

                Contributors
                Journal
                BMC Med Inform Decis Mak
                BMC Med Inform Decis Mak
                BMC Medical Informatics and Decision Making
                BioMed Central
                1472-6947
                2013
                5 June 2013
                : 13
                : 64
                Affiliations
                [1 ]Centre for Data Linkage, Curtin Health Innovation Research Institute, Curtin University, Perth, WA GPO U1987, Australia
                Article
                1472-6947-13-64
                10.1186/1472-6947-13-64
                3688507
                23739011
                3eb3ee6d-c3d8-48e6-8c87-4bf2cd47c9eb
                Copyright ©2013 Randall et al.; licensee BioMed Central Ltd.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 17 March 2013
                : 29 May 2013
                Categories
                Research Article

                Bioinformatics & Computational biology
                data cleaning,data quality,medical record linkage
                Bioinformatics & Computational biology
                data cleaning, data quality, medical record linkage

                Comments

                Comment on this article