Task Group 2 – Data Quality Tests and Assertions

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

MotivationOther than data availability, ‘Data Quality’ is probably the most significant issue for users of biodiversity data and this is especially so for the research community. Data Quality Tests and Assertions Task Group (TG-2) from the Biodiversity Information Standards (TDWG) Biodiversity Quality Interest Group is reviewing practical aspects relating to ‘data quality’ with a goal of providing a current best practice at the key interface between data users and data providers: tests and assertions. If an internationally agreed standard suite of core tests and resulting assertions can be used by all data providers and aggregators and hopefully data collectors, then greater and more appropriate use could be made of biodiversity data. Adopting this suite of core tests, data providers and particularly aggregators such as the Global Biodiversity Information Facility (GBIF) and its nodes would have increased credibility with the user communities and could provide more effective information for evaluating ‘fitness for use’.Goals, Outputs and OutcomesA standard core (fundamental) set of tests and associated assertions based around Darwin Core termsA standard suite of descriptive fields for each testBroad deployment of the tests, from collector to aggregatorA set of basic principles for the creation of tests/assertionsSoftware that provides an example implementation of each testData that can be used to validate an implementation of the testsA publication that captures the knowledge built during the creation of the tests/assertionsStrategyThe tests and rules generating assertions at the record-level are more fundamental than the tools or workflows that will be based on them. The priority is to create a fully documented suite of core tests that define a framework for ready extension across terms and domains.Status 2019-2020The core tests have proven to be far more complex than any of the team had anticipated. Several times over the past three years, we believed we had finalized the tests, only to find new issues that have required a fresh understanding and subsequent edits, e.g., the most recent dropping of the two tests related to dwc:identificationQualifier:TG2-VALIDATION_IDENTIFICATIONQUALIFIER_DETECTED andTG2-AMENDMENT_IDENTIFICATIONQUALIFIER_FROM_TAXONThis decision resulted from a review of dwc:identificationQualifier values in GBIF records and an evaluation of expected values based on the Darwin Core definition of the term. Aside from there being many values, the term expects the qualifier in relation to a given taxonomic name, and rules of open nomenclature are unevenly adopted across data records to reliably parse and detect dwc:identificationQualifier for these tests to be effective.A similar situation occurs for dwc:scientificName, where we have resorted to the term “polynomial” to refer to the non-authorship part of dwc:scientificName.What has occurred during the past year?Months of work on discussions and edits to the GitHub issues (= mainly the tests), using mainly via Zoom and email.We had hoped to have a face-to-face meeting in Bariloche, Argentina early in 2020 but the Corona virus stopped that. This was unfortunate as we needed this meeting to discuss the remaining complex issues as noted above. Attempting to address such issues by Zoom has been far less efficient.We are occasionally re-visiting decisions made years earlier. An indication that we have been doing this work for (too) many years.We have now standardized all the test parameters for the 99 CORE tests. Much work has gone into standardizing the phrasing and terminology within the 'Expected response' field of the tests – the parameter that most clearly defines each test.Two of the test fields that have taken most of our time to resolve have been ‘Parameters’ and what we now call ‘bdq:sourceAuthority’ (Chapman et al. 2020a). These are now complete. The work on ‘Parameters’ has fed in to Task Group 4 on Vocabularies of Values (see Vocabularies needed for Darwin Core terms prepared by TG4).We have published the work from the Data Quality Interest and Task Groups: Chapman et al. 2020bWe have extended the vocabulary that has been used for the Tests and Assertions.Development of the datasets that validate the implementation of the tests continues.We recognize the dependence on the work of the Annotations Interest Group for the results from the tests to have maximal impact. It is important that test results stay with the records.We will provide details of the challenges, the breakdown of the tests and the advances of the project.

Related collections

Most cited references 2

Record: found
Abstract: found
Article: found

Is Open Access

Developing Standards for Improved Data Quality and for Selecting Fit for Use Biodiversity Data

Arthur Harry Chapman, Lee Belbin, Paula F. Zermoglio … (2020)

The quality of biodiversity data publicly accessible via aggregators such as GBIF (Global Biodiversity Information Facility), the ALA (Atlas of Living Australia), iDigBio (Integrated Digitized Biocollections), and OBIS (Ocean Biogeographic Information System) is often questioned, especially by the research community.The Data Quality Interest Group, established by Biodiversity Information Standards (TDWG) and GBIF, has been engaged in four main activities: developing a framework for the assessment and management of data quality using a fitness for use approach; defining a core set of standardised tests and associated assertions based on Darwin Core terms; gathering and classifying user stories to form contextual-themed use cases, such as species distribution modelling, agrobiodiversity, and invasive species; and developing a standardised format for building and managing controlled vocabularies of values.Using the developed framework, data quality profiles have been built from use cases to represent user needs. Quality assertions can then be used to filter data suitable for a purpose. The assertions can also be used to provide feedback to data providers and custodians to assist in improving data quality at the source. A case study, using two different implementations of tests and assertions based around the Darwin Core "Event Date" terms, were also tested against GBIF data, to demonstrate that the tests are implementation agnostic, can be run on large aggregated datasets, and can make biodiversity data more fit for typical research uses.