Delivering Fit-for-Use Data: Quality control

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Automatic data mining is not an easy task and its success in the biodiversity world is deeply tied to the standardization and consistency of scientific journals' layout structure. The various formatting styles found in the over 500 million pages of published biodiversity information (Kalfatovich 2010), pose a remarkable challenge towards the goal of automating the liberation of data currently trapped on the printed page. Regular expressions and other pattern-recognition strategies invariably fail to cope with this diverse landscape of academic publishing. Challenges such as incomplete data and taxonomic uncertainty add several additional layers of complexity.However, in the era of big data, the liberation of all the different facts contained in biodiversity literature is of crucial importance. Plazi tackles this daunting task by providing workflows and technology to automatically process biodiversity publications and annotate the information therein, all within the principles of FAIR (findable, accessible, interoperable, and reusable) data usage (Agosti and Egloff 2009). It uses the concept of taxonomic treatments (Catapano 2019) as the most fundamental unit in biodiversity literature, to provide a framework that reflects the reality of taxonomic data for linking the different pieces of information contained in these taxonomic treatments. Treatment citations, composed of a taxonomic name and a bibliographic reference, and material citations carrying all specimen-related information are additional conceptual cornerstones for this framework. The resulting enhanced data are added to TreatmentBank. Figures and treatments are made Findable, Accessible, Interoperable and Reuseable (FAIR) by depositing them including specific metadata to the Biodiversity Literature Repository community (BLR) at the European Organization for Nuclear Research (CERN) repository Zenodo, and pushed to GBIF. The automation, however, is error prone due to the constraints explained above.In order to cope with this remarkable task without compromising data quality, Plazi has established a quality control process, based on logical rules that check the components of the extracted document raising errors in four different levels of severity. These errors are also used in a data transit control mechanism, “the gatekeeper”, which blocks certain data transits to create deposits (e.g., BLR) or reuse of data (e.g., GBIF) in the presence of specific errors. Finally, a set of automatic notifications were included in the plazi/community Github repository, in order to provide a channel that empowers external users to report data issues directly to a dedicated team of data miners, which will in turn and in a timely manner, fix these issues, improving data quality on demand. In this talk, we aim to explain Plazi’s internal quality control process and phases, the data transits that are potentially affected, as well as statistics on the most common issues raised by this automated endeavor and how we use the generated data to continuously improve this important step in Plazi's workflow.

Related collections

Most cited references 1

Record: found
Abstract: found
Article: found

Is Open Access

Taxonomic information exchange and copyright: the Plazi approach

Donat Agosti, Willi Egloff (2009)

Background A large part of our knowledge on the world's species is recorded in the corpus of biodiversity literature with well over hundred million pages, and is represented in natural history collections estimated at 2 – 3 billion specimens. But this body of knowledge is almost entirely in paper-print form and is not directly accessible through the Internet. For the digitization of this literature, new territories have to be chartered in the fields of technical, legal and social issues that presently impede its advance. The taxonomic literature seems especially destined for such a transformation. Discussion Plazi was founded as an association with the primary goal of transforming both the printed and, more recently, "born-digital" taxonomic literature into semantically enabled, enhanced documents. This includes the creation of a test body of literature, an XML schema modeling its logic content (TaxonX), the development of a mark-up editor (GoldenGATE) allowing also the enhancement of documents with links to external resources via Life Science Identifiers (LSID), a repository for publications and issuance of bibliographic identifiers, a dedicated server to serve the marked up content (the Plazi Search and Retrieval Server, SRS) and semantic tools to mine information. Plazi's workflow is designed to respect copyright protection and achieves extraction by observing exceptions and limitations existent in international copyright law. Conclusion The information found in Plazi's databases – taxonomic treatments as well as the metadata of the publications – are in the public domain and can therefore be used for further scientific research without any restriction, whether or not contained in copyrighted publications.