0
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Data Auditing, Cleaning and Quality Assurance Workflows from the Experience of a Scholarly Publisher

      , , , ,   ,

      Biodiversity Information Science and Standards

      Pensoft Publishers

      Read this article at

      ScienceOpenPublisher
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Data publishing became an important task in the agenda of many scholarly publishers in the last decade, but far less attention has been paid to the actual reviewing and quality checking of the published data. Quality checks are often being delegated to the reviewers of the article narrative, many of whom may not be qualified to provide a professional data review. The talk presents the workflows developed and used by Pensoft journals to provide data auditing, cleaning and quality assurance. These are: Data auditing/cleaning workflow for datasets published as data papers (Fig. 1, see also this blog). All datasets undergo an audit for compliance with a data quality checklist prior to peer-review. The author is provided with an audit report and is asked to correct the data flaws and consider other recommendations in the report. This check is conducted regardless of whether the datasets are provided as supplementary material within the data paper manuscript or linked from the Global Biodiversity Information Facility (GBIF) or another repository. The manuscript is not forwarded to peer review until the author corrects the data associated with it. This workflow is applied in all journals of the publisher’s portfolio, including Biodiversity Data Journal, ZooKeys, PhytoKeys, MycoKeys and others. Automated check and validation of data within the article narratiive in the Biodiversity Data Journal provided during the authoring process in the ARPHA Writing Tool (AWT), and consequently, during the peer review process in the journal. Among others, the automated validation tool checks for compliance with the biological Codes (for example, a new species description cannot be submitted without designation of a holotype and the respective specimen record). Check for consistency and validation of the full-text JATS XML against the TaxPub XML schema. This quality check ensures a succesfull full-text submission and display on PubMedCentral, extraction of taxon treatments and their visualisation on Plazi's TreatmentBank and GBIF, indexing in various data aggregators, and so on. Human-provided quality check of the mass automated extraction of taxon treatments from legacy literature via the GoldenGate-Imagine workflow developed by Plazi and implemented for the purposes of the Arcadia project in a collaboration with Pensoft. Putting high-quality data in valid XML formats (Pensoft's JATS and Plazi's TaxonX) into a machine readable semantic format (RDF) to guarantee efficient extraction and transformation. Data from Pensoft's journals and Plazi's treatments are uploaded as semantic triples into the OpenBiodiv Biodiversity Knowledge Graph where it is modeled according to the OpenBiodiv-O ontology (Senderov et al. 2018). Semantic technologies facilitate the mapping of scientific names in OpenBiodiv to GBIF's taxonomic backbone and the addressing of complex biodiversity questions. We have realised in the course of many years experience in data publishing that data quality checking and assurance testing requires specific knowledge and competencies, which also vary between the various methods of data handling and management, such as relational databases, semantic XML tagging, Linked Open Data, and others. This process cannot be trusted to peer reviewers only and requires the participation of dedicated data scientists and information specialists in the routine publishing process. This is the only way to make the published biodiversity data, such as taxon descriptions, occurrence records, biological observations and specimen characteristics, truly FAIR (Findable, Accessible, Interoperable, Reusable), so that they can be merged, reformatted and incorporated into novel and visionary projects, regardless of whether they are accessed by a human researcher or a data-mining process.

          Related collections

          Most cited references 1

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          OpenBiodiv-O: ontology of the OpenBiodiv knowledge management system

          Background The biodiversity domain, and in particular biological taxonomy, is moving in the direction of semantization of its research outputs. The present work introduces OpenBiodiv-O, the ontology that serves as the basis of the OpenBiodiv Knowledge Management System. Our intent is to provide an ontology that fills the gaps between ontologies for biodiversity resources, such as DarwinCore-based ontologies, and semantic publishing ontologies, such as the SPAR Ontologies. We bridge this gap by providing an ontology focusing on biological taxonomy. Results OpenBiodiv-O introduces classes, properties, and axioms in the domains of scholarly biodiversity publishing and biological taxonomy and aligns them with several important domain ontologies (FaBiO, DoCO, DwC, Darwin-SW, NOMEN, ENVO). By doing so, it bridges the ontological gap across scholarly biodiversity publishing and biological taxonomy and allows for the creation of a Linked Open Dataset (LOD) of biodiversity information (a biodiversity knowledge graph) and enables the creation of the OpenBiodiv Knowledge Management System. A key feature of the ontology is that it is an ontology of the scientific process of biological taxonomy and not of any particular state of knowledge. This feature allows it to express a multiplicity of scientific opinions. The resulting OpenBiodiv knowledge system may gain a high level of trust in the scientific community as it does not force a scientific opinion on its users (e.g. practicing taxonomists, library researchers, etc.), but rather provides the tools for experts to encode different views as science progresses. Conclusions OpenBiodiv-O provides a conceptual model of the structure of a biodiversity publication and the development of related taxonomic concepts. It also serves as the basis for the OpenBiodiv Knowledge Management System. Electronic supplementary material The online version of this article (doi:10.1186/s13326-017-0174-5) contains supplementary material, which is available to authorized users.
            Bookmark

            Author and article information

            Journal
            Biodiversity Information Science and Standards
            BISS
            Pensoft Publishers
            2535-0897
            June 13 2019
            June 13 2019
            : 3
            Article
            10.3897/biss.3.35019
            © 2019

            Comments

            Comment on this article