13
views
0
recommends
+1 Recommend
1 collections
    0
    shares

      Publish your biodiversity research with us!

      Submit your article here.

      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Argo as a platform for integrating distinct biodiversity analytics tools into workflows for building graph databases

      , , , ,
      Proceedings of TDWG
      Pensoft Publishers

      Read this article at

      ScienceOpenPublisher
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Together with the increasingly growing amount of available data on biodiversity comes the proliferation of various informatics tools aimed at the collection, management and analysis of biodiversity-relevant knowledge. Consequently, we have seen how several data formats and programming languages or environments have come into use, giving rise to a problem in interoperability should anyone wish to combine the outputs of distinct tools, or to integrate them into one solution. Argo (Rak et al. 2012), an online text mining workbench based on the Unstructured Information Management Architecture (UIMA) interoperability standard, offers a means for seamlessly unifying various tools and resources into customisable text processing workflows. Among many other features, Argo provides: (1) a library of diverse tools, i.e., UIMA components, each of which is dedicated to a specific task such as loading datasets or gazetteers of interest (e.g., the Biodiversity Term Inventory), recognition of species names and their semantically related terms (Nguyen et al. 2017); (2) a graphical interface for designing workflows using components as building blocks; (3) an environment for executing and monitoring the progress of workflows; and (4) a user-interactive annotation editor for manually revising or validating results of automated processing. Recently, Argo has been extended to provide support for incorporating into workflows external web services conforming with the Representational State Transfer (REST) protocol. Taking advantage of these features, we demonstrate how we combine in-house tools and resources for named entity recognition (Batista-Navarro et al. 2017) with externally developed ones, e.g., EXTRACT (Pafilis et al. 2016), in order to build text mining workflows for populating neo4j graph databases with biodiversity-relevant knowledge. To provide a few exemplars, we focus on use cases that seek to leverage various sources of literature to capture fine-grained information on the habitat and reproductive conditions of: (1) a subset of plants catalogued in World Flora Online (Jackson and Miller 2015), and (2) tropical trees belonging to the Dipterocarpaceae family.

          Related collections

          Most cited references5

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Argo: an integrative, interactive, text mining-based workbench supporting curation

          Curation of biomedical literature is often supported by the automatic analysis of textual content that generally involves a sequence of individual processing components. Text mining (TM) has been used to enhance the process of manual biocuration, but has been focused on specific databases and tasks rather than an environment integrating TM tools into the curation pipeline, catering for a variety of tasks, types of information and applications. Processing components usually come from different sources and often lack interoperability. The well established Unstructured Information Management Architecture is a framework that addresses interoperability by defining common data structures and interfaces. However, most of the efforts are targeted towards software developers and are not suitable for curators, or are otherwise inconvenient to use on a higher level of abstraction. To overcome these issues we introduce Argo, an interoperable, integrative, interactive and collaborative system for text analysis with a convenient graphic user interface to ease the development of processing workflows and boost productivity in labour-intensive manual curation. Robust, scalable text analytics follow a modular approach, adopting component modules for distinct levels of text analysis. The user interface is available entirely through a web browser that saves the user from going through often complicated and platform-dependent installation procedures. Argo comes with a predefined set of processing components commonly used in text analysis, while giving the users the ability to deposit their own components. The system accommodates various areas and levels of user expertise, from TM and computational linguistics to ontology-based curation. One of the key functionalities of Argo is its ability to seamlessly incorporate user-interactive components, such as manual annotation editors, into otherwise completely automatic pipelines. As a use case, we demonstrate the functionality of an in-built manual annotation editor that is well suited for in-text corpus annotation tasks. Database URL: http://www.nactem.ac.uk/Argo
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            EXTRACT: interactive extraction of environment metadata and term suggestion for metagenomic sample annotation

            The microbial and molecular ecology research communities have made substantial progress on developing standards for annotating samples with environment metadata. However, sample manual annotation is a highly labor intensive process and requires familiarity with the terminologies used. We have therefore developed an interactive annotation tool, EXTRACT, which helps curators identify and extract standard-compliant terms for annotation of metagenomic records and other samples. Behind its web-based user interface, the system combines published methods for named entity recognition of environment, organism, tissue and disease terms. The evaluators in the BioCreative V Interactive Annotation Task found the system to be intuitive, useful, well documented and sufficiently accurate to be helpful in spotting relevant text passages and extracting organism and environment terms. Comparison of fully manual and text-mining-assisted curation revealed that EXTRACT speeds up annotation by 15–25% and helps curators to detect terms that would otherwise have been missed. Database URL: https://extract.hcmr.gr/
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Constructing a biodiversity terminological inventory

              The increasing growth of literature in biodiversity presents challenges to users who need to discover pertinent information in an efficient and timely manner. In response, text mining techniques offer solutions by facilitating the automated discovery of knowledge from large textual data. An important step in text mining is the recognition of concepts via their linguistic realisation, i.e., terms. However, a given concept may be referred to in text using various synonyms or term variants, making search systems likely to overlook documents mentioning less known variants, which are albeit relevant to a query term. Domain-specific terminological resources, which include term variants, synonyms and related terms, are thus important in supporting semantic search over large textual archives. This article describes the use of text mining methods for the automatic construction of a large-scale biodiversity term inventory. The inventory consists of names of species, amongst which naming variations are prevalent. We apply a number of distributional semantic techniques on all of the titles in the Biodiversity Heritage Library, to compute semantic similarity between species names and support the automated construction of the resource. With the construction of our biodiversity term inventory, we demonstrate that distributional semantic models are able to identify semantically similar names that are not yet recorded in existing taxonomies. Such methods can thus be used to update existing taxonomies semi-automatically by deriving semantically related taxonomic names from a text corpus and allowing expert curators to validate them. We also evaluate our inventory as a means to improve search by facilitating automatic query expansion. Specifically, we developed a visual search interface that suggests semantically related species names, which are available in our inventory but not always in other repositories, to incorporate into the search query. An assessment of the interface by domain experts reveals that our query expansion based on related names is useful for increasing the number of relevant documents retrieved. Its exploitation can benefit both users and developers of search engines and text mining applications.
                Bookmark

                Author and article information

                Journal
                Proceedings of TDWG
                TDWGProc
                Pensoft Publishers
                2535-0897
                August 07 2017
                August 07 2017
                : 1
                : e20067
                Article
                10.3897/tdwgproceedings.1.20067
                16077ca4-44c6-4873-be72-69b8c87e3e9d
                © 2017

                http://creativecommons.org/licenses/by/4.0/

                History

                Comments

                Comment on this article