85
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Introducing Explorer of Taxon Concepts with a case study on spider measurement matrix building

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Taxonomic descriptions are traditionally composed in natural language and published in a format that cannot be directly used by computers. The Exploring Taxon Concepts (ETC) project has been developing a set of web-based software tools that convert morphological descriptions published in telegraphic style to character data that can be reused and repurposed. This paper introduces the first semi-automated pipeline, to our knowledge, that converts morphological descriptions into taxon-character matrices to support systematics and evolutionary biology research. We then demonstrate and evaluate the use of the ETC Input Creation - Text Capture - Matrix Generation pipeline to generate body part measurement matrices from a set of 188 spider morphological descriptions and report the findings.

          Results

          From the given set of spider taxonomic publications, two versions of input (original and normalized) were generated and used by the ETC Text Capture and ETC Matrix Generation tools. The tools produced two corresponding spider body part measurement matrices, and the matrix from the normalized input was found to be much more similar to a gold standard matrix hand-curated by the scientist co-authors. Special conventions utilized in the original descriptions (e.g., the omission of measurement units) were attributed to the lower performance of using the original input. The results show that simple normalization of the description text greatly increased the quality of the machine-generated matrix and reduced edit effort. The machine-generated matrix also helped identify issues in the gold standard matrix.

          Conclusions

          ETC Text Capture and ETC Matrix Generation are low-barrier and effective tools for extracting measurement values from spider taxonomic descriptions and are more effective when the descriptions are self-contained. Special conventions that make the description text less self-contained challenge automated extraction of data from biodiversity descriptions and hinder the automated reuse of the published knowledge. The tools will be updated to support new requirements revealed in this case study.

          Electronic supplementary material

          The online version of this article (doi:10.1186/s12859-016-1352-7) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references33

          • Record: found
          • Abstract: found
          • Article: not found

          Extracting medication information from clinical text.

          The Third i2b2 Workshop on Natural Language Processing Challenges for Clinical Records focused on the identification of medications, their dosages, modes (routes) of administration, frequencies, durations, and reasons for administration in discharge summaries. This challenge is referred to as the medication challenge. For the medication challenge, i2b2 released detailed annotation guidelines along with a set of annotated discharge summaries. Twenty teams representing 23 organizations and nine countries participated in the medication challenge. The teams produced rule-based, machine learning, and hybrid systems targeted to the task. Although rule-based systems dominated the top 10, the best performing system was a hybrid. Of all medication-related fields, durations and reasons were the most difficult for all systems to detect. While medications themselves were identified with better than 0.75 F-measure by all of the top 10 systems, the best F-measure for durations and reasons were 0.525 and 0.459, respectively. State-of-the-art natural language processing systems go a long way toward extracting medication names, dosages, modes, and frequencies. However, they are limited in recognizing duration and reason fields and would benefit from future research.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            ZFIN: enhancements and updates to the zebrafish model organism database

            ZFIN, the Zebrafish Model Organism Database, http://zfin.org, serves as the central repository and web-based resource for zebrafish genetic, genomic, phenotypic and developmental data. ZFIN manually curates comprehensive data for zebrafish genes, phenotypes, genotypes, gene expression, antibodies, anatomical structures and publications. A wide-ranging collection of web-based search forms and tools facilitates access to integrated views of these data promoting analysis and scientific discovery. Data represented in ZFIN are derived from three primary sources: curation of zebrafish publications, individual research laboratories and collaborations with bioinformatics organizations. Data formats include text, images and graphical representations. ZFIN is a dynamic resource with data added daily as part of our ongoing curation process. Software updates are frequent. Here, we describe recent additions to ZFIN including (i) enhanced access to images, (ii) genomic features, (iii) genome browser, (iv) transcripts, (v) antibodies and (vi) a community wiki for protocols and antibodies.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Time to change how we describe biodiversity.

              Taxonomists are arguably the most active annotators of the natural world, collecting and publishing millions of phenotype data annually through descriptions of new taxa. By formalizing these data, preferably as they are collected, taxonomists stand to contribute a data set with research potential that rivals or even surpasses genomics. Over a decade of electronic innovation and debate has initiated a revolution in the way that the biodiversity is described. Here, we opine that a new generation of semantically based digital scaffolding, presently in various stages of completeness, and a commitment by taxonomists and their colleagues to undertake this transformation, are required to complete the taxonomic revolution and critically broaden the relevance of its products. Copyright © 2011 Elsevier Ltd. All rights reserved.
                Bookmark

                Author and article information

                Contributors
                hongcui@email.arizona.edu
                dongfangxu9@email.arizona.edu
                stevenchong@email.arizona.edu
                ramirez@macn.gov.ar
                rodenhausen@email.arizona.edu
                James.Macklin@agr.gc.ca
                ludaesch@illinois.edu
                ram@cs.umb.edu
                edumsoto@gmail.com
                nicolas.mongiardinokoch@yale.edu
                Journal
                BMC Bioinformatics
                BMC Bioinformatics
                BMC Bioinformatics
                BioMed Central (London )
                1471-2105
                17 November 2016
                17 November 2016
                2016
                : 17
                : 471
                Affiliations
                [1 ]University of Arizona, Tucson, AZ USA
                [2 ]Museo Argentino de Ciencias, Naturales, CONICET, Buenos Aires, Argentina
                [3 ]Agriculture and Agri-Food Canada, Ottawa, Canada
                [4 ]University of Illinois at Urbana-Champaign, Champaign, USA
                [5 ]University of Massachusetts at Boston and Harvard University Herbaria, Massachusetts, USA
                [6 ]Department of Geology & Geophysics, Yale University, New Haven, Connecticut USA
                Author information
                http://orcid.org/0000-0003-0828-1102
                Article
                1352
                10.1186/s12859-016-1352-7
                5114841
                27855645
                c66ddeaf-e5eb-4896-9a55-1447262c8295
                © The Author(s). 2016

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 1 April 2016
                : 11 November 2016
                Funding
                Funded by: FundRef http://dx.doi.org/10.13039/100000001, National Science Foundation;
                Award ID: No. DBI-1147266
                Award Recipient :
                Funded by: CONICET
                Award ID: No. PIP-2012-0943
                Award Recipient :
                Categories
                Research Article
                Custom metadata
                © The Author(s) 2016

                Bioinformatics & Computational biology
                information extraction,text mining,natural language processing,taxonomic morphological descriptions,phenotypic characters,phenotypic traits,evaluation,spiders,etc,explorer of taxon concepts

                Comments

                Comment on this article