Blog
About

218
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Interoperability of text corpus annotations with the semantic web

      , 1 , 2 , 3

      BMC Proceedings

      BioMed Central

      Biomedical Linked Annotation Hackathon 2015

      23-27 February 2015

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Summary This paper explores the adaptation of the PubAnnotation model with recent more general proposals for the representation of annotations in the Semantic Web, referred to here as the Open Annotation model and the focus of the W3C Web Annotation Working Group. We argue that interoperability with standards under development for text annotation on the web, and with recent proposals related to nanopublications, will have benefits for the use and consistency of linguistically annotated text corpora. Introduction Formal annotation of language data is an activity that dates back at least to the classic work of Kucera and Francis on the Brown Corpus [1]. It further is a general scholarly activity by which scholars organize existing knowledge and facilitate the creation and sharing of new knowledge. Annotation is also becoming increasingly pervasive in the context of social media. Recognition of the widespread importance of annotation has resulted in recent efforts to develop standard data models for annotation [2-4], specifically targeting Web formalisms in order to take advantage of increasing efforts to expose information on the Web, such as through Linked Data initiatives (http://linkeddata.org). The WWW Consortium (W3C) has formed the Web Annotation Working Group (http://www.w3.org/annotation/) to develop specifications for a Web annotation architecture. In this paper, we propose the adoption of general semantic web-oriented annotation proposals for text annotation in the context of text corpora intended for use in developing Biomedical Natural Language Processing (BioNLP) solutions. We specifically look at adapting the current PubAnnotation format [5] for compatibility in relation to those proposals. We propose a representation of an annotated corpus in terms of the data models under development in the broader scholarly annotation community, and develop a translator from the existing PubAnnotation JSON format to the Open Annotation model. This generalization of the model is particularly pertinent to collaborative annotation scenarios; exposing linguistic annotations in the de facto language of the Semantic Web, the W3C's Resource Description Framework (RDF), provides several advantages that we have previously described [6]. We further demonstrate that the model can be integrated with the nanopublications model [7,8], facilitating their use in a growing set of data publication tools [9]. Annotation models PubAnnotation PubAnnotation is an annotation repository, that also provides a web services interface exposing the underlying texts and associated annotations [5]. This interface makes use of a simple JSON format that directly associates a span of text to a particular concept string. Open Annotation model The W3C Web Annotation working group will base its proposals in the prior Annotation Ontology [2] and Open Annotation Collaboration [3] models. Each of these models in turn incorporates elements from the earlier Annotea model [10]. We refer to this model as the Open Annotation model (OpenAnn) [4], and adopt it for our target representation. High-level model for scholarly annotation The basic high-level data model of the two primary Open Annotation models defines an Annotation as an association created between two elements, a Body or content resource and (one or more) Target resources. The annotation provides some information about the target through the connection to the body. For instance, an annotation may relate the token "apple" in a text (the target of the annotation) to the concept of an apple, perhaps represented as WordNet [11] synset "apple#1" (the body of the annotation). Figure 1 shows the base model defined in the OpenAnn model. The model, following linked data principles, assumes that each element of an annotation is a web-addressable entity that can be referenced with a URI. Figure 1 Base model for OpenANN. Annotations can be augmented with meta-data, e.g. the author or creation time of the annotation. The model allows for each element of the annotation - the annotation itself, the target, and the body - to have different associated meta-data, such as different authors. Graph Annotations The initial use cases for Open Annotation focused on single target-concept relationships, formalized as an expectation that the body of an Annotation be a single web resource, represented as a URI. However, to accommodate more complex bodies, a set of RDF statements can be captured in a construct known as a named graph [12]. The named graph as a whole has a URI. We propose to bundle all Body content into a named graph, so that both simple (e.g., entity) annotations and more complex (e.g., event) annotations can be captured in a consistent representation. This extension enables complex semantics to be associated with a resource, as well as supporting fine-grained tracking of the provenance of compositional annotations. These developments make possible the integration of linguistic annotation with the scholarly annotation models [13]. Representing PubAnn in OpenAnn As an example of the use of OpenAnn for PubAnn, we transform a PubAnn JSON statement for an entity into OpenAnn. The PubAnn statement {"id":"T13","span":{"begin":1304,"end":1309},"obj":"Protein"} is represented in OpenAnn as the following set of RDF statements: # The basic annotation structure :provenance1 { <PubMed-­‐1134658-­‐Ann1> a oa:Annotation. <PubMed-­‐1134658-­‐Ann1> oa:serializedBy <http://pubannotation.org>. <PubMed-­‐1134658-­‐Ann1> oa:hasTarget <PubMed-­‐ 1134658-­‐0-­‐SR1>. <PubMed-­‐1134658-­‐Ann1> oa:hasBody <PubMed-­‐ 1134658-­‐0-­‐T13>. <PubMed-­‐1134658-­‐Ann1> oa:motivatedBy oa:tagging. <PubMed-­‐1134658-­‐Ann1> prov:generatedOn "20141111". <PubMed-­‐1134658-­‐0-­‐T13> prov:derivedFrom<PubMed-­‐1134658-­‐0-­‐SR1> . } # The body (content) of the annotation. # A named graph. <PubMed-­‐1134658-­‐0-­‐T13> { <PubMed-­‐1134658-­‐0-­‐SR1> sio:refers-­‐to genia:Protein . } # The target of the annotation. <PubMed-­‐1134658-­‐0-­‐SR1> a oa:SpecificResource; oa:hasSource <http://pubannotation.org/docs/sourcedb/PMC/sourceid/1134658/divs/1.txt> ; oa:hasSelector <PubMed-­‐1134658-­‐0-­‐S1304-­‐1309>. # A selector for a location within the text resource. <PubMed-­‐1134658-­‐0-­‐S1304-­‐13099> a oa:TextPositionSelector ;       oa:start 1304 ;       oa:end 1309. We also extend our representation to be compatible with nanopublications (http://www.nanopub.org/guidelines) [7,8], a community standard for encapsulating assertions with their provenance into a portable digital object, by defining the annotation body to be the assertion of a nanopublication. :np { np:has-­‐assertion <PubMed-­‐1134658-­‐0-­‐ T13>; a np:Nanopublication . } The above approach can be similarly applied for capturing relational or event semantics as the body of an annotation, by encapsulating a set of triples representing the event within a named graph. We leave such examples for a more in-depth paper. Discussion The adoption of the Open Annotation formalism for representing annotations over textual corpora brings those annotations into the realm of the semantic web, enabling consistent specification of annotation content, provenance, and meta-data in terms of resolvable and reusable ontology concepts. It will allow annotations generated by different systems or individuals over the same documents to be more easily integrated, compared and contrasted. It further ensures interoperability of corpus annotations with components for authoring, sharing, and displaying annotations in browsers and other technical systems that will be developed through the broader efforts of the W3C, including digital publishing tools (cf. the Domeo annotation toolkit for the precursor of the Open Annotation model [14]). Nanopublications seem to be a particularly apt choice for structuring OpenAnn text annotations in the biomedical domain. Using nanopublications, the assertion, provenance, and metadata for a PubAnnotation are clearly demarcated into named graphs, which can retrieved, validated, and viewed by a growing set of data publication tools [9]. Furthermore, nanopublications are being used in an increasing number of biomedical resources to represent factual assertions and their provenance, and a number of tools are being developed specifically to work with nanopublications (e.g., the NanoBrowser http://nanobrowser.inn.ac/). They have been used for incentivizing the publication of human variation data [15], capturing claims [16] and scientific discourse [17], and publishing text-mined associations [18]. Bringing together Open Annotation with nanopublications offers substantial opportunities for access to and reuse of text annotations in combination with information derived from structured databases. Conclusions We have introduced a proposal for the representation of text annotations in terms of the Open Annotation model, and demonstrated how it could be applied to the current PubAnnotation JSON format. We structured our model to also be compatible with nanopublications, in order to enable integration of text annotations with information derived from curated databases. The result is a representation for text annotation on the web that is interoperable with the framework of two increasingly relevant semantic web models.

          Related collections

          Most cited references 6

          • Record: found
          • Abstract: found
          • Article: not found

          Microattribution and nanopublication as means to incentivize the placement of human genome variation data into the public domain.

          The advances in bioinformatics required to annotate human genomic variants and to place them in public data repositories have not kept pace with their discovery. Moreover, a law of diminishing returns has begun to operate both in terms of data publication and submission. Although the continued deposition of such data in the public domain is essential to maximize both their scientific and clinical utility, rewards for data sharing are few, representing a serious practical impediment to data submission. To date, two main strategies have been adopted as a means to encourage the submission of human genomic variant data: (1) database journal linkups involving the affiliation of a scientific journal with a publicly available database and (2) microattribution, involving the unambiguous linkage of data to their contributors via a unique identifier. The latter could in principle lead to the establishment of a microcitation-tracking system that acknowledges individual endeavor and achievement. Both approaches could incentivize potential data contributors, thereby encouraging them to share their data with the scientific community. Here, we summarize and critically evaluate approaches that have been proposed to address current deficiencies in data attribution and discuss ways in which they could become more widely adopted as novel scientific publication modalities. © 2012 Wiley Periodicals, Inc.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            An open annotation ontology for science on web 3.0

            Background There is currently a gap between the rich and expressive collection of published biomedical ontologies, and the natural language expression of biomedical papers consumed on a daily basis by scientific researchers. The purpose of this paper is to provide an open, shareable structure for dynamic integration of biomedical domain ontologies with the scientific document, in the form of an Annotation Ontology (AO), thus closing this gap and enabling application of formal biomedical ontologies directly to the literature as it emerges. Methods Initial requirements for AO were elicited by analysis of integration needs between biomedical web communities, and of needs for representing and integrating results of biomedical text mining. Analysis of strengths and weaknesses of previous efforts in this area was also performed. A series of increasingly refined annotation tools were then developed along with a metadata model in OWL, and deployed for feedback and additional requirements the ontology to users at a major pharmaceutical company and a major academic center. Further requirements and critiques of the model were also elicited through discussions with many colleagues and incorporated into the work. Results This paper presents Annotation Ontology (AO), an open ontology in OWL-DL for annotating scientific documents on the web. AO supports both human and algorithmic content annotation. It enables “stand-off” or independent metadata anchored to specific positions in a web document by any one of several methods. In AO, the document may be annotated but is not required to be under update control of the annotator. AO contains a provenance model to support versioning, and a set model for specifying groups and containers of annotation. AO is freely available under open source license at http://purl.org/ao/, and extensive documentation including screencasts is available on AO’s Google Code page: http://code.google.com/p/annotation-ontology/ . Conclusions The Annotation Ontology meets critical requirements for an open, freely shareable model in OWL, of annotation metadata created against scientific documents on the Web. We believe AO can become a very useful common model for annotation metadata on Web documents, and will enable biomedical domain ontologies to be used quite widely to annotate the scientific literature. Potential collaborators and those with new relevant use cases are invited to contact the authors.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Micropublications: a semantic model for claims, evidence, arguments and annotations in biomedical communications

              Background Scientific publications are documentary representations of defeasible arguments, supported by data and repeatable methods. They are the essential mediating artifacts in the ecosystem of scientific communications. The institutional “goal” of science is publishing results. The linear document publication format, dating from 1665, has survived transition to the Web. Intractable publication volumes; the difficulty of verifying evidence; and observed problems in evidence and citation chains suggest a need for a web-friendly and machine-tractable model of scientific publications. This model should support: digital summarization, evidence examination, challenge, verification and remix, and incremental adoption. Such a model must be capable of expressing a broad spectrum of representational complexity, ranging from minimal to maximal forms. Results The micropublications semantic model of scientific argument and evidence provides these features. Micropublications support natural language statements; data; methods and materials specifications; discussion and commentary; challenge and disagreement; as well as allowing many kinds of statement formalization. The minimal form of a micropublication is a statement with its attribution. The maximal form is a statement with its complete supporting argument, consisting of all relevant evidence, interpretations, discussion and challenges brought forward in support of or opposition to it. Micropublications may be formalized and serialized in multiple ways, including in RDF. They may be added to publications as stand-off metadata. An OWL 2 vocabulary for micropublications is available at http://purl.org/mp. A discussion of this vocabulary along with RDF examples from the case studies, appears as OWL Vocabulary and RDF Examples in Additional file 1. Conclusion Micropublications, because they model evidence and allow qualified, nuanced assertions, can play essential roles in the scientific communications ecosystem in places where simpler, formalized and purely statement-based models, such as the nanopublications model, will not be sufficient. At the same time they will add significant value to, and are intentionally compatible with, statement-based formalizations. We suggest that micropublications, generated by useful software tools supporting such activities as writing, editing, reviewing, and discussion, will be of great value in improving the quality and tractability of biomedical communications.
                Bookmark

                Author and article information

                Contributors
                Conference
                BMC Proc
                BMC Proc
                BMC Proceedings
                BioMed Central
                1753-6561
                2015
                6 August 2015
                : 9
                : Suppl 5
                : A2
                Affiliations
                [1 ]Dept of Computing & Information Systems, The University of Melbourne, Melbourne, Australia
                [2 ]Database Center for Life Science, Research Organization of Information and Systems, Kashiwa, Japan
                [3 ]Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
                Article
                1753-6561-9-S5-A2
                10.1186/1753-6561-9-S5-A2
                4582753
                Copyright © 2015 Verspoor et al.;

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                Biomedical Linked Annotation Hackathon 2015
                Kashiwa, Japan
                23-27 February 2015
                Categories
                Meeting Abstract

                Medicine

                Comments

                Comment on this article