Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL)

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Success in extracting biological relationships is mainly dependent on the complexity of the task as well as the availability of high-quality training data. Here, we describe the new corpora in the systems biology modeling language BEL for training and testing biological relationship extraction systems that we prepared for the BioCreative V BEL track. BEL was designed to capture relationships not only between proteins or chemicals, but also complex events such as biological processes or disease states. A BEL nanopub is the smallest unit of information and represents a biological relationship with its provenance. In BEL relationships (called BEL statements), the entities are normalized to defined namespaces mainly derived from public repositories, such as sequence databases, MeSH or publicly available ontologies. In the BEL nanopubs, the BEL statements are associated with citation information and supportive evidence such as a text excerpt. To enable the training of extraction tools, we prepared BEL resources and made them available to the community. We selected a subset of these resources focusing on a reduced set of namespaces, namely, human and mouse genes, ChEBI chemicals, MeSH diseases and GO biological processes, as well as relationship types ‘increases’ and ‘decreases’. The published training corpus contains 11 000 BEL statements from over 6000 supportive text excerpts. For method evaluation, we selected and re-annotated two smaller subcorpora containing 100 text excerpts. For this re-annotation, the inter-annotator agreement was measured by the BEL track evaluation environment and resulted in a maximal F-score of 91.18% for full statement agreement. In addition, for a set of 100 BEL statements, we do not only provide the gold standard expert annotations, but also text excerpts pre-selected by two automated systems. Those text excerpts were evaluated and manually annotated as true or false supportive in the course of the BioCreative V BEL track task.

Database URL: http://wiki.openbel.org/display/BIOC/Datasets

Related collections

Most cited references 32

Record: found
Abstract: found
Article: not found

The Systems Biology Graphical Notation.

Nicolas Le Novère, Michael Hucka, Huaiyu Mi … (2009)

Circuit diagrams and Unified Modeling Language diagrams are just two examples of standard visual languages that help accelerate work by promoting regularity, removing ambiguity and enabling software tool support for communication of complex information. Ironically, despite having one of the highest ratios of graphical to textual information, biology still lacks standard graphical notations. The recent deluge of biological knowledge makes addressing this deficit a pressing concern. Toward this goal, we present the Systems Biology Graphical Notation (SBGN), a visual language developed by a community of biochemists, modelers and computer scientists. SBGN consists of three complementary languages: process diagram, entity relationship diagram and activity flow diagram. Together they enable scientists to represent networks of biochemical interactions in a standard, unambiguous way. We believe that SBGN will foster efficient and accurate representation, visualization, storage, exchange and reuse of information on all kinds of biological knowledge, from gene regulation, to metabolism, to cellular signaling.

0 comments Cited 344 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Gene: a gene-centered information resource at NCBI

Garth R. Brown, Vichet Hem, Kenneth Katz … (2014)

The National Center for Biotechnology Information's (NCBI) Gene database (www.ncbi.nlm.nih.gov/gene) integrates gene-specific information from multiple data sources. NCBI Reference Sequence (RefSeq) genomes for viruses, prokaryotes and eukaryotes are the primary foundation for Gene records in that they form the critical association between sequence and a tracked gene upon which additional functional and descriptive content is anchored. Additional content is integrated based on the genomic location and RefSeq transcript and protein sequence data. The content of a Gene record represents the integration of curation and automated processing from RefSeq, collaborating model organism databases, consortia such as Gene Ontology, and other databases within NCBI. Records in Gene are assigned unique, tracked integers as identifiers. The content (citations, nomenclature, genomic location, gene products and their attributes, phenotypes, sequences, interactions, variation details, maps, expression, homologs, protein domains and external databases) is available via interactive browsing through NCBI's Entrez system, via NCBI's Entrez programming utilities (E-Utilities and Entrez Direct) and for bulk transfer by FTP.

0 comments Cited 288 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

BioPAX – A community standard for pathway data sharing

Emek Demir, Michael P. Cary, Suzanne Paley … (2010)

BioPAX (Biological Pathway Exchange) is a standard language to represent biological pathways at the molecular and cellular level. Its major use is to facilitate the exchange of pathway data (http://www.biopax.org). Pathway data captures our understanding of biological processes, but its rapid growth necessitates development of databases and computational tools to aid interpretation. However, the current fragmentation of pathway information across many databases with incompatible formats presents barriers to its effective use. BioPAX solves this problem by making pathway data substantially easier to collect, index, interpret and share. BioPAX can represent metabolic and signaling pathways, molecular and genetic interactions and gene regulation networks. BioPAX was created through a community process. Through BioPAX, millions of interactions organized into thousands of pathways across many organisms, from a growing number of sources, are available. Thus, large amounts of pathway data are available in a computable form to support visualization, analysis and biological discovery.

0 comments Cited 268 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Database (Oxford)

Journal ID (iso-abbrev): Database (Oxford)

Journal ID (publisher-id): databa

Journal ID (hwp): databa

Title: Database: The Journal of Biological Databases and Curation

Publisher: Oxford University Press

ISSN (Electronic): 1758-0463

Publication date Collection: 2016

Publication date (Electronic): 20 August 2016

Publication date PMC-release: 20 August 2016

Volume: 2016

Electronic Location Identifier: baw113

Affiliations

[1 ]Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany

[2 ]Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland

[3 ]Department of Health Science Research, Mayo Clinic, Rochester, MN, USA

[4 ]Selventa, One Alewife Center, Cambridge, MA 02140, USA

Author notes

* Corresponding author: Email: juliane.fluck@ 123456scai.fraunhofer.de

Citation details: Fluck,J., Madan,S., Ansari,S. et al. Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL). Database (2016) Vol. 2016: article ID baw113; doi:10.1093/database/baw113

Article

Publisher ID: baw113

DOI: 10.1093/database/baw113

PMC ID: 4995071

PubMed ID: 27554092

SO-VID: bb0b9a7c-2bb8-4334-817f-8a8e5b6203b1

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 23 December 2015

Date revision received : 07 July 2016

Date accepted : 07 July 2016

Page count

Pages: 20

Comments

Comment on this article

scite_

Cited by 9

See all cited by

Most referenced authors 1,621

See all reference authors

Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL)

Read this article at

Abstract

Related collections

Nanopublications (single, attributable and machine-readable assertions in scientific literature)

Most cited references 32

The Systems Biology Graphical Notation.

Gene: a gene-centered information resource at NCBI

BioPAX – A community standard for pathway data sharing

Author and article information

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 9

Cited by 9

Most referenced authors 1,621