Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Current biomedical research needs to leverage and exploit the large amount of information reported in scientific publications. Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for identification of actionable knowledge from free text repositories. We present the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated diseases.

Results

By exploiting morpho-syntactic information of the text, BeFree is able to identify gene-disease, drug-disease and drug-target associations with state-of-the-art performance. The application of BeFree to real-case scenarios shows its effectiveness in extracting information relevant for translational research. We show the value of the gene-disease associations extracted by BeFree through a number of analyses and integration with other data sources. BeFree succeeds in identifying genes associated to a major cause of morbidity worldwide, depression, which are not present in other public resources. Moreover, large-scale extraction and analysis of gene-disease associations, and integration with current biomedical knowledge, provided interesting insights on the kind of information that can be found in the literature, and raised challenges regarding data prioritization and curation. We found that only a small proportion of the gene-disease associations discovered by using BeFree is collected in expert-curated databases. Thus, there is a pressing need to find alternative strategies to manual curation, in order to review, prioritize and curate text-mining data and incorporate it into domain-specific databases. We present our strategy for data prioritization and discuss its implications for supporting biomedical research and applications.

Conclusions

BeFree is a novel text mining system that performs competitively for the identification of gene-disease, drug-disease and drug-target associations. Our analyses show that mining only a small fraction of MEDLINE results in a large dataset of gene-disease associations, and only a small proportion of this dataset is actually recorded in curated resources (2%), raising several issues on data prioritization and curation. We propose that joint analysis of text mined data with data curated by experts appears as a suitable approach to both assess data quality and highlight novel and interesting information.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-015-0472-9) contains supplementary material, which is available to authorized users.

Related collections

Most cited references 45

Record: found
Abstract: found
Article: not found

DAVID: Database for Annotation, Visualization, and Integrated Discovery.

Glynn Dennis, Brad T. Sherman, Douglas A Hosack … (2003)

Functional annotation of differentially expressed genes is a necessary and critical step in the analysis of microarray data. The distributed nature of biological knowledge frequently requires researchers to navigate through numerous web-accessible databases gathering information one gene at a time. A more judicious approach is to provide query-based access to an integrated database that disseminates biologically rich information across large datasets and displays graphic summaries of functional information. Database for Annotation, Visualization, and Integrated Discovery (DAVID; http://www.david.niaid.nih.gov) addresses this need via four web-based analysis modules: 1) Annotation Tool - rapidly appends descriptive data from several public databases to lists of genes; 2) GoCharts - assigns genes to Gene Ontology functional categories based on user selected classifications and term specificity level; 3) KeggCharts - assigns genes to KEGG metabolic processes and enables users to view genes in the context of biochemical pathway maps; and 4) DomainCharts - groups genes according to PFAM conserved protein domains. Analysis results and graphical displays remain dynamically linked to primary data and external data repositories, thereby furnishing in-depth as well as broad-based data coverage. The functionality provided by DAVID accelerates the analysis of genome-scale datasets by facilitating the transition from data collection to biological meaning.

0 comments Cited 1427 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A literature network of human genes for high-throughput analysis of gene expression.

A Laegreid, E Hovig, T. Jenssen … (2001)

We have carried out automated extraction of explicit and implicit biomedical knowledge from publicly available gene and text databases to create a gene-to-gene co-citation network for 13,712 named human genes by automated analysis of titles and abstracts in over 10 million MEDLINE records. The associations between genes have been annotated by linking genes to terms from the medical subject heading (MeSH) index and terms from the gene ontology (GO) database. The extracted database and accompanying web tools for gene-expression analysis have collectively been named 'PubGene'. We validated the extracted networks by three large-scale experiments showing that co-occurrence reflects biologically meaningful relationships, thus providing an approach to extract and structure known biology. We validated the applicability of the tools by analyzing two publicly available microarray data sets.

0 comments Cited 140 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Identifying gene-disease associations using centrality on a literature mined gene-interaction network

Arzucan Özgür, Thuy Vu, Güneş Erkan … (2008)

Motivation: Understanding the role of genetics in diseases is one of the most important aims of the biological sciences. The completion of the Human Genome Project has led to a rapid increase in the number of publications in this area. However, the coverage of curated databases that provide information manually extracted from the literature is limited. Another challenge is that determining disease-related genes requires laborious experiments. Therefore, predicting good candidate genes before experimental analysis will save time and effort. We introduce an automatic approach based on text mining and network analysis to predict gene-disease associations. We collected an initial set of known disease-related genes and built an interaction network by automatic literature mining based on dependency parsing and support vector machines. Our hypothesis is that the central genes in this disease-specific network are likely to be related to the disease. We used the degree, eigenvector, betweenness and closeness centrality metrics to rank the genes in the network. Results: The proposed approach can be used to extract known and to infer unknown gene-disease associations. We evaluated the approach for prostate cancer. Eigenvector and degree centrality achieved high accuracy. A total of 95% of the top 20 genes ranked by these methods are confirmed to be related to prostate cancer. On the other hand, betweenness and closeness centrality predicted more genes whose relation to the disease is currently unknown and are candidates for experimental study. Availability: A web-based system for browsing the disease-specific gene-interaction networks is available at: http://gin.ncibi.org Contact: radev@umich.edu

0 comments Cited 123 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Àlex Bravo: abravo@imim.es

Janet Piñero: jpinero@imim.es

Núria Queralt-Rosinach: nqueralt@imim.es

Michael Rautschka: rautschy@gmail.com

Laura I Furlong: lfurlong@imim.es

Journal

Journal ID (nlm-ta): BMC Bioinformatics

Journal ID (iso-abbrev): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central (London )

ISSN (Electronic): 1471-2105

Publication date (Electronic): 21 February 2015

Publication date PMC-release: 21 February 2015

Publication date Collection: 2015

Volume: 16

Issue: 1

Electronic Location Identifier: 55

Affiliations

Research Programme on Biomedical Informatics (GRIB), IMIM, DCEXS, Universitat Pompeu Fabra, Barcelona, Spain

Article

Publisher ID: 472

DOI: 10.1186/s12859-015-0472-9

PMC ID: 4466840

PubMed ID: 25886734

SO-VID: c3b25584-41e1-4cec-bc28-fd186507e106

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

History

Date received : 24 July 2014

Date accepted : 19 January 2015

Custom metadata

ScienceOpen disciplines: Bioinformatics & Computational biology

Keywords: text mining,information extraction,big data,translational bioinformatics,biocuration,disease,machine learning,corpus development

Data availability:

ScienceOpen disciplines: Bioinformatics & Computational biology

Keywords: text mining, information extraction, big data, translational bioinformatics, biocuration, disease, machine learning, corpus development

Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research

Read this article at

Abstract

Background

Results

Conclusions

Electronic supplementary material

Related collections

Annual Reviews AI, Machine Learning, and Society

Most cited references 45

DAVID: Database for Annotation, Visualization, and Integrated Discovery.

A literature network of human genes for high-throughput analysis of gene expression.

Identifying gene-disease associations using centrality on a literature mined gene-interaction network

Author and article information

Contributors

Journal

Affiliations

Article

History

Categories

Custom metadata

Comments

Comment on this article

Similar content 81

Cited by 74

Most referenced authors 495