Identification of transcription factor contexts in literature using machine learning approaches

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Availability of information about transcription factors (TFs) is crucial for genome biology, as TFs play a central role in the regulation of gene expression. While manual literature curation is expensive and labour intensive, the development of semi-automated text mining support is hindered by unavailability of training data. There have been no studies on how existing data sources (e.g. TF-related data from the MeSH thesaurus and GO ontology) or potentially noisy example data (e.g. protein-protein interaction, PPI) could be used to provide training data for identification of TF-contexts in literature.

Results

In this paper we describe a text-classification system designed to automatically recognise contexts related to transcription factors in literature. A learning model is based on a set of biological features (e.g. protein and gene names, interaction words, other biological terms) that are deemed relevant for the task. We have exploited background knowledge from existing biological resources (MeSH and GO) to engineer such features. Weak and noisy training datasets have been collected from descriptions of TF-related concepts in MeSH and GO, PPI data and data representing non-protein-function descriptions. Three machine-learning methods are investigated, along with a vote-based merging of individual approaches and/or different training datasets. The system achieved highly encouraging results, with most classifiers achieving an F-measure above 90%.

Conclusions

The experimental results have shown that the proposed model can be used for identification of TF-related contexts (i.e. sentences) with high accuracy, with a significantly reduced set of features when compared to traditional bag-of-words approach. The results of considering existing PPI data suggest that there is not as high similarity between TF and PPI contexts as we have expected. We have also shown that existing knowledge sources are useful both for feature engineering and for obtaining noisy positive training data.

Related collections

Most cited references 34

Record: found
Abstract: found
Article: not found

A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae.

Peter Uetz, Loic Giot, Gerard Cagney … (2000)

Two large-scale yeast two-hybrid screens were undertaken to identify protein-protein interactions between full-length open reading frames predicted from the Saccharomyces cerevisiae genome sequence. In one approach, we constructed a protein array of about 6,000 yeast transformants, with each transformant expressing one of the open reading frames as a fusion to an activation domain. This array was screened by a simple and automated procedure for 192 yeast proteins, with positive responses identified by their positions in the array. In a second approach, we pooled cells expressing one of about 6,000 activation domain fusions to generate a library. We used a high-throughput screening procedure to screen nearly all of the 6,000 predicted yeast proteins, expressed as Gal4 DNA-binding domain fusion proteins, against the library, and characterized positives by sequence analysis. These approaches resulted in the detection of 957 putative interactions involving 1,004 S. cerevisiae proteins. These data reveal interactions that place functionally unclassified proteins in a biological context, interactions between proteins involved in the same biological function, and interactions that link biological functions together into larger cellular processes. The results of these screens are shown here.

0 comments Cited 895 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A Bayesian networks approach for predicting protein-protein interactions from genomic data.

R. Jansen (2003)

We have developed an approach using Bayesian networks to predict protein-protein interactions genome-wide in yeast. Our method naturally weights and combines into reliable predictions genomic features only weakly associated with interaction (e.g., messenger RNAcoexpression, coessentiality, and colocalization). In addition to de novo predictions, it can integrate often noisy, experimental interaction data sets. We observe that at given levels of sensitivity, our predictions are more accurate than the existing high-throughput experimental data sets. We validate our predictions with TAP (tandem affinity purification) tagging experiments. Our analysis, which gives a comprehensive view of yeast interactions, is available at genecensus.org/intint.

0 comments Cited 355 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

TRANSFAC: an integrated system for gene expression regulation.

E. Wingender, X. Chen, R Hehl … (2000)

TRANSFAC is a database on transcription factors, their genomic binding sites and DNA-binding profiles (http://transfac.gbf.de/TRANSFAC/). Its content has been enhanced, in particular by information about training sequences used for the construction of nucleotide matrices as well as by data on plant sites and factors. Moreover, TRANSFAC has been extended by two new modules: PathoDB provides data on pathologically relevant mutations in regulatory regions and transcription factor genes, whereas S/MARt DB compiles features of scaffold/matrix attached regions (S/MARs) and the proteins binding to them. Additionally, the databases TRANSPATH, about signal transduction, and CYTOMER, about organs and cell types, have been extended and are increasingly integrated with the TRANSFAC data sources.

0 comments Cited 260 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Conference

Journal ID (nlm-ta): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central

ISSN (Electronic): 1471-2105

Publication date Collection: 2008

Publication date (Electronic): 11 April 2008

Volume: 9

Issue: Suppl 3

Page: S11

Affiliations

[1 ]School of Computer Science, University of Manchester, Manchester, UK

Article

Publisher ID: 1471-2105-9-S3-S11

DOI: 10.1186/1471-2105-9-S3-S11

PMC ID: 2352869

PubMed ID: 18426546

SO-VID: 179e96a2-b0ab-4c29-aec1-61dbe4cbf9ad

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Conference name: The Second International Symposium on Languages in Biology and Medicine (LBM) 2007

Identification of transcription factor contexts in literature using machine learning approaches

Read this article at

Abstract

Background

Results

Conclusions

Related collections

Genetoberfest

Most cited references 34

A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae.

A Bayesian networks approach for predicting protein-protein interactions from genomic data.

TRANSFAC: an integrated system for gene expression regulation.

Author and article information

Conference

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 71

Cited by 2

Most referenced authors 909