Discovering semantic features in the literature: a foundation for building functional associations

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Experimental techniques such as DNA microarray, serial analysis of gene expression (SAGE) and mass spectrometry proteomics, among others, are generating large amounts of data related to genes and proteins at different levels. As in any other experimental approach, it is necessary to analyze these data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research.

Results

We present a method to create literature profiles for large sets of genes or proteins based on common semantic features extracted from a corpus of relevant documents. These profiles can be used to establish pair-wise similarities among genes, utilized in gene/protein classification or can be even combined with experimental measurements. Semantic features can be used by researchers to facilitate the understanding of the commonalities indicated by experimental results. Our approach is based on non-negative matrix factorization (NMF), a machine-learning algorithm for data analysis, capable of identifying local patterns that characterize a subset of the data. The literature is thus used to establish putative relationships among subsets of genes or proteins and to provide coherent justification for this clustering into subsets. We demonstrate the utility of the method by applying it to two independent and vastly different sets of genes.

Conclusion

The presented method can create literature profiles from documents relevant to sets of genes. The representation of genes as additive linear combinations of semantic features allows for the exploration of functional associations as well as for clustering, suggesting a valuable methodology for the validation and interpretation of high-throughput experimental data.

Related collections

Most cited references 33

Record: found
Abstract: not found
Article: not found

A STATISTICAL INTERPRETATION OF TERM SPECIFICITY AND ITS APPLICATION IN RETRIEVAL

KAREN SPARCK JONES (1972)

0 comments Cited 355 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Entrez Gene: gene-centered information at NCBI

Donna R. Maglott, Jim Ostell, Kim D. Pruitt … (2004)

Entrez Gene (www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene) is NCBI's database for gene-specific information. It does not include all known or predicted genes; instead Entrez Gene focuses on the genomes that have been completely sequenced, that have an active research community to contribute gene-specific information, or that are scheduled for intense sequence analysis. The content of Entrez Gene represents the result of curation and automated integration of data from NCBI's Reference Sequence project (RefSeq), from collaborating model organism databases, and from many other databases available from NCBI. Records are assigned unique, stable and tracked integers as identifiers. The content (nomenclature, map location, gene products and their attributes, markers, phenotypes, and links to citations, sequences, variation details, maps, expression, homologs, protein domains and external databases) is updated as new information becomes available. Entrez Gene is a step forward from NCBI's LocusLink, with both a major increase in taxonomic scope and improved access through the many tools associated with NCBI Entrez.

0 comments Cited 339 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A literature network of human genes for high-throughput analysis of gene expression.

A Laegreid, E Hovig, T. Jenssen … (2001)

We have carried out automated extraction of explicit and implicit biomedical knowledge from publicly available gene and text databases to create a gene-to-gene co-citation network for 13,712 named human genes by automated analysis of titles and abstracts in over 10 million MEDLINE records. The associations between genes have been annotated by linking genes to terms from the medical subject heading (MeSH) index and terms from the gene ontology (GO) database. The extracted database and accompanying web tools for gene-expression analysis have collectively been named 'PubGene'. We validated the extracted networks by three large-scale experiments showing that co-occurrence reflects biologically meaningful relationships, thus providing an approach to extract and structure known biology. We validated the applicability of the tools by analyzing two publicly available microarray data sets.

0 comments Cited 140 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central (London )

ISSN (Electronic): 1471-2105

Publication date Collection: 2006

Publication date (Electronic): 26 January 2006

Volume: 7

Page: 41

Affiliations

[1 ]Biocomputing Unit, Centro Nacional de Biotecnologia – CSIC, Madrid, Spain

[2 ]School of Computing, Queen's University, Kingston, Ontario, Canada

[3 ]Dpto. Arquitectura de Computadores, Universidad Complutense de Madrid, Madrid, Spain

Article

Publisher ID: 1471-2105-7-41

DOI: 10.1186/1471-2105-7-41

PMC ID: 1386711

PubMed ID: 16438716

SO-VID: ff40a672-17be-4be9-bc12-ead4c1f7dd0e

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Discovering semantic features in the literature: a foundation for building functional associations

Read this article at

Abstract

Background

Results

Conclusion

Related collections

Genetoberfest

Most cited references 33

A STATISTICAL INTERPRETATION OF TERM SPECIFICITY AND ITS APPLICATION IN RETRIEVAL

Entrez Gene: gene-centered information at NCBI

A literature network of human genes for high-throughput analysis of gene expression.

Author and article information

Journal

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 58

Cited by 16

Most referenced authors 451