Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships.

Results

We acquired the component of ChemSpider containing only manually curated names and synonyms. Rule-based term filtering, semi-automatic manual curation, and disambiguation rules were applied. We tested the dictionary from ChemSpider on an annotated corpus and compared the results with those for the Chemlist dictionary. The ChemSpider dictionary of ca. 80 k names was only a 1/3 to a 1/4 the size of Chemlist at around 300 k. The ChemSpider dictionary had a precision of 0.43 and a recall of 0.19 before the application of filtering and disambiguation and a precision of 0.87 and a recall of 0.19 after filtering and disambiguation. The Chemlist dictionary had a precision of 0.20 and a recall of 0.47 before the application of filtering and disambiguation and a precision of 0.67 and a recall of 0.40 after filtering and disambiguation.

Conclusions

We conclude the following: (1) The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; (2) Rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary. ChemSpider is available as a web service at http://www.chemspider.com/ and the Chemlist dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web at http://www.biosemantics.org/chemlist.

Related collections

Most cited references 14

Record: found
Abstract: found
Article: found

Is Open Access

HMDB: a knowledgebase for the human metabolome

David Wishart, Craig Knox, An Guo … (2009)

The Human Metabolome Database (HMDB, http://www.hmdb.ca) is a richly annotated resource that is designed to address the broad needs of biochemists, clinical chemists, physicians, medical geneticists, nutritionists and members of the metabolomics community. Since its first release in 2007, the HMDB has been used to facilitate the research for nearly 100 published studies in metabolomics, clinical biochemistry and systems biology. The most recent release of HMDB (version 2.0) has been significantly expanded and enhanced over the previous release (version 1.0). In particular, the number of fully annotated metabolite entries has grown from 2180 to more than 6800 (a 300% increase), while the number of metabolites with biofluid or tissue concentration data has grown by a factor of five (from 883 to 4413). Similarly, the number of purified compounds with reference to NMR, LC-MS and GC-MS spectra has more than doubled (from 380 to more than 790 compounds). In addition to this significant expansion in database size, many new database searching tools and new data content has been added or enhanced. These include better algorithms for spectral searching and matching, more powerful chemical substructure searches, faster text searching software, as well as dedicated pathway searching tools and customized, clickable metabolic maps. Changes to the user-interface have also been implemented to accommodate future expansion and to make database navigation much easier. These improvements should make the HMDB much more useful to a much wider community of users.

0 comments Cited 554 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

ChEBI: a database and ontology for chemical entities of biological interest

Kirill Degtyarenko, Paula de Matos, Marcus Ennis … (2008)

Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on ‘small’ chemical compounds. The molecular entities in question are either natural products or synthetic products used to intervene in the processes of living organisms. Genome-encoded macromolecules (nucleic acids, proteins and peptides derived from proteins by cleavage) are not as a rule included in ChEBI. In addition to molecular entities, ChEBI contains groups (parts of molecular entities) and classes of entities. ChEBI includes an ontological classification, whereby the relationships between molecular entities or classes of entities and their parents and/or children are specified. ChEBI is available online at http://www.ebi.ac.uk/chebi/

0 comments Cited 331 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Medical Subject Headings (MeSH).

C E Lipscomb (2000)

0 comments Cited 287 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): J Cheminform

Title: Journal of Cheminformatics

Publisher: BioMed Central

ISSN (Electronic): 1758-2946

Publication date Collection: 2010

Publication date (Electronic): 23 March 2010

Volume: 2

Page: 3

Affiliations

[1 ]Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands

[2 ]Department of Health Risk Analysis and Toxicology, Maastricht University, Maastricht, The Netherlands

[3 ]Royal Society of Chemistry, 904 Tamaras Circle, Wake Forest, NC-27587, USA

Article

Publisher ID: 1758-2946-2-3

DOI: 10.1186/1758-2946-2-3

PMC ID: 2848622

PubMed ID: 20331846

SO-VID: 5e55a6f4-c8a9-4318-a5df-9adae5b87b5c

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 23 November 2009

Date accepted : 23 March 2010

Comments

Comment on this article

scite_

Cited by 9

See all cited by

- Version 1

Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining

Read this article at

Abstract

Background

Results

Conclusions

Related collections

ChemSpider related publications

Most cited references 14

HMDB: a knowledgebase for the human metabolome

ChEBI: a database and ontology for chemical entities of biological interest

Medical Subject Headings (MeSH).

Author and article information

Journal

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 86

Cited by 9