Blog
About

92
views
0
recommends
+1 Recommend
1 collections
    12
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Correction In 'Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining' (Hettne et al. Journal of Cheminformatics 2010, 2:3) [1], the name of the automatically curated dictionary is identified as 'Chemlist'. CHEMLIST is a trademark that the American Chemical Society has used for many years to identify its Regulated Chemicals Listing (CAS) database. To avoid future confusion, the 'Chemlist' dictionary mentioned in this article has been renamed to 'Jochem.'

          Related collections

          Most cited references 1

          • Record: found
          • Abstract: found
          • Article: found

          Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining

          Background Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships. Results We acquired the component of ChemSpider containing only manually curated names and synonyms. Rule-based term filtering, semi-automatic manual curation, and disambiguation rules were applied. We tested the dictionary from ChemSpider on an annotated corpus and compared the results with those for the Chemlist dictionary. The ChemSpider dictionary of ca. 80 k names was only a 1/3 to a 1/4 the size of Chemlist at around 300 k. The ChemSpider dictionary had a precision of 0.43 and a recall of 0.19 before the application of filtering and disambiguation and a precision of 0.87 and a recall of 0.19 after filtering and disambiguation. The Chemlist dictionary had a precision of 0.20 and a recall of 0.47 before the application of filtering and disambiguation and a precision of 0.67 and a recall of 0.40 after filtering and disambiguation. Conclusions We conclude the following: (1) The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; (2) Rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary. ChemSpider is available as a web service at http://www.chemspider.com/ and the Chemlist dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web at http://www.biosemantics.org/chemlist.
            Bookmark

            Author and article information

            Journal
            J Cheminform
            Journal of Cheminformatics
            BioMed Central
            1758-2946
            2010
            3 June 2010
            : 2
            : 4
            Affiliations
            [1 ]Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
            [2 ]Department of Health Risk Analysis and Toxicology, Maastricht University, Maastricht, The Netherlands
            [3 ]Royal Society of Chemistry, 904 Tamaras Circle, Wake Forest, NC-27587, USA
            Article
            1758-2946-2-4
            10.1186/1758-2946-2-4
            2890529
            20525267
            Copyright ©2010 Hettne et al; licensee BioMed Central Ltd.

            This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

            Categories
            Correction

            Chemoinformatics

            Comments

            Comment on this article