Large scale study of multiple-molecule queries

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

In ligand-based screening, as well as in other chemoinformatics applications, one seeks to effectively search large repositories of molecules in order to retrieve molecules that are similar typically to a single molecule lead. However, in some case, multiple molecules from the same family are available to seed the query and search for other members of the same family.

Multiple-molecule query methods have been less studied than single-molecule query methods. Furthermore, the previous studies have relied on proprietary data and sometimes have not used proper cross-validation methods to assess the results. In contrast, here we develop and compare multiple-molecule query methods using several large publicly available data sets and background. We also create a framework based on a strict cross-validation protocol to allow unbiased benchmarking for direct comparison in future studies across several performance metrics.

Results

Fourteen different multiple-molecule query methods were defined and benchmarked using: (1) 41 publicly available data sets of related molecules with similar biological activity; and (2) publicly available background data sets consisting of up to 175,000 molecules randomly extracted from the ChemDB database and other sources. Eight of the fourteen methods were parameter free, and six of them fit one or two free parameters to the data using a careful cross-validation protocol. All the methods were assessed and compared for their ability to retrieve members of the same family against the background data set by using several performance metrics including the Area Under the Accumulation Curve (AUAC), Area Under the Curve (AUC), F1-measure, and BEDROC metrics.

Consistent with the previous literature, the best parameter-free methods are the MAX-SIM and MIN-RANK methods, which score a molecule to a family by the maximum similarity, or minimum ranking, obtained across the family. One new parameterized method introduced in this study and two previously defined methods, the Exponential Tanimoto Discriminant (ETD), the Tanimoto Power Discriminant (TPD), and the Binary Kernel Discriminant ( BKD), outperform most other methods but are more complex, requiring one or two parameters to be fit to the data.

Conclusion

Fourteen methods for multiple-molecule querying of chemical databases, including novel methods, (ETD) and (TPD), are validated using publicly available data sets, standard cross-validation protocols, and established metrics. The best results are obtained with ETD, TPD, BKD, MAX-SIM, and MIN-RANK. These results can be replicated and compared with the results of future studies using data freely downloadable from http://cdb.ics.uci.edu/.

Related collections

Most cited references 24

Record: found
Abstract: found
Article: not found

Exploring expression data: identification and analysis of coexpressed genes.

S Yooseph, Geoffrey Heyer, S Kruglyak (1999)

Analysis procedures are needed to extract useful information from the large amount of gene expression data that is becoming available. This work describes a set of analytical tools and their application to yeast cell cycle data. The components of our approach are (1) a similarity measure that reduces the number of false positives, (2) a new clustering algorithm designed specifically for grouping gene expression patterns, and (3) an interactive graphical cluster analysis tool that allows user feedback and validation. We use the clusters generated by our algorithm to summarize genome-wide expression and to initiate supervised clustering of genes into biologically meaningful groups.

0 comments Cited 142 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Maximum unbiased validation (MUV) data sets for virtual screening based on PubChem bioactivity data.

Knut Baumann, G Rohrer (2009)

Refined nearest neighbor analysis was recently introduced for the analysis of virtual screening benchmark data sets. It constitutes a technique from the field of spatial statistics and provides a mathematical framework for the nonparametric analysis of mapped point patterns. Here, refined nearest neighbor analysis is used to design benchmark data sets for virtual screening based on PubChem bioactivity data. A workflow is devised that purges data sets of compounds active against pharmaceutically relevant targets from unselective hits. Topological optimization using experimental design strategies monitored by refined nearest neighbor analysis functions is applied to generate corresponding data sets of actives and decoys that are unbiased with regard to analogue bias and artificial enrichment. These data sets provide a tool for Maximum Unbiased Validation (MUV) of virtual screening methods. The data sets and a software package implementing the MUV design workflow are freely available at http://www.pharmchem.tu-bs.de/lehre/baumann/MUV.html.

0 comments Cited 89 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity.

S. Swamidass, Jonathan HK Chen, Pierre Baldi … (2005)

Small molecules play a fundamental role in organic chemistry and biology. They can be used to probe biological systems and to discover new drugs and other useful compounds. As increasing numbers of large datasets of small molecules become available, it is necessary to develop computational methods that can deal with molecules of variable size and structure and predict their physical, chemical and biological properties. Here we develop several new classes of kernels for small molecules using their 1D, 2D and 3D representations. In 1D, we consider string kernels based on SMILES strings. In 2D, we introduce several similarity kernels based on conventional or generalized fingerprints. Generalized fingerprints are derived by counting in different ways subpaths contained in the graph of bonds, using depth-first searches. In 3D, we consider similarity measures between histograms of pairwise distances between atom classes. These kernels can be computed efficiently and are applied to problems of classification and prediction of mutagenicity, toxicity and anti-cancer activity on three publicly available datasets. The results derived using cross-validation methods are state-of-the-art. Tradeoffs between various kernels are briefly discussed. Datasets available from http://www.igb.uci.edu/servers/servers.html

0 comments Cited 46 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): J Cheminform

Title: Journal of Cheminformatics

Publisher: BioMed Central

ISSN (Electronic): 1758-2946

Publication date Collection: 2009

Publication date (Electronic): 4 June 2009

Volume: 1

Page: 7

Affiliations

[1 ]The Bren School of Information and Computer Science, Institute for Genomics and Bioinformatics, University of California, Irvine, CA 92697-3435, USA

Article

Publisher ID: 1758-2946-1-7

DOI: 10.1186/1758-2946-1-7

PMC ID: 3225883

PubMed ID: 20298525

SO-VID: 87800bf0-36d1-4749-a191-85f2c1e91f05

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 1 June 2009

Date accepted : 4 June 2009

Comments

Comment on this article

scite_

Cited by 10

See all cited by

Most referenced authors 142

See all reference authors

- Version 1

Large scale study of multiple-molecule queries

Read this article at

Abstract

Background

Results

Conclusion

Related collections

ChemSpider related publications

Most cited references 24

Exploring expression data: identification and analysis of coexpressed genes.

Maximum unbiased validation (MUV) data sets for virtual screening based on PubChem bioactivity data.

Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity.

Author and article information

Journal

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 88

Cited by 10

Most referenced authors 142