31
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      QSAR Modeling of Imbalanced High-Throughput Screening Data in PubChem

      research-article

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Many of the structures in PubChem are annotated with activities determined in high-throughput screening (HTS) assays. Because of the nature of these assays, the activity data are typically strongly imbalanced, with a small number of active compounds contrasting with a very large number of inactive compounds. We have used several such imbalanced PubChem HTS assays to test and develop strategies to efficiently build robust QSAR models from imbalanced data sets. Different descriptor types [Quantitative Neighborhoods of Atoms (QNA) and “biological” descriptors] were used to generate a variety of QSAR models in the program GUSAR. The models obtained were compared using external test and validation sets. We also report on our efforts to incorporate the most predictive of our models in the publicly available NCI/CADD Group Web services ( http://cactus.nci.nih.gov/chemical/apps/cap).

          Related collections

          Most cited references11

          • Record: found
          • Abstract: found
          • Article: not found

          PubChem as a public resource for drug discovery.

          PubChem is a public repository of small molecules and their biological properties. Currently, it contains more than 25 million unique chemical structures and 90 million bioactivity outcomes associated with several thousand macromolecular targets. To address the potential utility of this public resource for drug discovery, we systematically summarized the protein targets in PubChem by function, 3D structure and biological pathway. Moreover, we analyzed the potency, selectivity and promiscuity of the bioactive compounds identified for these biological targets, including the chemical probes generated by the NIH Molecular Libraries Program. As a public resource, PubChem lowers the barrier for researchers to advance the development of chemical tools for modulating biological processes and drug candidates for disease treatments. Published by Elsevier Ltd.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Machine learning methods for property prediction in chemoinformatics: Quo Vadis?

            This paper is focused on modern approaches to machine learning, most of which are as yet used infrequently or not at all in chemoinformatics. Machine learning methods are characterized in terms of the "modes of statistical inference" and "modeling levels" nomenclature and by considering different facets of the modeling with respect to input/ouput matching, data types, models duality, and models inference. Particular attention is paid to new approaches and concepts that may provide efficient solutions of common problems in chemoinformatics: improvement of predictive performance of structure-property (activity) models, generation of structures possessing desirable properties, model applicability domain, modeling of properties with functional endpoints (e.g., phase diagrams and dose-response curves), and accounting for multiple molecular species (e.g., conformers or tautomers).
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Comparison of random forest and Pipeline Pilot Naïve Bayes in prospective QSAR predictions.

              Random forest is currently considered one of the best QSAR methods available in terms of accuracy of prediction. However, it is computationally intensive. Naïve Bayes is a simple, robust classification method. The Laplacian-modified Naïve Bayes implementation is the preferred QSAR method in the widely used commercial chemoinformatics platform Pipeline Pilot. We made a comparison of the ability of Pipeline Pilot Naïve Bayes (PLPNB) and random forest to make accurate predictions on 18 large, diverse in-house QSAR data sets. These include on-target and ADME-related activities. These data sets were set up as classification problems with either binary or multicategory activities. We used a time-split method of dividing training and test sets, as we feel this is a realistic way of simulating prospective prediction. PLPNB is computationally efficient. However, random forest predictions are at least as good and in many cases significantly better than those of PLPNB on our data sets. PLPNB performs better with ECFP4 and ECFP6 descriptors, which are native to Pipeline Pilot, and more poorly with other descriptors we tried. © 2012 American Chemical Society
                Bookmark

                Author and article information

                Journal
                J Chem Inf Model
                J Chem Inf Model
                ci
                jcisd8
                Journal of Chemical Information and Modeling
                American Chemical Society
                1549-9596
                1549-960X
                13 February 2015
                13 February 2014
                24 March 2014
                : 54
                : 3
                : 705-712
                Affiliations
                []CADD Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health , DHHS, NCI-Frederick, 376 Boyles St., Frederick, Maryland 21702, United States
                []Basic Science Program, Leidos Biomedical, Inc. , Computer-Aided Drug Design Group, Chemical Biology Laboratory, Frederick National Laboratory for Cancer Research , 376 Boyles St., Frederick, Maryland 21702, United States
                Author notes
                [* ]E-mail: mn1@ 123456helix.nih.gov . Telephone: +1-301-846-5903.
                Article
                10.1021/ci400737s
                3985743
                24524735
                93dfb3e2-a633-4a7f-9175-51a4f349010a
                Copyright © 2014 American Chemical Society
                History
                : 11 December 2013
                Funding
                National Institutes of Health, United States
                Categories
                Article
                Custom metadata
                ci400737s
                ci-2013-00737s

                Computational chemistry & Modeling
                Computational chemistry & Modeling

                Comments

                Comment on this article