11
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Can human experts predict solubility better than computers?

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          In this study, we design and carry out a survey, asking human experts to predict the aqueous solubility of druglike organic compounds. We investigate whether these experts, drawn largely from the pharmaceutical industry and academia, can match or exceed the predictive power of algorithms. Alongside this, we implement 10 typical machine learning algorithms on the same dataset. The best algorithm, a variety of neural network known as a multi-layer perceptron, gave an RMSE of 0.985 log S units and an R 2 of 0.706. We would not have predicted the relative success of this particular algorithm in advance. We found that the best individual human predictor generated an almost identical prediction quality with an RMSE of 0.942 log S units and an R 2 of 0.723. The collection of algorithms contained a higher proportion of reasonably good predictors, nine out of ten compared with around half of the humans. We found that, for either humans or algorithms, combining individual predictions into a consensus predictor by taking their median generated excellent predictivity. While our consensus human predictor achieved very slightly better headline figures on various statistical measures, the difference between it and the consensus machine learning predictor was both small and statistically insignificant. We conclude that human experts can predict the aqueous solubility of druglike molecules essentially equally well as machine learning algorithms. We find that, for either humans or algorithms, combining individual predictions into a consensus predictor by taking their median is a powerful way of benefitting from the wisdom of crowds.

          Electronic supplementary material

          The online version of this article (10.1186/s13321-017-0250-y) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references48

          • Record: found
          • Abstract: found
          • Article: not found

          Random forest: a classification and regression tool for compound classification and QSAR modeling.

          A new classification and regression tool, Random Forest, is introduced and investigated for predicting a compound's quantitative or categorical biological activity based on a quantitative description of the compound's molecular structure. Random Forest is an ensemble of unpruned classification or regression trees created by using bootstrap samples of the training data and random feature selection in tree induction. Prediction is made by aggregating (majority vote or averaging) the predictions of the ensemble. We built predictive models for six cheminformatics data sets. Our analysis demonstrates that Random Forest is a powerful tool capable of delivering performance that is among the most accurate methods to date. We also present three additional features of Random Forest: built-in performance assessment, a measure of relative importance of descriptors, and a measure of compound similarity that is weighted by the relative importance of descriptors. It is the combination of relatively high prediction accuracy and its collection of desired features that makes Random Forest uniquely suited for modeling in cheminformatics.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo-and Bioinformatics

            The Chemistry Development Kit (CDK) is a freely available open-source Java library for Structural Chemo-and Bioinformatics. Its architecture and capabilities as well as the development as an open-source project by a team of international collaborators from academic and industrial institutions is described. The CDK provides methods for many common tasks in molecular informatics, including 2D and 3D rendering of chemical structures, I/O routines, SMILES parsing and generation, ring searches, isomorphism checking, structure diagram generation, etc. Application scenarios as well as access information for interested users and potential contributors are given.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Prediction of drug solubility from structure.

              The aqueous solubility of a drug is an important factor affecting its bioavailability. Numerous computational methods have been developed for the prediction of aqueous solubility from a compound's structure. A review is provided of the methodology and quality of results for the most useful procedures including the model implemented in the QikProp program. Viable methods now exist for predictions with less than 1 log unit uncertainty, which is adequate for prescreening synthetic candidates or design of combinatorial libraries. Further progress with predictive methods would require an experimental database of highly accurate solubilities for a large, diverse collection of drug-like molecules.
                Bookmark

                Author and article information

                Contributors
                samuel.boobier@gmail.com
                anne.osbourn@jic.ac.uk
                jbom@st-andrews.ac.uk
                Journal
                J Cheminform
                J Cheminform
                Journal of Cheminformatics
                Springer International Publishing (Cham )
                1758-2946
                13 December 2017
                13 December 2017
                2017
                : 9
                : 63
                Affiliations
                [1 ]ISNI 0000 0001 0721 1626, GRID grid.11914.3c, Biomedical Sciences Research Complex and EaStCHEM School of Chemistry, , University of St Andrews, ; St Andrews, KY16 9ST Scotland, UK
                [2 ]ISNI 0000 0001 2175 7246, GRID grid.14830.3e, Department of Metabolic Biology, , John Innes Centre, ; Norwich Research Park, Norwich, NR4 7UH UK
                Author information
                http://orcid.org/0000-0002-0379-6097
                Article
                250
                10.1186/s13321-017-0250-y
                5729181
                28316652
                34c77abc-7957-4101-9ca4-a1c3c201bd5b
                © The Author(s) 2017

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 15 July 2017
                : 2 December 2017
                Categories
                Research Article
                Custom metadata
                © The Author(s) 2017

                Chemoinformatics
                Chemoinformatics

                Comments

                Comment on this article