12
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      An automated framework for QSAR model building

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          In-silico quantitative structure–activity relationship (QSAR) models based tools are widely used to screen huge databases of compounds in order to determine the biological properties of chemical molecules based on their chemical structure. With the passage of time, the exponentially growing amount of synthesized and known chemicals data demands computationally efficient automated QSAR modeling tools, available to researchers that may lack extensive knowledge of machine learning modeling. Thus, a fully automated and advanced modeling platform can be an important addition to the QSAR community.

          Results

          In the presented workflow the process from data preparation to model building and validation has been completely automated. The most critical modeling tasks (data curation, data set characteristics evaluation, variable selection and validation) that largely influence the performance of QSAR models were focused. It is also included the ability to quickly evaluate the feasibility of a given data set to be modeled. The developed framework is tested on data sets of thirty different problems. The best-optimized feature selection methodology in the developed workflow is able to remove 62–99% of all redundant data. On average, about 19% of the prediction error was reduced by using feature selection producing an increase of 49% in the percentage of variance explained (PVE) compared to models without feature selection. Selecting only the models with a modelability score above 0.6, average PVE scores were 0.71. A strong correlation was verified between the modelability scores and the PVE of the models produced with variable selection.

          Conclusions

          We developed an extendable and highly customizable fully automated QSAR modeling framework. This designed workflow does not require any advanced parameterization nor depends on users decisions or expertise in machine learning/programming. With just a given target or problem, the workflow follows an unbiased standard protocol to develop reliable QSAR models by directly accessing online manually curated databases or by using private data sets. The other distinctive features of the workflow include prior estimation of data modelability to avoid time-consuming modeling trials for non modelable data sets, an efficient variable selection procedure and the facility of output availability at each modeling task for the diverse application and reproduction of historical predictions. The results reached on a selection of thirty QSAR problems suggest that the approach is capable of building reliable models even for challenging problems.

          Electronic supplementary material

          The online version of this article (10.1186/s13321-017-0256-5) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references58

          • Record: found
          • Abstract: not found
          • Article: not found

          The problem of overfitting.

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo-and Bioinformatics

            The Chemistry Development Kit (CDK) is a freely available open-source Java library for Structural Chemo-and Bioinformatics. Its architecture and capabilities as well as the development as an open-source project by a team of international collaborators from academic and industrial institutions is described. The CDK provides methods for many common tasks in molecular informatics, including 2D and 3D rendering of chemical structures, I/O routines, SMILES parsing and generation, ring searches, isomorphism checking, structure diagram generation, etc. Application scenarios as well as access information for interested users and potential contributors are given.
              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              KNIME - the Konstanz information miner

                Bookmark

                Author and article information

                Contributors
                saminakausar.bioinfo@gmail.com
                aofalcao@ciencias.ulisboa.pt
                Journal
                J Cheminform
                J Cheminform
                Journal of Cheminformatics
                Springer International Publishing (Cham )
                1758-2946
                16 January 2018
                16 January 2018
                2018
                : 10
                : 1
                Affiliations
                [1 ]ISNI 0000 0001 2181 4263, GRID grid.9983.b, LaSIGE, Departamento de Informática, Faculdade de Ciências, , Universidade de Lisboa, ; 1749-016 Lisbon, Portugal
                [2 ]ISNI 0000 0001 2181 4263, GRID grid.9983.b, BioISI: Biosystems and Integrative Sciences Institute, Faculdade de Ciências, , Universidade de Lisboa, ; 1749-016 Lisbon, Portugal
                Author information
                http://orcid.org/0000-0002-5207-7136
                Article
                256
                10.1186/s13321-017-0256-5
                5770354
                29340790
                5f98aa20-37f8-4899-a1cd-7510eea27fc1
                © The Author(s) 2018

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 31 May 2017
                : 27 December 2017
                Funding
                Funded by: Fundação para a Ciência e a Tecnologia
                Award ID: PTDC/EEI-ESS/4923/2014
                Award ID: SFRH/BD/111654/2015
                Award Recipient :
                Categories
                Research Article
                Custom metadata
                © The Author(s) 2018

                Chemoinformatics
                quantitative structure–activity relationship (qsar),machine learning,feature selection,variable importance,random forests,support vector machines,knime,data set modelability

                Comments

                Comment on this article