9
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Database fingerprint (DFP): an approach to represent molecular databases

      brief-report

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Molecular fingerprints are widely used in several areas of chemoinformatics including diversity analysis and similarity searching. The fingerprint-based analysis of chemical libraries, in particular of large collections, usually requires the molecular representation of each compound in the library that may lead to issues of storage space and redundant calculations. In fact, information redundancy is inherent to the data, resulting on binary digit positions in the fingerprint without significant information.

          Results

          Herein is proposed a general approach to represent an entire compound library with a single binary fingerprint. The development of the database fingerprint (DFP) is illustrated first using a short fingerprint (MACCS keys) for 10 data sets of general interest in chemistry. The application of the DFP is further shown with PubChem fingerprints for the data sets used in the primary example but with a larger number of compounds, up to 25,000 molecules. The performance of DFP were studied through differential Shannon entropy, k-mean clustering, and DFP/Tanimoto similarity.

          Conclusions

          The DFP is designed to capture key information of the compound collection and can be used to compare and assess the diversity of molecular libraries. This Preliminary Communication shows the potential of the novel fingerprint to conduct inter-library relationships. A major future goal is to apply the DFP for virtual screening and developing DFP for other data sets based on several different type of fingerprints.

          Graphical Abstract

          Database fingerprint captures the key information of molecular databases to perform chemical space characterization and virtual screening

          Electronic supplementary material

          The online version of this article (doi:10.1186/s13321-017-0195-1) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references23

          • Record: found
          • Abstract: found
          • Article: not found

          Reoptimization of MDL keys for use in drug discovery.

          For a number of years MDL products have exposed both 166 bit and 960 bit keysets based on 2D descriptors. These keysets were originally constructed and optimized for substructure searching. We report on improvements in the performance of MDL keysets which are reoptimized for use in molecular similarity. Classification performance for a test data set of 957 compounds was increased from 0.65 for the 166 bit keyset and 0.67 for the 960 bit keyset to 0.71 for a surprisal S/N pruned keyset containing 208 bits and 0.71 for a genetic algorithm optimized keyset containing 548 bits. We present an overview of the underlying technology supporting the definition of descriptors and the encoding of these descriptors into keysets. This technology allows definition of descriptors as combinations of atom properties, bond properties, and atomic neighborhoods at various topological separations as well as supporting a number of custom descriptors. These descriptors can then be used to set one or more bits in a keyset. We constructed various keysets and optimized their performance in clustering bioactive substances. Performance was measured using methodology developed by Briem and Lessel. "Directed pruning" was carried out by eliminating bits from the keysets on the basis of random selection, values of the surprisal of the bit, or values of the surprisal S/N ratio of the bit. The random pruning experiment highlighted the insensitivity of keyset performance for keyset lengths of more than 1000 bits. Contrary to initial expectations, pruning on the basis of the surprisal values of the various bits resulted in keysets which underperformed those resulting from random pruning. In contrast, pruning on the basis of the surprisal S/N ratio was found to yield keysets which performed better than those resulting from random pruning. We also explored the use of genetic algorithms in the selection of optimal keysets. Once more the performance was only a weak function of keyset size, and the optimizations failed to identify a single globally optimal keyset. Instead multiple, equally optimal keysets could be produced which had relatively low overlap of the descriptors they encoded.
            Bookmark
            • Record: found
            • Abstract: not found
            • Book: not found

            Data mining practical machine learning tools and techniques

              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Chemoinformatic analysis of combinatorial libraries, drugs, natural products, and molecular libraries small molecule repository.

              A multiple criteria approach is presented, that is used to perform a comparative analysis of four recently developed combinatorial libraries to drugs, Molecular Libraries Small Molecule Repository (MLSMR) and natural products. The compound databases were assessed in terms of physicochemical properties, scaffolds, and fingerprints. The approach enables the analysis of property space coverage, degree of overlap between collections, scaffold and structural diversity, and overall structural novelty. The degree of overlap between combinatorial libraries and drugs was assessed using the R-NN curve methodology, which measures the density of chemical space around a query molecule embedded in the chemical space of a target collection. The combinatorial libraries studied in this work exhibit scaffolds that were not observed in the drug, MLSMR, and natural products databases. The fingerprint-based comparisons indicate that these combinatorial libraries are structurally different than current drugs. The R-NN curve methodology revealed that a proportion of molecules in the combinatorial libraries is located within the property space of the drugs. However, the R-NN analysis also showed that there are a significant number of molecules in several combinatorial libraries that are located in sparse regions of the drug space.
                Bookmark

                Author and article information

                Contributors
                hidragyrum@gmail.com
                cesarrjacas1985@gmail.com
                kmtzm@unam.mx
                medinajl@unam.mx , jose.medina.franco@gmail.com
                Journal
                J Cheminform
                J Cheminform
                Journal of Cheminformatics
                Springer International Publishing (Cham )
                1758-2946
                6 February 2017
                6 February 2017
                2017
                : 9
                : 9
                Affiliations
                [1 ]ISNI 0000 0001 2159 0001, GRID grid.9486.3, Departamento de Farmacia, Facultad de Química, , Universidad Nacional Autónoma de México, ; Avenida Universidad 3000, 04510 Mexico City, Mexico
                [2 ]ISNI 0000 0001 2159 0001, GRID grid.9486.3, Instituto de Química, , Universidad Nacional Autónoma de México, ; Avenida Universidad 3000, 04510 Mexico City, Mexico
                [3 ]Escuela de Sistemas y Computación, Pontificia Universidad Católica del Ecuador Sede Esmeraldas (PUCESE), Esmeraldas, Ecuador
                Author information
                http://orcid.org/0000-0003-4940-1107
                Article
                195
                10.1186/s13321-017-0195-1
                5293704
                28224019
                07741765-dbf4-4064-8cdd-71841f9f72b7
                © The Author(s) 2017

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 30 September 2016
                : 23 January 2017
                Funding
                Funded by: FundRef http://dx.doi.org/10.13039/501100005739, Universidad Nacional Autónoma de México;
                Award ID: PAIP 5000-9163
                Award Recipient :
                Categories
                Preliminary Communication
                Custom metadata
                © The Author(s) 2017

                Chemoinformatics
                diversity,information content,molecular fingerprints,similarity,shannon entropy
                Chemoinformatics
                diversity, information content, molecular fingerprints, similarity, shannon entropy

                Comments

                Comment on this article