6
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      An open source chemical structure curation pipeline using RDKit

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          The ChEMBL database is one of a number of public databases that contain bioactivity data on small molecule compounds curated from diverse sources. Incoming compounds are typically not standardised according to consistent rules. In order to maintain the quality of the final database and to easily compare and integrate data on the same compound from different sources it is necessary for the chemical structures in the database to be appropriately standardised.

          Results

          A chemical curation pipeline has been developed using the open source toolkit RDKit. It comprises three components: a Checker to test the validity of chemical structures and flag any serious errors; a Standardizer which formats compounds according to defined rules and conventions and a GetParent component that removes any salts and solvents from the compound to create its parent. This pipeline has been applied to the latest version of the ChEMBL database as well as uncurated datasets from other sources to test the robustness of the process and to identify common issues in database molecular structures.

          Conclusion

          All the components of the structure pipeline have been made freely available for other researchers to use and adapt for their own use. The code is available in a GitHub repository and it can also be accessed via the ChEMBL Beaker webservices. It has been used successfully to standardise the nearly 2 million compounds in the ChEMBL database and the compound validity checker has been used to identify compounds with the most serious issues so that they can be prioritised for manual curation.

          Related collections

          Most cited references14

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology

          BindingDB, www.bindingdb.org, is a publicly accessible database of experimental protein-small molecule interaction data. Its collection of over a million data entries derives primarily from scientific articles and, increasingly, US patents. BindingDB provides many ways to browse and search for data of interest, including an advanced search tool, which can cross searches of multiple query types, including text, chemical structure, protein sequence and numerical affinities. The PDB and PubMed provide links to data in BindingDB, and vice versa; and BindingDB provides links to pathway information, the ZINC catalog of available compounds, and other resources. The BindingDB website offers specialized tools that take advantage of its large data collection, including ones to generate hypotheses for the protein targets bound by a bioactive compound, and for the compounds bound by a new protein of known sequence; and virtual compound screening by maximal chemical similarity, binary kernel discrimination, and support vector machine methods. Specialized data sets are also available, such as binding data for hundreds of congeneric series of ligands, drawn from BindingDB and organized for use in validating drug design methods. BindingDB offers several forms of programmatic access, and comes with extensive background material and documentation. Here, we provide the first update of BindingDB since 2007, focusing on new and unique features and highlighting directions of importance to the field as a whole.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            InChI, the IUPAC International Chemical Identifier

            This paper documents the design, layout and algorithms of the IUPAC International Chemical Identifier, InChI.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              InChI - the worldwide chemical structure identifier standard

              Since its public introduction in 2005 the IUPAC InChI chemical structure identifier standard has become the international, worldwide standard for defined chemical structures. This article will describe the extensive use and dissemination of the InChI and InChIKey structure representations by and for the world-wide chemistry community, the chemical information community, and major publishers and disseminators of chemical and related scientific offerings in manuscripts and databases.
                Bookmark

                Author and article information

                Contributors
                arl@ebi.ac.uk
                Journal
                J Cheminform
                J Cheminform
                Journal of Cheminformatics
                Springer International Publishing (Cham )
                1758-2946
                1 September 2020
                1 September 2020
                2020
                : 12
                : 51
                Affiliations
                [1 ]GRID grid.225360.0, ISNI 0000 0000 9709 7726, European Molecular Biology Laboratory, , European Bioinformatics Institute, ; Wellcome Genome Campus, Hinxton, CB10 1SD Cambridgeshire UK
                [2 ]T5 Informatics GmbH, Basel, 4055 Switzerland
                [3 ]GRID grid.423328.c, ISNI 0000 0001 2180 7418, Present Address: The Cambridge Crystallographic Data Centre, ; 12 Union Road, Cambridge, CB2 1EZ UK
                [4 ]GRID grid.5335.0, ISNI 0000000121885934, Present Address: Department of Oncology, , University of Cambridge, ; Cambridge, UK
                Author information
                http://orcid.org/0000-0003-1424-480X
                Article
                456
                10.1186/s13321-020-00456-1
                7458899
                33430988
                6946153f-85e1-4b5f-8beb-3c78160060b6
                © The Author(s) 2020

                Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

                History
                : 11 June 2020
                : 24 August 2020
                Funding
                Funded by: FundRef http://dx.doi.org/10.13039/100004440, Wellcome Trust;
                Award ID: WT086151/Z/08/Z
                Award ID: WT104104/Z/14/Z
                Award Recipient :
                Funded by: FundRef http://dx.doi.org/10.13039/100013060, European Molecular Biology Laboratory;
                Categories
                Methodology
                Custom metadata
                © The Author(s) 2020

                Chemoinformatics
                chemistry,curation,chembl,rdkit,open source,standardisation
                Chemoinformatics
                chemistry, curation, chembl, rdkit, open source, standardisation

                Comments

                Comment on this article