Bioactive Molecule Prediction Using Extreme Gradient Boosting

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Following the explosive growth in chemical and biological data, the shift from traditional methods of drug discovery to computer-aided means has made data mining and machine learning methods integral parts of today’s drug discovery process. In this paper, extreme gradient boosting (Xgboost), which is an ensemble of Classification and Regression Tree (CART) and a variant of the Gradient Boosting Machine, was investigated for the prediction of biological activity based on quantitative description of the compound’s molecular structure. Seven datasets, well known in the literature were used in this paper and experimental results show that Xgboost can outperform machine learning algorithms like Random Forest (RF), Support Vector Machines (LSVM), Radial Basis Function Neural Network (RBFN) and Naïve Bayes (NB) for the prediction of biological activities. In addition to its ability to detect minority activity classes in highly imbalanced datasets, it showed remarkable performance on both high and low diversity datasets.

Related collections

Most cited references 20

Record: found
Abstract: found
Article: not found

Random forest: a classification and regression tool for compound classification and QSAR modeling.

Vladimir Svetnik, Andy Liaw, Christopher Tong … (2003)

A new classification and regression tool, Random Forest, is introduced and investigated for predicting a compound's quantitative or categorical biological activity based on a quantitative description of the compound's molecular structure. Random Forest is an ensemble of unpruned classification or regression trees created by using bootstrap samples of the training data and random feature selection in tree induction. Prediction is made by aggregating (majority vote or averaging) the predictions of the ensemble. We built predictive models for six cheminformatics data sets. Our analysis demonstrates that Random Forest is a powerful tool capable of delivering performance that is among the most accurate methods to date. We also present three additional features of Random Forest: built-in performance assessment, a measure of relative importance of descriptors, and a measure of compound similarity that is weighted by the relative importance of descriptors. It is the combination of relatively high prediction accuracy and its collection of desired features that makes Random Forest uniquely suited for modeling in cheminformatics.

0 comments Cited 723 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Benchmarking sets for molecular docking.

Niu Huang, Brian K. Shoichet, John Irwin (2006)

Ligand enrichment among top-ranking hits is a key metric of molecular docking. To avoid bias, decoys should resemble ligands physically, so that enrichment is not simply a separation of gross features, yet be chemically distinct from them, so that they are unlikely to be binders. We have assembled a directory of useful decoys (DUD), with 2950 ligands for 40 different targets. Every ligand has 36 decoy molecules that are physically similar but topologically distinct, leading to a database of 98,266 compounds. For most targets, enrichment was at least half a log better with uncorrected databases such as the MDDR than with DUD, evidence of bias in the former. These calculations also allowed 40x40 cross-docking, where the enrichments of each ligand set could be compared for all 40 targets, enabling a specificity metric for the docking screens. DUD is freely available online as a benchmarking set for docking at http://blaster.docking.org/dud/.

0 comments Cited 292 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

XGBoost: A Scalable Tree Boosting System

Tianqi Chen, Carlos Guestrin (2016)

Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

0 comments Cited 189 times – based on 0 reviews

Preprint

     Review now

Bookmark

All references

Author and article information

Contributors

Leif A. Eriksson: Role: Academic Editor

Journal

Journal ID (nlm-ta): Molecules

Journal ID (iso-abbrev): Molecules

Journal ID (publisher-id): molecules

Title: Molecules

Publisher: MDPI

ISSN (Electronic): 1420-3049

Publication date (Electronic): 28 July 2016

Publication date Collection: August 2016

Volume: 21

Issue: 8

Electronic Location Identifier: 983

Affiliations

[1 ]UTM Big Data Centre, Ibnu Sina Institute for Scientific and Industrial Research, Universiti Teknologi Malaysia, Skudai, Johor 81310, Malaysia; bmismail2@ 123456live.utm.my

[2 ]Information Systems Department, Faculty of Computing, Universiti Teknologi Malaysia, Skudai, Johor 81310, Malaysia

Author notes

[* ]Correspondence: faisalsaeed@ 123456utm.my ; Tel.: +60-7-5532-406

Article

Publisher ID: molecules-21-00983

DOI: 10.3390/molecules21080983

PMC ID: 6273295

PubMed ID: 27483216

SO-VID: 1762c04e-91c1-4d74-9de1-00ed75ded853

License:

Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license ( http://creativecommons.org/licenses/by/4.0/).

History

Date received : 01 May 2016

Date accepted : 22 July 2016

Comments

Comment on this article

scite_

Cited by 60

See all cited by

Most referenced authors 324

See all reference authors

- Version 1

Bioactive Molecule Prediction Using Extreme Gradient Boosting

Read this article at

Abstract

Related collections

Drug_transporters

Most cited references 20

Random forest: a classification and regression tool for compound classification and QSAR modeling.

Benchmarking sets for molecular docking.

XGBoost: A Scalable Tree Boosting System

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 143

Cited by 60

Most referenced authors 324