DHSpred: support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

DNase I hypersensitive sites (DHSs) are genomic regions that provide important information regarding the presence of transcriptional regulatory elements and the state of chromatin. Therefore, identifying DHSs in uncharacterized DNA sequences is crucial for understanding their biological functions and mechanisms. Although many experimental methods have been proposed to identify DHSs, they have proven to be expensive for genome-wide application. Therefore, it is necessary to develop computational methods for DHS prediction. In this study, we proposed a support vector machine (SVM)-based method for predicting DHSs, called DHSpred (DNase I Hypersensitive Site predictor in human DNA sequences), which was trained with 174 optimal features. The optimal combination of features was identified from a large set that included nucleotide composition and di- and trinucleotide physicochemical properties, using a random forest algorithm. DHSpred achieved a Matthews correlation coefficient and accuracy of 0.660 and 0.871, respectively, which were 3% higher than those of control SVM predictors trained with non-optimized features, indicating the efficiency of the feature selection method. Furthermore, the performance of DHSpred was superior to that of state-of-the-art predictors. An online prediction server has been developed to assist the scientific community, and is freely available at: http://www.thegleelab.org/DHSpred.html

Related collections

Most cited references 50

Record: found
Abstract: found
Article: not found

Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms.

Kuo-Chen Chou, Hong-Bin Shen (2008)

Information on subcellular localization of proteins is important to molecular cell biology, proteomics, system biology and drug discovery. To provide the vast majority of experimental scientists with a user-friendly tool in these areas, we present a package of Web servers developed recently by hybridizing the 'higher level' approach with the ab initio approach. The package is called Cell-PLoc and contains the following six predictors: Euk-mPLoc, Hum-mPLoc, Plant-PLoc, Gpos-PLoc, Gneg-PLoc and Virus-PLoc, specialized for eukaryotic, human, plant, Gram-positive bacterial, Gram-negative bacterial and viral proteins, respectively. Using these Web servers, one can easily get the desired prediction results with a high expected accuracy, as demonstrated by a series of cross-validation tests on the benchmark data sets that covered up to 22 subcellular location sites and in which none of the proteins included had > or =25% sequence identity to any other protein in the same subcellular-location subset. Some of these Web servers can be particularly used to deal with multiplex proteins as well, which may simultaneously exist at, or move between, two or more different subcellular locations. Proteins with multiple locations or dynamic features of this kind are particularly interesting, because they may have some special biological functions intriguing to investigators in both basic research and drug discovery. This protocol is a step-by-step guide on how to use the Web-server predictors in the Cell-PLoc package. The computational time for each prediction is less than 5 s in most cases. The Cell-PLoc package is freely accessible at http://chou.med.harvard.edu/bioinf/Cell-PLoc.

0 comments Cited 226 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins.

Zheng Yang, Rebecca Thomson, Philip McNeil … (2005)

Recent studies have found many proteins containing regions that do not form well-defined three-dimensional structures in their native states. The study and detection of such disordered regions is important both for understanding protein function and for facilitating structural analysis since disordered regions may affect solubility and/or crystallizability. We have developed the regional order neural network (RONN) software as an application of our recently developed 'bio-basis function neural network' pattern recognition algorithm to the detection of natively disordered regions in proteins. The results of blind-testing a panel of nine disorder prediction tools (including RONN) against 80 protein sequences derived from the Protein Data Bank shows that, based on the probability excess measure, RONN performed the best.

0 comments Cited 209 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Controlling the double helix.

Gary Felsenfeld, Mark Groudine (2003)

Chromatin is the complex of DNA and proteins in which the genetic material is packaged inside the cells of organisms with nuclei. Chromatin structure is dynamic and exerts profound control over gene expression and other fundamental cellular processes. Changes in its structure can be inherited by the next generation, independent of the DNA sequence itself.

0 comments Cited 200 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Oncotarget

Journal ID (iso-abbrev): Oncotarget

Journal ID (publisher-id): Oncotarget

Journal ID (publisher-id): ImpactJ

Title: Oncotarget

Publisher: Impact Journals LLC

ISSN (Electronic): 1949-2553

Publication date Collection: 5 January 2018

Publication date (Electronic): 8 December 2017

Volume: 9

Issue: 2

Pages: 1944-1956

Affiliations

¹ Department of Physiology, Ajou University School of Medicine, Suwon, Republic of Korea

² Institute of Molecular Science and Technology, Ajou University, Suwon, Republic of Korea

Author notes

Correspondence to: Balachandran Manavalan, bala@ 123456ajou.ac.kr

Gwang Lee, glee@ 123456ajou.ac.kr

Article

Publisher ID: 23099

DOI: 10.18632/oncotarget.23099

PMC ID: 5788611

PubMed ID: 29416743

SO-VID: e01e21f8-6f5c-44e2-b8df-c92a7ea5990e

License:

This is an open-access article distributed under the terms of the Creative Commons Attribution License 3.0 (CC BY 3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 6 September 2017

Date accepted : 17 November 2017

Comments

Comment on this article

scite_

Cited by 39

See all cited by

Most referenced authors 747

See all reference authors

- Version 1
- Version 1

DHSpred: support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest

Read this article at

Abstract

Related collections

Annual Reviews AI, Machine Learning, and Society

Most cited references 50

Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms.

RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins.

Controlling the double helix.

Author and article information

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 103

Cited by 39

Most referenced authors 747