Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Existing methods of predicting DNA-binding proteins used valuable features of physicochemical properties to design support vector machine (SVM) based classifiers. Generally, selection of physicochemical properties and determination of their corresponding feature vectors rely mainly on known properties of binding mechanism and experience of designers. However, there exists a troublesome problem for designers that some different physicochemical properties have similar vectors of representing 20 amino acids and some closely related physicochemical properties have dissimilar vectors.

Results

This study proposes a systematic approach (named Auto-IDPCPs) to automatically identify a set of physicochemical and biochemical properties in the AAindex database to design SVM-based classifiers for predicting and analyzing DNA-binding domains/proteins. Auto-IDPCPs consists of 1) clustering 531 amino acid indices in AAindex into 20 clusters using a fuzzy c-means algorithm, 2) utilizing an efficient genetic algorithm based optimization method IBCGA to select an informative feature set of size m to represent sequences, and 3) analyzing the selected features to identify related physicochemical properties which may affect the binding mechanism of DNA-binding domains/proteins. The proposed Auto-IDPCPs identified m=22 features of properties belonging to five clusters for predicting DNA-binding domains with a five-fold cross-validation accuracy of 87.12%, which is promising compared with the accuracy of 86.62% of the existing method PSSM-400. For predicting DNA-binding sequences, the accuracy of 75.50% was obtained using m=28 features, where PSSM-400 has an accuracy of 74.22%. Auto-IDPCPs and PSSM-400 have accuracies of 80.73% and 82.81%, respectively, applied to an independent test data set of DNA-binding domains. Some typical physicochemical properties discovered are hydrophobicity, secondary structure, charge, solvent accessibility, polarity, flexibility, normalized Van Der Waals volume, pK (pK-C, pK-N, pK-COOH and pK-a(RCOOH)), etc.

Conclusions

The proposed approach Auto-IDPCPs would help designers to investigate informative physicochemical and biochemical properties by considering both prediction accuracy and analysis of binding mechanism simultaneously. The approach Auto-IDPCPs can be also applicable to predict and analyze other protein functions from sequences.

Related collections

Most cited references 17

Record: found
Abstract: found
Article: not found

Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information.

Shandar Ahmad, M. Gromiha, Akinori Sarai (2004)

Though vitally important to cell function, the mechanism of protein-DNA binding has not yet been completely understood. We therefore analysed the relationship between DNA binding and protein sequence composition, solvent accessibility and secondary structure. Using non-redundant databases of transcription factors and protein-DNA complexes, neural network models were developed to utilize the information present in this relationship to predict DNA-binding proteins and their binding residues. Sequence composition was found to provide sufficient information to predict the probability of its binding to DNA with nearly 69% sensitivity at 64% accuracy for the considered proteins; sequence neighbourhood and solvent accessibility information were sufficient to make binding site predictions with 40% sensitivity at 79% accuracy. Detailed analysis of binding residues shows that some three- and five-residue segments frequently bind to DNA and that solvent accessibility plays a major role in binding. Although, binding behaviour was not associated with any particular secondary structure, there were interesting exceptions at the residue level. Over-representation of some residues in the binding sites was largely lost at the total sequence level, but a different kind of compositional preference was observed in DNA-binding proteins.

0 comments Cited 118 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Identification of DNA-binding proteins using support vector machines and evolutionary profiles

Manish Kumar, Michael Gromiha, Gajendra Raghava (2007)

Background Identification of DNA-binding proteins is one of the major challenges in the field of genome annotation, as these proteins play a crucial role in gene-regulation. In this paper, we developed various SVM modules for predicting DNA-binding domains and proteins. All models were trained and tested on multiple datasets of non-redundant proteins. Results SVM models have been developed on DNAaset, which consists of 1153 DNA-binding and equal number of non DNA-binding proteins, and achieved the maximum accuracy of 72.42% and 71.59% using amino acid and dipeptide compositions, respectively. The performance of SVM model improved from 72.42% to 74.22%, when evolutionary information in form of PSSM profiles was used as input instead of amino acid composition. In addition, SVM models have been developed on DNAset, which consists of 146 DNA-binding and 250 non-binding chains/domains, and achieved the maximum accuracy of 79.80% and 86.62% using amino acid composition and PSSM profiles. The SVM models developed in this study perform better than existing methods on a blind dataset. Conclusion A highly accurate method has been developed for predicting DNA-binding proteins using SVM and PSSM profiles. This is the first study in which evolutionary information in form of PSSM profiles has been used successfully for predicting DNA-binding proteins. A web-server DNAbinder has been developed for identifying DNA-binding proteins and domains from query amino acid sequences .

0 comments Cited 87 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Fuzzy C-means method for clustering microarray data.

Philippe Kastner, Doulaye Dembele (2003)

Clustering analysis of data from DNA microarray hybridization studies is essential for identifying biologically relevant groups of genes. Partitional clustering methods such as K-means or self-organizing maps assign each gene to a single cluster. However, these methods do not provide information about the influence of a given gene for the overall shape of clusters. Here we apply a fuzzy partitioning method, Fuzzy C-means (FCM), to attribute cluster membership values to genes. A major problem in applying the FCM method for clustering microarray data is the choice of the fuzziness parameter m. We show that the commonly used value m = 2 is not appropriate for some data sets, and that optimal values for m vary widely from one data set to another. We propose an empirical method, based on the distribution of distances between genes in a given data set, to determine an adequate value for m. By setting threshold levels for the membership values, genes which are tigthly associated to a given cluster can be selected. Using a yeast cell cycle data set as an example, we show that this selection increases the overall biological significance of the genes within the cluster. Supplementary text and Matlab functions are available at http://www-igbmc.u-strasbg.fr/fcm/

0 comments Cited 60 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Conference

Journal ID (nlm-ta): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central

ISSN (Electronic): 1471-2105

Publication date Collection: 2011

Publication date (Electronic): 15 February 2011

Volume: 12

Issue: Suppl 1

Page: S47

Affiliations

[1 ]Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan

[2 ]Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan

[3 ]Department of Multimedia Entertainment Science, Asia Pacific Institute of Creativity, Miaoli, Taiwan

[4 ]Department of Automation Engineering, National Formosa University, Yunlin 632, Taiwan

Article

Publisher ID: 1471-2105-12-S1-S47

DOI: 10.1186/1471-2105-12-S1-S47

PMC ID: 3044304

PubMed ID: 21342579

SO-VID: 76bc0b98-7170-4aeb-989a-70896e3bb12b

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Conference name: The Ninth Asia Pacific Bioinformatics Conference (APBC 2011)

Conference location: Inchon, Korea

Conference date: 11–14 January 2011

History

Comments

Comment on this article

scite_

Cited by 15

See all cited by

Most referenced authors 107

See all reference authors

Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties

Read this article at

Abstract

Background

Results

Conclusions

Related collections

Genetoberfest

Most cited references 17

Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information.

Identification of DNA-binding proteins using support vector machines and evolutionary profiles

Fuzzy C-means method for clustering microarray data.

Author and article information

Conference

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 111

Cited by 15

Most referenced authors 107