nifPred: Proteome-Wide Identification and Categorization of Nitrogen-Fixation Proteins of Diaztrophs Based on Composition-Transition-Distribution Features Using Support Vector Machine

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

As inorganic nitrogen compounds are essential for basic building blocks of life (e.g., nucleotides and amino acids), the role of biological nitrogen-fixation (BNF) is indispensible. All nitrogen fixing microbes rely on the same nitrogenase enzyme for nitrogen reduction, which is in fact an enzyme complex consists of as many as 20 genes. However, the occurrence of six genes viz., nifB, nifD, nifE, nifH, nifK, and nifN has been proposed to be essential for a functional nitrogenase enzyme. Therefore, identification of these genes is important to understand the mechanism of BNF as well as to explore the possibilities for improving BNF from agricultural sustainability point of view. Further, though the computational tools are available for the annotation and phylogenetic analysis of nifH gene sequences alone, to the best of our knowledge no tool is available for the computational prediction of the above mentioned six categories of nitrogen-fixation (nif) genes or proteins. Thus, we proposed an approach, which is first of its kind for the computational identification of nif proteins encoded by the six categories of nif genes. Sequence-derived features were employed to map the input sequences into vectors of numeric observations that were subsequently fed to the support vector machine as input. Two types of classifier were constructed: (i) a binary classifier for classification of nif and non-nitrogen-fixation (non-nif) proteins, and (ii) a multi-class classifier for classification of six categories of nif proteins. Higher accuracies were observed for the combination of composition-transition-distribution (CTD) feature set and radial kernel, as compared to the other feature-kernel combinations. The overall accuracies were observed >90% in both binary and multi-class classifications. The developed approach further achieved >92% accuracy, while evaluated with blind (independent) test datasets. The developed approach also produced higher accuracy in identifying nif proteins, while evaluated using proteome-wide datasets of several species. Furthermore, we established a prediction server nifPred ( http://webapp.cabgrid.res.in/nifPred) to assist the scientific community for proteome-wide identification of six categories of nif proteins. Besides, the source code of nifPred is also available at https://github.com/PrabinaMeher/nifPred. The developed web server is expected to supplement the transcriptional profiling and comparative genomics studies for the identification and functional annotation of genes related to BNF.

Related collections

Most cited references 51

Record: found
Abstract: found
Article: found

Is Open Access

CD-HIT: accelerated for clustering the next-generation sequencing data

Limin Fu, Beifang Niu, Zhengwei Zhu … (2012)

Summary: CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to ∼24 cores and a quasi-linear speedup for up to ∼8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions. Availability: http://cd-hit.org. Contact: liwz@sdsc.edu Supplementary information: Supplementary data are available at Bioinformatics online.

0 comments Cited 2223 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Prediction of protein subcellular localization.

Chin-Sheng Yu, Yu-Ching Chen, Chih-Hao Lu … (2006)

Because the protein's function is usually related to its subcellular localization, the ability to predict subcellular localization directly from protein sequences will be useful for inferring protein functions. Recent years have seen a surging interest in the development of novel computational tools to predict subcellular localization. At present, these approaches, based on a wide range of algorithms, have achieved varying degrees of success for specific organisms and for certain localization categories. A number of authors have noticed that sequence similarity is useful in predicting subcellular localization. For example, Nair and Rost (Protein Sci 2002;11:2836-2847) have carried out extensive analysis of the relation between sequence similarity and identity in subcellular localization, and have found a close relationship between them above a certain similarity threshold. However, many existing benchmark data sets used for the prediction accuracy assessment contain highly homologous sequences-some data sets comprising sequences up to 80-90% sequence identity. Using these benchmark test data will surely lead to overestimation of the performance of the methods considered. Here, we develop an approach based on a two-level support vector machine (SVM) system: the first level comprises a number of SVM classifiers, each based on a specific type of feature vectors derived from sequences; the second level SVM classifier functions as the jury machine to generate the probability distribution of decisions for possible localizations. We compare our approach with a global sequence alignment approach and other existing approaches for two benchmark data sets-one comprising prokaryotic sequences and the other eukaryotic sequences. Furthermore, we carried out all-against-all sequence alignment for several data sets to investigate the relationship between sequence homology and subcellular localization. Our results, which are consistent with previous studies, indicate that the homology search approach performs well down to 30% sequence identity, although its performance deteriorates considerably for sequences sharing lower sequence identity. A data set of high homology levels will undoubtedly lead to biased assessment of the performances of the predictive approaches-especially those relying on homology search or sequence annotations. Our two-level classification system based on SVM does not rely on homology search; therefore, its performance remains relatively unaffected by sequence homology. When compared with other approaches, our approach performed significantly better. Furthermore, we also develop a practical hybrid method, which combines the two-level SVM classifier and the homology search method, as a general tool for the sequence annotation of subcellular localization.

0 comments Cited 542 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Hidden Markov model speed heuristic and iterative HMM search procedure

L Steven Johnson, Sean R. Eddy, Elon Portugaly (2010)

Background Profile hidden Markov models (profile-HMMs) are sensitive tools for remote protein homology detection, but the main scoring algorithms, Viterbi or Forward, require considerable time to search large sequence databases. Results We have designed a series of database filtering steps, HMMERHEAD, that are applied prior to the scoring algorithms, as implemented in the HMMER package, in an effort to reduce search time. Using this heuristic, we obtain a 20-fold decrease in Forward and a 6-fold decrease in Viterbi search time with a minimal loss in sensitivity relative to the unfiltered approaches. We then implemented an iterative profile-HMM search method, JackHMMER, which employs the HMMERHEAD heuristic. Due to our search heuristic, we eliminated the subdatabase creation that is common in current iterative profile-HMM approaches. On our benchmark, JackHMMER detects 14% more remote protein homologs than SAM's iterative method T2K. Conclusions Our search heuristic, HMMERHEAD, significantly reduces the time needed to score a profile-HMM against large sequence databases. This search heuristic allowed us to implement an iterative profile-HMM search method, JackHMMER, which detects significantly more remote protein homologs than SAM's T2K and NCBI's PSI-BLAST.

0 comments Cited 488 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Prabina K. Meher: URI : http://loop.frontiersin.org/people/512880/overview

Tanmaya K. Sahu: URI : http://loop.frontiersin.org/people/458380/overview

Shachi Gahoi: URI : http://loop.frontiersin.org/people/355137/overview

Atmakuri R. Rao: URI : http://loop.frontiersin.org/people/334419/overview

Journal

Journal ID (nlm-ta): Front Microbiol

Journal ID (iso-abbrev): Front Microbiol

Journal ID (publisher-id): Front. Microbiol.

Title: Frontiers in Microbiology

Publisher: Frontiers Media S.A.

ISSN (Electronic): 1664-302X

Publication date (Electronic): 29 May 2018

Publication date Collection: 2018

Volume: 9

Electronic Location Identifier: 1100

Affiliations

[1] ¹Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute , New Delhi, India

[2] ²Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute , New Delhi, India

[3] ³Department of Bioinformatics, Orissa University of Agriculture and Technology , Bhubaneswar, India

Author notes

Edited by: John R. Battista, Louisiana State University, United States

Reviewed by: Daan R. Speth, California Institute of Technology, United States; Bei-Wen Ying, University of Tsukuba, Japan

*Correspondence: Atmakuri R. Rao rao.cshl.work@ 123456gmail.com

This article was submitted to Evolutionary and Genomic Microbiology, a section of the journal Frontiers in Microbiology

†These authors have contributed equally to this work.

Article

DOI: 10.3389/fmicb.2018.01100

PMC ID: 5986947

SO-VID: a8c5527b-2d02-4f0f-829c-ec00681af890

License:

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

History

Date received : 04 December 2017

Date accepted : 08 May 2018

Page count

Figures: 6, Tables: 5, Equations: 1, References: 78, Pages: 16, Words: 11436

Funding

Funded by: Indian Council of Agricultural Research 10.13039/501100001503

Award ID: Agril.Edn.4-1/2013-A&P

Comments

Comment on this article

scite_

Cited by 5

See all cited by

Most referenced authors 1,695

See all reference authors

nifPred: Proteome-Wide Identification and Categorization of Nitrogen-Fixation Proteins of Diaztrophs Based on Composition-Transition-Distribution Features Using Support Vector Machine

Read this article at

Abstract

Related collections

Microbial Genomics

Most cited references 51

CD-HIT: accelerated for clustering the next-generation sequencing data

Prediction of protein subcellular localization.

Hidden Markov model speed heuristic and iterative HMM search procedure

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Page count

Funding

Categories

Comments

Comment on this article

Similar content 112

Cited by 5

Most referenced authors 1,695