Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Ascribing function to sequence in the absence of biological data is an ongoing challenge in bioinformatics. Differentiating the toxins of venomous animals from homologues having other physiological functions is particularly problematic as there are no universally accepted methods by which to attribute toxin function using sequence data alone. Bioinformatics tools that do exist are difficult to implement for researchers with little bioinformatics training. Here we announce a machine learning tool called ‘ToxClassifier’ that enables simple and consistent discrimination of toxins from non-toxin sequences with >99% accuracy and compare it to commonly used toxin annotation methods. ‘ToxClassifer’ also reports the best-hit annotation allowing placement of a toxin into the most appropriate toxin protein family, or relates it to a non-toxic protein having the closest homology, giving enhanced curation of existing biological databases and new venomics projects. ‘ToxClassifier’ is available for free, either to download ( https://github.com/rgacesa/ToxClassifier) or to use on a web-based server ( http://bioserv7.bioinfo.pbf.hr/ToxClassifier/).

Most cited references 27

Record: found
Abstract: found
Article: not found

Comparison of the predicted and observed secondary structure of T4 phage lysozyme.

B W Matthews (1975)

Predictions of the secondary structure of T4 phage lysozyme, made by a number of investigators on the basis of the amino acid sequence, are compared with the structure of the protein determined experimentally by X-ray crystallography. Within the amino terminal half of the molecule the locations of helices predicted by a number of methods agree moderately well with the observed structure, however within the carboxyl half of the molecule the overall agreement is poor. For eleven different helix predictions, the coefficients giving the correlation between prediction and observation range from 0.14 to 0.42. The accuracy of the predictions for both beta-sheet regions and for turns are generally lower than for the helices, and in a number of instances the agreement between prediction and observation is no better than would be expected for a random selection of residues. The structural predictions for T4 phage lysozyme are much less successful than was the case for adenylate kinase (Schulz et al. (1974) Nature 250, 140-142). No one method of prediction is clearly superior to all others, and although empirical predictions based on larger numbers of known protein structure tend to be more accurate than those based on a limited sample, the improvement in accuracy is not dramatic, suggesting that the accuracy of current empirical predictive methods will not be substantially increased simply by the inclusion of more data from additional protein structure determinations.

0 comments Cited 669 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Hidden Markov models in computational biology. Applications to protein modeling.

A. Krogh, M. Brown, I. S. Mian … (1994)

Hidden Markov Models (HMMs) are applied to the problems of statistical modeling, database searching and multiple sequence alignment of protein families and protein domains. These methods are demonstrated on the globin family, the protein kinase catalytic domain, and the EF-hand calcium binding motif. In each case the parameters of an HMM are estimated from a training set of unaligned sequences. After the HMM is built, it is used to obtain a multiple alignment of all the training sequences. It is also used to search the SWISS-PROT 22 database for other sequences that are members of the given protein family, or contain the given domain. The HMM produces multiple alignments of good quality that agree closely with the alignments produced by programs that incorporate three-dimensional structural information. When employed in discrimination tests (by examining how closely the sequences in a database fit the globin, kinase and EF-hand HMMs), the HMM is able to distinguish members of these families from non-members with a high degree of accuracy. Both the HMM and PROFILESEARCH (a technique used to search for relationships between a protein sequence and multiply aligned sequences) perform better in these tests than PROSITE (a dictionary of sites and patterns in proteins). The HMM appears to have a slight advantage over PROFILESEARCH in terms of lower rates of false negatives and false positives, even though the HMM is trained using only unaligned sequences, whereas PROFILESEARCH requires aligned training sequences. Our results suggest the presence of an EF-hand calcium binding motif in a highly conserved and evolutionary preserved putative intracellular region of 155 residues in the alpha-1 subunit of L-type calcium channels which play an important role in excitation-contraction coupling. This region has been suggested to contain the functional domains that are typical or essential for all L-type calcium channels regardless of whether they couple to ryanodine receptors, conduct ions or both.

0 comments Cited 315 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Database resources of the National Center for Biotechnology.

D Wheeler (2003)

In addition to maintaining the GenBank(R) nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval resources for the data in GenBank and other biological data made available through NCBI's Web site. NCBI resources include Entrez, PubMed, PubMed Central (PMC), LocusLink, the NCBITaxonomy Browser, BLAST, BLAST Link (BLink), Electronic PCR (e-PCR), Open Reading Frame (ORF) Finder, References Sequence (RefSeq), UniGene, HomoloGene, ProtEST, Database of Single Nucleotide Polymorphisms (dbSNP), Human/Mouse Homology Map, Cancer Chromosome Aberration Project (CCAP), Entrez Genomes and related tools, the Map Viewer, Model Maker (MM), Evidence Viewer (EV), Clusters of Orthologous Groups (COGs) database, Retroviral Genotyping Tools, SAGEmap, Gene Expression Omnibus (GEO), Online Mendelian Inheritance in Man (OMIM), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), and the Conserved Domain Architecture Retrieval Tool (CDART). Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of the resources can be accessed through the NCBI home page at: http://www.ncbi.nlm.nih.gov.

0 comments Cited 299 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Paul F. Long

Journal

Journal ID (publisher-id): peerj-cs

Journal ID (pmc): peerj-cs

Journal ID (nlm-ta): PeerJ Comput. Sci.

Title: PeerJ Computer Science

Abbreviated Title: PeerJ Comput. Sci.

Publisher: PeerJ Inc. (San Francisco, USA )

ISSN (Electronic): 2376-5992

Publication date (Electronic): 10 October 2016

Volume: 2

Electronic Location Identifier: e90

Affiliations

[1 ]Institute of Pharmaceutical Science, King’s College London , London, United Kingdom

[2 ]Department of Chemistry, King’s College London , London, United Kingdom

[3 ]Brazil Institute, King’s College London , London, United Kingdom

[4 ]Faculdade de Ciências Farmacêuticas, Universidade de São Paulo , São Paulo, Brazil

Article

Publisher ID: cs-90

DOI: 10.7717/peerj-cs.90

SO-VID: 7a63def2-6c74-4682-bf6e-b8c56622a4f7

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

History

Date received : 24 June 2016

Date accepted : 8 September 2016

Funding

Funded by: United Kingdom Medical Research Council

Award ID: G82144A

Funded by: Universidade de São Paulo

Award ID: 13.1.1502.9.8

This work was supported by the United Kingdom Medical Research Council (MRC grant G82144A). PFL is also supported as a Visiting International Research Professor by the Universidade de São Paulo (USP grant 13.1.1502.9.8). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions

Read this article at

Abstract

Most cited references 27

Comparison of the predicted and observed secondary structure of T4 phage lysozyme.

Hidden Markov models in computational biology. Applications to protein modeling.

Database resources of the National Center for Biotechnology.

Author and article information

Contributors

Journal

Affiliations

Article

History

Funding

Categories

Comments

Comment on this article

Similar content 258

Cited by 18

Most referenced authors 1,133