31
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Poly(A) motif prediction using spectral latent features from human DNA sequences

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Motivation: Polyadenylation is the addition of a poly(A) tail to an RNA molecule. Identifying DNA sequence motifs that signal the addition of poly(A) tails is essential to improved genome annotation and better understanding of the regulatory mechanisms and stability of mRNA.

          Existing poly(A) motif predictors demonstrate that information extracted from the surrounding nucleotide sequences of candidate poly(A) motifs can differentiate true motifs from the false ones to a great extent. A variety of sophisticated features has been explored, including sequential, structural, statistical, thermodynamic and evolutionary properties. However, most of these methods involve extensive manual feature engineering, which can be time-consuming and can require in-depth domain knowledge.

          Results: We propose a novel machine-learning method for poly(A) motif prediction by marrying generative learning (hidden Markov models) and discriminative learning (support vector machines). Generative learning provides a rich palette on which the uncertainty and diversity of sequence information can be handled, while discriminative learning allows the performance of the classification task to be directly optimized. Here, we used hidden Markov models for fitting the DNA sequence dynamics, and developed an efficient spectral algorithm for extracting latent variable information from these models. These spectral latent features were then fed into support vector machines to fine-tune the classification performance.

          We evaluated our proposed method on a comprehensive human poly(A) dataset that consists of 14 740 samples from 12 of the most abundant variants of human poly(A) motifs. Compared with one of the previous state-of-the-art methods in the literature (the random forest model with expert-crafted features), our method reduces the average error rate, false-negative rate and false-positive rate by 26, 15 and 35%, respectively. Meanwhile, our method makes ∼30% fewer error predictions relative to the other string kernels. Furthermore, our method can be used to visualize the importance of oligomers and positions in predicting poly(A) motifs, from which we can observe a number of characteristics in the surrounding regions of true and false motifs that have not been reported before.

          Availability: http://sfb.kaust.edu.sa/Pages/Software.aspx

          Contact: lsong@ 123456cc.gatech.edu or xin.gao@ 123456kaust.edu.sa

          Supplementary information: Supplementary data are available at Bioinformatics online.

          Related collections

          Most cited references31

          • Record: found
          • Abstract: found
          • Article: not found

          GeneMark.hmm: new solutions for gene finding.

          The number of completely sequenced bacterial genomes has been growing fast. There are computer methods available for finding genes but yet there is a need for more accurate algorithms. The GeneMark. hmm algorithm presented here was designed to improve the gene prediction quality in terms of finding exact gene boundaries. The idea was to embed the GeneMark models into naturally derived hidden Markov model framework with gene boundaries modeled as transitions between hidden states. We also used the specially derived ribosome binding site pattern to refine predictions of translation initiation codons. The algorithm was evaluated on several test sets including 10 complete bacterial genomes. It was shown that the new algorithm is significantly more accurate than GeneMark in exact gene prediction. Interestingly, the high gene finding accuracy was observed even in the case when Markov models of order zero, one and two were used. We present the analysis of false positive and false negative predictions with the caution that these categories are not precisely defined if the public database annotation is used as a control.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Ending the message: poly(A) signals then and now.

            Polyadenylation [poly(A)] signals (PAS) are a defining feature of eukaryotic protein-coding genes. The central sequence motif AAUAAA was identified in the mid-1970s and subsequently shown to require flanking, auxiliary elements for both 3'-end cleavage and polyadenylation of premessenger RNA (pre-mRNA) as well as to promote downstream transcriptional termination. More recent genomic analysis has established the generality of the PAS for eukaryotic mRNA. Evidence for the mechanism of mRNA 3'-end formation is outlined, as is the way this RNA processing reaction communicates with RNA polymerase II to terminate transcription. The widespread phenomenon of alternative poly(A) site usage and how this interrelates with pre-mRNA splicing is then reviewed. This shows that gene expression can be drastically affected by how the message is ended. A central theme of this review is that while genomic analysis provides generality for the importance of PAS selection, detailed mechanistic understanding still requires the direct analysis of specific genes by genetic and biochemical approaches.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Mismatch string kernels for discriminative protein classification.

              Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training and prediction are also important concerns. We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the problem of protein classification and remote homology detection. These kernels measure sequence similarity based on shared occurrences of fixed-length patterns in the data, allowing for mutations between patterns. Thus, the kernels provide a biologically well-motivated way to compare protein sequences without relying on family-based generative models such as hidden Markov models. We compute the kernels efficiently using a mismatch tree data structure, allowing us to calculate the contributions of all patterns occurring in the data in one pass while traversing the tree. When used with an SVM, the kernels enable fast prediction on test sequences. We report experiments on two benchmark SCOP datasets, where we show that the mismatch kernel used with an SVM classifier performs competitively with state-of-the-art methods for homology detection, particularly when very few training examples are available. Examination of the highest-weighted patterns learned by the SVM classifier recovers biologically important motifs in protein families and superfamilies.
                Bookmark

                Author and article information

                Journal
                Bioinformatics
                Bioinformatics
                bioinformatics
                bioinfo
                Bioinformatics
                Oxford University Press
                1367-4803
                1367-4811
                1 July 2013
                19 June 2013
                19 June 2013
                : 29
                : 13
                : i316-i325
                Affiliations
                1College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA, 2Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia and 3Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
                Author notes
                *To whom all correspondence should be addressed.
                Article
                btt218
                10.1093/bioinformatics/btt218
                3694652
                23813000
                dfe0e199-389e-4626-890c-29431672658d
                © The Author 2013. Published by Oxford University Press.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

                History
                Page count
                Pages: 10
                Categories
                Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany
                Original Papers
                Sequence Analysis

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article