12
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          RNAs play key roles in cells through the interactions with proteins known as the RNA-binding proteins (RBP) and their binding motifs enable crucial understanding of the post-transcriptional regulation of RNAs. How the RBPs correctly recognize the target RNAs and why they bind specific positions is still far from clear. Machine learning-based algorithms are widely acknowledged to be capable of speeding up this process. Although many automatic tools have been developed to predict the RNA-protein binding sites from the rapidly growing multi-resource data, e.g. sequence, structure, their domain specific features and formats have posed significant computational challenges. One of current difficulties is that the cross-source shared common knowledge is at a higher abstraction level beyond the observed data, resulting in a low efficiency of direct integration of observed data across domains. The other difficulty is how to interpret the prediction results. Existing approaches tend to terminate after outputting the potential discrete binding sites on the sequences, but how to assemble them into the meaningful binding motifs is a topic worth of further investigation.

          Results

          In viewing of these challenges, we propose a deep learning-based framework (iDeep) by using a novel hybrid convolutional neural network and deep belief network to predict the RBP interaction sites and motifs on RNAs. This new protocol is featured by transforming the original observed data into a high-level abstraction feature space using multiple layers of learning blocks, where the shared representations across different domains are integrated. To validate our iDeep method, we performed experiments on 31 large-scale CLIP-seq datasets, and our results show that by integrating multiple sources of data, the average AUC can be improved by 8% compared to the best single-source-based predictor; and through cross-domain knowledge integration at an abstraction level, it outperforms the state-of-the-art predictors by 6%. Besides the overall enhanced prediction performance, the convolutional neural network module embedded in iDeep is also able to automatically capture the interpretable binding motifs for RBPs. Large-scale experiments demonstrate that these mined binding motifs agree well with the experimentally verified results, suggesting iDeep is a promising approach in the real-world applications.

          Conclusion

          The iDeep framework not only can achieve promising performance than the state-of-the-art predictors, but also easily capture interpretable binding motifs. iDeep is available at http://www.csbio.sjtu.edu.cn/bioinf/iDeep

          Electronic supplementary material

          The online version of this article (doi:10.1186/s12859-017-1561-8) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references28

          • Record: found
          • Abstract: found
          • Article: not found

          DANN: a deep learning approach for annotating the pathogenicity of genetic variants.

          Annotating genetic variants, especially non-coding variants, for the purpose of identifying pathogenic variants remains a challenge. Combined annotation-dependent depletion (CADD) is an algorithm designed to annotate both coding and non-coding variants, and has been shown to outperform other annotation algorithms. CADD trains a linear kernel support vector machine (SVM) to differentiate evolutionarily derived, likely benign, alleles from simulated, likely deleterious, variants. However, SVMs cannot capture non-linear relationships among the features, which can limit performance. To address this issue, we have developed DANN. DANN uses the same feature set and training data as CADD to train a deep neural network (DNN). DNNs can capture non-linear relationships among features and are better suited than SVMs for problems with a large number of samples and features. We exploit Compute Unified Device Architecture-compatible graphics processing units and deep learning techniques such as dropout and momentum training to accelerate the DNN training. DANN achieves about a 19% relative reduction in the error rate and about a 14% relative increase in the area under the curve (AUC) metric over CADD's SVM methodology. All data and source code are available at https://cbcl.ics.uci.edu/public_data/DANN/. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Identification of Neuronal RNA Targets of TDP-43-containing Ribonucleoprotein Complexes*♦

            TAR DNA-binding protein 43 (TDP-43) is associated with a spectrum of neurodegenerative diseases. Although TDP-43 resembles heterogeneous nuclear ribonucleoproteins, its RNA targets and physiological protein partners remain unknown. Here we identify RNA targets of TDP-43 from cortical neurons by RNA immunoprecipitation followed by deep sequencing (RIP-seq). The canonical TDP-43 binding site (TG) n is 55.1-fold enriched, and moreover, a variant with adenine in the middle, (TG) n TA(TG) m , is highly abundant among reads in our TDP-43 RIP-seq library. TDP-43 RNA targets can be divided into three different groups: those primarily binding in introns, in exons, and across both introns and exons. TDP-43 RNA targets are particularly enriched for Gene Ontology terms related to synaptic function, RNA metabolism, and neuronal development. Furthermore, TDP-43 binds to a number of RNAs encoding for proteins implicated in neurodegeneration, including TDP-43 itself, FUS/TLS, progranulin, Tau, and ataxin 1 and -2. We also identify 25 proteins that co-purify with TDP-43 from rodent brain nuclear extracts. Prominent among them are nuclear proteins involved in pre-mRNA splicing and RNA stability and transport. Also notable are two neuron-enriched proteins, methyl CpG-binding protein 2 and polypyrimidine tract-binding protein 2 (PTBP2). A PTBP2 consensus RNA binding motif is enriched in the TDP-43 RIP-seq library, suggesting that PTBP2 may co-regulate TDP-43 RNA targets. This work thus reveals the protein and RNA components of the TDP-43-containing ribonucleoprotein complexes and provides a framework for understanding how dysregulation of TDP-43 in RNA metabolism contributes to neurodegeneration.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Discovery of multi-dimensional modules by integrative analysis of cancer genomic data

              Recent technology has made it possible to simultaneously perform multi-platform genomic profiling (e.g. DNA methylation (DM) and gene expression (GE)) of biological samples, resulting in so-called ‘multi-dimensional genomic data’. Such data provide unique opportunities to study the coordination between regulatory mechanisms on multiple levels. However, integrative analysis of multi-dimensional genomics data for the discovery of combinatorial patterns is currently lacking. Here, we adopt a joint matrix factorization technique to address this challenge. This method projects multiple types of genomic data onto a common coordinate system, in which heterogeneous variables weighted highly in the same projected direction form a multi-dimensional module (md-module). Genomic variables in such modules are characterized by significant correlations and likely functional associations. We applied this method to the DM, GE, and microRNA expression data of 385 ovarian cancer samples from the The Cancer Genome Atlas project. These md-modules revealed perturbed pathways that would have been overlooked with only a single type of data, uncovered associations between different layers of cellular activities and allowed the identification of clinically distinct patient subgroups. Our study provides an useful protocol for uncovering hidden patterns and their biological implications in multi-dimensional ‘omic’ data.
                Bookmark

                Author and article information

                Contributors
                xypan172436@gmail.com
                hbshen@sjtu.edu.cn
                Journal
                BMC Bioinformatics
                BMC Bioinformatics
                BMC Bioinformatics
                BioMed Central (London )
                1471-2105
                28 February 2017
                28 February 2017
                2017
                : 18
                : 136
                Affiliations
                [1 ]ISNI 0000 0001 0674 042X, GRID grid.5254.6, Department of Veterinary Clinical and Animal Sciences, , University of Copenhagen, ; Copenhagen, Denmark
                [2 ]ISNI 0000 0004 0369 313X, GRID grid.419897.a, Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, , Ministry of Education of China, ; Shanghai, China
                Article
                1561
                10.1186/s12859-017-1561-8
                5331642
                28245811
                31143013-8fc1-420c-96f1-ae009a01a216
                © The Author(s) 2017

                Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 15 November 2016
                : 23 February 2017
                Funding
                Funded by: FundRef http://dx.doi.org/10.13039/501100003399, Science and Technology Commission of Shanghai Municipality;
                Award ID: 16JC1404300
                Award Recipient :
                Funded by: FundRef http://dx.doi.org/10.13039/501100001809, National Natural Science Foundation of China;
                Award ID: 61671288
                Award Recipient :
                Funded by: FundRef http://dx.doi.org/10.13039/501100001809, National Natural Science Foundation of China;
                Award ID: 31628003
                Award Recipient :
                Categories
                Research Article
                Custom metadata
                © The Author(s) 2017

                Bioinformatics & Computational biology
                rna-binding protein,clip-seq,deep belief network,convolutional neural network,multimodal deep learning

                Comments

                Comment on this article