13
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism, and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning–based methods have been proposed to identify DNA methylation, and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep learning framework for predicting DNA methylation sites, which is based on 5 popular transformer-based language models. The framework identifies methylation sites for 3 different types of DNA methylation: N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the “pretrain and fine-tune” paradigm. Pretraining is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA methylation status of each type. The 5 models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source, and we provide a web server that implements the approach.

          Related collections

          Most cited references53

          • Record: found
          • Abstract: found
          • Article: not found

          Attention Is All You Need

          The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. 15 pages, 5 figures
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            DNA methylation and its basic function.

            In the mammalian genome, DNA methylation is an epigenetic mechanism involving the transfer of a methyl group onto the C5 position of the cytosine to form 5-methylcytosine. DNA methylation regulates gene expression by recruiting proteins involved in gene repression or by inhibiting the binding of transcription factor(s) to DNA. During development, the pattern of DNA methylation in the genome changes as a result of a dynamic process involving both de novo DNA methylation and demethylation. As a consequence, differentiated cells develop a stable and unique DNA methylation pattern that regulates tissue-specific gene transcription. In this chapter, we will review the process of DNA methylation and demethylation in the nervous system. We will describe the DNA (de)methylation machinery and its association with other epigenetic mechanisms such as histone modifications and noncoding RNAs. Intriguingly, postmitotic neurons still express DNA methyltransferases and components involved in DNA demethylation. Moreover, neuronal activity can modulate their pattern of DNA methylation in response to physiological and environmental stimuli. The precise regulation of DNA methylation is essential for normal cognitive function. Indeed, when DNA methylation is altered as a result of developmental mutations or environmental risk factors, such as drug exposure and neural injury, mental impairment is a common side effect. The investigation into DNA methylation continues to show a rich and complex picture about epigenetic gene regulation in the central nervous system and provides possible therapeutic targets for the treatment of neuropsychiatric disorders.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              BioBERT: a pre-trained biomedical language representation model for biomedical text mining

              Abstract Motivation Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora. Results We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts. Availability and implementation We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.
                Bookmark

                Author and article information

                Contributors
                Journal
                Gigascience
                Gigascience
                gigascience
                GigaScience
                Oxford University Press
                2047-217X
                25 July 2023
                2023
                25 July 2023
                : 12
                : giad054
                Affiliations
                Algorithms in Bioinformatics, Institute for Bioinformatics and Medical Informatics, University of Tübingen , 72076 Tübingen, Germany
                Algorithms in Bioinformatics, Institute for Bioinformatics and Medical Informatics, University of Tübingen , 72076 Tübingen, Germany
                International Max Planck Research School “From Molecules to Organisms”, Max Planck Institute for Biology Tübingen , 72076 Tübingen, Germany
                Cluster of Excellence: EXC 2124: Controlling Microbes to Fight Infection, University of Tübingen , 72076 Tübingen, Germany
                Algorithms in Bioinformatics, Institute for Bioinformatics and Medical Informatics, University of Tübingen , 72076 Tübingen, Germany
                International Max Planck Research School “From Molecules to Organisms”, Max Planck Institute for Biology Tübingen , 72076 Tübingen, Germany
                Cluster of Excellence: EXC 2124: Controlling Microbes to Fight Infection, University of Tübingen , 72076 Tübingen, Germany
                Author notes
                Correspondence address. Daniel H. Huson, Sand 14, University of Tübingen 72076 Germany. E-mail: daniel.huson@ 123456uni-tuebingen.de
                Author information
                https://orcid.org/0000-0003-1403-292X
                https://orcid.org/0000-0002-6460-6525
                https://orcid.org/0000-0002-2961-604X
                Article
                giad054
                10.1093/gigascience/giad054
                10367125
                0afa25d6-9735-4676-ac76-16c694f6f82f
                © The Author(s) 2023. Published by Oxford University Press GigaScience.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 19 February 2023
                : 09 May 2023
                : 18 July 2023
                Page count
                Pages: 11
                Funding
                Funded by: BMBF, DOI 10.13039/501100002347;
                Award ID: 031A532B
                Award ID: 031A533A
                Award ID: 031A533B
                Award ID: 031A534A
                Award ID: 031A535A
                Award ID: 031A537A
                Award ID: 031A537B
                Award ID: 031A537C
                Award ID: 031A537D
                Award ID: 031A538A
                Categories
                Research
                AcademicSubjects/SCI00960
                AcademicSubjects/SCI02254

                dna methylation,natural language processing,model ensemble,model explainability,web server

                Comments

                Comment on this article