0
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Improving protein domain classification for third-generation sequencing reads using deep learning

      research-article
      1 , 2 , 2 ,
      BMC Genomics
      BioMed Central

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          With the development of third-generation sequencing (TGS) technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domain annotation without assembly, thus can produce important insights into the biological functions of the underlying data. However, the high error rate in TGS data raises a new challenge to established domain analysis pipelines. The state-of-the-art methods are not optimized for noisy reads and have shown unsatisfactory accuracy of domain classification in TGS data. New computational methods are still needed to improve the performance of domain prediction in long noisy reads.

          Results

          In this work, we introduce ProDOMA, a deep learning model that conducts domain classification for TGS reads. It uses deep neural networks with 3-frame translation encoding to learn conserved features from partially correct translations. In addition, we formulate our problem as an open-set problem and thus our model can reject reads not containing the targeted domains. In the experiments on simulated long reads of protein coding sequences and real TGS reads from the human genome, our model outperforms HMMER and DeepFam on protein domain classification.

          Conclusions

          In summary, ProDOMA is a useful end-to-end protein domain analysis tool for long noisy reads without relying on error correction.

          Supplementary Information

          The online version contains supplementary material available at (10.1186/s12864-021-07468-7).

          Related collections

          Most cited references31

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability

          We report a major update of the MAFFT multiple sequence alignment program. This version has several new features, including options for adding unaligned sequences into an existing alignment, adjustment of direction in nucleotide alignment, constrained alignment and parallel processing, which were implemented after the previous major update. This report shows actual examples to explain how these features work, alone and in combination. Some examples incorrectly aligned by MAFFT are also shown to clarify its limitations. We discuss how to avoid misalignments, and our ongoing efforts to overcome such limitations.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            WebLogo: a sequence logo generator.

            WebLogo generates sequence logos, graphical representations of the patterns within a multiple sequence alignment. Sequence logos provide a richer and more precise description of sequence similarity than consensus sequences and can rapidly reveal significant features of the alignment otherwise difficult to perceive. Each logo consists of stacks of letters, one stack for each position in the sequence. The overall height of each stack indicates the sequence conservation at that position (measured in bits), whereas the height of symbols within the stack reflects the relative frequency of the corresponding amino or nucleic acid at that position. WebLogo has been enhanced recently with additional features and options, to provide a convenient and highly configurable sequence logo generator. A command line interface and the complete, open WebLogo source code are available for local installation and customization. Copyright 2004 Cold Spring Harbor Laboratory Press
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              The Pfam protein families database in 2019

              Abstract The last few years have witnessed significant changes in Pfam (https://pfam.xfam.org). The number of families has grown substantially to a total of 17,929 in release 32.0. New additions have been coupled with efforts to improve existing families, including refinement of domain boundaries, their classification into Pfam clans, as well as their functional annotation. We recently began to collaborate with the RepeatsDB resource to improve the definition of tandem repeat families within Pfam. We carried out a significant comparison to the structural classification database, namely the Evolutionary Classification of Protein Domains (ECOD) that led to the creation of 825 new families based on their set of uncharacterized families (EUFs). Furthermore, we also connected Pfam entries to the Sequence Ontology (SO) through mapping of the Pfam type definitions to SO terms. Since Pfam has many community contributors, we recently enabled the linking between authorship of all Pfam entries with the corresponding authors’ ORCID identifiers. This effectively permits authors to claim credit for their Pfam curation and link them to their ORCID record.
                Bookmark

                Author and article information

                Contributors
                dunan00001@gmail.com
                jyshang2-c@my.cityu.edu.hk
                yannisun@cityu.edu.hk
                Journal
                BMC Genomics
                BMC Genomics
                BMC Genomics
                BioMed Central (London )
                1471-2164
                9 April 2021
                9 April 2021
                2021
                : 22
                : 251
                Affiliations
                [1 ]GRID grid.17088.36, ISNI 0000 0001 2150 1785, Computer Science and Engineering, Michigan State University, ; East Lansing, 48824 USA
                [2 ]GRID grid.35030.35, ISNI 0000 0004 1792 6846, Electrical Engineering, City University of Hong Kong, ; Hong Kong, People’s Republic of China
                Article
                7468
                10.1186/s12864-021-07468-7
                8033682
                e27accc5-c82b-4f15-9592-a0acf2b28b01
                © The Author(s) 2021

                Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

                History
                : 28 November 2020
                : 19 February 2021
                Funding
                Funded by: FundRef http://dx.doi.org/10.13039/100007567, City University of Hong Kong;
                Award ID: 7200620
                Funded by: HKIDS
                Award ID: 9360163
                Categories
                Research Article
                Custom metadata
                © The Author(s) 2021

                Genetics
                Genetics

                Comments

                Comment on this article