4
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Within- and cross-species predictions of plant specialized metabolism genes using transfer learning

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Plant specialized metabolites mediate interactions between plants and the environment and have significant agronomical/pharmaceutical value. Most genes involved in specialized metabolism (SM) are unknown because of the large number of metabolites and the challenge in differentiating SM genes from general metabolism (GM) genes. Plant models like Arabidopsis thaliana have extensive, experimentally derived annotations, whereas many non-model species do not. Here we employed a machine learning strategy, transfer learning, where knowledge from A. thaliana is transferred to predict gene functions in cultivated tomato with fewer experimentally annotated genes. The first tomato SM/GM prediction model using only tomato data performs well ( F-measure = 0.74, compared with 0.5 for random and 1.0 for perfect predictions), but from manually curating 88 SM/GM genes, we found many mis-predicted entries were likely mis-annotated. When the SM/GM prediction models built with A. thaliana data were used to filter out genes where the A. thaliana-based model predictions disagreed with tomato annotations, the new tomato model trained with filtered data improved significantly ( F-measure = 0.92). Our study demonstrates that SM/GM genes can be better predicted by leveraging cross-species information. Additionally, our findings provide an example for transfer learning in genomics where knowledge can be transferred from an information-rich species to an information-poor one.

          Related collections

          Most cited references63

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Trimmomatic: a flexible trimmer for Illumina sequence data

          Motivation: Although many next-generation sequencing (NGS) read preprocessing tools already existed, we could not find any tool or combination of tools that met our requirements in terms of flexibility, correct handling of paired-end data and high performance. We have developed Trimmomatic as a more flexible and efficient preprocessing tool, which could correctly handle paired-end data. Results: The value of NGS read preprocessing is demonstrated for both reference-based and reference-free tasks. Trimmomatic is shown to produce output that is at least competitive with, and in many cases superior to, that produced by other tools, in all scenarios tested. Availability and implementation: Trimmomatic is licensed under GPL V3. It is cross-platform (Java 1.5+ required) and available at http://www.usadellab.org/cms/index.php?page=trimmomatic Contact: usadel@bio1.rwth-aachen.de Supplementary information: Supplementary data are available at Bioinformatics online.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            HTSeq—a Python framework to work with high-throughput sequencing data

            Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard workflows, custom scripts are needed. Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data, such as genomic coordinates, sequences, sequencing reads, alignments, gene model information and variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes. Availability and implementation: HTSeq is released as an open-source software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index at https://pypi.python.org/pypi/HTSeq. Contact: sanders@fs.tum.de
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              PAML 4: phylogenetic analysis by maximum likelihood.

              PAML, currently in version 4, is a package of programs for phylogenetic analyses of DNA and protein sequences using maximum likelihood (ML). The programs may be used to compare and test phylogenetic trees, but their main strengths lie in the rich repertoire of evolutionary models implemented, which can be used to estimate parameters in models of sequence evolution and to test interesting biological hypotheses. Uses of the programs include estimation of synonymous and nonsynonymous rates (d(N) and d(S)) between two protein-coding DNA sequences, inference of positive Darwinian selection through phylogenetic comparison of protein-coding genes, reconstruction of ancestral genes and proteins for molecular restoration studies of extinct life forms, combined analysis of heterogeneous data sets from multiple gene loci, and estimation of species divergence times incorporating uncertainties in fossil calibrations. This note discusses some of the major applications of the package, which includes example data sets to demonstrate their use. The package is written in ANSI C, and runs under Windows, Mac OSX, and UNIX systems. It is available at -- (http://abacus.gene.ucl.ac.uk/software/paml.html).
                Bookmark

                Author and article information

                Contributors
                Role: Handling Editor
                Journal
                In Silico Plants
                In Silico Plants
                insilicoplants
                In Silico Plants
                Oxford University Press (UK )
                2517-5025
                2020
                30 July 2020
                30 July 2020
                : 2
                : 1
                : diaa005
                Affiliations
                [1 ] Department of Plant Biology, Michigan State University , East Lansing, MI, USA
                [2 ] Ecology, Evolutionary Biology, and Behavior Program, Michigan State University , East Lansing, MI, USA
                [3 ] Department of Biochemistry and Molecular Biology, Michigan State University , East Lansing, MI, USA
                [4 ] Department of Biology, The College of New Jersey , Ewing, NJ, USA
                [5 ] MSU-DOE Plant Research Laboratory, Michigan State University , East Lansing, MI, USA
                [6 ] Science Research Center, Yamaguchi University , Yamaguchi, Japan
                [7 ] Department of Horticulture, Michigan State University , East Lansing, MI, USA
                [8 ] Department of Computational Mathematics, Science and Engineering, Michigan State University , East Lansing, MI
                Author notes
                Corresponding author’s e-mail address: shius@ 123456msu.edu
                Present address: Department of Botany, University of Wisconsin-Madison, Madison, WI, USA
                Author information
                http://orcid.org/0000-0002-2104-7292
                http://orcid.org/0000-0001-6470-235X
                Article
                diaa005
                10.1093/insilicoplants/diaa005
                7731531
                33344884
                3cf6c0e1-f5e5-445c-8490-15dc509a980c
                © The Author(s) 2020. Published by Oxford University Press on behalf of the Annals of Botany Company.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 13 March 2020
                : 21 July 2020
                : 07 October 2020
                Page count
                Pages: 19
                Funding
                Funded by: National Science Foundation, DOI 10.13039/100000001;
                Award ID: IOS-1546617
                Award ID: IOS-1811055
                Award ID: DEB-1655386
                Funded by: National Institute of General Medical Sciences, DOI 10.13039/100000057;
                Funded by: National Institutes of Health, DOI 10.13039/100000002;
                Award ID: T32-GM110523
                Funded by: U.S. Department of Energy Great Lakes Bioenergy Research Center;
                Award ID: DE-SC0018409
                Funded by: Michigan AgBioResearch;
                Funded by: U.S. Department of Agriculture, DOI 10.13039/100000199;
                Funded by: National Institute of Food and Agriculture, DOI 10.13039/100005825;
                Award ID: MICL02552
                Categories
                Original Research
                AcademicSubjects/SCI01210
                AcademicSubjects/SCI01060

                cross-species gene prediction,specialized metabolism,transfer learning

                Comments

                Comment on this article