Blog
About

  • Record: found
  • Abstract: found
  • Article: found
Is Open Access

A machine learning based framework to identify and classify long terminal repeat retrotransposons

Read this article at

Bookmark
      There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

      Abstract

      Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-L earner, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: R epeatM asker, C ensor and L trD igest. In contrast to these methods, TE-L earner is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance, while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-L earner’s predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE.

      Author summary

      Over the years, with the increase of the acquisition of biological data, the extraction of knowledge from this data is getting more important. To understand how biology works is very important to increase the quality of the products and services which use biological data. This directly influences companies and governments, which need to remain in the knowledge frontier of an increasing competitive economy. Transposable Elements (TEs) are an example of very important biological data, and to understand their role in the genomes of organisms is very important for the development of products based on biological data. As an example, we can cite the production biofuels such as the sugar-cane-based ones. Many studies have revealed the presence of active TEs in this plant, which has gained economic importance in many countries. To understand how TEs influence the plant should help researchers to develop more resistant varieties of sugar-cane, increasing the production. Thus, the development of computational methods able to help biologists in the correct identification and classification of TEs is very important from both theoretical and practical perspectives.

      Related collections

      Most cited references 25

      • Record: found
      • Abstract: not found
      • Article: not found

      Random forests

        Bookmark
        • Record: found
        • Abstract: found
        • Article: not found

        A unified classification system for eukaryotic transposable elements.

        Our knowledge of the structure and composition of genomes is rapidly progressing in pace with their sequencing. The emerging data show that a significant portion of eukaryotic genomes is composed of transposable elements (TEs). Given the abundance and diversity of TEs and the speed at which large quantities of sequence data are emerging, identification and annotation of TEs presents a significant challenge. Here we propose the first unified hierarchical classification system, designed on the basis of the transposition mechanism, sequence similarities and structural relationships, that can be easily applied by non-experts. The system and nomenclature is kept up to date at the WikiPoson web site.
          Bookmark
          • Record: found
          • Abstract: found
          • Article: found

          LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons

           Zhao Xu-an,  Hao Wang (2007)
          Long terminal repeat retrotransposons (LTR elements) are ubiquitous eukaryotic transposable elements. They play important roles in the evolution of genes and genomes. Ever-growing amount of genomic sequences of many organisms present a great challenge to fast identifying them. That is the first and indispensable step to study their structure, distribution, functions and other biological impacts. However, until today, tools for efficient LTR retrotransposon discovery are very limited. Thus, we developed LTR_FINDER web server. Given DNA sequences, it predicts locations and structure of full-length LTR retrotransposons accurately by considering common structural features. LTR_FINDER is a system capable of scanning large-scale sequences rapidly and the first web server for ab initio LTR retrotransposon finding. We illustrate its usage and performance on the genome of Saccharomyces cerevisiae. The web server is freely accessible at http://tlife.fudan.edu.cn/ltr_finder/.
            Bookmark

            Author and article information

            Affiliations
            [1 ] Department of Computer Science, KU Leuven, Leuven, Belgium
            [2 ] Department of Public Health and Primary Care, KU Leuven Kulak, Kortrijk, Belgium
            [3 ] Department of Respiratory Medicine, Ghent University, and VIB Inflammation Research Center, Ghent, Belgium
            [4 ] Department of Computer Science, UFSCar Federal University of São Carlos, São Carlos, São Paulo, Brazil
            [5 ] Department of Statistics, Applied Mathematics, and Computer Science, UNESP São Paulo State University, Rio Claro, São Paulo, Brazil
            [6 ] Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, São Carlos, São Paulo, Brazil
            [7 ] INRIA Lille Nord Europe, 40 avenue Halley, 59650 Villeneuve d’Ascq, France
            [8 ] Department of Biology, UNESP São Paulo State University, São José do Rio Preto, São Paulo, Brazil
            Rutgers University, UNITED STATES
            Author notes

            The authors have declared that no competing interests exist.

            Contributors
            Role: Formal analysis, Role: Investigation, Role: Methodology, Role: Writing – original draft
            ORCID: http://orcid.org/0000-0003-0983-256X, Role: Conceptualization, Role: Formal analysis, Role: Methodology, Role: Writing – original draft
            ORCID: http://orcid.org/0000-0002-2582-1695, Role: Formal analysis, Role: Methodology, Role: Validation, Role: Writing – original draft
            ORCID: http://orcid.org/0000-0002-5598-6263, Role: Conceptualization, Role: Data curation, Role: Formal analysis, Role: Funding acquisition, Role: Methodology, Role: Project administration, Role: Writing – original draft
            Role: Formal analysis, Role: Investigation, Role: Methodology, Role: Writing – review & editing
            Role: Conceptualization, Role: Funding acquisition, Role: Methodology, Role: Project administration, Role: Supervision, Role: Writing – review & editing
            Role: Conceptualization, Role: Methodology, Role: Validation, Role: Writing – review & editing
            ORCID: http://orcid.org/0000-0003-0378-3699, Role: Conceptualization, Role: Funding acquisition, Role: Project administration, Role: Supervision, Role: Writing – review & editing
            Role: Editor
            Journal
            PLoS Comput Biol
            PLoS Comput. Biol
            plos
            ploscomp
            PLoS Computational Biology
            Public Library of Science (San Francisco, CA USA )
            1553-734X
            1553-7358
            April 2018
            23 April 2018
            : 14
            : 4
            29684010 5933816 PCOMPBIOL-D-17-00156 10.1371/journal.pcbi.1006097
            © 2018 Schietgat et al

            This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

            Counts
            Figures: 11, Tables: 6, Pages: 21
            Product
            Funding
            This work was supported by the Explorative Scientific Co-operation Programme between KU Leuven and São Paulo State University (UNESP), the Research Foundation Flanders (FWO-Vlaanderen) [project G.0413.09 to EC, postdoctoral grant to CV, grant GA.001.15N (Chist-ERANET call 2013 Adalab project) to JR], the Research Fund KU Leuven, ERC Starting Grant 240186 and IWT-SBO Nemoa to LS, the São Paulo Research Foundation (FAPESP - Brazil) [project 2015/14300-1 to RC, project 2012/24774-2 to CNF, project 2013/15070-4 to CMAC], the National Council for Scientific and Technological Development (CNPq-Brazil) [project 306493/2013-6 to CMAC], and Coordination for the Improvement of Higher Education Personnel (CAPES-Brazil) to EC. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
            Categories
            Research Article
            Biology and Life Sciences
            Biochemistry
            Proteins
            Protein Domains
            Research and Analysis Methods
            Experimental Organism Systems
            Model Organisms
            Drosophila Melanogaster
            Research and Analysis Methods
            Model Organisms
            Drosophila Melanogaster
            Research and Analysis Methods
            Experimental Organism Systems
            Animal Models
            Drosophila Melanogaster
            Biology and Life Sciences
            Organisms
            Eukaryota
            Animals
            Invertebrates
            Arthropoda
            Insects
            Drosophila
            Drosophila Melanogaster
            Research and Analysis Methods
            Experimental Organism Systems
            Model Organisms
            Arabidopsis Thaliana
            Research and Analysis Methods
            Model Organisms
            Arabidopsis Thaliana
            Biology and Life Sciences
            Organisms
            Eukaryota
            Plants
            Brassica
            Arabidopsis Thaliana
            Research and Analysis Methods
            Experimental Organism Systems
            Plant and Algal Models
            Arabidopsis Thaliana
            Biology and Life Sciences
            Genetics
            Genetic Elements
            Mobile Genetic Elements
            Transposable Elements
            Retrotransposons
            Biology and Life Sciences
            Genetics
            Genomics
            Mobile Genetic Elements
            Transposable Elements
            Retrotransposons
            Computer and Information Sciences
            Artificial Intelligence
            Machine Learning
            Biology and Life Sciences
            Genetics
            Genomics
            Animal Genomics
            Invertebrate Genomics
            Research and Analysis Methods
            Database and Informatics Methods
            Biological Databases
            Sequence Databases
            Research and Analysis Methods
            Database and Informatics Methods
            Bioinformatics
            Sequence Analysis
            Sequence Databases
            Engineering and Technology
            Management Engineering
            Decision Analysis
            Decision Trees
            Research and Analysis Methods
            Decision Analysis
            Decision Trees
            Custom metadata
            vor-update-to-uncorrected-proof
            2018-05-03
            Data and software are available at https://dtai.cs.kuleuven.be/software/te-learner.

            Quantitative & Systems biology

            Comments

            Comment on this article