Blog
About

6
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      A machine learning based framework to identify and classify long terminal repeat retrotransposons

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-L earner, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: R epeatM asker, C ensor and L trD igest. In contrast to these methods, TE-L earner is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance, while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-L earner’s predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE.

          Author summary

          Over the years, with the increase of the acquisition of biological data, the extraction of knowledge from this data is getting more important. To understand how biology works is very important to increase the quality of the products and services which use biological data. This directly influences companies and governments, which need to remain in the knowledge frontier of an increasing competitive economy. Transposable Elements (TEs) are an example of very important biological data, and to understand their role in the genomes of organisms is very important for the development of products based on biological data. As an example, we can cite the production biofuels such as the sugar-cane-based ones. Many studies have revealed the presence of active TEs in this plant, which has gained economic importance in many countries. To understand how TEs influence the plant should help researchers to develop more resistant varieties of sugar-cane, increasing the production. Thus, the development of computational methods able to help biologists in the correct identification and classification of TEs is very important from both theoretical and practical perspectives.

          Related collections

          Most cited references 25

          • Record: found
          • Abstract: not found
          • Article: not found

          Random forests

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            A unified classification system for eukaryotic transposable elements.

            Our knowledge of the structure and composition of genomes is rapidly progressing in pace with their sequencing. The emerging data show that a significant portion of eukaryotic genomes is composed of transposable elements (TEs). Given the abundance and diversity of TEs and the speed at which large quantities of sequence data are emerging, identification and annotation of TEs presents a significant challenge. Here we propose the first unified hierarchical classification system, designed on the basis of the transposition mechanism, sequence similarities and structural relationships, that can be easily applied by non-experts. The system and nomenclature is kept up to date at the WikiPoson web site.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found

              LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons

               Zhao Xu-an,  Hao Wang (2007)
              Long terminal repeat retrotransposons (LTR elements) are ubiquitous eukaryotic transposable elements. They play important roles in the evolution of genes and genomes. Ever-growing amount of genomic sequences of many organisms present a great challenge to fast identifying them. That is the first and indispensable step to study their structure, distribution, functions and other biological impacts. However, until today, tools for efficient LTR retrotransposon discovery are very limited. Thus, we developed LTR_FINDER web server. Given DNA sequences, it predicts locations and structure of full-length LTR retrotransposons accurately by considering common structural features. LTR_FINDER is a system capable of scanning large-scale sequences rapidly and the first web server for ab initio LTR retrotransposon finding. We illustrate its usage and performance on the genome of Saccharomyces cerevisiae. The web server is freely accessible at http://tlife.fudan.edu.cn/ltr_finder/.
                Bookmark

                Author and article information

                Contributors
                Role: Formal analysisRole: InvestigationRole: MethodologyRole: Writing – original draft
                Role: ConceptualizationRole: Formal analysisRole: MethodologyRole: Writing – original draft
                Role: Formal analysisRole: MethodologyRole: ValidationRole: Writing – original draft
                Role: ConceptualizationRole: Data curationRole: Formal analysisRole: Funding acquisitionRole: MethodologyRole: Project administrationRole: Writing – original draft
                Role: Formal analysisRole: InvestigationRole: MethodologyRole: Writing – review & editing
                Role: ConceptualizationRole: Funding acquisitionRole: MethodologyRole: Project administrationRole: SupervisionRole: Writing – review & editing
                Role: ConceptualizationRole: MethodologyRole: ValidationRole: Writing – review & editing
                Role: ConceptualizationRole: Funding acquisitionRole: Project administrationRole: SupervisionRole: Writing – review & editing
                Role: Editor
                Journal
                PLoS Comput Biol
                PLoS Comput. Biol
                plos
                ploscomp
                PLoS Computational Biology
                Public Library of Science (San Francisco, CA USA )
                1553-734X
                1553-7358
                April 2018
                23 April 2018
                : 14
                : 4
                Affiliations
                [1 ] Department of Computer Science, KU Leuven, Leuven, Belgium
                [2 ] Department of Public Health and Primary Care, KU Leuven Kulak, Kortrijk, Belgium
                [3 ] Department of Respiratory Medicine, Ghent University, and VIB Inflammation Research Center, Ghent, Belgium
                [4 ] Department of Computer Science, UFSCar Federal University of São Carlos, São Carlos, São Paulo, Brazil
                [5 ] Department of Statistics, Applied Mathematics, and Computer Science, UNESP São Paulo State University, Rio Claro, São Paulo, Brazil
                [6 ] Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, São Carlos, São Paulo, Brazil
                [7 ] INRIA Lille Nord Europe, 40 avenue Halley, 59650 Villeneuve d’Ascq, France
                [8 ] Department of Biology, UNESP São Paulo State University, São José do Rio Preto, São Paulo, Brazil
                Rutgers University, UNITED STATES
                Author notes

                The authors have declared that no competing interests exist.

                PCOMPBIOL-D-17-00156
                10.1371/journal.pcbi.1006097
                5933816
                29684010
                © 2018 Schietgat et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                Counts
                Figures: 11, Tables: 6, Pages: 21
                Product
                Funding
                This work was supported by the Explorative Scientific Co-operation Programme between KU Leuven and São Paulo State University (UNESP), the Research Foundation Flanders (FWO-Vlaanderen) [project G.0413.09 to EC, postdoctoral grant to CV, grant GA.001.15N (Chist-ERANET call 2013 Adalab project) to JR], the Research Fund KU Leuven, ERC Starting Grant 240186 and IWT-SBO Nemoa to LS, the São Paulo Research Foundation (FAPESP - Brazil) [project 2015/14300-1 to RC, project 2012/24774-2 to CNF, project 2013/15070-4 to CMAC], the National Council for Scientific and Technological Development (CNPq-Brazil) [project 306493/2013-6 to CMAC], and Coordination for the Improvement of Higher Education Personnel (CAPES-Brazil) to EC. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology and Life Sciences
                Biochemistry
                Proteins
                Protein Domains
                Research and Analysis Methods
                Experimental Organism Systems
                Model Organisms
                Drosophila Melanogaster
                Research and Analysis Methods
                Model Organisms
                Drosophila Melanogaster
                Research and Analysis Methods
                Experimental Organism Systems
                Animal Models
                Drosophila Melanogaster
                Biology and Life Sciences
                Organisms
                Eukaryota
                Animals
                Invertebrates
                Arthropoda
                Insects
                Drosophila
                Drosophila Melanogaster
                Research and Analysis Methods
                Experimental Organism Systems
                Model Organisms
                Arabidopsis Thaliana
                Research and Analysis Methods
                Model Organisms
                Arabidopsis Thaliana
                Biology and Life Sciences
                Organisms
                Eukaryota
                Plants
                Brassica
                Arabidopsis Thaliana
                Research and Analysis Methods
                Experimental Organism Systems
                Plant and Algal Models
                Arabidopsis Thaliana
                Biology and Life Sciences
                Genetics
                Genetic Elements
                Mobile Genetic Elements
                Transposable Elements
                Retrotransposons
                Biology and Life Sciences
                Genetics
                Genomics
                Mobile Genetic Elements
                Transposable Elements
                Retrotransposons
                Computer and Information Sciences
                Artificial Intelligence
                Machine Learning
                Biology and Life Sciences
                Genetics
                Genomics
                Animal Genomics
                Invertebrate Genomics
                Research and Analysis Methods
                Database and Informatics Methods
                Biological Databases
                Sequence Databases
                Research and Analysis Methods
                Database and Informatics Methods
                Bioinformatics
                Sequence Analysis
                Sequence Databases
                Engineering and Technology
                Management Engineering
                Decision Analysis
                Decision Trees
                Research and Analysis Methods
                Decision Analysis
                Decision Trees
                Custom metadata
                vor-update-to-uncorrected-proof
                2018-05-03
                Data and software are available at https://dtai.cs.kuleuven.be/software/te-learner.

                Quantitative & Systems biology

                Comments

                Comment on this article