78
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The task of eukaryotic genome annotation remains challenging. Only a few genomes could serve as standards of annotation achieved through a tremendous investment of human curation efforts. Still, the correctness of all alternative isoforms, even in the best-annotated genomes, could be a good subject for further investigation. The new BRAKER2 pipeline generates and integrates external protein support into the iterative process of training and gene prediction by GeneMark-EP+ and AUGUSTUS. BRAKER2 continues the line started by BRAKER1 where self-training GeneMark-ET and AUGUSTUS made gene predictions supported by transcriptomic data. Among the challenges addressed by the new pipeline was a generation of reliable hints to protein-coding exon boundaries from likely homologous but evolutionarily distant proteins. In comparison with other pipelines for eukaryotic genome annotation, BRAKER2 is fully automatic. It is favorably compared under equal conditions with other pipelines, e.g. MAKER2, in terms of accuracy and performance. Development of BRAKER2 should facilitate solving the task of harmonization of annotation of protein-coding genes in genomes of different eukaryotic species. However, we fully understand that several more innovations are needed in transcriptomic and proteomic technologies as well as in algorithmic development to reach the goal of highly accurate annotation of eukaryotic genomes.

          Related collections

          Most cited references50

          • Record: found
          • Abstract: found
          • Article: not found

          Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

          Rapid advances in next-generation sequencing technologies have dramatically changed our ability to perform genome-scale analyses. The human reference genome used for most genomic analyses represents only a small number of individuals, limiting its usefulness for genotyping. We designed a novel method, HISAT2, for representing and searching an expanded model of the human reference genome, in which a large catalogue of known genomic variants and haplotypes is incorporated into the data structure used for searching and alignment. This strategy for representing a population of genomes, along with a fast and memory-efficient search algorithm, enables more detailed and accurate variant analyses than previous methods. We demonstrate two initial applications of HISAT2: HLA typing, a critical need in human organ transplantation, and DNA fingerprinting, widely used in forensics. These applications are part of HISAT-genotype, with performance not only surpassing earlier computational methods, but matching or exceeding the accuracy of laboratory-based assays.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Fast and sensitive protein alignment using DIAMOND.

            The alignment of sequencing reads against a protein reference database is a major computational bottleneck in metagenomics and data-intensive evolutionary projects. Although recent tools offer improved performance over the gold standard BLASTX, they exhibit only a modest speedup or low sensitivity. We introduce DIAMOND, an open-source algorithm based on double indexing that is 20,000 times faster than BLASTX on short reads and has a similar degree of sensitivity.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Tandem repeats finder: a program to analyze DNA sequences.

              G. Benson (1999)
              A tandem repeat in DNA is two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats have been shown to cause human disease, may play a variety of regulatory and evolutionary roles and are important laboratory and analytic tools. Extensive knowledge about pattern size, copy number, mutational history, etc. for tandem repeats has been limited by the inability to easily detect them in genomic sequence data. In this paper, we present a new algorithm for finding tandem repeats which works without the need to specify either the pattern or pattern size. We model tandem repeats by percent identity and frequency of indels between adjacent pattern copies and use statistically based recognition criteria. We demonstrate the algorithm's speed and its ability to detect tandem repeats that have undergone extensive mutational change by analyzing four sequences: the human frataxin gene, the human beta T cellreceptor locus sequence and two yeast chromosomes. These sequences range in size from 3 kb up to 700 kb. A World Wide Web server interface atc3.biomath.mssm.edu/trf.html has been established for automated use of the program.
                Bookmark

                Author and article information

                Contributors
                Journal
                NAR Genom Bioinform
                NAR Genom Bioinform
                nargab
                NAR Genomics and Bioinformatics
                Oxford University Press
                2631-9268
                March 2021
                06 January 2021
                06 January 2021
                : 3
                : 1
                : lqaa108
                Affiliations
                School of Biological Sciences, Georgia Institute of Technology , Atlanta, GA 30332, USA
                Institute of Mathematics and Computer Science, University of Greifswald , 17489 Greifswald, Germany
                Center for Functional Genomics of Microbes, University of Greifswald , 17489 Greifswald, Germany
                Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology , Atlanta, GA 30332, USA
                Institute of Mathematics and Computer Science, University of Greifswald , 17489 Greifswald, Germany
                Center for Functional Genomics of Microbes, University of Greifswald , 17489 Greifswald, Germany
                Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology , Atlanta, GA 30332, USA
                School of Computational Science and Engineering, Georgia Institute of Technology , Atlanta, GA 30332, USA
                Author notes
                To whom correspondence should be addressed. Email: borodovsky@ 123456gatech.edu

                The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.

                The authors wish it to be known that, in their opinion, the last two authors should be regarded as Joint Last Authors.

                Author information
                http://orcid.org/0000-0002-1401-4046
                Article
                lqaa108
                10.1093/nargab/lqaa108
                7787252
                33575650
                f9891e4d-8752-4044-a172-ce6c31d77f62
                © The Author(s) 2021. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@ 123456oup.com

                History
                : 10 August 2020
                : 26 November 2020
                : 20 December 2020
                Page count
                Pages: 11
                Funding
                Funded by: National Institutes of Health, DOI 10.13039/100000002;
                Award ID: GM128145
                Categories
                AcademicSubjects/SCI00030
                AcademicSubjects/SCI00980
                AcademicSubjects/SCI01060
                AcademicSubjects/SCI01140
                AcademicSubjects/SCI01180
                Standard Article

                Comments

                Comment on this article