BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The task of eukaryotic genome annotation remains challenging. Only a few genomes could serve as standards of annotation achieved through a tremendous investment of human curation efforts. Still, the correctness of all alternative isoforms, even in the best-annotated genomes, could be a good subject for further investigation. The new BRAKER2 pipeline generates and integrates external protein support into the iterative process of training and gene prediction by GeneMark-EP+ and AUGUSTUS. BRAKER2 continues the line started by BRAKER1 where self-training GeneMark-ET and AUGUSTUS made gene predictions supported by transcriptomic data. Among the challenges addressed by the new pipeline was a generation of reliable hints to protein-coding exon boundaries from likely homologous but evolutionarily distant proteins. In comparison with other pipelines for eukaryotic genome annotation, BRAKER2 is fully automatic. It is favorably compared under equal conditions with other pipelines, e.g. MAKER2, in terms of accuracy and performance. Development of BRAKER2 should facilitate solving the task of harmonization of annotation of protein-coding genes in genomes of different eukaryotic species. However, we fully understand that several more innovations are needed in transcriptomic and proteomic technologies as well as in algorithmic development to reach the goal of highly accurate annotation of eukaryotic genomes.

Related collections

Most cited references 50

Record: found
Abstract: found
Article: not found

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

Daehwan Kim, Joseph M Paggi, Chanhee Park … (2021)

Rapid advances in next-generation sequencing technologies have dramatically changed our ability to perform genome-scale analyses. The human reference genome used for most genomic analyses represents only a small number of individuals, limiting its usefulness for genotyping. We designed a novel method, HISAT2, for representing and searching an expanded model of the human reference genome, in which a large catalogue of known genomic variants and haplotypes is incorporated into the data structure used for searching and alignment. This strategy for representing a population of genomes, along with a fast and memory-efficient search algorithm, enables more detailed and accurate variant analyses than previous methods. We demonstrate two initial applications of HISAT2: HLA typing, a critical need in human organ transplantation, and DNA fingerprinting, widely used in forensics. These applications are part of HISAT-genotype, with performance not only surpassing earlier computational methods, but matching or exceeding the accuracy of laboratory-based assays.

0 comments Cited 3378 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Fast and sensitive protein alignment using DIAMOND.

Benjamin Buchfink, Chao Xie, Daniel Huson (2015)

The alignment of sequencing reads against a protein reference database is a major computational bottleneck in metagenomics and data-intensive evolutionary projects. Although recent tools offer improved performance over the gold standard BLASTX, they exhibit only a modest speedup or low sensitivity. We introduce DIAMOND, an open-source algorithm based on double indexing that is 20,000 times faster than BLASTX on short reads and has a similar degree of sensitivity.

0 comments Cited 3190 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Tandem repeats finder: a program to analyze DNA sequences.

G. Benson (1999)

A tandem repeat in DNA is two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats have been shown to cause human disease, may play a variety of regulatory and evolutionary roles and are important laboratory and analytic tools. Extensive knowledge about pattern size, copy number, mutational history, etc. for tandem repeats has been limited by the inability to easily detect them in genomic sequence data. In this paper, we present a new algorithm for finding tandem repeats which works without the need to specify either the pattern or pattern size. We model tandem repeats by percent identity and frequency of indels between adjacent pattern copies and use statistically based recognition criteria. We demonstrate the algorithm's speed and its ability to detect tandem repeats that have undergone extensive mutational change by analyzing four sequences: the human frataxin gene, the human beta T cellreceptor locus sequence and two yeast chromosomes. These sequences range in size from 3 kb up to 700 kb. A World Wide Web server interface atc3.biomath.mssm.edu/trf.html has been established for automated use of the program.

0 comments Cited 1471 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Tomáš Brůna

Katharina J Hoff

Alexandre Lomsadze

Mario Stanke

Mark Borodovsky:

ORCID: http://orcid.org/0000-0002-1401-4046

Journal

Journal ID (nlm-ta): NAR Genom Bioinform

Journal ID (iso-abbrev): NAR Genom Bioinform

Journal ID (publisher-id): nargab

Title: NAR Genomics and Bioinformatics

Publisher: Oxford University Press

ISSN (Electronic): 2631-9268

Publication date Collection: March 2021

Publication date (Electronic): 06 January 2021

Publication date PMC-release: 06 January 2021

Volume: 3

Issue: 1

Electronic Location Identifier: lqaa108

Affiliations

School of Biological Sciences, Georgia Institute of Technology , Atlanta, GA 30332, USA

Institute of Mathematics and Computer Science, University of Greifswald , 17489 Greifswald, Germany

Center for Functional Genomics of Microbes, University of Greifswald , 17489 Greifswald, Germany

Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology , Atlanta, GA 30332, USA

Institute of Mathematics and Computer Science, University of Greifswald , 17489 Greifswald, Germany

Center for Functional Genomics of Microbes, University of Greifswald , 17489 Greifswald, Germany

Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology , Atlanta, GA 30332, USA

School of Computational Science and Engineering, Georgia Institute of Technology , Atlanta, GA 30332, USA

Author notes

To whom correspondence should be addressed. Email: borodovsky@ 123456gatech.edu

The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.

The authors wish it to be known that, in their opinion, the last two authors should be regarded as Joint Last Authors.

Author information

Mark Borodovsky http://orcid.org/0000-0002-1401-4046

Article

Publisher ID: lqaa108

DOI: 10.1093/nargab/lqaa108

PMC ID: 7787252

PubMed ID: 33575650

SO-VID: f9891e4d-8752-4044-a172-ce6c31d77f62

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@ 123456oup.com

History

Date received : 10 August 2020

Date revision received : 26 November 2020

Date accepted : 20 December 2020

Page count

Pages: 11

Funding

Funded by: National Institutes of Health, DOI 10.13039/100000002;

Award ID: GM128145

BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database

Read this article at

Abstract

Related collections

Genome Integrity

Most cited references 50

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

Fast and sensitive protein alignment using DIAMOND.

Tandem repeats finder: a program to analyze DNA sequences.

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Comments

Comment on this article

Similar content 168

Cited by 402

Most referenced authors 1,163