AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

A large number of gene prediction programs for the human genome exist. These annotation tools use a variety of methods and data sources. In the recent ENCODE genome annotation assessment project (EGASP), some of the most commonly used and recently developed gene-prediction programs were systematically evaluated and compared on test data from the human genome. AUGUSTUS was among the tools that were tested in this project.

Results

AUGUSTUS can be used as an ab initio program, that is, as a program that uses only one single genomic sequence as input information. In addition, it is able to combine information from the genomic sequence under study with external hints from various sources of information. For EGASP, we used genomic sequence alignments as well as alignments to expressed sequence tags (ESTs) and protein sequences as additional sources of information. Within the category of ab initio programs AUGUSTUS predicted significantly more genes correctly than any other ab initio program. At the same time it predicted the smallest number of false positive genes and the smallest number of false positive exons among all ab initio programs. The accuracy of AUGUSTUS could be further improved when additional extrinsic data, such as alignments to EST, protein and/or genomic sequences, was taken into account.

Conclusion

AUGUSTUS turned out to be the most accurate ab initio gene finder among the tested tools. Moreover it is very flexible because it can take information from several sources simultaneously into consideration.

Related collections

Most cited references 18

Record: found
Abstract: found
Article: found

Is Open Access

Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources

Mario Stanke, Oliver Schöffmann, Burkhard Morgenstern … (2006)

Background In order to improve gene prediction, extrinsic evidence on the gene structure can be collected from various sources of information such as genome-genome comparisons and EST and protein alignments. However, such evidence is often incomplete and usually uncertain. The extrinsic evidence is usually not sufficient to recover the complete gene structure of all genes completely and the available evidence is often unreliable. Therefore extrinsic evidence is most valuable when it is balanced with sequence-intrinsic evidence. Results We present a fairly general method for integration of external information. Our method is based on the evaluation of hints to potentially protein-coding regions by means of a Generalized Hidden Markov Model (GHMM) that takes both intrinsic and extrinsic information into account. We used this method to extend the ab initio gene prediction program AUGUSTUS to a versatile tool that we call AUGUSTUS+. In this study, we focus on hints derived from matches to an EST or protein database, but our approach can be used to include arbitrary user-defined hints. Our method is only moderately effected by the length of a database match. Further, it exploits the information that can be derived from the absence of such matches. As a special case, AUGUSTUS+ can predict genes under user-defined constraints, e.g. if the positions of certain exons are known. With hints from EST and protein databases, our new approach was able to predict 89% of the exons in human chromosome 22 correctly. Conclusion Sensitive probabilistic modeling of extrinsic evidence such as sequence database matches can increase gene prediction accuracy. When a match of a sequence interval to an EST or protein sequence is used it should be treated as compound information rather than as information about individual positions.

0 comments Cited 549 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Human-mouse alignments with BLASTZ.

Scott Schwartz, W. Kent, Arian Smit … (2003)

The Mouse Genome Analysis Consortium aligned the human and mouse genome sequences for a variety of purposes, using alignment programs that suited the various needs. For investigating issues regarding genome evolution, a particularly sensitive method was needed to permit alignment of a large proportion of the neutrally evolving regions. We selected a program called BLASTZ, an independent implementation of the Gapped BLAST algorithm specifically designed for aligning two long genomic sequences. BLASTZ was subsequently modified, both to attain efficiency adequate for aligning entire mammalian genomes and to increase its sensitivity. This work describes BLASTZ, its modifications, the hardware environment on which we run it, and several empirical studies to validate its results.

0 comments Cited 467 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

GeneMark.hmm: new solutions for gene finding.

A. Lukashin (1998)

The number of completely sequenced bacterial genomes has been growing fast. There are computer methods available for finding genes but yet there is a need for more accurate algorithms. The GeneMark. hmm algorithm presented here was designed to improve the gene prediction quality in terms of finding exact gene boundaries. The idea was to embed the GeneMark models into naturally derived hidden Markov model framework with gene boundaries modeled as transitions between hidden states. We also used the specially derived ribosome binding site pattern to refine predictions of translation initiation codons. The algorithm was evaluated on several test sets including 10 complete bacterial genomes. It was shown that the new algorithm is significantly more accurate than GeneMark in exact gene prediction. Interestingly, the high gene finding accuracy was observed even in the case when Markov models of order zero, one and two were used. We present the analysis of false positive and false negative predictions with the caution that these categories are not precisely defined if the public database annotation is used as a control.

0 comments Cited 436 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Genome Biol

Title: Genome Biology

Publisher: BioMed Central (London )

ISSN (Print): 1465-6906

ISSN (Electronic): 1465-6914

Publication date (Print): 2006

Publication date (Electronic): 7 August 2006

Volume: 7

Issue: Suppl 1

Page: S11

Affiliations

[1 ]Institut für Mikrobiologie und Genetik, Universität Göttingen, Goldschmidtstraße, 37077 Göttingen, Germany

Article

Publisher ID: gb-2006-7-s1-s11

DOI: 10.1186/gb-2006-7-s1-s11

PMC ID: 1810548

PubMed ID: 16925833

SO-VID: 2f5198aa-8732-40cf-b92a-eb60ef968184

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome

Read this article at

Abstract

Background

Results

Conclusion

Related collections

Genome Engineering using CRISPR

Most cited references 18

Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources

Human-mouse alignments with BLASTZ.

GeneMark.hmm: new solutions for gene finding.

Author and article information

Journal

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 98

Cited by 130

Most referenced authors 627