Automatic annotation of eukaryotic genes, pseudogenes and promoters

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

The ENCODE gene prediction workshop (EGASP) has been organized to evaluate how well state-of-the-art automatic gene finding methods are able to reproduce the manual and experimental gene annotation of the human genome. We have used Softberry gene finding software to predict genes, pseudogenes and promoters in 44 selected ENCODE sequences representing approximately 1% (30 Mb) of the human genome. Predictions of gene finding programs were evaluated in terms of their ability to reproduce the ENCODE-HAVANA annotation.

Results

The Fgenesh++ gene prediction pipeline can identify 91% of coding nucleotides with a specificity of 90%. Our automatic pseudogene finder (PSF program) found 90% of the manually annotated pseudogenes and some new ones. The Fprom promoter prediction program identifies 80% of TATA promoters sequences with one false positive prediction per 2,000 base-pairs (bp) and 50% of TATA-less promoters with one false positive prediction per 650 bp. It can be used to identify transcription start sites upstream of annotated coding parts of genes found by gene prediction software.

Conclusion

We review our software and underlying methods for identifying these three important structural and functional genome components and discuss the accuracy of predictions, recent advances and open problems in annotating genomic sequences. We have demonstrated that our methods can be effectively used for initial automatic annotation of the eukaryotic genome.

Related collections

Most cited references 19

Record: found
Abstract: found
Article: not found

GeneMark.hmm: new solutions for gene finding.

A. Lukashin (1998)

The number of completely sequenced bacterial genomes has been growing fast. There are computer methods available for finding genes but yet there is a need for more accurate algorithms. The GeneMark. hmm algorithm presented here was designed to improve the gene prediction quality in terms of finding exact gene boundaries. The idea was to embed the GeneMark models into naturally derived hidden Markov model framework with gene boundaries modeled as transitions between hidden states. We also used the specially derived ribosome binding site pattern to refine predictions of translation initiation codons. The algorithm was evaluated on several test sets including 10 complete bacterial genomes. It was shown that the new algorithm is significantly more accurate than GeneMark in exact gene prediction. Interestingly, the high gene finding accuracy was observed even in the case when Markov models of order zero, one and two were used. We present the analysis of false positive and false negative predictions with the caution that these categories are not precisely defined if the public database annotation is used as a control.

0 comments Cited 436 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Ab initio gene finding in Drosophila genomic DNA.

A. A. Salamov, V. V. Solovyev (2000)

Ab initio gene identification in the genomic sequence of Drosophila melanogaster was obtained using (human gene predictor) and Fgenesh programs that have organism-specific parameters for human, Drosophila, plants, yeast, and nematode. We did not use information about cDNA/EST in most predictions to model a real situation for finding new genes because information about complete cDNA is often absent or based on very small partial fragments. We investigated the accuracy of gene prediction on different levels and designed several schemes to predict an unambiguous set of genes (annotation CGG1), a set of reliable exons (annotation CGG2), and the most complete set of exons (annotation CGG3). For 49 genes, protein products of which have clear homologs in protein databases, predictions were recomputed by Fgenesh+ program. The first annotation serves as the optimal computational description of new sequence to be presented in a database. Reliable exons from the second annotation serve as good candidates for selecting the PCR primers for experimental work for gene structure verification. Our results shows that we can identify approximately 90% of coding nucleotides with 20% false positives. At the exon level we accurately predicted 65% of exons and 89% including overlapping exons with 49% false positives. Optimizing accuracy of prediction, we designed a gene identification scheme using Fgenesh, which provided sensitivity (Sn) = 98% and specificity (Sp) = 86% at the base level, Sn = 81% (97% including overlapping exons) and Sp = 58% at the exon level and Sn = 72% and Sp = 39% at the gene level (estimating sensitivity on std1 set and specificity on std3 set). In general, these results showed that computational gene prediction can be a reliable tool for annotating new genomic sequences, giving accurate information on 90% of coding sequences with 14% false positives. However, exact gene prediction (especially at the gene level) needs additional improvement using gene prediction algorithms. The program was also tested for predicting genes of human Chromosome 22 (the last variant of Fgenesh can analyze the whole chromosome sequence). This analysis has demonstrated that the 88% of manually annotated exons in Chromosome 22 were among the ab initio predicted exons. The suite of gene identification programs is available through the WWW server of Computational Genomics Group at http://genomic.sanger.ac.uk/gf. html.

0 comments Cited 418 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

dbEST--database for "expressed sequence tags".

M S Boguski, T. M. Lowe, C Tolstoshev (1993)

0 comments Cited 325 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Genome Biol

Title: Genome Biology

Publisher: BioMed Central (London )

ISSN (Print): 1465-6906

ISSN (Electronic): 1465-6914

Publication date (Print): 2006

Publication date (Electronic): 7 August 2006

Volume: 7

Issue: Suppl 1

Page: S10

Affiliations

[1 ]Department of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK

[2 ]Softberry Inc., Radio Circle, Mount Kisco, NY10549, USA

Article

Publisher ID: gb-2006-7-s1-s10

DOI: 10.1186/gb-2006-7-s1-s10

PMC ID: 1810547

PubMed ID: 16925832

SO-VID: 444c81fe-d4e5-44cc-9252-e58cf54ed929

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Automatic annotation of eukaryotic genes, pseudogenes and promoters

Read this article at

Abstract

Background

Results

Conclusion

Related collections

Genes & Diseases

Most cited references 19

GeneMark.hmm: new solutions for gene finding.

Ab initio gene finding in Drosophila genomic DNA.

dbEST--database for "expressed sequence tags".

Author and article information

Journal

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 338

Cited by 245

Most referenced authors 1,229