18
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps.

      Bioinformatics
      Algorithms, Amino Acid Sequence, Animals, Caenorhabditis elegans, genetics, Codon, Genes, Helminth, Molecular Sequence Data, Protein Biosynthesis, Reproducibility of Results, Sequence Homology, Amino Acid

      Read this article at

      ScienceOpenPublisherPubMed
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Locating protein-coding exons (CDSs) on a eukaryotic genomic DNA sequence is the initial and an essential step in predicting the functions of the genes embedded in that part of the genome. Accurate prediction of CDSs may be achieved by directly matching the DNA sequence with a known protein sequence or profile of a homologous family member(s). A new convention for encoding a DNA sequence into a series of 23 possible letters (translated codon or tron code) was devised to improve this type of analysis. Using this convention, a dynamic programming algorithm was developed to align a DNA sequence and a protein sequence or profile so that the spliced and translated sequence optimally matches the reference the same as the standard protein sequence alignment allowing for long gaps. The objective function also takes account of frameshift errors, coding potentials, and translational initiation, termination and splicing signals. This method was tested on Caenorhabditis elegans genes of known structures. The accuracy of prediction measured in terms of a correlation coefficient (CC) was about 95% at the nucleotide level for the 288 genes tested, and 97. 0% for the 170 genes whose product and closest homologue share more than 30% identical amino acids. We also propose a strategy to improve the accuracy of prediction for a set of paralogous genes by means of iterative gene prediction and reconstruction of the reference profile derived from the predicted sequences. The source codes for the program 'aln' written in ANSI-C and the test data will be available via anonymous FTP at ftp.genome.ad.jp/pub/genomenet/saitama-cc. gotoh@cancer-c.pref.saitama.jp

          Related collections

          Author and article information

          Comments

          Comment on this article