The coding potential of the human genome: global compositional properties identify with statistical significance a plethora of new potential coding regions

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Bioinformatics predictions of coding sequences rely on details models of compositional properties of genes. Such models are based on large, genome-specific training sets of known genes. Although these models are optimal for the identification of 'average' genes, they may be over-parameterized to allow recognition of genes of anomalous properties, for example genes coding for very short peptides. We have developed two approaches to the identification of coding sequences that rely on more general compositional principles that we expect to be conserved over a wider variety of genes. The first approach is based on the observation that coding regions generally exhibit contrasting global compositional properties in the three codon positions, depending on the overall base composition of the sequence. For example, sequences rich in C and G bases have a much higher GC content in third codon position and a relatively low GC content in second codon position. General rules on the base content at the three codon positions as a function of the overall base content can be identified and exploited to score sequence regions for their coding potential. More generally, the period-three structure of coding regions imposes compositional periodicity to the sequence that, irrespective of the specific type of contrasts that we might expect to see, result in a significantly non-random distribution of bases. Applying these principles, we have devised two algorithms to detect potential coding regions in sequences of any composition, one based on overall compositional expectations and one based on overall contrasts. We have applied our procedure to the human genome. To our surprise, we have detected a plethora of regions, not overlapping with any of the currently annotated gene sequences, that display with high statistical significance a periodic structure often conforming to expectations for coding regions in terms of base-type composition. The frequency of these regions is far greater than the random frequency observed in corresponding scrambled sequences. Most of these regions also show levels of complexity that distinguish them from repetitive elements and that are consistent with the complexity of known genes. Our bioinformatics results provide a rich source of information for future experimental analyses and the potential for exciting new discoveries.

Related collections

Author and article information

Conference

Journal ID (nlm-ta): Genome Biol

Title: Genome Biology

Publisher: BioMed Central

ISSN (Print): 1465-6906

ISSN (Electronic): 1465-6914

Publication date (Print): 2010

Publication date (Electronic): 11 October 2010

Volume: 11

Issue: Suppl 1

Page: P7

Affiliations

[1 ]Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL 32610, USA

Article

Publisher ID: gb-2010-11-s1-p7

DOI: 10.1186/gb-2010-11-s1-p7

PMC ID: 3026278

SO-VID: 92a65772-bc3e-477b-ae11-12ff83d107a2

Conference name: Beyond the Genome: The true gene count, human evolution and disease genomics

Conference location: Boston, MA, USA

Conference date: 11–13 October 2010

History

Comments

Comment on this article

scite_