41
views
0
recommends
+1 Recommend
0 collections
    4
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      Computational analysis of core promoters in the Drosophila genome

      research-article
      1 , 4 , , 1 , 3 , 1 , 2
      Genome Biology
      BioMed Central

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Candidate transcription start sites have been identified for about 2,000 Drosophila genes by aligning 5' expressed sequence tags (ESTs) from cap-trapped cDNA libraries to the genome. Examination of the sequences flanking these candidate transcription start sites revealed the presence of well-known core promoter motifs such as the TATA box, the initiator and the downstream promoter element (DPE).

          Abstract

          Background

          The core promoter, a region of about 100 base-pairs flanking the transcription start site (TSS), serves as the recognition site for the basal transcription apparatus. Drosophila TSSs have generally been mapped by individual experiments; the low number of accurately mapped TSSs has limited analysis of promoter sequence motifs and the training of computational prediction tools.

          Results

          We identified TSS candidates for about 2,000 Drosophila genes by aligning 5' expressed sequence tags (ESTs) from cap-trapped cDNA libraries to the genome, while applying stringent criteria concerning coverage and 5'-end distribution. Examination of the sequences flanking these TSSs revealed the presence of well-known core promoter motifs such as the TATA box, the initiator and the downstream promoter element (DPE). We also define, and assess the distribution of, several new motifs prevalent in core promoters, including what appears to be a variant DPE motif. Among the prevalent motifs is the DNA-replication-related element DRE, recently shown to be part of the recognition site for the TBP-related factor TRF2. Our TSS set was then used to retrain the computational promoter predictor McPromoter, allowing us to improve the recognition performance to over 50% sensitivity and 40% specificity. We compare these computational results to promoter prediction in vertebrates.

          Conclusions

          There are relatively few recognizable binding sites for previously known general transcription factors in Drosophila core promoters. However, we identified several new motifs enriched in promoter regions. We were also able to significantly improve the performance of computational TSS prediction in Drosophila.

          Related collections

          Most cited references39

          • Record: found
          • Abstract: found
          • Article: not found

          Identification of protein coding regions by database similarity search.

          Sequence similarity between a translated nucleotide sequence and a known biological protein can provide strong evidence for the presence of a homologous coding region, even between distantly related genes. The computer program BLASTX performed conceptual translation of a nucleotide query sequence followed by a protein database search in one programmatic step. We characterized the sensitivity of BLASTX recognition to the presence of substitution, insertion and deletion errors in the query sequence and to sequence divergence. Reading frames were reliably identified in the presence of 1% query errors, a rate that is typical for primary sequence data. BLASTX is appropriate for use in moderate and large scale sequencing projects at the earliest opportunity, when the data are most prone to containing errors.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.

            Molecular biologists frequently can obtain interesting insight by aligning a set of related DNA, RNA or protein sequences. Such alignments can be used to determine either evolutionary or functional relationships. Our interest is in identifying functional relationships. Unless the sequences are very similar, it is necessary to have a specific strategy for measuring-or scoring-the relatedness of the aligned sequences. If the alignment is not known, one can be determined by finding an alignment that optimizes the scoring scheme. We describe four components to our approach for determining alignments of multiple sequences. First, we review a log-likelihood scoring scheme we call information content. Second, we describe two methods for estimating the P value of an individual information content score: (i) a method that combines a technique from large-deviation statistics with numerical calculations; (ii) a method that is exclusively numerical. Third, we describe how we count the number of possible alignments given the overall amount of sequence data. This count is multiplied by the P value to determine the expected frequency of an information content score and, thus, the statistical significance of the corresponding alignment. Statistical significance can be used to compare alignments having differing widths and containing differing numbers of sequences. Fourth, we describe a greedy algorithm for determining alignments of functionally related sequences. Finally, we test the accuracy of our P value calculations, and give an example of using our algorithm to identify binding sites for the Escherichia coli CRP protein. Programs were developed under the UNIX operating system and are available by anonymous ftp from ftp://beagle.colorado.edu/pub/consensus.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Application of a time-delay neural network to promoter annotation in the Drosophila melanogaster genome.

              Computational methods for automated genome annotation are critical to understanding and interpreting the bewildering mass of genomic sequence data presently being generated and released. A neural network model of the structural and compositional properties of a eukaryotic core promoter region has been developed and its application for analysis of the Drosophila melanogaster genome is presented. The model uses a time-delay architecture, a special case of a feed-forward neural network. The structure of this model allows for variable spacing between functional binding sites, which is known to play a key role in the transcription initiation process. Application of this model to a test set of core promoters not only gave better discrimination of potential promoter sites than previous statistical or neural network models, but also revealed indirectly subtle properties of the transcription initiation signal. When tested in the Adh region of 2.9 Mbases of the Drosophila genome, the neural network for promoter prediction (NNPP) program that incorporates the time-delay neural network model gives a recognition rate of 75% (69/92) with a false positive rate of 1/547 bases. The present work can be regarded as one of the first intensive studies that applies novel gene regulation technologies to the identification of the complex gene regulation sites in the genome of Drosophila melanogaster.
                Bookmark

                Author and article information

                Journal
                Genome Biol
                Genome Biology
                BioMed Central (London )
                1465-6906
                1465-6914
                2002
                20 December 2002
                : 3
                : 12
                : research0087.1-87.12
                Affiliations
                [1 ]Department of Molecular and Cell Biology, University of California at Berkeley, Berkeley, CA 94720-3200, USA
                [2 ]Howard Hughes Medical Institute, University of California at Berkeley, Berkeley, CA 94720-3200, USA
                [3 ]Computer Science 5, University of Erlangen-Nuremberg, Martensstrasse 3, D-91058 Erlangen, Germany
                [4 ]Current address: Department of Biology, Massachusetts Institute of Technology, 77 Massachusetts Ave 68-223, Cambridge, MA 02139, USA
                Correspondence: Uwe Ohler. E-mail: ohler@mit.edu
                Article
                gb-2002-3-12-research0087
                10.1186/gb-2002-3-12-research0087
                151189
                12537576
                63c33176-e25e-4fa4-8ffc-4241aee05856
                Copyright © 2002 Ohler et al., licensee BioMed Central Ltd
                History
                : 7 October 2002
                : 19 November 2002
                : 27 November 2002
                Categories
                Research

                Genetics
                Genetics

                Comments

                Comment on this article