A general integrative genomic feature transcription factor binding site prediction method applied to analysis of USF1 binding in cardiovascular disease

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Transcription factors are key mediators of human complex disease processes. Identifying the target genes of transcription factors will increase our understanding of the biological network leading to disease risk. The prediction of transcription factor binding sites (TFBSs) is one method to identify these target genes; however, current prediction methods need improvement. We chose the transcription factor upstream stimulatory factor l ( USF1) to evaluate the performance of our novel TFBS prediction method because of its known genetic association with coronary artery disease (CAD) and the recent availability of USF1 chromatin immunoprecipitation microarray (ChIP-chip) results. The specific goals of our study were to develop a novel and accurate genome-scale method for predicting USF1 binding sites and associated target genes to aid in the study of CAD. Previously published USF1 ChIP-chip data for 1 per cent of the genome were used to develop and evaluate several kernel logistic regression prediction models. A combination of genomic features (phylogenetic conservation, regulatory potential, presence of a CpG island and DNaseI hypersensitivity), as well as position weight matrix (PWM) scores, were used as variables for these models. Our most accurate predictor achieved an area under the receiver operator characteristic curve of 0.827 during cross-validation experiments, significantly outperforming standard PWM-based prediction methods. When applied to the whole human genome, we predicted 24,010 USF1 binding sites within 5 kilobases upstream of the transcription start site of 9,721 genes. These predictions included 16 of 20 genes with strong evidence of USF1 regulation. Finally, in the spirit of genomic convergence, we integrated independent experimental CAD data with these USF1 binding site prediction results to develop a prioritised set of candidate genes for future CAD studies. We have shown that our novel prediction method, which employs genomic features related to the presence of regulatory elements, enables more accurate and efficient prediction of USF1 binding sites. This method can be extended to other transcription factors identified in human disease studies to help further our understanding of the biology of complex disease.

Related collections

Most cited references 22

Record: found
Abstract: found
Article: not found

High-resolution mapping and characterization of open chromatin across the genome.

Alan Boyle, Sean Davis, Hennady Shulha … (2008)

Mapping DNase I hypersensitive (HS) sites is an accurate method of identifying the location of genetic regulatory elements, including promoters, enhancers, silencers, insulators, and locus control regions. We employed high-throughput sequencing and whole-genome tiled array strategies to identify DNase I HS sites within human primary CD4+ T cells. Combining these two technologies, we have created a comprehensive and accurate genome-wide open chromatin map. Surprisingly, only 16%-21% of the identified 94,925 DNase I HS sites are found in promoters or first exons of known genes, but nearly half of the most open sites are in these regions. In conjunction with expression, motif, and chromatin immunoprecipitation data, we find evidence of cell-type-specific characteristics, including the ability to identify transcription start sites and locations of different chromatin marks utilized in these cells. In addition, and unexpectedly, our analyses have uncovered detailed features of nucleosome structure.

0 comments Cited 564 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

DNA binding sites: representation and discovery.

G Stormo (2000)

The purpose of this article is to provide a brief history of the development and application of computer algorithms for the analysis and prediction of DNA binding sites. This problem can be conveniently divided into two subproblems. The first is, given a collection of known binding sites, develop a representation of those sites that can be used to search new sequences and reliably predict where additional binding sites occur. The second is, given a set of sequences known to contain binding sites for a common factor, but not knowing where the sites are, discover the location of the sites in each sequence and a representation for the specificity of the protein.

0 comments Cited 421 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.

G Hertz, G Stormo (2015)

Molecular biologists frequently can obtain interesting insight by aligning a set of related DNA, RNA or protein sequences. Such alignments can be used to determine either evolutionary or functional relationships. Our interest is in identifying functional relationships. Unless the sequences are very similar, it is necessary to have a specific strategy for measuring-or scoring-the relatedness of the aligned sequences. If the alignment is not known, one can be determined by finding an alignment that optimizes the scoring scheme. We describe four components to our approach for determining alignments of multiple sequences. First, we review a log-likelihood scoring scheme we call information content. Second, we describe two methods for estimating the P value of an individual information content score: (i) a method that combines a technique from large-deviation statistics with numerical calculations; (ii) a method that is exclusively numerical. Third, we describe how we count the number of possible alignments given the overall amount of sequence data. This count is multiplied by the P value to determine the expected frequency of an information content score and, thus, the statistical significance of the corresponding alignment. Statistical significance can be used to compare alignments having differing widths and containing differing numbers of sequences. Fourth, we describe a greedy algorithm for determining alignments of functionally related sequences. Finally, we test the accuracy of our P value calculations, and give an example of using our algorithm to identify binding sites for the Escherichia coli CRP protein. Programs were developed under the UNIX operating system and are available by anonymous ftp from ftp://beagle.colorado.edu/pub/consensus.

0 comments Cited 267 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Hum Genomics

Journal ID (iso-abbrev): Hum. Genomics

Title: Human Genomics

Publisher: BioMed Central

ISSN (Print): 1473-9542

ISSN (Electronic): 1479-7364

Publication date Collection: 2009

Publication date (Electronic): 1 April 2009

Volume: 3

Issue: 3

Pages: 221-235

Affiliations

[1 ]Department of Medicine and Center for Human Genetics, Duke University Medical Center, Durham, NC 27710, USA

[2 ]Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27708, USA

[3 ]Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708, USA

[4 ]Department of Computer Science, North Carolina State University, Raleigh, NC 27695, USA

[5 ]Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27695, USA

Article

Publisher ID: 1479-7364-3-3-221

DOI: 10.1186/1479-7364-3-3-221

PMC ID: 2742312

PubMed ID: 19403457

SO-VID: 2c1f6d03-a32f-4655-8aa5-e5740decc72a

History

Date received : 17 November 2008

Date accepted : 17 November 2008

Comments

Comment on this article

scite_

Cited by 2

See all cited by

Most referenced authors 431

See all reference authors

A general integrative genomic feature transcription factor binding site prediction method applied to analysis of USF1 binding in cardiovascular disease

Read this article at

Abstract

Related collections

Genome Integrity

Most cited references 22

High-resolution mapping and characterization of open chromatin across the genome.

DNA binding sites: representation and discovery.

Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.

Author and article information

Journal

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 163

Cited by 2

Most referenced authors 431