67
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Prediction of DNA binding motifs from 3D models of transcription factors; identifying TLX3 regulated genes

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Proper cell functioning depends on the precise spatio-temporal expression of its genetic material. Gene expression is controlled to a great extent by sequence-specific transcription factors (TFs). Our current knowledge on where and how TFs bind and associate to regulate gene expression is incomplete. A structure-based computational algorithm (TF2DNA) is developed to identify binding specificities of TFs. The method constructs homology models of TFs bound to DNA and assesses the relative binding affinity for all possible DNA sequences using a knowledge-based potential, after optimization in a molecular mechanics force field. TF2DNA predictions were benchmarked against experimentally determined binding motifs. Success rates range from 45% to 81% and primarily depend on the sequence identity of aligned target sequences and template structures, TF2DNA was used to predict 1321 motifs for 1825 putative human TF proteins, facilitating the reconstruction of most of the human gene regulatory network. As an illustration, the predicted DNA binding site for the poorly characterized T-cell leukemia homeobox 3 (TLX3) TF was confirmed with gel shift assay experiments. TLX3 motif searches in human promoter regions identified a group of genes enriched in functions relating to hematopoiesis, tissue morphology, endocrine system and connective tissue development and function.

          Related collections

          Most cited references54

          • Record: found
          • Abstract: found
          • Article: not found

          Protein homology detection by HMM-HMM comparison.

          Protein homology detection and sequence alignment are at the basis of protein structure prediction, function prediction and evolution. We have generalized the alignment of protein sequences with a profile hidden Markov model (HMM) to the case of pairwise alignment of profile HMMs. We present a method for detecting distant homologous relationships between proteins based on this approach. The method (HHsearch) is benchmarked together with BLAST, PSI-BLAST, HMMER and the profile-profile comparison tools PROF_SIM and COMPASS, in an all-against-all comparison of a database of 3691 protein domains from SCOP 1.63 with pairwise sequence identities below 20%.Sensitivity: When the predicted secondary structure is included in the HMMs, HHsearch is able to detect between 2.7 and 4.2 times more homologs than PSI-BLAST or HMMER and between 1.44 and 1.9 times more than COMPASS or PROF_SIM for a rate of false positives of 10%. Approximately half of the improvement over the profile-profile comparison methods is attributable to the use of profile HMMs in place of simple profiles. Alignment quality: Higher sensitivity is mirrored by an increased alignment quality. HHsearch produced 1.2, 1.7 and 3.3 times more good alignments ('balanced' score >0.3) than the next best method (COMPASS), and 1.6, 2.9 and 9.4 times more than PSI-BLAST, at the family, superfamily and fold level, respectively.Speed: HHsearch scans a query of 200 residues against 3691 domains in 33 s on an AMD64 2GHz PC. This is 10 times faster than PROF_SIM and 17 times faster than COMPASS.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy

            The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of genomic, transcript and protein sequence records. These records are selected and curated from public sequence archives and represent a significant reduction in redundancy compared to the volume of data archived by the International Nucleotide Sequence Database Collaboration. The database includes over 16 000 organisms, 2.4 × 106 genomic records, 13 × 106 proteins and 2 × 106 RNA records spanning prokaryotes, eukaryotes and viruses (RefSeq release 49, September 2011). The RefSeq database is maintained by a combined approach of automated analyses, collaboration and manual curation to generate an up-to-date representation of the sequence, its features, names and cross-links to related sources of information. We report here on recent growth, the status of curating the human RefSeq data set, more extensive feature annotation and current policy for eukaryotic genome annotation via the NCBI annotation pipeline. More information about the resource is available online (see http://www.ncbi.nlm.nih.gov/RefSeq/).
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Combining evidence using p-values: application to sequence homology searches.

              To illustrate an intuitive and statistically valid method for combining independent sources of evidence that yields a p-value for the complete evidence, and to apply it to the problem of detecting simultaneous matches to multiple patterns in sequence homology searches. In sequence analysis, two or more (approximately) independent measures of the membership of a sequence (or sequence region) in some class are often available. We would like to estimate the likelihood of the sequence being a member of the class in view of all the available evidence. An example is estimating the significance of the observed match of a macromolecular sequence (DNA or protein) to a set of patterns (motifs) that characterize a biological sequence family. An intuitive way to do this is to express each piece of evidence as a p-value, and then use the product of these p-values as the measure of membership in the family. We derive a formula and algorithm (QFAST) for calculating the statistical distribution of the product of n independent p-values. We demonstrate that sorting sequences by this p-value effectively combines the information present in multiple motifs, leading to highly accurate and sensitive sequence homology searches.
                Bookmark

                Author and article information

                Journal
                Nucleic Acids Res
                Nucleic Acids Res
                nar
                nar
                Nucleic Acids Research
                Oxford University Press
                0305-1048
                1362-4962
                16 December 2014
                26 November 2014
                26 November 2014
                : 42
                : 22
                : 13500-13512
                Affiliations
                [1 ]Department of Systems and Computational Biology, Albert Einstein College of Medicine, 1300 Morris Park Ave., Bronx, NY 10461, USA
                [2 ]Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Ave., Bronx, NY 10461, USA
                [3 ]Macromolecular Therapeutics Development, Albert Einstein College of Medicine, 1300 Morris Park Ave., Bronx, NY 10461, USA
                [4 ]Molecular Neuroscience Laboratory, Geisinger Clinic, 100 North Academy Avenue, Danville, PA 17822, USA
                Author notes
                [* ]To whom correspondence should be addressed. Tel: +1-718-678-1068; Fax: +1-718-678-1019; Email: andras.fiser@ 123456einstein.yu.edu
                Article
                10.1093/nar/gku1228
                4267649
                25428367
                9834f98f-7c7b-433e-b630-fe2f4170cf86
                © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 07 November 2014
                : 17 October 2014
                : 01 August 2014
                Page count
                Pages: 13
                Categories
                16
                24
                Computational Biology
                Custom metadata
                16 December 2014

                Genetics
                Genetics

                Comments

                Comment on this article