14
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Genome-wide identification of coding and non-coding conserved sequence tags in human and mouse genomes

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          The accurate detection of genes and the identification of functional regions is still an open issue in the annotation of genomic sequences. This problem affects new genomes but also those of very well studied organisms such as human and mouse where, despite the great efforts, the inventory of genes and regulatory regions is far from complete. Comparative genomics is an effective approach to address this problem. Unfortunately it is limited by the computational requirements needed to perform genome-wide comparisons and by the problem of discriminating between conserved coding and non-coding sequences. This discrimination is often based (thus dependent) on the availability of annotated proteins.

          Results

          In this paper we present the results of a comprehensive comparison of human and mouse genomes performed with a new high throughput grid-based system which allows the rapid detection of conserved sequences and accurate assessment of their coding potential. By detecting clusters of coding conserved sequences the system is also suitable to accurately identify potential gene loci.

          Following this analysis we created a collection of human-mouse conserved sequence tags and carefully compared our results to reliable annotations in order to benchmark the reliability of our classifications. Strikingly we were able to detect several potential gene loci supported by EST sequences but not corresponding to as yet annotated genes.

          Conclusion

          Here we present a new system which allows comprehensive comparison of genomes to detect conserved coding and non-coding sequences and the identification of potential gene loci. Our system does not require the availability of any annotated sequence thus is suitable for the analysis of new or poorly annotated genomes.

          Related collections

          Most cited references15

          • Record: found
          • Abstract: found
          • Article: not found

          Human-mouse alignments with BLASTZ.

          The Mouse Genome Analysis Consortium aligned the human and mouse genome sequences for a variety of purposes, using alignment programs that suited the various needs. For investigating issues regarding genome evolution, a particularly sensitive method was needed to permit alignment of a large proportion of the neutrally evolving regions. We selected a program called BLASTZ, an independent implementation of the Gapped BLAST algorithm specifically designed for aligning two long genomic sequences. BLASTZ was subsequently modified, both to attain efficiency adequate for aligning entire mammalian genomes and to increase its sensitivity. This work describes BLASTZ, its modifications, the hardware environment on which we run it, and several empirical studies to validate its results.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            An apolipoprotein influencing triglycerides in humans and mice revealed by comparative sequencing.

            Comparison of genomic DNA sequences from human and mouse revealed a new apolipoprotein (APO) gene (APOAV) located proximal to the well-characterized APOAI/CIII/AIV gene cluster on human 11q23. Mice expressing a human APOAV transgene showed a decrease in plasma triglyceride concentrations to one-third of those in control mice; conversely, knockout mice lacking Apoav had four times as much plasma triglycerides as controls. In humans, single nucleotide polymorphisms (SNPs) across the APOAV locus were found to be significantly associated with plasma triglyceride levels in two independent studies. These findings indicate that APOAV is an important determinant of plasma triglyceride levels, a major risk factor for coronary artery disease.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Fast-evolving noncoding sequences in the human genome

              Background The manner in which the expression of genes is regulated defines and determines many of the cellular and developmental processes in an organism. It has been hypothesized that variation in gene regulation is responsible for much of the phenotypic diversity within and between species [1]. In particular, it was proposed a few decades ago that the phenotypic divergence between human and chimpanzees is largely due to changes in gene regulation rather than changes in the protein-coding sequences of genes [2]. Although it has been long recognized that regulatory sequences play an important role in genome function, the fine structure and evolutionary patterns of such sequences are not well understood [3], mainly because such sequences have a much more complex functional code and appear not to be restricted to particular sequence motifs. One of the most powerful approaches with which to identify regulatory sequences has been to use multiple species comparative sequence analysis to look for conserved noncoding (CNC) sequences [4], but these sequences represent only a subset of regulatory elements in the genome and only a subset of them are regulatory elements [5]. CNC sequences are distributed throughout the genome in a manner independent of gene density [6,7]. Studies of nucleotide variation have revealed strong selective constraints on CNC sequences in human populations [8], and so there is little doubt that a large number of them have a functional role. The abundance and genomic distribution of CNC sequences has raised intriguing questions about the functions of such sequences in the genome. Although a small fraction of the CNC sequences can be associated with transcriptional regulation (most of the most highly conserved examples of CNC sequences appear to be enhancers of early development genes [5,9]), there remains a large number of CNC sequences with unexplained function. Although the identification of CNC sequences relies on sequence conservation, it is conceivable that some of the most interesting functional noncoding elements are also evolving under positive (directional) selection in particular lineages. Studies in Drosophila have suggested that such a pattern exists in untranslated regions and in some introns and intergenic DNA [10]. Moreover, loss-of-function mutations as well as mutations that lead to gain of novel functions are also likely to contribute to evolutionary change [11,12]. A relatively recent model for the evolution of novel gene function following gene duplication proposes that the reciprocal degeneration of regulatory elements after duplication (duplication-degeneration-complementation) [13] could drive gene subfunctionalization, and an older model of gene duplication proposed an important role for positive selection after duplication [14-16]. All of the above evolutionary processes could contribute to phenotypic evolution in the human lineage, and would result in a lineage-specific acceleration of the substitution rate of associated functional noncoding DNA. In the present study we conducted an analysis of lineage-specific acceleration of previously identified CNC sequences in vertebrates. By comparing the CNC sequences of three genomes - human, chimpanzee and macaque - we identify 1,356 CNC sequences that have an excess of human-specific substitutions relative to the chimpanzee lineage. By analyzing the genomic distribution and nucleotide variation of these fast-evolving (accelerated) CNC sequences, we find that significant numbers of them are found in the most recent (mostly human-specific) segmental duplications, and single nucleotide polymorphisms (SNPs) within them are associated with changes in gene expression. We also find a strong signal of recent directional selection in the human lineage. Results Searching for fast-evolving (accelerated) conserved noncoding sequences We have selected 304,291 of the most conserved noncoding sequences of at least 100 base pairs (bp) in length to look for evidence of accelerated substitution rate in the human lineage (see Materials and methods, below), by comparing the orthologous sequences of CNC sequences between human and chimpanzee. We used a χ2-based test to detect regions of CNC sequence that are diverging at an accelerated rate in either the human or chimpanzee lineage [17]. The test requires at least four substitutions between human and chimpanzee. Of the 304,291 CNC sequences, only 26,475 have at least four human-chimpanzee substitutions. For those 26,475 CNC sequences, we generated human-chimpanzee-macaque three-way alignments to infer the direction of substitutions, and performed Tajima's one-tailed χ2 test to detect human-specific or chimpanzee-specific substitution rate acceleration, applying the Yate's correction for continuity to correct for small substitution counts [17]. The chosen P value threshold was P = 0.08, because it was the P value with the minimum false discovery rate (FDR; see Materials and methods, below) in the range of P values between 0.05 and 0.15 (FDR = 75%). At this threshold we detected a total of 2,794 (10.6%) accelerated CNC sequences (hereafter referred to as accelerated noncoding [ANC] sequences) in either the human (1,356 ANC sequences [5.1%]) or the chimpanzee (1,438 ANC sequences [5.3%]) lineage (Figure 1a) with P ≤ 0.08, whereas we expected only 2,118 in total by chance. The FDR of 75% is likely to be an overestimate because the Yate's correction is generally considered conservative. Figure 1 Substitution rates of 1,356 human-specific ANC sequences. Shown are the relative rates (P distance) of substitutions of (a) the 1,356 accelerated noncoding (ANC) sequences in the human (y-axis) and chimpanzee (x-axis) lineages, and (b) the 1,145 ANC sequences excluding those within potential confounding features (segmental duplications, copy number variants, pseudogenes, and retroposons). Comparison of the human and chimpanzee chromosomes in the alignments reveals that only 20 out of 1,356 are not on the expected syntenic chromosome (Additional data file 1). We also conducted visual and manual examination of a random sample of 5% of the ANC sequences across the whole spectrum of significance (Additional data file 1) to confirm that the signals we detect are not a result of misalignments, and we have concluded that this is very rare (only two out of 72 cases are potentially problematic). Some of the ANC sequences overlap with features that could potentially create such patterns (segmental duplications, retroposed genes, and pseudogenes), but in all of the cases that we tested the result cannot be explained by misalignment. In fact, if we exclude sequences that could generate potential alignment artefacts (segmental duplications, retroposed genes, and pseudogenes [see below]), we then detect 1,145 human ANC sequences (Figure 1b) relative to 18,289 power CNC sequences. The FDR is estimated at 40% (P 5%) within ANC sequences and power CNC sequences. A linear regression was then performed (separately within each population) between quantitative gene expression values for 14,925 probes (a subset chosen on the basis of sufficient measurable expression levels and variability) and numerically coded genotypes (0, 1, 2) of each SNP within a 10 Mb window centered on the midpoint of each transcript probe. The statistical significance was evaluated through the use of 10,000 permutations performed separately for each gene. In each permutation of a single gene, the most significant P value was retained, and so that there were 10,000 P values for each gene. From these distributions, for each gene, we determined significance thresholds of 0.0001, 0.001, and 0.01. For each gene tested for association with SNPs in ANC sequences or power CNC sequences, the GO slim terms were tabulated in a nonredundant list (multiple transcripts were removed). For each GO slim term the counts of genes with and without the GO slim term in significantly associated genes (at threshold 0.01) and the total genes tested were compared using 2 × 2 contingency tables tested by the Fisher's exact test for genes associated with SNPs in accelerated and the power CNC sequences. Additional data files The following additional data files are available with the online version of this paper. Additional data file 1 is a table listing the coordinates for ANC sequences, highlighting those manually checked and overlapping other elements. Additional data file 2 is a figure of the patterns and levels of nucleotide variation in ANC sequences compared with the alternatively defined fast-evolving CNC sequences. Supplementary Material Additional data file 1 Provided is a table listing the coordinates for ANC sequences, highlighting those manually checked and overlapping other elements. Click here for file Additional data file 2 Provided is a figure of the patterns and levels of nucleotide variation in ANC sequences compared to the alternatively defined fast evolving CNC sequences. Click here for file
                Bookmark

                Author and article information

                Journal
                BMC Genomics
                BMC Genomics
                BioMed Central
                1471-2164
                2008
                11 June 2008
                : 9
                : 277
                Affiliations
                [1 ]Department of Structural Chemistry and Inorganic Stereochemistry, School of Pharmacy, University of Milan, Italy
                [2 ]Department of Biomolecular Sciences and Biotechnology, University of Milan, Italy
                [3 ]National Institute of Nuclear Physics, Bari, Italy
                [4 ]Istituto Tecnologie Biomediche, Consiglio Nazionale delle Ricerche, Bari, Italy
                [5 ]Dipartimento di Biochimica e Biologia Molecolare, University of Bari, Italy
                Article
                1471-2164-9-277
                10.1186/1471-2164-9-277
                2442843
                18547402
                8f9e1bff-0a45-4023-97b9-0a090efe1040
                Copyright © 2008 Mignone et al; licensee BioMed Central Ltd.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 18 October 2007
                : 11 June 2008
                Categories
                Research Article

                Genetics
                Genetics

                Comments

                Comment on this article