Assessing the gene space in draft genomes

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Genome sequencing projects have been initiated for a wide range of eukaryotes. A few projects have reached completion, but most exist as draft assemblies. As one of the main reasons to sequence a genome is to obtain its catalog of genes, an important question is how complete or completable the catalog is in unfinished genomes. To answer this question, we have identified a set of core eukaryotic genes (CEGs), that are extremely highly conserved and which we believe are present in low copy numbers in higher eukaryotes. From an analysis of a phylogenetically diverse set of eukaryotic genome assemblies, we found that the proportion of CEGs mapped in draft genomes provides a useful metric for describing the gene space, and complements the commonly used N50 length and x-fold coverage values.

Related collections

Most cited references 12

Record: found
Abstract: found
Article: not found

Profile hidden Markov models.

S. Eddy (1998)

The recent literature on profile hidden Markov model (profile HMM) methods and software is reviewed. Profile HMMs turn a multiple sequence alignment into a position-specific scoring system suitable for searching databases for remotely homologous sequences. Profile HMM analyses complement standard pairwise comparison methods for large-scale sequence analysis. Several software implementations and two large libraries of profile HMMs of common protein domains are available. HMM methods performed comparably to threading methods in the CASP2 structure prediction exercise.

0 comments Cited 1254 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Genome sequence of the nematode C. elegans: a platform for investigating biology.

(1999)

The 97-megabase genomic sequence of the nematode Caenorhabditis elegans reveals over 19,000 genes. More than 40 percent of the predicted protein products find significant matches in other organisms. There is a variety of repeated sequences, both local and dispersed. The distinctive distribution of some repeats and highly conserved genes provides evidence for a regional organization of the chromosomes.

0 comments Cited 656 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

The Genome Sequence of Caenorhabditis briggsae: A Platform for Comparative Genomics

Lincoln D. Stein, Zhirong Bao, Darin Blasiar … (2003)

Introduction Comparative sequence analysis is a global approach toward recognizing much of the functional sequence in a genome. Comparisons of genomes of appropriate evolutionary distance can aid in defining protein-coding genes, in recognizing noncoding genes, and in finding regulatory sequences and other functional elements of a genome. The soil-dwelling nematode Caenorhabditis elegans has been intensively studied over the past several decades to establish the molecular genetic basis of its development and behavior. The completion of its genome sequence (C. elegans Sequencing Consortium 1998) provides a complete description of the genetic information, but decoding the program embedded in the sequence remains a challenge. Caenorhabditis briggsae is another soil-dwelling nematode that diverged from C. elegans approximately 100 million years ago (MYA) (Coghlan and Wolfe 2002) and, along with Caenorhabditis remanei, is one of C. elegans' closest known relatives (Jovelin et al. 2003). The two organisms are almost indistinguishable morphologically (Nigon and Dougherty 1949). They follow very similar developmental programs, for example, in sex determination and vulval development (Kirouac and Sternberg 2003; Stothard and Pilgrim 2003). There are, however, subtle differences between the two species. One interesting difference is the inability of C. briggsae to take up and distribute interfering RNAs (M. Montgomery, personal communication). There are also differences in the nature and timing of key events in the development of the vulva (Kirouac and Sternberg 2003), the excretory pore (Wang and Chamberlin 2002), and the male tail (Fitch and Emmons 1995; Fitch 1997). Genomic sequencing of C. briggsae actually began in the mid-1990s, when the Genome Sequencing Center at Washington University in St. Louis began sequencing large-insert clones from throughout the genome, eventually producing 12 Mbp of finished sequence. Even this relatively small amount of sequence, together with genes cloned by individual labs, has been extremely valuable to researchers, who have used it to identify putative transcription factor binding sites (Gilleard et al. 1997; Kirouac and Sternberg 2003), to infer patterns of evolutionary constraints (Civetta and Singh 1998; Castillo-Davis and Hartl 2002), to probe the dynamics of chromosome evolution (Coghlan and Wolfe 2002), and to study the expansion and contraction of large gene families (Sluder et al. 1999; Robertson 2001). We have now extended this work to the whole genome level, assembling a “draft”-quality C. briggsae sequence that covers an estimated 98% of the genome. In this report we discuss the characteristics of this draft and give our preliminary analysis of the C. briggsae genome at the nucleotide and protein levels. Results We begin by describing the sequencing procedure, followed by the results of our analyses at the nucleotide and protein levels. Data files containing the results of the analyses reported here are found as Supporting Information (the full set is Dataset S1). The C. briggsae sequence and analytic results are also available in interactive format from WormBase (http://www.wormbase.org). Genomic Sequencing To sequence the C. briggsae genome, we adopted a hybrid strategy combining whole-genome shotgun sequencing (WGS) with a high-resolution, sequence-ready physical map. The previously finished clone-based sequence was then integrated with the whole-genome assembly to produce the draft sequence. Fingerprint Map Construction Physical mapping and sequencing was performed on C. briggsae strain AF16 (Fodor et al. 1983). We constructed a physical map for C. briggsae using a high-throughput fingerprinting scheme (Marra et al. 1997). Our substrates were two large-insert C. briggsae libraries: a 6.5-fold coverage fosmid library (M. A. Marra, unpublished data) and a 10-fold coverage bacterial artificial chromosome (BAC) library (M. A. Marra and P. deJong, unpublished data). After several rounds of automated assembly and manual review, we developed a physical map consisting of 188 fingerprinted contigs (FPCs) with a mean length of approximately 450 kbp containing 17,885 BAC clones and 16,414 fosmid clones. The longest FPC contig spans more than 4 Mbp of genomic sequence. Because there is little genetic mapping information for C. briggsae, these FPC contigs cannot currently be localized to chromosomal locations. Shotgun Sequencing, Assembly, and Physical Map Integration We performed WGS of small-insert plasmid libraries using the paired-end sequencing strategy introduced by Edwards et al. (1990). This was supplemented with end sequencing of clones from the BAC library in order to facilitate integration of the assembled sequence with the physical map. In all, 2.068 million shotgun reads were assembled, giving more than 10-fold coverage of the C. briggsae genome. A total of 82% of the shotgun reads were paired. The WGS was then assembled using the Phusion assembler (Mullikin and Ning 2003) into a set of 5,341 contigs; these contigs were linked together using the read pair information into a total of 899 gapped “supercontigs” containing 105.6 Mbp of DNA sequence and another 1.9 Mbp of inferred gaps. The N50 (the length x such that 50% of the genome lies in blocks of x or longer) of the contigs was 41 kbp. The supercontigs were substantially larger, as reflected in an N50 of 474 kbp. For reasons discussed below, the actual C. briggsae genome may be slightly smaller than 105.6 Mbp. To integrate the sequence assembly with the physical map, we performed a conceptual restriction digest of the 5,341 sequence contigs and incorporated them into the fingerprint map using automated assembly followed by manual review. During this process, misassemblies in both the physical map and the sequence assembly were detected and corrected. Lastly, we integrated the sequence from 272 finished fosmid and BAC clones from many different regions of the C. briggsae genome. We used this finished sequence to fill 264 gaps in the assembly, adding 269 kbp of sequence from 155 finished clones. In the final analysis, 463 sequence supercontigs from the WGS assembly were placed in 142 FPC contigs. These 142 supercontigs had an N50 length of 1,450 kbp and a total sequence length covering 102,431,873 bp, equivalent to 94% of the summed length of all supercontigs. A total of 436 sequence supercontigs could not be placed onto the FPC map. These unplaced supercontigs were relatively small (largest, 103 kbp; mean, 13.8 kbp; SD, 16.6 kbp; N50, 34 kbp) and covered a total of 6.0 Mbp. This version of the draft assembly is referred to as cb25.agp8. Data for this assembly are available at ftp://ftp.sanger.ac.uk/pub/wormbase/cbriggsae/cb25.agp8/ Completeness and Accuracy of the Draft To assess the completeness of the C. briggsae draft sequence, we compared the WGS contigs and supercontigs to the 12 Mbp of previously finished sequence. We did this before incorporating the finished sequence into the assembly. The SSAHA algorithm (Ning et al. 2001), a fast search method for large DNA databases, was used to match previously finished contigs to the corresponding regions of the WGS draft. From this, we estimate that the contigs cover 98% of the previously finished clone-based sequences and that the scaffolds span 99.3% of the clone-based sequence. Thus, we estimate that the draft covers 98% of the C. briggsae genome. Some 65,000 of the unassembled reads matched three or more other reads in the set. These represent 2.5% of the total reads, suggesting that 2.6 Mbp of additional sequence is represented in these unplaced reads. Among these read clusters are the 5S, ribosomal DNA (rDNA), and mitochondrial sequences. Assembly of these reads yields a 14,420 base mitochondrial sequence, a 7,429–7,431 base rDNA repeat unit, and a 697 and a 938–940 base 5S repeat unit. Based on the total mitochondrial reads, we estimate on average approximately ten copies of the mitochondrial genome per haploid nuclear genome. Most other clusters appear to contain tandem copies of low complexity sequence of the type found scattered throughout the C. elegans genome. Read pair and assembly information allow both the rDNA and 5S clusters to be placed in the current assembly. The rDNA sequences lie adjacent to telomeric repeats as they do in C. elegans. The two arrays of the 5S gene lie adjacent to each other. To assess overall quality of the assembly and to evaluate the gap size estimate, we again used the comparison of the assembly to the finished clone sequence. We detected no global misassemblies. The 264 gaps that were filled with finished sequence contained a total estimated gap size of 179 kbp before filling. After gap filling, however, the total length of the assembly decreased by 323 kbp. Because of the change in the gap size estimates, we estimate that 3% of the bases in the assembly consist of undetected overlaps. Combined with the estimated coverage of 98%, this yields an estimate for the C. briggsae genome size of 104 Mbp. However, this figure contains substantial uncertainty, in part because the finished sequence is not a random selection of the genome, and this number is certain to be revised as additional finishing is performed. We assessed sequence accuracy in two ways. First, we aligned the finished sequence to the final WGS assembly and counted discrepancies. Using this method, the accuracy is 99.98%. Second, we examined the consensus quality scores (Ewing and Green 1998; P. Green, unpublished data) across the assembly. These data again suggest a sequencing accuracy of approximately 99.98%. Content of the C. briggsae Genome To characterize the C. briggsae genome, we began with the following analyses of the content of the C. briggsae genome: (1) identification of candidate protein-coding genes and development of a canonical gene set; (2) characterization of the predicted protein-coding gene set by domain analysis; (3) comparative analysis of the C. briggsae and C. elegans proteomes using the predicted protein-coding gene set; (4) identification and characterization of candidate noncoding genes; and (5) characterization of the repeat family content of the genome. In later sections we examine the C. briggsae and C. elegans genomes at the long-range, structural level and illustrate the utility of comparative analysis in aiding genome interpretation. For all C. briggsae characterizations described in the remainder of this section, we used the cb25.agp8 assembly described earlier. With minor exceptions noted below, all comparisons that involved C. elegans used the C. elegans genome and annotations contained in WormBase release version WS77 (WS77), which was current as of April 2002. Protein-Coding Genes Different protein-coding gene prediction algorithms generally have high concordance rates for predicting exons, but they tend to disagree on the grouping of exons into genes (Reese et al. 2000). Hence, four different gene prediction programs may give four strikingly different answers across the same region of the genome. An increasingly common procedure for overcoming this problem is to predict gene structures using several ab initio gene prediction programs, to compare their output in order to find a representative prediction, and then to partially or fully confirm the structures with experimental data from expressed sequence tags (ESTs), sequenced cDNAs, or protein similarity matches (Goff et al. 2002; Rogic et al. 2002). To find protein-coding genes in C. briggsae, we developed an enhancement of this procedure that uses the concordance of predictions between C. elegans and C. briggsae to predict the most likely gene model. We predicted genes in the C. briggsae genome using the programs Genefinder (version 980506; P. Green, unpublished data), FGENESH (Salamov and Solovyev 2000), TWINSCAN (Korf et al. 2001), and the Ensembl annotation pipeline (Clamp et al. 2003). These programs combine a variety of gene prediction methodologies, including ab initio predictions (Genefinder, FGENESH), EST- and protein-based comparisons (Ensembl), and sequence conservation metrics (TWINSCAN). We called genes in the C. elegans genome by combining hand-curated data models from WS77 (Stein et al. 2001) with the ab initio predictions from Genefinder and FGENESH. For technical reasons, we were unable to run TWINSCAN on the C. elegans genome, while the Ensembl methodology of predicting genes based on matches to previously predicted proteins meant that an Ensembl C. elegans gene set would essentially be a duplicate of the hand-curated WormBase set. As expected, the output of the four gene prediction programs was largely concordant with respect to the position of C. briggsae exons (80% of exons predicted identically by two or more programs, 26% predicted identically by all four programs), but discordant with respect to gene predictions (38% of genes called identically by two or more programs, just 4% called identically by all four programs). A similar pattern was seen in the genes called in C. elegans. To select among overlapping predictions produced by different programs, we reasoned that the most likely gene model is the one that maximizes the similarity between the gene sets in two related species (Figure 1). For each C. briggsae region that had multiple overlapping but inconsistent predictions, our selection procedure chose the prediction that had the most extensive similarity to the matching C. elegans prediction. The extent of similarity was measured by the fraction of the C. briggsae prediction that aligned to the matching C. elegans prediction at the protein level. Likewise, from all the predictions for a C. elegans gene, we chose the prediction having the most extensive similarity to its C. briggsae match. This selection step produced two gene sets, one each for C. briggsae and C. elegans. The gene sets were then filtered to remove transposons and putative pseudogenes. We call the gene sets produced by our procedure “hybrid” gene sets because the final gene sets are a mixture of gene predictions from multiple programs. The procedure selected predictions for both species simultaneously, yielding a C. briggsae hybrid gene set and a C. elegans hybrid gene set. To assess the accuracy of the gene prediction programs in C. elegans, we made a “gold standard” set of C. elegans gene predictions, consisting of 2,257 genes from WS77 for which every base and intron–exon junction had been confirmed by cDNA or EST data. Genefinder made 2,309 predictions that overlapped a gold-standard gene, of which 1,280 (53%) contained all confirmed bases and introns. FGENESH made 2,742 predictions that overlapped a gold-standard gene, of which 1,230 (45%) contained all the confirmed data. We also used the gold standard to assess our selection procedure. For C. elegans genes in the gold-standard set, the selection procedure chose the correct gene model for 92% of gold-standard genes, choosing an alternative (incorrect FGENESH or Genefinder) model 8% of the time. We could not assess the accuracy of the gene prediction programs or selection procedure directly in C. briggsae because we lacked an independent dataset to create a gold standard. The final C. briggsae gene set contains 19,507 genes, and the hybrid C. elegans gene set contains 20,621 genes. Some of the genes taken from WS77 have alternative splices, so the 20,621 C. elegans genes have 21,578 different splice variants. In the absence of substantial EST data, we are currently unable to call or comment on patterns of alternative splicing in C. briggsae. In order to compare the C. briggsae and C. elegans hybrid gene sets to the C. elegans WS77 gene set, we also applied our transposon and pseudogene filtering step to the C. elegans WS77 gene set. This removed 619 genes to create a “pruned” WS77 set of 18,808 genes and 19,791 splices. This pruned set is henceforth called WS77*. An important caveat is that some of the predictions discarded by our filtering step may include real exons: 29 (9%) of the 316 putative pseudogenes in C. elegans WS77 that were discarded have been partially or fully confirmed by EST or cDNA data. Files containing the C. briggsae, C. elegans hybrid, and C. elegans WS77* gene predictions, their genomic positions, and their conceptual translations are available as Dataset S2. Comparing the C. briggsae and C. elegans Gene Sets The C. briggsae gene set (19,507 genes), the C. elegans WS77* gene set (18,808 genes), and the C. elegans hybrid gene set (20,621 genes) all contain approximately the same number of genes. The recent WormBase release WS103 (June 2003; approximately 19,600 curated genes) also has a similar number. We next set about examining these sets in more detail. The unspliced lengths of genes are roughly the same in the two species (C. briggsae median, 1.9 kbp; C. elegans WS77*, 1.9 kbp; Table 1), and the total length of the C. briggsae genome occupied by the 19,507 genes, including their introns, is 56 Mbp (54% of the 102 Mbp assembly)—approximately the same fraction of the C. elegans genome occupied by the WS77* gene set. Thus, the larger size of the C. briggsae genome is not due to an increase in the number or size of protein-coding genes. The C. elegans gene sets have slightly more introns than the C. briggsae hybrid set. Some of this difference might be due to the hand-curation of the WS77 gene set, since curation may add exons that were missed by gene prediction software. However, as shown in the C. briggsae/C. elegans Orthologs section below, a portion of the intron differences can be confirmed as exhibiting true evolutionary changes. We examined codon usage for both the predicted gene sets as a whole and for C. briggsae/C. elegans ortholog pairs (described below) and found no substantial differences between the protein-coding gene codon usage in the two species. We then used the EMBOSS tool codcmp to test for a significant difference in codon usage. The GC content for the two species differed only slightly for the genome as a whole (35.4% for C. elegans versus 37.4% for C. briggsae) and for protein coding exons (42.7% versus 44.1%). Domain and Gene Ontology Analysis We used Pfam analysis to search the C. briggsae gene set for functional domains and other known sequence motifs and to assign InterPro annotation to each such feature found (Zdobnov and Apweiler 2001). These InterPro annotations were translated into Gene Ontology (GO) functional descriptions, and the descriptions were grouped into broader categories according to molecular function (13 categories) and biological process (9 categories) (“GOslim”; http://www.ebi.ac.uk/proteome; Gene Ontology Consortium 2001). We did not classify proteins by intracellular compartment due to the small number of genes in these categories for the C. elegans dataset. Of 19,507 predicted proteins in the C. briggsae dataset, 10,606 (54.4%) had one or more Pfam annotations, 9,829 were associated with one or more InterPro terms, and 6,526 of these could be assigned one or more GO terms. This annotation process touched 1,443 InterPro terms and 706 GO terms. The results of these classifications are available as Dataset S2. For comparison, we performed an identical analysis on the C. elegans proteins in the C. elegans hybrid gene set. Of the 20,621 proteins in this set (counting only the longest protein encoded by each alternatively spliced gene), 11,116 (53.9%) had Pfam annotations, 10,460 had InterPro annotations, and 6,696 of these could be assigned GO terms, touching 1,436 InterPro terms and 699 GO terms. As expected, the distribution of classifications is roughly the same in both species (Table 2), with 75%–76% of classifiable proteins associated with metabolic processes, 15%–16% associated with transport, and 14%–15% associated with cell communication. The molecular function classification showed the majority of proteins classified as having ligand-binding or carrier functions (63%–66%), followed by enzymatic activity (48%) and nucleic acid binding (28%–30%). There is a 4% difference in the proportion of proteins classified as involved in signal transduction in C. elegans relative to C. briggsae. As will be discussed in a later section, this difference is largely due to the difference in the number of predicted chemosensory receptor proteins in the two genomes. The significance of other small differences, such as the approximately 1% increase in C. elegans in the proportion of proteins involved in cell motility, is not known. C. briggsae/C. elegans Orthologs We searched for orthologs between the 19,507 C. briggsae genes and the 18,808 C. elegans WS77* genes. Although it is possible for a gene in one species to have multiple orthologs in another species if the gene has duplicated since the two species diverged, for this analysis we used the simpler definition of a pair of genes that have a common ancestor and are in a one-to-one correspondence between the two species. We found orthologs by searching for C. briggsae/C. elegans gene pairs that were each other's top BLASTP (Altschul et al. 1997) match in the opposite species. We identified 11,255 such gene pairs. We then used conserved gene order (synteny) between the two species to identify nonreciprocal best matches that were supported by the positions of flanking orthologs. This procedure netted an additional 900 orthologs. The final set of 12,155 orthologs corresponds to 62% of the C. briggsae gene set and 65% of the C. elegans WS77* gene set. To assess the accuracy of this ortholog definition, we compared the results obtained by this procedure to those obtained by building phylogenetic trees of the chemosensory receptor sra protein subfamily (see the Protein Families section below). Phylogenetic tree building identified eight ortholog pairs in this set, seven of which were called identically by the mutual-best BLASTP match procedure. The mutual-best BLASTP match procedure had one false negative and one false positive involving a recent C. briggsae gene amplification. As will be seen, the chemosensory receptor family represents the worst-case scenario of a family that is undergoing rapid evolutionary change. This provides conservative estimates of false positive and false negative rates for the mutual-best BLASTP match procedure of roughly 15%. The median percent identity between orthologs at the protein level is 80% (mean, 75%; SD, 18%), which is similar to the level of divergence between mouse/human orthologs (median identity, 78.5%; Waterston et al. 2002). The ortholog pairs are very similar in terms of exon length (median in both species, 0.15 kbp), coding length per gene (median, 1.14 kbp in C. elegans versus 1.11 kbp in C. briggsae), and gene length (median, 2.29 kbp in C. elegans versus 2.19 kbp in C. briggsae). However, orthologs are longer than the overall set of predicted genes (median length, 1.90 kbp in C. elegans), which suggests that the nonorthologous gene set includes a population of truncated or split genes. To pursue the earlier observation that C. elegans has more introns than C. briggsae, we searched for cases in which a C. elegans gene has an intron absent from its C. briggsae ortholog and vice versa. To do this, we aligned orthologous proteins and searched for cases where a single exon in one species aligned to two adjacent exons in the other species. We found 6,579 species-specific introns among the 60,775 introns in the ortholog pairs: 4,379 C. elegans-specific introns and 2,200 C. briggsae-specific introns. This approximately 2-fold ratio agrees with that reported by Kent and Zahler (2000) using a smaller dataset. Intron gains or losses have occurred at a rate of at least 0.5 per gene in the 80–110 million years (MY) since C. elegans and C. briggsae diverged (see the Estimating the C. briggsae/C. elegans Divergence Date section below). This is similar to the arthropod rate of approximately one intron gain or loss per gene per 125 MY since Drosophila and Anopheles diverged (Zdobnov et al. 2002). In contrast, in mouse and human there have been fewer than 0.01 losses or gains per gene in 75 MY (Roy et al. 2003). Since the average number of introns per gene is quite different among these species, this means that 9% of Caenorhabditis introns are species-specific, in contrast to 50% of Anopheles/Drosophila introns and only 0.05% of mouse/human introns. Thus, intron–exon structure has apparently evolved more rapidly in nematodes and arthropods than in chordates. The list of ortholog pairs can be found as Dataset S3. Rate of Neutral Evolution and Estimates of Selective Pressure Using the set of C. briggsae/C. elegans ortholog pairs, we calculated the rates of nonsynonymous (KA) and synonymous (KS) amino acid substitutions, using a maximum likelihood (ML) algorithm that corrects for reversion events (Yang 1997) and removing pairs where accurate estimates of KA and KS were impossible. Orthologous genes identified by mutual-best BLASTP hits had an average KS of 1.78 (SD, 0.62) synonymous substitutions per synonymous site and a KA of 0.11 (SD 0.09) nonsynonymous substitutions per nonsynonymous site, while orthologous gene pairs identified by the combination of mutual-best BLASTP hits and colinearity had average KS and KA of 1.73 (SD, 0.68) and 0.12 (SD, 0.131), respectively. The corrected KS rate is almost three times as high as that reported between mouse and human (KS = 0.6; Waterston et al. 2002), despite the fact that the apparent C. briggsae/C. briggsae divergence date is only 5–45 MY before the Mus musculus/Homo sapiens divergence date (see next section). However, the reported KA/KS ratio for mouse/human (0.115) is similar to the ratio that we see in C. briggsae/C. elegans (approximately 0.06), arguing that the levels of purifying selection are similar. The KA/KS ratio, a measure of selective pressure, is expected to be 1.0 for genes that are under no selective pressure (for example, pseudogenes), less than 1.0 for genes under purifying selection, and greater than 1.0 for genes under positive selection. As expected, we found that nearly all the genes in ortholog pairs are under purifying selection (Figure 2). However, the extent of this purifying selection is more marked in genes with essential functions. For example, ortholog pairs which exhibit an embryonic lethal phenotype in systematic RNA inhibition (RNAi) screens of the C. elegans ortholog partner (Maeda et al. 2001; Piano and Gunsalus 2002; Kamath et al. 2003) show a markedly lower KA/KS ratio than do pairs for which a wild-type phenotype was observed (KA/KS, 0.0445 ± 0.0340 versus 0.0627 ± 0.0494; p 1.0, we do not find evidence for positive selection in these genes, but given the high value of substitution between the two species, genes with a ratio greater than 1.0 may no longer be recognizable as orthologs. Perhaps genes under positive selection might be found among members of gene families or in genes that are no longer recognizably similar at the primary sequence level. Sequence analysis of species of intermediate evolutionary distance, if available, would be revealing. Also, since our tests for positive selection are very conservative, we would not detect any sites under positive selection if they are small in proportion to those of the gene under purifying selection (Yang et al. 2000). Estimating the C. briggsae/C. elegans Divergence Date Using the divergence of the nematodes from the arthropods 800–1,000 MYA (Blaxter 1998) to calibrate the molecular clock, we estimated the C. briggsae/C. elegans divergence date from 338 sets of orthologs. Each set comprised a C. elegans gene and its one-to-one orthologs from C. briggsae, A. gambiae, and H. sapiens. When the nematode/arthropod divergence is taken to be 800 MYA, a 95% confidence interval for the median C. briggsae/C. elegans speciation date is 78–90 MYA. If the nematode/arthropod divergence is taken to be 1,000 MYA, the interval becomes 97–113 MYA. Our best estimate of the C. briggsae/C. elegans speciation date is therefore approximately 80–110 MYA. This confidence interval is tighter than a previous estimate of 50–120 MYA made using 92 sets of orthologs from the then available C. briggsae genome (Coghlan and Wolfe 2002). The current estimate is probably more accurate due to both a larger sample size and improved C. briggsae gene predictions and ortholog assignments. Interestingly, recent studies date the mouse/human divergence to 65–75 MYA (Waterston et al. 2002), so the date of the C. briggsae/C. elegans divergence was between 5 and 45 MY before the rodent/primate divergence. C. briggsae/C. elegans Paralogs and Orphans In this section, we look in more detail at those C. briggsae proteins that could not be assigned to C. elegans orthologs. Roughly, a third of the C. elegans and C. briggsae proteins fall into this category. Of these, 4,545 (23%) C. elegans (WS77*) genes and 5,211 (28%) C. briggsae proteins have multiple BLASTP matches in the opposite species. These correspond to paralogous relationships within gene families and are examined in greater detail in the next section (Gene Families). The remaining 2,108 (11%) C. elegans and 2,141 (11%) C. briggsae genes do not have any BLASTP hit of E-value 1020 greater (less significant) than the best hit. In this way we identified 1,914 C. elegans/human orthologs and 2,498 C. elegans/A. gambiae orthologs, while 11,255 C. briggsae/C. elegans one-to-one orthologs were found by identifying mutual-best BLASTP hits as described above. For 1,397 C. elegans genes, we had a C. briggsae, a human, and a mosquito ortholog. For each of the 1,397 quartets, we aligned the four proteins using ClustalW (Thompson et al. 1994) and made a guide-tree using protdist and neighbor from the PHYLIP package (Felsenstein 1989). For each ortholog set, the alignment and guide-tree were used as input for Gu and Zhang's (1997) program GAMMA, which estimated an α parameter for the Ã distribution used to correct for rate variation among amino acid sites. For 148 trees, GAMMA could not estimate the α parameter. For the remaining 1,249 trees, we used the two-cluster test (Takezaki et al. 1995) to check for unequal rates between lineages, taking human to be the outgroup to Anopheles and Caenorhabditis (Aguinaldo et al. 1997); 338 trees passed the test at the 5% significance level. For each of these 338 trees, the branch lengths were re-estimated under the assumption of rate constancy, using Takezaki and Nei's (1995) program with the Ã correction for multiple hits. We calibrated the linearized trees by taking the nematode/arthropod divergence to be 800–1,000 MYA (Blaxter 1998). Gene family clustering Gene families were identified utilizing the TRIBE-MCL method (Van Dongen 2000; Enright et al. 2002). Briefly, TRIBE-MCL identifies gene families using a Markov Clustering (MCL) procedure operating on a matrix of expectation values computed from a similarity search of all versus all of C. briggsae and C. elegans protein sequences. We used the Smith–Waterman algorithm as implemented in SSEARCH (Smith and Waterman 1981; Pearson 1991) to achieve a greater measure of sensitivity than BLASTP. The MCL clustering was executed with an inflation value of 1.6, chosen to minimize the number of orphaned or mispaired putative orthologs without greatly expanding the total number of clusters. Where more than one splice variant existed for a gene, only the longest transcript was chosen as a representative for the gene. TRIBE cluster annotation was created by merging the information from the InterPro (Zdobnov and Apweiler 2001) and Pfam 9.0 (Bateman et al. 2002) domains that occurred in at least two of the genes in a family. The full output from the TRIBE-MCL clustering and the Pfam domains for each cluster are available as Dataset S6. To develop a phylogenetic tree of the sra chemosensory receptor protein family, we combined chemosensory receptor genes belonging to the same Pfam subfamily from both species into a joint file that was then aligned using ClustalW (Thompson et al. 1994). The output file was fed into the PHYLIP package (programs used include: seqboot, protdist, neighbor, and consense; Felsenstein 1989) to generate a tree file. TreeView, part of the PHYLIP package, was used to visualize the family relationships. To identify phylogenetic-tree-based ortholog pairs of the sra protein family, we chose C. briggsae/C. elegans protein pairs that were each other's closest neighbors in greater than 900 of 1,000 bootstrap repetitions. Physical clustering of genes in a family along the chromosomes was tested by a permutation test counting the number of genes in a sliding window of 15 genes that are members of the same TRIBE family cluster. The average per window in each species was compared to the averages for 1,000 simulated genomes of randomly ordered genes. The maximum observed value for the shuffled genomes was always found to be smaller than the observed genome state, indicating that physical clustering of gene families is significantly nonrandom. Prediction of ncRNA genes We predicted tRNA genes using tRNAscan-SE version 123 (Lowe and Eddy 1997) in eukaryotic mode with default parameters and a threshold of 20 bits. To predict rRNA and miRNA genes, we extracted sequences from the sequence databases and searched for homologous sequences in the C. briggsae genome using BLASTN (Altschul et al. 1997). True matches were defined as those having greater than 95% identity over greater than 95% of the query length. Fragmentary matches were defined as all other hits with p value of <0.001. We predicted all other ncRNA genes using the Rfam 3.0 library of covariance models (Griffiths-Jones et al. 2003) and the INFERNAL software suite (Eddy 2002) with Rfam family-specific score thresholds. Repeat families We used RepeatMasker (A. F. A. Smit and P. Green, unpublished data) to identify presumptive repetitive elements in C. briggsae. Two different repeat library files were used to run RepeatMasker. The first library file, elegans.lib, was obtained from RepBase (Jurka 2000). RepBase includes all previously identified C. elegans repeat elements, as well as repeat elements from a variety of vertebrate and invertebrate species. When elegans.lib failed to identify the expected number of repeats in C. briggsae, we built a second repeat library file using RECON (Bao and Eddy 2002). For the initial all versus all pairwise comparison of genomic sequences, the result of which serves as input to RECON, we used WUBLAST (W. R. Gish, unpublished data; http://blast.wustl.edu/) with options “−kap E=0.00001 wordmask=dust wordmask=seg maskextra=20 −hspmax 5000 M=5 N=−11 Q=22 R=11.” All parameters of RECON were left at their default values. For RECON-defined families with more than ten copies, a consensus sequence for each was constructed by aligning the ten longest members of the family with DIALIGN2 (Morgenstern 1999), with its default options, then selecting a simple majority rule consensus residue for each column in the multialignment. For C. briggsae, 723 consensuses were built. A total of 554 consensus sequences were built for C. elegans following the same protocol. RECON has a known artifact in which it identifies conserved protein family domain and multicopy noncoding genes such as tRNAs and rRNAs in addition to finding bona fide repeat family elements. We removed these cases from the C. elegans and C. briggsae RECON library files by applying a series of filters. Before filtering, C. elegans and C. briggsae RECON library files contained 554 and 723 entries, respectively. To remove known ncRNA species, we ran INFERNAL (Eddy 2002) together with Rfam (Griffiths-Jones et al. 2003). For C. briggsae, this step removed 22 tRNAs, two histone H3 genes, and one each of U1, U2, U5, and U6. For C. elegans, this step removed 20 tRNAs, three rRNAs, one histone H3 gene, and one each of U1, U2, U5, and U6. Prior to removing gene family domains from the RECON libraries, it was necessary to remove transposons and other repetitive gene calls from the gene prediction sets. We used elegans.lib to identify “trusted repeats” in the RECON libraries, finding 188 trusted repeats in the RECON library for C. elegans and 72 trusted repeats in the RECON repeat library for C. briggsae. We then used BLAST (Altschul et al. 1997) to identify and remove trusted repeat sequences from the DNA sequences of the C. elegans and C. briggsae gene sets. A gene prediction was taken to be a transposable element if it had a TBLASTN hit of E-value <10−30 in the RepBase C. elegans library (version 8.2; Jurka 2000) or if its individual exons all had BLASTN hits of E-value <10−10 in the RECON repeat library for that species. This step removed 303 genes from C. elegans WS77 gene set, 434 genes from C. elegans hybrid gene set, and 627 genes from C. briggsae gene set. We next used TBLASTN to match the library of RECON-identified repeat elements against the filtered C. elegans and C. briggsae hybrid gene sets as well as SwissProt, and we removed repeat families that matched proteins with an E-value <10−10. A total of 181 and 382 entries were removed from the C. elegans and C. briggsae RECON libraries, respectively. Finally, we manually examined those repeat entries with significant protein hits and added an additional eight entries to the trusted repeat library for C. briggsae. Manual examination did not yield more trusted repeats for C. elegans. The final filtered RECON libraries contained 473 C. briggsae and 377 C. elegans entries. The repeat family data can be found as Dataset S7. Nucleotide-level sequence alignments We used the WABA algorithm (Kent and Zahler 2002) to align the C. briggsae WGS assembly to the complete C. elegans genomic sequence. All WGS supercontigs were “cut” into 100 kbp pieces and individually aligned to the six C. elegans chromosomes. The coordinates of the regions of alignment were then transformed back into supercontig coordinates for further analysis. To characterize raw WABA aligning segments, we partitioned the C. elegans genome into eight compartments corresponding to CDS, introns, 5′ and 3′ UTRs, upstream regions, downstream regions, and repeat regions. The remainder of the genome was considered to be intergenic. For consistency, upstream regions were considered to be 1,000 bp upstream of the translational start site, and downstream regions were considered to be 1,000 bp downstream of the translational stop, regardless of whether or not the UTRs of the gene were known. In the case of a region that could be scored in two or more ways, such as a region that is within the 1,000 bp downstream window of one gene and upstream of another, the region was assigned to one partition on the basis of left-to-right priority in the list above. For example, CDSs have priority over an intron. A WABA segment was scored as belonging to a partition if it shared at least one base overlap with a region in that partition. For the purposes of comparison, we also repeated the whole-genome alignment with the BLASTZ algorithm (Schwarz et al. 2003). The results were highly comparable, but the coverage was slightly lower with BLASTZ (56% coverage with BLASTZ versus 65% coverage with WABA, not adjusted for overlaps). Synteny block construction To identify putative syntenic regions from the raw WABA alignment blocks, we merged adjacent “strong,” “weak,” and “coding” blocks into a smaller number of contiguous blocks. This reduced the 1.3 million raw WABA blocks to 104,097 contiguous blocks. From this set, we discarded blocks in which more than five C. elegans segments aligned to a segment of C. briggsae or vice versa. The remaining alignments were merged using the “simple merge” algorithm, which searches for and merges uninterrupted series of alignments that are monotonically increasing in both the C. elegans and C. briggsae genomes. This procedure yielded 26,231 merged blocks of alignment. We then performed an additional round of merging using a dynamic programming algorithm to identify the longest monotonically increasing set of alignment blocks. Unlike the simple merge, this algorithm can jump over interrupted regions of colinearity. For each C. briggsae supercontig, the algorithm first finds the longest series of blocks then finds the next longest series using those left over from the first iteration. This continues until no blocks remain. During the process, the identification of monotonically increasing blocks is constrained so that no gap in the series can be greater than 100 kbp in either genome and so that no single gap can cause a relative expansion of greater than 5-fold in either the C. elegans or C. briggsae coordinates. This step yielded 13,467 merged blocks. Examination of the distribution of synteny block lengths showed a large asymmetric peak at 1,250 bp and a long tail of longer block lengths (data not shown). Most of these blocks involved a single unmerged WABA alignment and correlated poorly with the positions of previously identified orthologs. To exclude these small blocks from further analysis, we filtered the blocks for a lower size limit of 1,850 bp, which reduced the coverage of the C. elegans genome by 1.5% but excluded 64% of the merged blocks, leaving a final candidate list of 4,837 synteny blocks. Each block was classified as an inversion, transposition, or reciprocal translocation by examining the breakpoint junctions a/b and c/d in C. briggsae: A block was classified as an inversion if a was adjacent to c and b was adjacent to d from the perspective of C. elegans coordinates. A block was classified as a transposition if a and d were adjacent in C. elegans and a reciprocal event could not be identified. A block was classified as involving one or two reciprocal translocations if another breakpoint could be found such that a was adjacent to f or e was adjacent to d in C. elegans. Updating the C. elegans gene set We used TBLASTN searches (Altschul et al. 1997) of the C. briggsae genome to see how many of the gene model changes suggested in the C. elegans hybrid gene set, compared to the WS77 gene set, were supported by C. briggsae similarity. We only considered new hybrid exons and extensions and truncations or deletions of existing WS77 exons where the extended or truncated region was greater than or equal to five amino acids (15 bp) long. If the C. elegans hybrid gene set predicted a new exon in an existing WS77 gene, we considered the new exon to be well-supported if the exon had a TBLASTN hit of E-value <10−3 in the C. briggsae genome that covered greater than or equal to 10 amino acids of the new exon. If the C. elegans hybrid gene set predicted an extension of an existing WS77 exon, we considered the extension to be well-supported if the extended exon had a TBLASTN hit of <10−3 in the C. briggsae genome which covered greater than or equal to 10 amino acids of the extended part. In contrast, if the C. elegans hybrid gene set predicted that an existing exon in a WS77 gene should be deleted, we considered the exon deletion to be well supported if the WS77 exon did not have any TBLASTN hit of E-value <0.1 that covered greater than or equal to five amino acids of the exon. Likewise, if the C. elegans hybrid gene set predicted that an existing exon in a WS77 gene should be truncated, we considered the exon truncation to be well-supported if the WS77 exon did not have any TBLASTN hit of E-value <0.1 that covered greater than ot equal to five amino acids of the truncated part. Supporting Information Dataset S1 Full Set of Supporting Information (64 MB GZ). Click here for additional data file. Dataset S2 Gene Prediction Directory This directory comprises cb.hybrid.gff, a file of C. briggsae gene predictions using the “hybrid” method in GFF format (http://www.sanger.ac.uk/Software/formats/GFF/); cb.fa, file that contains conceptual translations of C. briggsae gene predictions in FASTA format; ws77.hybrid.gff, a file that contains C. elegans gene predictions using the “hybrid” method in GFF format; ws77.hybrid.fa, a file that contains C. elegans hybrid genes' conceptual translations in FASTA format; ws77.gff, a file that contains the canonical C. elegans gene predictions from Wormbase version WS77* (after filtering for transposase-containing genes and other artefacts); ws77.fa, a file that contains C. elegans WS77* genes' conceptual translations in FASTA format; gene2ip2go.briggsae, a file that contains InterPro annotations of predicted C. briggsae gene products, plus GO terms when available; and gene2ip2go.elegans, a file that contains InterPro annotations of predicted C. elegans gene products from WS77*, plus GO terms when available. (20 MB ZIP). Click here for additional data file. Dataset S3 Orthologs and Orphans Directory This directory comprises orthologs.txt, a file that contains the list of C. briggsae/C. elegans ortholog pairs; orthologs-kaks.txt, a file that contains KA/KS values for ortholog pairs, as well as other information used to determine regional and functional variation in KA/KS; and cb_orphans and ce_orphans, files that contain lists of putative “orphan” genes that are found in one species but not in another. (433 KB ZIP). Click here for additional data file. Dataset S4 Alignment Directory This directory comprises cb.waba_state.gff, a file that contains WABA alignments between C. briggsae and C. elegans; and cb.waba.gff, a file that contains WABA alignments in which transitions between adjacent WABA block types (“strong,” “weak” and “coding” ) have been merged. (41.8 MB ZIP). Click here for additional data file. Dataset S5 Synteny Directory This directory comprises synteny_blocks.txt, a file that contains 4,837 synteny blocks identified between C. elegans and C. briggsae; and sorted_ultracontigs.txt, a file that contains 382 C. briggsae ultracontigs that have been ordered and oriented on the C. elegans genome relative to their largest synteny block. (120 KB TXT). Click here for additional data file. Dataset S6 Gene Family Directory This directory comprises tribe_cluster.txt, a file that contains TRIBE-MCL co-clusters of C. briggsae and C. elegans predicted proteins, with the clusters sorted with the largest first; tribe_cluster_classify_short.txt, a file that contains the Pfam and InterPro classification of co-clusters; and tribe_cluster_classify_long.txt, which is similar to the previous file, except that all InterPro/Pfam annotations are shown. (360 KB ZIP). Click here for additional data file. Dataset S7 Repeats This directory comprises Cb_repeat.lib, a FASTA-format file of RECON-predicted repeat elements from C. briggsae after removal of protein families; Ce_repeat.lib, a FASTA-format file of C. elegans predicted repeat elements; Cb_RECON_repeat.gff, a GFF-format file indicating the positions of predicted repeats in the C. briggsae genome; and Ce_RECON_repeat.gff, a GFF-format file indicating the positions of predicted repeats in the C. elegans genome. (1.9 MB TXT). Click here for additional data file. Poster S1 The Genome Sequence of Caenorhabditis briggsae: A Platform for Comparative Genomics The nematodes C. briggsae and C. elegans diverged 80–110 MYA, near to the time of divergence of human from mouse, but are almost indistinguishable morphologically. These figures demonstrate the high degree of morphological similarity between C. briggsae and C. elegans and show how the patterns of evolutionary conservation between the two species vary across the five autosomes (I–V) and sex chromosome (X) of C. elegans. (50.6 MB PDF). Click here for additional data file. Accession Numbers The C. briggsae genomic sequence has been submitted to the GenBank WGS division as the 578 entries accessioned CAAC01000001 through CAAC01000578. The genes discussed in this paper can be found in the WormBase database: C06G3.7 (http://wormbase.org/db/gene/gene?name=C06G3.7); CBG05747 (http://wormbase.org/db/gene/gene?name=CBG05747); mir-35 (http://wormbase.org/db/gene/gene?name=mir-35); mir-41 (http://wormbase.org/db/gene/gene?name=mir-41); and snt-1 (http://wormbase.org/db/gene/gene?name=snt-1).

0 comments Cited 303 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Nucleic Acids Res

Journal ID (iso-abbrev): Nucleic Acids Res

Journal ID (publisher-id): nar

Journal ID (hwp): nar

Title: Nucleic Acids Research

Publisher: Oxford University Press

ISSN (Print): 0305-1048

ISSN (Electronic): 1362-4962

Publication date (Print): January 2009

Publication date (Electronic): 28 November 2008

Publication date PMC-release: 28 November 2008

Volume: 37

Issue: 1

Pages: 289-297

Affiliations

¹UC Davis Genome Center, University of California Davis, Davis, CA, USA and ²The Wellcome Trust Sanger Institute, Genome Campus, Hinxton, CB10 1SA, UK

Author notes

*To whom correspondence should be addressed. Tel: +1 530 754 4989; Email: ifkorf@ 123456ucdavis.edu

The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.

Article

Publisher ID: gkn916

DOI: 10.1093/nar/gkn916

PMC ID: 2615622

PubMed ID: 19042974

SO-VID: 0da010a1-9606-47ee-946f-d2c142a2bbf5

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 17 July 2008

Date revision received : 28 October 2008

Date accepted : 30 October 2008

Comments

Comment on this article

scite_

Cited by 238

See all cited by

Most referenced authors 2,113

See all reference authors

Assessing the gene space in draft genomes

Read this article at

Abstract

Related collections

Genes & Diseases

Most cited references 12

Profile hidden Markov models.

Genome sequence of the nematode C. elegans: a platform for investigating biology.

The Genome Sequence of Caenorhabditis briggsae: A Platform for Comparative Genomics

Author and article information

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 362

Cited by 238

Most referenced authors 2,113