106
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Deep Sequencing-Based Analysis of the Cymbidium ensifolium Floral Transcriptome

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Cymbidium ensifolium is a Chinese Cymbidium with an elegant shape, beautiful appearance, and a fragrant aroma. C. ensifolium has a long history of cultivation in China and it has excellent commercial value as a potted plant and cut flower. The development of C. ensifolium genomic resources has been delayed because of its large genome size. Taking advantage of technical and cost improvement of RNA-Seq, we extracted total mRNA from flower buds and mature flowers and obtained a total of 9.52 Gb of filtered nucleotides comprising 98,819,349 filtered reads. The filtered reads were assembled into 101,423 isotigs, representing 51,696 genes. Of the 101,423 isotigs, 41,873 were putative homologs of annotated sequences in the public databases, of which 158 were associated with floral development and 119 were associated with flowering. The isotigs were categorized according to their putative functions. In total, 10,212 of the isotigs were assigned into 25 eukaryotic orthologous groups (KOGs), 41,690 into 58 gene ontology (GO) terms, and 9,830 into 126 Arabidopsis Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, and 9,539 isotigs into 123 rice pathways. Comparison of the isotigs with those of the two related orchid species P. equestris and C. sinense showed that 17,906 isotigs are unique to C. ensifolium. In addition, a total of 7,936 SSRs and 16,676 putative SNPs were identified. To our knowledge, this transcriptome database is the first major genomic resource for C. ensifolium and the most comprehensive transcriptomic resource for genus Cymbidium. These sequences provide valuable information for understanding the molecular mechanisms of floral development and flowering. Sequences predicted to be unique to C. ensifolium would provide more insights into C. ensifolium gene diversity. The numerous SNPs and SSRs identified in the present study will contribute to marker development for C. ensifolium.

          Related collections

          Most cited references55

          • Record: found
          • Abstract: found
          • Article: not found

          Genic microsatellite markers in plants: features and applications.

          Expressed sequence tag (EST) projects have generated a vast amount of publicly available sequence data from plant species; these data can be mined for simple sequence repeats (SSRs). These SSRs are useful as molecular markers because their development is inexpensive, they represent transcribed genes and a putative function can often be deduced by a homology search. Because they are derived from transcripts, they are useful for assaying the functional diversity in natural populations or germplasm collections. These markers are valuable because of their higher level of transferability to related species, and they can often be used as anchor markers for comparative mapping and evolutionary studies. They have been developed and mapped in several crop species and could prove useful for marker-assisted selection, especially when the markers reside in the genes responsible for a phenotypic trait. Applications and potential uses of EST-SSRs in plant genetics and breeding are discussed.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            De novo assembly and analysis of RNA-seq data.

            We describe Trans-ABySS, a de novo short-read transcriptome assembly and analysis pipeline that addresses variation in local read densities by assembling read substrings with varying stringencies and then merging the resulting contigs before analysis. Analyzing 7.4 gigabases of 50-base-pair paired-end Illumina reads from an adult mouse liver poly(A) RNA library, we identified known, new and alternative structures in expressed transcripts, and achieved high sensitivity and specificity relative to reference-based assembly methods.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes

              Background Comparative analysis of genomes from distant species provides new insights into gene functions, genome evolution and phylogeny. In particular, the comparative genomics of prokaryotes has revealed previously underappreciated major trends in genome evolution, namely, extensive lineage-specific gene loss and horizontal gene transfer (HGT) [1-7]. To efficiently extract functional and evolutionary information from multiple genomes, rational classification of genes based on homologous relationships is indispensable. The two principal classes of homologs are orthologs and paralogs [8-11]. Orthologs are defined as homologous genes that evolved via vertical descent from a single ancestral gene in the last common ancestor of the compared species. Paralogs are homologous genes, which, at some stage of evolution, have evolved by duplication of an ancestral gene. Orthology and paralogy are intimately linked because, if a duplication (or a series of duplications) occurs after the speciation event that separated the compared species, orthology becomes a relationship between sets of paralogs, rather than individual genes (in which case, such genes are called co-orthologs). Correct identification of orthologs and paralogs is of central importance for both the functional and evolutionary aspects of comparative genomics [12,13]. Orthologs typically occupy the same functional niche in different organisms; in contrast, paralogs evolve to functional diversification as they diverge after the duplication [14-16]. Therefore, robustness of genome annotation depends on accurate identification of orthologs. A clear demarcation of orthologs and paralogs is also required for constructing evolutionary scenarios, which include, along with vertical inheritance, lineage-specific gene loss and HGT [5,7]. In principle, orthologs, including co-orthologs, should be identified by means of phylogenetic analysis of entire families of homologous proteins, which is expected to define orthologous protein sets as clades [17-19]. However, for genome-wide protein sets, such analysis remains extremely labor-intensive, and error-prone as well. Accordingly, procedures have been developed for identifying sets of likely orthologs without explicit referral to phylogenetic analysis. These procedures are based on the notion of a genome-specific best hit (BeT), that is, the protein from a target genome that is most similar (typically in terms of similarity scores computed using BLAST or another sequence-comparison method) to a given protein from the query genome [20,21]. The assumption central to this approach is that orthologs have a greater similarity to each other than to any other protein from the respective genomes. When multiple genomes are analyzed, pairs of probable orthologs detected on the basis of BeTs are combined into orthologous clusters represented in all or a subset of the analyzed genomes [20,22]. This approach, amended with additional procedures for detecting co-orthologous protein sets and for treating multidomain proteins, was implemented in the database of Clusters of Orthologous Groups (COGs) of proteins [20,23,24]. The current COG set includes approximately 70% of the proteins encoded in 69 genomes of prokaryotes and unicellular eukaryotes [25]. The COGs have been used for functional annotation of new genomes [26-29], target selection in structural genomics [30-32], identification of potential drug targets [33,34] and genome-wide evolutionary studies [4,13,35-38]. Sonnhammer and co-workers independently developed a similar methodology for identification of co-orthologous protein sets from pairwise genome comparisons and applied it to the sequenced eukaryotic genomes [39]. A central notion introduced in the context of the COG analysis is that of a phyletic pattern, that is, the pattern of representation (presence-absence) of analyzed species in each COG [13,20]. Similar concepts have been independently developed and applied by others [40,41]. The COGs show a remarkable scatter of phyletic patterns, with only a small minority represented in all sequenced genomes. A recent quantitative study showed that parsimonious evolutionary scenarios for most COGs involve multiple events of gene loss and HGT [7]. Both similarity and complementarity among the phyletic patterns of COGs, in conjunction with other information, such as conservation of gene order, have been successfully employed to predict gene functions [13,42,43]. The comparison of phyletic pattern has been formalized in set-theoretical algorithms and systematically applied to the computational and experimental analysis of bacterial flagellar systems, which demonstrated the considerable robustness of this approach [44]. We recently extended the system of orthologous protein clusters to complex, multicellular eukaryotes [25]. Here, we examine the phyletic patterns of KOGs in connection with known and predicted protein functions. In-depth analysis of some of these KOGs resulted in prediction of previously uncharacterized, but apparently essential, conserved eukaryotic protein functions. We also reconstruct the parsimonious scenario of evolution of the crown-group eukaryotes by assigning the loss of genes (KOGs) and emergence of new genes to the branches of the phylogenetic tree and explicitly delineate the minimal gene sets for various ancestral forms. To our knowledge, this is the first systematic, genome-wide examination of the sets of orthologous genes in eukaryotes. Results and discussion KOGs for seven sequenced eukaryotic genomes: functional and evolutionary implications of phyletic patterns Eukaryotic KOGs were constructed on the basis of the comparison of proteins encoded in the genomes of three animals (Homo sapiens [45], the fruit fly Drosophila melanogaster [46] and the nematode Caenorhabditis elegans [47]), the green plant Arabidopsis thaliana (thale cress) [48], two fungi (budding yeast Saccharomyces cerevisiae [49] and fission yeast Schizosaccharomyces pombe [50]) and the microsporidian Encephalitozoon cuniculi [51]. The procedure for KOG construction was a modification of the one previously used for COGs [20,24] and is described in greater detail elsewhere ([25]; see also Materials and methods). An important difference stems from the fact that complex eukaryotes encode many more multidomain proteins than prokaryotes and, furthermore, orthologous eukaryotic proteins often differ in domain composition, with additional domains accrued in more complex forms [3,45]. Accordingly, and unlike the original COG construction procedure, probable orthologs with different domain architectures were assigned to one KOG and were not split if they shared a common core of domains. In addition to the KOGs, which consisted of at least three species, clusters of putative orthologs from two species (TWOGs) and lineage-specific expansions (LSEs) of paralogs from each of the analyzed genomes were identified ([25,52]; see also Materials and methods). In most of the analyses discussed below, KOGs and TWOGs are treated together, unless otherwise specified. Figure 1 shows the assignment of the proteins from each of the analyzed eukaryotes to KOGs with different numbers of species, TWOGs and LSEs. The fraction of proteins assigned to KOGs tends to decrease with the increasing genome size, from 81% for S. pombe to 51% for the largest, the human genome. (For reasons that remain unclear, but might be related to its intracellular parasitic lifestyle, E. cuniculi has a relatively small fraction of conserved proteins that belonged to KOGs: approximately 60%.) The contribution of LSEs shows the opposite trend, being the greatest in the largest genomes, that is, human and Arabidopsis, and minimal in the microsporidian (Figure 1). A notable difference was observed between eukaryotes in terms of their representation in KOGs found in different numbers of species. While the three unicellular organisms are represented mainly in the highly conserved seven- or six-species KOGs, a much larger fraction of the gene set in animals and Arabidopsis is accounted for by LSEs, and by KOGs found in three or four genomes. These include animal-specific genes and genes that are shared by plants and animals but not by fungi and the microsporidian (Figure 1). The large number of KOGs in the latter group (700 KOGs represented in Arabidopsis and at least two animal species) is notable and probably results from massive, lineage-specific loss of genes during eukaryotic evolution (see below). The phyletic patterns of KOGs reveal both the existence of a conserved eukaryotic gene core and substantial diversity. The 'pan-eukaryotic' genes, which are represented in each of the seven analyzed genomes, account for around 20% of the KOGs, and approximately the same number of KOGs include all species except for the microsporidian, an intracellular parasite with a highly degraded genome [51]. Among the remaining KOGs, a large group includes representatives of the three analyzed animal species (worm, fly and humans) but a substantial fraction (approximately 30%) are KOGs with unexpected patterns, for example, one animal, one plant and one fungal species (see [53] and examples in Table 1). During the manual curation of the KOG set, the KOGs with unexpected patterns were scrutinized in an effort to detect potential highly diverged members from one or more of the analyzed genomes. Some of these unexpected patterns might indicate that a gene is still missing in the analyzed set of protein sequences from one or more of the species included; reports of newly discovered genes have appeared since the release of the initial reports on genome sequences of complex eukaryotes, for example, as a result of massive sequencing of human cDNAs [54], exhaustive annotation of the Drosophila genome [55] and comparative analysis of closely related yeast genomes [56]. The unexpected phyletic patterns seem, however, largely to reflect the extensive, lineage-specific gene loss that is characteristic of eukaryotic evolution [57]; on many occasions, this scenario is supported by the presence of orthologs in other eukaryotic lineages and/or in prokaryotes (Table 1). However, interesting exceptions to the multiple loss explanation might exist as exemplified by the ATP/ADP-translocase, which is present in Arabidopsis and Encephalitozoon and could have evolved via independent HGT from intracellular bacterial parasites ([58] and Table 2). Common phyletic patterns of genes that otherwise were not suspected to be functionally linked might suggest the existence of such connections and prompt additional analysis leading to concrete functional predictions [42,59-61]. The pair of KOG5324 and KOG4246 is a case in point that has not been described previously. The initial observation that these KOGs share the same unusual pattern of presence-absence in eukaryotes, and have similar phyletic patterns in prokaryotes, with a ubiquitous presence in archaea, prompted a more detailed examination of the multiple alignments of the respective proteins and the conservation of the (predicted) operon organization in archaea and bacteria (Table 2 and data not shown). The combination of clues from these analyses suggests that the two proteins interact in a still uncharacterized pathway of RNA processing, which also includes RNA 3'-phosphate cyclase (KOG3980)) [62] and cytosine-C5-methylase (NOL1/NOP2 in eukaryotes; KOG1122). The proteins in KOG3833 and KOG4528 are likely to represent novel enzyme families, possibly a kinase-phosphatase pair (E.V.K. and L. Aravind, unpublished data). Notably, these predicted new enzymes are present in animals and E. cuniculi but not in Arabidopsis or yeasts. In contrast, KOG3980 is present in all analyzed eukaryotic genomes except for Arabidopsis, whereas KOG1122 is pan-eukaryotic. These differences in the phyletic patterns of the components of the predicted pathway are concordant with the patterns in eukaryotes in that. Figure 2 shows the distribution of known and predicted functions of eukaryotic proteins among 20 functional categories for the entire set of KOGs and, separately, for KOGs represented in six or seven species and the animal-specific KOGs. Compared to the functional breakdown of prokaryotic COGs [25], the prevalence of signal transduction is notable among eukaryotes. This feature is particularly prominent in animal-specific KOGs, whereas the highly conserved set is comparatively enriched in proteins that are involved in translation, transcription, chaperone-like functions, cell cycle control and chromatin dynamics (Figure 2). The large number of KOGs for which only general functional prediction was feasible, and those whose functions remain unknown, even among the subset that is represented in six or seven eukaryotic species, emphasizes that our current understanding of eukaryotic biology is seriously lacking with even in respect of the functions of highly conserved genes. The distribution of KOGs by the number of paralogs in each genome is shown in Figure 3. The preponderance of lineage-specific duplication of conserved genes, that is, intra-KOG LSEs, in multicellular eukaryotes is obvious. Cases when a single gene in yeast or, particularly, Encephalitozoon, has two or more co-orthologs in animals and/or plants are most common in KOGs, whereas the reverse situation is rare. These observations support the notion of the major contribution of LSE to the evolution of eukaryotic complexity [52]. However, 131 KOGs are represented by a single ortholog in all genomes compared (Table 2) and a substantial number of KOGs have one member from a majority of the genomes (data not shown). Recent theoretical modeling of the evolution of paralogous families has suggested that, in general, ancient protein families tend to have multiple paralogs [5,63]. Therefore, whenever a KOG has a single member in all or most species, this should be attributed to selection against duplication of this particular gene. A prominent cause of such selection could be the involvement of the respective gene products in essential multisubunit complexes, such that imbalance between subunits leads to deleterious effects [64]. Known and new functions of single-member, pan-eukaryotic KOGs We examined in greater detail the 131 KOGs that are represented by a single gene in each of the seven genomes (Table 2). As can be envisaged from their presence in diverse eukaryotic taxa, including the 'minimal' genome of Encephalitozoon, and as shown by comparison with the knockout phenotype data (Table 2 and see below), these pan-eukaryotic KOGs are of particular biological importance. For the great majority of these KOGs (113 of the 131), the function has been experimentally determined or confidently predicted to a varying degree of detail using computational methods (Table 2). However, around 20 KOGs from this set remained uncharacterized at the time of this analysis and, for all but two of these, substantial functional inferences could be drawn through a combination of sequence-profile analysis, structure prediction and genomic-context analysis of prokaryotic homologs (Table 2). Some of these predicted new functions are variations on well-known themes, such as two predicted PP-loop ATPases, which are probably involved in novel, essential RNA modifications (KOGs 2522 and 2316) or two predicted E3 components of ubiquitin ligases (KOGs 0396 and 3800). Other predicted functions appear to be completely new, such as proteins in KOG3176 and 3303 which are likely to be essential components of eukaryotic replication and/or repair systems. Each of these uncharacterized but ubiquitous and largely essential eukaryotic genes is an attractive target for experimental studies. Examination of the experimentally characterized and predicted functions of pan-eukaryotic, single-member KOGs leads to interesting conclusions. Nearly all the functionally characterized KOGs in this set consist of proteins that are subunits of known multiprotein complexes (Table 2). The most prominent of these are the complexes involved in rRNA processing and ribosome assembly, such as the recently discovered rRNA processosome and the pre-40S subunit, as well as the spliceosome, and various complexes involved in transcription (Table 2). Accordingly, this set of KOGs is markedly enriched for proteins involved in various forms of RNA processing, assembly of ribonucleoprotein (RNP) particles and transcription. In addition, KOGs in the single-member pan-eukaryotic set include subunits of molecular complexes that are not directly related to RNA processing, such as the proteasome, the TCP-1 chaperonin complex [65] and the TRAPP complex involved in protein trafficking [66]. Altogether, more than 80% of the yeast proteins in the pan-eukaryotic, single-member KOGs belong to known macromolecular complexes included in the MIPS database [67], as compared to around 64% for all yeast proteins in the KOGs, which is a moderate but statistically highly significant excess (data not shown). This preponderance of multiprotein complex formation among the single-member pan-eukaryotic KOGs is fully compatible with the balance hypothesis [64]. The most unexpected observation regarding the single-member, pan-eukaryotic KOGs, is probably that in 14 of these proteins, the only detectable domain was the WD40 repeat (Table 2). This is particularly notable because WD40-repeat proteins, which are extremely abundant in eukaryotes and are present in several prokaryotic lineages as well [68], are not generally known to form well-defined, one-to-one orthologous relationships. The WD40 proteins in the pan-eukaryotic KOGs listed in Table 2 are exceptions, which is probably due to their unique and essential roles in the assembly of RNA-processing complexes. It has recently been demonstrated that, in S. cerevisiae, seven of these proteins are subunits of the 18S rRNA processosome, or at least are involved in ribosomal assembly [69,70]. Taking these results together with the unusual phyletic pattern, it seems possible to predict with considerable confidence that those WD40 proteins in the 131-KOG set that remain uncharacterized belong to the same or similar RNA-processing complexes (Table 2). With some notable exceptions, such as the WD40 proteins, the KOGs in the single-member, pan-eukaryotic set show remarkable patterns of evolutionary conservation: they are either (nearly) ubiquitous in the three kingdoms of life, for example, RNA polymerase subunits, or are universally conserved in eukaryotes and archaea but missing in bacteria, such as most of the proteins implicated in RNA processing (Table 2). Thus, it appears that elaborate molecular machines central to the functioning of the eukaryotic cell have evolved, largely from ancestral archaeo-eukaryotic components, at the onset of eukaryotic evolution, and both loss and duplication of the respective genes have been strongly selected against throughout the rest of eukaryotic evolution. Variation of evolutionary rates among KOGs Genome-wide analysis of protein evolutionary rates shows a broad range of variation [71]. Here, we investigate the variation of evolutionary rates among the ubiquitous KOGs represented in all seven analyzed genomes and the connection between the evolutionary rate and protein function in the KOG set. The characteristic evolutionary rate of each KOG, which included a member(s) from Arabidopsis, was determined by measuring the mean evolutionary distance from Arabidopsis (the outgroup in the phylogenetic tree; see below) to the other species. Even among the KOGs that include all seven species and, accordingly, appear to represent the conserved core of eukaryotic genes, the evolutionary rates differ by a factor of 20 between the fastest- and the slowest-evolving KOGs. Excluding 5% of the KOGs from each tail of the distribution still leaves almost a fourfold difference in evolutionary rates (Figure 4a). We then compared the distributions of evolutionary rates for different functional categories of KOGs (Tables 3,4 and Figure 4b). Although all the distributions substantially overlapped, there was a statistically highly significant difference between the evolutionary rates for proteins with different functions (Tables 3,4 and Figure 4b). The slowest-evolving proteins are those involved in translation and RNA processing, the fastest-evolving ones are involved in cellular trafficking and transport, whereas components of replication and transcription systems have intermediate evolutionary rates (Tables 3,4 and Figure 4b). A parsimonious scenario of gene loss and emergence in eukaryotic evolution and reconstruction of ancestral eukaryotic gene sets Assuming a particular species tree topology, methods of evolutionary parsimony analysis can be used to construct a parsimonious scenario of evolution, that is, mapping of different types of evolutionary events onto the branches of the tree. With prokaryotes, the problem is confounded by the major contributions from both lineage-specific gene loss and HGT to genome evolution, with the relative likelihoods of these events remaining uncertain [5,7]. The possibility of substantial HGT between major lineages of eukaryotes can apparently be safely disregarded, providing for an unambiguous most parsimonious scenario that includes only gene loss and emergence of new genes as elementary events. Some crucial aspects of the phylogenetic tree of the eukaryotic crown group remain a matter of contention. The consensus of many phylogenetic analyses appears to point to an animal-fungal clade and clustering of microsporidia with the fungi. However, a major uncertainty remains with respect to the topology of the animal tree: the majority of studies on protein phylogenies support a coelomate (chordate-arthropod) clade [72-74], whereas rRNA phylogeny and some protein family trees point to the so-called ecdysozoan (arthropod-nematode) clade [75-78]. We treated the phyletic pattern of each KOG as a string of binary characters (1 for the presence of the given species and 0 for its absence in the given KOG) and constructed the parsimonious scenarios of gene loss and emergence during evolution of the eukaryotic crown group for both the coelomate and the ecdysozoan topologies of the phylogenetic tree. For the purpose of this reconstruction, the Dollo parsimony approach was adopted [79]. Under this approach, gene loss is considered irreversible; thus, a gene (a KOG member) can be lost independently in several evolutionary lineages but cannot be regained. This assumption is justified by the implausibility of HGT between eukaryotes (the Dollo approach is not valid for reconstruction of prokaryotic ancestors). In the resulting parsimonious scenarios, each branch was associated with both gene loss and emergence of new genes, with the exception of the plant branch and the branch leading to the common ancestor of fungi and animals, to which gene losses could not be assigned with the current set of genomes (Figure 5a,b). There is little doubt that, once genomes of early-branching eukaryotes are included, gene loss associated with these branches will become apparent. The principal features of the reconstructed scenarios include massive gene loss in the fungal clade, with additional elimination of numerous genes in the microsporidian; emergence of a large set of new genes at the onset of the animal clade; and subsequent substantial gene loss in each of the animal lineages, particularly in the nematodes and arthropods (Figure 5a,b). The estimated number of genes lost in S. cerevisiae after its divergence from the common ancestor with the other yeast species, S. pombe, closely agreed with a previous estimate produced by a different approach [57]. The switch from the coelomate topology of the animal sub-tree to the ecdysozoan topology resulted in relatively small changes in the distribution of gains and losses: the most notable difference was the greater number of genes lost in the nematode lineage and the smaller number of genes lost in the insect lineage under the ecdysozoan scenario compared to the coelomate scenario (Figure 5a,b). The parsimony analysis described above involves explicit reconstruction of the gene sets of ancestral eukaryotic genomes. Under the Dollo parsimony model, which was used for this analysis, an ancestral gene (KOG) set is the union of the KOGs that are shared by the respective outgroup and each of the remaining species. Thus, the gene set for the common ancestor of the crown group includes all the KOGs in which Arabidopsis co-occurs with any of the other analyzed species. Similarly, the reconstructed gene set for the common ancestor of fungi and animals consists of all KOGs in which at least one fungal species co-occurs with at least one animal species. These are conservative reconstructions of ancestral gene sets because, as already indicated, gene losses in the lineages branching off the deepest bifurcation could not be detected. Under this conservative approach, 3,413 genes (KOGs) were assigned to the last common ancestor of the crown group (Figure 5a,b). More realistically, it appears likely that a certain number of ancestral genes have been lost in all, or all but one, of the analyzed lineages during subsequent evolution, such that the gene set of the eukaryotic crown group ancestor might have been close in size to those of modern yeasts. In terms of the functional composition, the reconstructed core gene set of the crown-group ancestor resembled more the highly conserved KOGs than the animal-specific KOGs (Figure 3) in being enriched in housekeeping functions such as translation, transcription and RNA processing (data not shown). The functional profiles of the gene sets that were lost in different lineages showed substantial differences (Table 5). Thus, for example, in the lineage leading to the common ancestor of the animals, the greatest loss among genes assigned to functional categories was seen in amino acid and coenzyme metabolism; in contrast, in the fly and the nematode, more substantial degradation was observed among transcription factors and proteins with chaperone-like functions. Genes for proteins involved in RNA processing and translation are, in general, not heavily affected by loss except in the highly degraded parasite E. cuniculi. On many occasions, the switch from the coelomate to the ecdysozoan topology replaces two independent, parallel losses in the insect and nematode clades with a single loss at the base of the ecdysozoan branch, although, on the whole, trees based on gene content support the coelomate topology [74]. In particular, the ecdysozoan topology, unlike the coelomate topology, implies early loss of several genes involved in translation, transcription and repair (Table 6). Notably, a large fraction of genes lost in each lineage has only a general functional prediction or no prediction at all (Table 5). This emphasizes the paucity of our current understanding of lineage-specific gene sets. As noticed previously during the analysis of the genes lost in S. cerevisiae after its divergence from the common ancestor with S. pombe, functionally connected genes tend to be co-eliminated during evolution [57]. The present study generalizes this conclusion as many functionally coherent groups of co-eliminated KOGs become apparent (Table 5). Importantly, different branches of the same complex systems tend to be eliminated in parallel in different lineages, for example, largely non-overlapping sets of genes for proteins of the ubiquitin-proteasome-signalosome systems are lost in the fungal-microsporidial lineage and in the nematodes (Table 6). It seems likely that elimination of these genes reflects independent trends for simplification of regulatory processes in these lineages. An interesting trend seen in these data is the deterioration of the mitochondrial ribosome, which occurred in several eukaryotic lineages and appears to have been partly parallel (as it occurred independently in fungi-microsporidia and in animals) and partly consecutive: early loss in the ancestral animal line was followed by elimination of additional genes for ribosomal proteins in individual lineages (Table 6). C. elegans has one of the shortest mitochondrial rRNAs and might have a 'minimal' mitochondrial ribosome [80]; the present analysis details the stages leading to this ultimate degradation of the mitochondrial ribosome. An exhaustive analysis of the patterns of gene loss is beyond the scope of this work. It seems clear that it has potential of improving our understanding of eukaryotic evolution and functional predictions through examination of co-eliminated gene groups. Evolutionary relationships between eukaryotic and prokaryotic orthologous gene sets The prokaryotic COGs and eukaryotic KOGs were identified in separate genome comparisons, although an overlap existed because both sets included the unicellular eukaryotes, namely two yeasts and the microsporidian. To identify the prokaryotic counterparts of the KOGs, the sequences of the eukaryotic proteins included in the KOGs were compared using the RPS-BLAST program to the position-specific scoring matrices (PSSMs) constructed for all prokaryotic COGs ([81] see Materials and methods for details). The results were checked manually and also by comparing the assignment of proteins from unicellular eukaryotes to each of the orthologous gene sets. Altogether, probable orthologous relationships were established between 2,456 eukaryotic KOGs and TWOGs (44% of the total) and 1,516 prokaryotic COGs. A more detailed breakdown of the relationships between eukaryotic and prokaryotic orthologous gene clusters could reveal important evolutionary trends. Figure 6a compares the occurrence of prokaryotic counterparts for the entire set of eukaryotic KOGs and its subsets conserved at different levels. Clearly, the reconstructed gene set of the common ancestor of the crown group and, particularly, the pan-eukaryotic KOGs are enriched in ancient KOGs (those with prokaryotic counterparts) as compared to the full KOG collection. In contrast, among KOGs that are inferred to have evolved in individual lineages within the crown group, a significantly lower fraction has detectable prokaryotic counterparts (Figure 6a). Early evolution of eukaryotes is known to have involved duplication of ancient genes inherited from prokaryotes [82], and this was apparent in the KOGs against COGs comparison. Although one-to-one relationships were predominant, in around 30% of cases, two or more eukaryotic KOGs corresponded to the same prokaryotic COG (Figure 6b). This indicates extensive duplication of ancestral genes at early stages of eukaryotic evolution; moreover, a substantial fraction of these genes have undergone repeated duplications, resulting in a one-to-many relationship between prokaryotic and eukaryotic orthologs (Figure 6b). An in-depth analysis of the relationships between eukaryotic and prokaryotic orthologous gene clusters should include an attempt to decipher their evolutionary history, that is, classification of the C/KOGs represented both in eukaryotes and prokaryotes into: those that have been inherited from the last universal common ancestor; the archaeo-eukaryotic subset; and those that are shared because of HGT between bacteria and eukaryotes at various stages of eukaryotic evolution. This analysis is beyond the scope of the present work. Perhaps the principal message to stress here is that, using a fairly sensitive sequence comparison method, prokaryotic homologs could be detected for only some 44% of the eukaryotic KOGs, and this fraction increased to around 54% for those genes that could be traced to the last common ancestor of the crown group (Figure 6a). This observation emphasizes the major amount of innovation that accompanied the emergence and early evolution of eukaryotes; even those KOGs for which prokaryotic counterparts will be eventually identified through more sensitive sequence and structure comparison apparently experienced rapid evolution during the prokaryote-eukaryote transition. Phyletic patterns of KOGs and dispensability of yeast and worm genes There are 860 KOGs with at least one representative from each of the seven analyzed genomes. In accord with the 'knockout rate' hypothesis [83], which has been largely supported by recent, genome-wide analysis of gene conservation [38,84], it could be expected that these highly conserved genes were essential for the survival of eukaryotic organisms. This appears particularly plausible given the near-minimal eukaryotic gene complement of the microsporidian. The prediction was put to the test using the recently published functional profile of the yeast S. cerevisiae genome, which includes the data on the growth rates of homozygous deletion strains for 96% of the open reading frames (ORFs) in the yeast genome [85]. Growth rates have been previously interpreted as a measure of fitness [84]. When the phyletic patterns of the KOGs were superimposed on the data on gene dispensability (with essential genes operationally defined as those whose deletion had a lethal effect in a rich medium) [85], it was found that 45% of the essential genes were conserved in all seven species and 25% were represented in six species (typically with the exception of E. cuniculi); 15% of the essential yeast genes had no orthologs in the other analyzed genomes (Figure 7a). In a striking contrast, among non-essential genes, only 16.5% were represented in all compared genomes and 28.5% had no detectable orthologs (Figure 7a). The reciprocal comparison is equally illustrative: essential genes composed 18.5% of the entire set of yeast genes but 35% of the genes (KOGs) represented in all seven species. This translates into a statistically highly significant dependence between a gene's (in)dispensability and conservation over long evolutionary distances. The probability of the set of highly conserved genes being so enriched for essential genes as a result of chance was estimated at 0.5) were discarded. As the divergence times for all KOGs are presumed to be the same (and equal to the time elapsed since the last common ancestor for the eukaryotic crown group), the mean evolutionary distance in a KOG is a measure of the KOG's evolutionary rate. The parsimonious evolutionary scenario, which included gene losses and emergence of KOGs mapped to the branches of the eukaryotic phylogenetic tree, was constructed by using the DOLLOP program of the PHYLIP package [97]; this program is based on the Dollo parsimony method, which assumes irreversibility of character loss [79]. For the analysis of domain accretion, conserved domains from the NCBI CDD database were detected in the eukaryotic proteins that belonged to the KOGs by using the RPS-BLAST program [81] with an E-value cut-off of 0.001. Domains with biased amino acid sequence composition, which tend to produce a high false-positive rate in RPS-BLAST searches, were excluded from the analysis. The eukaryotic KOG set is accessible at [98] and via ftp at [99]. The reconstructed ancestral gene sets are available at [100].
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS One
                PLoS ONE
                plos
                plosone
                PLoS ONE
                Public Library of Science (San Francisco, USA )
                1932-6203
                2013
                31 December 2013
                : 8
                : 12
                : e85480
                Affiliations
                [1 ]Institute of Horticulture, Zhejiang Academy of Agricultural Sciences, Hangzhou, People’s Republic of China
                [2 ]Department of Gastroenterology, The First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou, People’s Republic of China
                [3 ]College of Life Sciences, Hubei University, Wuhan, People's Republic of China
                National Rice Research Center, United States of America
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                Conceived and designed the experiments: XL JL TY. Performed the experiments: XL LX JL. Analyzed the data: XL JL. Contributed reagents/materials/analysis tools: XL LX JL. Wrote the manuscript: XL LX FJ JL CS MX DQ.

                Article
                PONE-D-13-26332
                10.1371/journal.pone.0085480
                3877369
                6081a26e-a473-4d06-88f8-6ff28e94e267
                Copyright @ 2013

                This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 25 June 2013
                : 27 November 2013
                Funding
                This research was supported by the National Basic Research Program funded by the Nature Science Foundation of China (No.31201648), the Postdoctoral Science Foundation of China (No. 2012M521203), the Special Postdoctoral Science Foundation of China (No. 2013T60607), and the Foundation for Selected Postdoctoral project of Zhejiang (Bsh1201032), the Qianjiang talents project (No. 2013R10081), Scientific and technical innovation promotion project of ZAAS (2012R05Y01E04) and the New Variety of Flowers Breeding Group Project of Zhejiang Province (No. 2012C12909-10). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article

                Uncategorized
                Uncategorized

                Comments

                Comment on this article