21
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Systematic Analysis and Comparison of Nucleotide-Binding Site Disease Resistance Genes in a Diploid Cotton Gossypium raimondii

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Plant disease resistance genes are a key component of defending plants from a range of pathogens. The majority of these resistance genes belong to the super-family that harbors a Nucleotide-binding site (NBS). A number of studies have focused on NBS-encoding genes in disease resistant breeding programs for diverse plants. However, little information has been reported with an emphasis on systematic analysis and comparison of NBS-encoding genes in cotton. To fill this gap of knowledge, in this study, we identified and investigated the NBS-encoding resistance genes in cotton using the whole genome sequence information of Gossypium raimondii. Totally, 355 NBS-encoding resistance genes were identified. Analyses of the conserved motifs and structural diversity showed that the most two distinct features for these genes are the high proportion of non-regular NBS genes and the high diversity of N-termini domains. Analyses of the physical locations and duplications of NBS-encoding genes showed that gene duplication of disease resistance genes could play an important role in cotton by leading to an increase in the functional diversity of the cotton NBS-encoding genes. Analyses of phylogenetic comparisons indicated that, in cotton, the NBS-encoding genes with TIR domain not only have their own evolution pattern different from those of genes without TIR domain, but also have their own species-specific pattern that differs from those of TIR genes in other plants. Analyses of the correlation between disease resistance QTL and NBS-encoding resistance genes showed that there could be more than half of the disease resistance QTL associated to the NBS-encoding genes in cotton, which agrees with previous studies establishing that more than half of plant resistance genes are NBS-encoding genes.

          Related collections

          Most cited references47

          • Record: found
          • Abstract: found
          • Article: not found

          SMART, a simple modular architecture research tool: identification of signaling domains.

          Accurate multiple alignments of 86 domains that occur in signaling proteins have been constructed and used to provide a Web-based tool (SMART: simple modular architecture research tool) that allows rapid identification and annotation of signaling domain sequences. The majority of signaling proteins are multidomain in character with a considerable variety of domain combinations known. Comparison with established databases showed that 25% of our domain set could not be deduced from SwissProt and 41% could not be annotated by Pfam. SMART is able to determine the modular architectures of single sequences or genomes; application to the entire yeast genome revealed that at least 6.7% of its genes contain one or more signaling domains, approximately 350 greater than previously annotated. The process of constructing SMART predicted (i) novel domain homologues in unexpected locations such as band 4.1-homologous domains in focal adhesion kinases; (ii) previously unknown domain families, including a citron-homology domain; (iii) putative functions of domain families after identification of additional family members, for example, a ubiquitin-binding role for ubiquitin-associated domains (UBA); (iv) cellular roles for proteins, such predicted DEATH domains in netrin receptors further implicating these molecules in axonal guidance; (v) signaling domains in known disease genes such as SPRY domains in both marenostrin/pyrin and Midline 1; (vi) domains in unexpected phylogenetic contexts such as diacylglycerol kinase homologues in yeast and bacteria; and (vii) likely protein misclassifications exemplified by a predicted pleckstrin homology domain in a Candida albicans protein, previously described as an integrin.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            The Genomes of Oryza sativa: A History of Duplications

            Introduction The importance of the rice genome is reflected in the fact that rice was the first crop plant to have its genome sequenced; astonishingly, it was sequenced by four independent research teams at Beijing Institute of Genomics [1], Syngenta [2], International Rice Genome Sequencing Project (IRGSP) [3,4,5], and Monsanto. Beijing analyzed the two parental strains, 93–11 and PA64s, for a popular land race of super-hybrid rice, LYP9, and released a 4.2x draft for 93–11, a cultivar of the indica subspecies. This draft was acquired by a whole-genome shotgun (WGS) method [6]. Syngenta and IRGSP worked on Nipponbare, a cultivar of the japonica subspecies. Syngenta also used a WGS method and published a 6x draft. IRGSP used the clone-by-clone method [7] and released a 10x draft that incorporates the Syngenta data. Their publications include the finished version of Chromosomes 1, 4, and 10. These efforts have been widely hailed not only because rice feeds much of the world's population but also because rice is expected, through comparative analyses, to play a major role in understanding the grass family of crop plants [8,9,10,11,12,13]. We will report on an improved version of Beijing indica, which brings the coverage of the 93–11 dataset up to 6.28x. In addition, we improved Syngenta japonica by reassembling their sequence from the raw traces (National Center for Biotechnology Information Trace Archive; http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?) and combining that information with our 93–11 assembly. We achieved almost three orders of magnitude of improvement in long-range contiguity, and put essentially all the genes on the map, by combining the two WGS assemblies in a manner that preserves the single nucleotide polymorphism (SNP) information for indica–japonica differences. Both of these WGS assemblies were constructed independent of the information in IRGSP japonica. Hence, the two japonica assemblies allow us to compare the WGS and clone-by-clone methods objectively. By taking the clone-by-clone assembly as a “gold standard,” we can estimate the intrinsic misassembly rates for our two WGS assemblies—not just the japonica WGS but also the indica WGS, as identical assembly procedures are used and both contain 6x coverage. If we compare IRGSP japonica to Beijing indica, any increases in the discrepancy rate beyond this intrinsic misassembly rate can be attributed to indica–japonica differences. In the same spirit, genes are identified for all three assemblies using the same annotation procedures, to assess gene content differences without the methodological inconsistencies that have plagued previous comparisons. Finally, we introduce a simple method for analyzing gene duplications that resolves the contradictory claims that rice is an ancient aneuploid [14] and an ancient polyploid [15]. In the process, we demonstrate that duplication of individual genes plays a major role in the continuing evolution of the grass genomes. Both WGS sequences, and details of our analyses, are available from our own Web site (Beijing Genomics Institute Rice Information System; http://rise.genomics.org.cn) [16]. The version of IRGSP japonica that we use was downloaded October 5, 2003, from GenBank and DNA Data Bank of Japan according to the guidelines at http://www.genome.arizona.edu/shotgun/rice/status and the physical map at http://rgp.dna.affrc.go.jp/IRGSP/download. Results WGS Assembly of indica and japonica Many legitimate concerns have been raised about the differing qualities of the rice sequences that have been published [17,18] and on the idea that they must be “finished” [19,20]. Higher quality is of course a good thing, but it does come at a cost, and lost in the discussion is the reality that cost–benefit factors have always been important in sequencing. Most notably, all genome projects to date have focused primarily on the euchromatic regions that can be cloned and sequenced, even though important genes are missed as a result. For example, an essential 5.1-Mb fertility gene [21] resides in the heterochromatic Y chromosome of the Drosophila genome. In plant genomes, costs are primarily driven by the intergenic retrotransposon clusters [22] that account for about half of the rice genome, and even more of the larger maize (6x) and wheat (38x) genomes. Hence, our objective is merely to have all the genes assembled in one piece, without fragmentation, and anchored to the maps. A similar objective has been proposed [23,24] for crop genomes in general. Our benchmark is the set of full-length japonica cDNAs from the Knowledge-Based Oryza Molecular-Biological Encyclopedia [25] that contains 19,079 nonredundant cDNAs (nr-KOME). We begin with a few definitions. At the end of any WGS, a substantial fraction of the reads (specifically, those whose sequences are highly repeated across the genome) are invariably left unassembled. The usable reads are assembled into contigs, scaffolds, and super-scaffolds. In a contig, the identity of every base is defined. In contrast, scaffolds and super-scaffolds have gaps (regions of known length but otherwise unknown base content). The difference is that one refers to the sequence before any linking information from indica and japonica sources are combined (scaffold) and the other refers to the sequence after they are combined (super-scaffold). All of the raw data that went into these WGS assemblies are listed in Table S1, and the assembly procedure itself is outlined in Figure 1. Compared with our previous 4.2x assembly of indica, more shotgun reads and a few directed finishing reads were added to increase the coverage to 6.28x. We did not use the older assembly at all. Instead, we went back to the raw reads and reassembled them, with an updated version of RePS [26,27] that incorporates some recent concepts from Phusion [28]. Increasing coverage is essential for reducing single-base error rates. Based on the estimates from RePS, 97.2% and 94.6% of our new assembly has an error rate of better than 10−3 and 10−4, respectively. For the older assembly, the percentages were only 90.8% and 83.5%, respectively. Equally important, and as expected from Poisson sampling statistics [29], increasing coverage improves the scaffold size to a point where, even without additional finishing effort, most of the nr-KOME cDNAs can be aligned in one piece, without fragmentation. All we had to do was find a way to link these scaffolds together to create larger super-scaffolds, which could then be anchored to the physical [30] and genetic [31] maps. Mapped super-scaffolds for Beijing indica have a N50 size (the size above which half of the total length of a sequence dataset is found) of 8.3 Mb, which is a thousand times better than our previous draft, as shown in Table 1. We used an unorthodox method to construct super-scaffolds of megabase size from initial scaffolds of 30-kb size. Most of the increase in long-range contiguity came from combining the two WGS assemblies, not from the bacterial artificial chromosome (BAC) end pairs, which were of limited utility because their insert sizes were too large. Notice that in combining indica and japonica data, we use the alternate subspecies only for order and orientation information, not to fill missing bases. In other words, every base in the indica assembly is from indica. Not one single base is from japonica. Another key point is that Syngenta japonica is our reassembly of their raw data, not the published assembly. By using RePS for both WGS assemblies, we obtain error estimates for every base, which will later be essential for use in polymorphism detection. We would concede that if genes are ordered differently in indica and japonica, there is a small probability that by forcing the two subspecies together, we lose this information. However, there is no evidence of a major reordering of the genes because, if there were, it would have been seen in all these years of genetic mapping. The benefits thus outweigh the risks. The total genome size, including the unassembled reads and the unmapped pieces of all sizes, is 466.3 Mb for Beijing indica and 433.2 Mb for Syngenta japonica. For this estimate, we added up all the pieces at the scaffold level (as opposed to the super-scaffold level, where the gap size estimates are taken from the alternate subspecies and may not be representative of the underlying genome). We believe this difference is real, because the two genome sizes are based on the same procedures and similar WGS datasets. Although many smaller pieces fall between the cracks in the maps, these unmapped pieces turn out to be extremely gene poor. Hence, in our submission to DNA Data Bank of Japan/European Molecular Biology Library/GenBank, we omit unassembled reads and unmapped pieces smaller than 2 kb, which has the advantage of also filtering out nonrice contaminants from inevitable mix-ups in the lab. Physical distance is defined along a pseudo-chromosome where gaps of estimated size larger than 200 kb (a typical BAC) are collapsed to 200 kb. Between adjacent super-scaffolds, where by definition we do not have an estimated gap size, we insert a 5-kb gap. To validate the long-range accuracy of our assemblies, we compared physical and genetic distances, as shown in Figures S1 and S2. We use only those 1,519 markers that can be found in all three rice assemblies by Blastn at E-values of 10−100. There are two classes of discrepancies. First, the marker is on different chromosomes. All three rice assemblies agree with each other but not with the genetic map in 135 of 152 such markers. In the second class, the disagreement is on positions within a chromosome, and all three rice assemblies agree with each other but not with the genetic map in 41 of 60 such markers. Only a small handful of discrepancies are unique to any one assembly. It is highly unlikely that all three rice assemblies will make the same mistake, so we conclude that on the scale of hundreds of kilobases, our WGS data are better than the genetic map. Computed over every five markers, the mean (median) recombination rate is 4.5 (4.2) cM/Mb. We do expect smaller-scale misassemblies in the WGS data, as, for example, in Beijing indica, 98.1%, 71.0%, and 39.3% of the unassembled, assembled-but-unmapped, and mapped pieces, respectively, contain 20-mer repeats that are estimated to occur at least twice in the genome. About half of these 20-mer repeats are recognizable transposable elements (TEs) in RepeatMasker (http://www.repeatmasker.org, and TE compositions in different categories of assembled data are summarized in Table S2. The most problematic misassemblies are those that occur within genes, as these affect our ability to annotate the genome. Hence, we compared the WGS data to gene sequences defined by nr-KOME and excised from IRGSP japonica. We searched for alignment discrepancies of at least 500 bp, consistent with misassembled reads, and interpreted any increase in the discrepancy rate from Syngenta japonica to Beijing indica as being due to polymorphic differences. There are remarkably few genes with discrepancies in coding exons, only 0.23% in Syngenta japonica and 1.44% in Beijing indica. If we include UTR exons and introns, the rates are 0.84% in Syngenta japonica and 5.65% in Beijing indica. Hence, the implication is that WGS misassemblies occur less frequently than polymorphic differences. Table 2 shows the number of nr-KOME cDNAs that are found in each of the three rice assemblies, using the criterion that 95% of the coding region must be alignable in BLAT [32]. Some cDNAs align to multiple pieces of the assembly, but most align to one single piece. Even if we consider only the latter case, all three rice assemblies are at least 91.2% complete. Regardless of the assembly, the gaps seem to be random, as genes that are fragmented in one assembly are often intact in another. Of the cDNAs, 98.1% can be found in one piece in either Beijing indica or Syngenta japonica (if we also insist that they be anchored to the map, this number becomes 97.7%). Combining all three rice assemblies results in 98.6% completeness. Strikingly, only 0.7% of the genes align to the unmapped Beijing indica sequence, despite the fact these unmapped data were 12.3% of the searched sequence. This is the first of many examples that we will provide to support the idea that the unmapped pieces are extremely gene poor. Gene Identification and Classification We used an unorthodox method for gene identification. The conventional method, epitomized by Ensembl [33], uses sequence similarity to known genes and proteins to remove erroneous predictions, which are a serious problem for vertebrates because of the preponderance of large, multiexon genes, some of which can be megabases in size. However, plant genes are only a few kilobases in size, and given that Arabidopsis is still the only other sequenced plant, the Ensembl approach would remove many valid genes in a misguided effort to control a less serious problem. We removed erroneous predictions by relying instead on the fact most of them are actually TEs that are mistakenly called genes. Ultimately, our method is vindicated by whole-genome microarray experiments using 70-mer oligos that are hybridized to mRNA from five different tissue types. One finds that 82% of predicted rice genes with no homologs in Arabidopsis can be detected in this manner, as opposed to 88% of predicted rice genes with homologs (L. Ma, J. Wang, C. Chen, X. Liu, N. Su, et al., unpublished data). For the purpose of discussion, we will classify rice genes as WH (with homolog) or NH (no homolog), based on sequence similarity to Arabidopsis, with the stringency set to a level that is typically found in the literature. Nucleotide sequences are translated into protein sequences, and the Arabidopsis genome is searched in all six reading frames using TBlastN at E-values of 10−7. Putative exons are chained together, and success is declared if we can account for either 50% of the protein or 100 residues. We are not concerned that more sensitive search algorithms might identify homologies that we missed. Even the best algorithms are limited in their ability to identify structural homology by sequence similarity [34]. The main objective is to show how genes that are highly homologous or nonhomologous are sufficiently different as to merit special attention in data analysis, and the simplest way to emphasize this is to draw a dividing line. For methodological consistency, we annotated all three rice assemblies using the same procedures. We use FGENESH [35] for gene prediction because it has been shown to be the best of the available ab initio algorithms for rice [1]. An updated performance assessment is shown in Figure S3. The challenge in removing erroneous predictions resulting from TEs lies in how we compensate for the fact that the database used by RepeatMasker is incomplete. Figure 2 demonstrates how grass genomes are organized as gene islands of low copy number separated by intergenic repeat clusters of high copy number. We set a dividing line at copy number 10, not because there are no TEs below it but because there are few genes above it. Specifically, for genes defined by nr-KOME, 99.4% of the exons and 98.1% of the introns are attributed to 20-mers of copy number under 10. Using the finished sequence of Chromosomes 1 and 10, we show in Figure S4 that the mean (median) sizes are 23.7 kb (9.6 kb) for gene islands and 5.6 kb (3.5 kb) for intergenic repeat clusters. Applying RepeatMasker to these intergenic repeat clusters only identifies 47.6% as TEs, overwhelmingly gypsy and copia. We therefore propose to filter the predictions by removing genes for which 50% of their coding region is attributable to any combination of RepeatMasker TEs or 20-mers of copy number over 10. Although this filter might remove some real genes, it removes only a small fraction of them, as demonstrated by the nr-KOME cDNAs, where it eliminates 0.9% of these genes. In contrast, applying this same filter to the FGENESH predictions eliminates 19%–22% of the gene set, as indicated in Table 3. We believe that most of the removed predictions are TEs and that the benefits of removing these artifacts outweigh the risks of removing real genes. After this procedure, the gene counts range from 49,088 (Beijing indica) to 45,824 (Syngenta japonica) to 43,635 (IRGSP japonica). Previous estimates for Chromosomes 1, 4, and 10 made no such correction and found slightly larger numbers. About 45%–47% of predicted genes are NH, in contrast to 34.3% of nr-KOME cDNAs. This discrepancy is due to a combination of prediction errors and the fact that NH genes are difficult to clone because they are poorly expressed (data not shown). Radically different numbers have been given for mean gene size, from 2.6 kb in Chromosome 10 to 4.5 kb in our previous article. As we show in Table 4, much of this discrepancy can be explained by differences in definition. Predicted genes have a mean (median) size of 2.5 kb (1.8 kb). We get the same result for nr-KOME if we exclude UTRs, but we get a size of 3.6 kb (2.9 kb) if we include UTRs. If we restrict the genes to WH genes, this raises the gene size to 4.0 kb (3.4 kb). Even after removing likely TEs, two particular subclasses warrant caution, as they contain a higher than normal rate of erroneous predictions, which is reflected in a reduced rate of confirmation by ESTs. Overall, we used 200,648 ESTs from indica, japonica, and other rice subspecies. The confirmation rule is exact match over 100 bp. Genes predicted in unmapped sequences are confirmed at much lower rates than genes predicted in mapped sequences—about 11 times lower, even after removing 3.4 times as many unmapped genes as likely TEs. Genes unique to only one assembly also show lower confirmation rates, by a factor of roughly nine, when compared with the 35,052–36,940 genes that are shared by all three assemblies, as summarized in Figure 3. A more detailed analysis is given in Table S3. What is important is that few of these genes are likely to be real. We can use the ratio of the EST confirmation rates to correct our gene count estimates. Beijing indica is computed as [(36,940 × 39.6) + (1967 × 28.1) + (1586 × 20.4) + (8595 × 4.9)]/39.6 = 40,216. Similarly, we get 37,794 for Syngenta japonica and 37,581 for IRGSP japonica. If unique genes are truly expressed at lower levels than shared genes, this procedure might underestimate the gene count. One should thus interpret these numbers as lower bounds. Using the same EST adjustments, the number of predicted genes in Beijing indica that are not found in either japonica assembly is 1,064. Conversely, Syngenta japonica has 1,517 predicted genes that are not in indica (the number for IRGSP japonica is 1,479). As a fraction of the totals, 2.2% and 3.3% of indica and japonica genes, respectively, are unique to the subspecies, which is plausibly comparable to the amount of sequence that might still be missing. There is little difference in gene content between indica and japonica, but major differences are seen in the intergenic regions. Only 260 Mb (72%) of the mapped sequences can be aligned. This remains true no matter how much we relax the alignment parameters, and despite the fact that we had 34,190 “anchor points” (see Figure 1), which ensure that the indica–japonica comparisons are always made between the same regions of the chromosomes from the two subspecies. This unalignable fraction would be even larger if unmapped and unassembled sequences were included. Notice also that 20-mer repeat content is 59.2% in mapped-but-unaligned regions, as compared to 31.8% in mapped-and-aligned regions. Everything that we see is consistent with the fact that plant intergenic regions are rapidly evolving [36]. As further proof of this fact, Table 5 shows the SNP rates in these alignable regions. The rates vary from as little as 3.0 SNP/kb in coding regions to as much as 27.6 SNP/kb in identifiable TEs. Biological functions are inferred by and displayed within the Bioverse framework [37,38] by combining more than seven of the latest computational techniques, including profile–profile comparison to well-curated protein families, motif discovery, and structural assignment/prediction. Note that we do not use transitive annotations, as their error propagation rates are too high. We present these results in Gene Ontology (GO) [39] and InterPro [40] formats. Functions are assigned to 60.2% of WH genes and even to 17.5% of NH genes, reflecting the fact that Bioverse uses highly sensitive techniques. Figure 4 shows a couple of our GO comparisons, focused on plant-specific categories in Gramene [41]. From the fraction of the gene set in each category, rice and Arabidopsis are remarkably similar. FGENESH-predicted genes and nr-KOME cDNAs exhibit very similar patterns too, confirming the unbiased nature of these cDNAs. InterPro domain categories tell much the same story, and these data are summarized in Table S4. Bioverse is distinguished from other annotation pipelines in that it also determines protein–protein interactions. Two proteins are predicted to interact if they are both similar in sequence to proteins involved in known interactions. The known interactions are taken from numerous sources, including Protein Data Bank [42] and the Database of Interacting Proteins (which stores yeast two-hybrid studies, affinity column studies, and literature searches) [43]. The resultant network has 1,879 proteins/nodes with 8,902 unique interactions. Figure 5 highlights a small portion of this network, for defense proteins (i.e., classified as “defense related” under GO molecular function or “defense response” under GO biological process) and their direct neighbors in the network. Many occupy central positions, meaning the network would fall apart if they were removed. Such genes are essential for cell survival [44]. More details can be found at http://bioverse.compbio.washington.edu. Figure S5 shows that, near the centromeres, there is an increase in TE density (especially for large class I TEs like gypsy and copia) and a decrease in gene density. A more detailed view is given by the pullout figures of Figure S6, right down to the level of individual genes and TEs, to emphasize the excellent level of concordance between the two different WGS assemblies: Beijing indica and Syngenta japonica. Evidence of Whole-Genome Duplication Duplication of individual genes, chromosomal segments, or even entire genomes is an important source of raw materials for gene genesis [45]. In the extreme case of a whole-genome duplication (WGD), convincing examples are difficult to find because of the expected rapid loss of duplicated genes and because the rate of individual gene duplication is high enough to mask any remnants of an ancient WGD [46]. Yeast was the first genome in which a WGD was detected [47]. In plants, the existence issue is not disputed, as polyploidy is common [48,49,50,51,52,53], but even with complete genome sequence, many details remain obscure. For Arabidopsis, the number and timing of these duplication events is still unknown [54,55,56,57,58,59]. For rice, segmental duplications were known [60,61,62] before the rice genome sequence was published. However, detailed analysis of this sequence has resulted in the contradictory assertions that rice is an ancient aneuploid [14] and an ancient polyploid [15]. Here, we resolve this conflict by showing that every conceivable class of duplication that could have happened did in fact happen, including a WGD. We accept that every class of duplication is present in the same genome, and we thus explicitly assign, to every homolog pair, a status as to the class of duplication from which it came. For the sake of discussion, we define three classes: segmental duplication of multiple genes along a chromosome, tandem duplication of individual genes, and a category called background duplications to encompass everything else that cannot be so easily classified. In this conception, a WGD is a collection of segmental duplications that cover a majority of the genome, all of which date back to a common time in evolutionary history. All three rice assemblies give the same result, so we show only Beijing indica. Unlike previous analyses, we avoid predicted genes. Instead, we define a homolog pair to be a single nr-KOME cDNA and one of its potentially many homologs within rice. These homologs are defined by translating the cDNA's coding sequence into protein and searching the rice genome in all six reading frames for putative exons, with TBlastN at E-values of 10−7. Exons in the same order and orientation are linked together, and success is declared if these linked exons can account for 50% of the original protein sequence. This technique has the advantage that the homolog need not be a cDNA or a predicted gene (as neither dataset is likely to be complete). In fact, the homolog might even be a remnant of an ancient duplication that is no longer a functional gene. Complications are found at two extremes. Many cDNAs have no homologs, but many others have too many homologs. In particular, 24.5% of WH genes have no homologs in rice, whereas 64.4% of NH genes have no homologs in rice. Because NH genes are dispersed throughout the genome, sandwiched between WH genes, we cannot adopt a strict colinearity rule in our search for duplicated segments. There would be too many exceptions. Conversely, when there is at least one homolog in rice, the mean (median) number of homologs per cDNA is 40 (5). Rather than deal with the complexities of this situation, we focus first on the cDNAs with one and only one homolog. This reduces the background duplication noise and allows us to identify trend lines indicative of segmental and tandem gene duplications. We can then add back those cDNAs with more than one homolog that we had rejected earlier by using our newly defined trend lines to constrain the choices. The above procedure leaves us with 2,271 homolog pairs (or cDNAs). We adopt a graphical approach, because in the presence of massive background noise, trend lines are often easier to identify by eye than by software. Figure 6 depicts Chromosomes 2 and 6, and Figure S7 depicts all 12 chromosomes. There are 18 pairs of duplicated segments that together cover 65.7% of the length of all the mapped super-scaffolds. The mean (median) number of homolog pairs per segment is 34 (23). The segment sizes are 6.9 Mb (5.4 Mb), and they differ by 43% (42%) within a segment pair, which is not at all unexpected given the rapidly evolving nature of the rice intergenic regions. Instances of multiple duplicated segments on the same chromosomal region are extremely rare, covering only 0.9% of the total length. No additional multilevel duplications are detected if we use cDNAs with up to two homologs, as opposed to those with only one. Notice also that there are duplicated segments on all 12 rice chromosomes, as summarized in Figure 7. One can date the duplications by computing the number of substitutions per silent site (Ks). Multiple substitution corrections are done within K-Estimator [63]. To improve our statistics, we now include the higher-order homologs (those cDNAs with more than one homolog that we had removed before). Table 6 shows that this doubles or triples the number of homolog pairs in every segment and brings the mean (median) to 74 (53). The resultant Ks distribution is shown in Figure 8. One pair of segments on Chromosomes 11 and 12 is more recent in origin and has more homolog pairs per unit length than all the others. It was previously identified in many publications. If we ignore this segment pair, the mean Ks is 0.69, dating the duplication event to 53 million years ago (Mya), assuming a neutral evolutionary rate of 6.5 × 10−9 substitutions per silent site per year [64]. Most of the uncertainties are due to the multiple-substitution corrections for Ks. Another popular algorithm for Ks [65] dates the duplication event to 94 Mya. The molecular clock can also vary between genes and between taxa [66,67]. Evidence for the former is seen in the width of the distribution for Ks in Figure 8, which has a standard deviation of 49.8% based on individual homolog pairs (as opposed to 14.5% when based on duplicated segment pairs). We believe that the variation between genes will cancel out, but we cannot remove the systematic error resulting from the multiple substitution corrections or the potential error in the 6.5 × 10−9 evolutionary rate (which was derived from a small number of genes). However, all we really want to know is whether the duplication event occurred before or after the origin of the grasses, 55–70 Mya [68]. To this end, phylogenetic approaches can be used, albeit for a limited number of genes, because so few plants have been fully sequenced. A majority of these phylogenies indicate that the duplication event occurred before this pivotal point in evolution [14]. Almost certainly, the duplication event occurred after the divergence of monocots and eudicots, 170–235 Mya [69]. However, the best evidence for the statement that the duplication event must have predated the origin of the grasses is the fact that there is no other way to reconcile it with the widely observed synteny between different grass genomes [70]. In striking contrast, the Chromosome 11 to 12 duplication dates back to just 21 Mya, which postdates the origins of the grasses by a comfortable margin. If we accept that a WGD occurred before the divergence of maize–rice, and that a duplication in Chromosomes 11 and 12 occurred afterward, we might then expect to find two levels of duplication in this region of rice. We thus extended our analysis to consider cDNAs that map to as many as four loci. No indications of such a multilevel duplication could be found. Undaunted, we decided to try another approach and analyzed the maize–rice synteny, starting from the maize genetic map [71]. The results are given in Figures S8 and S9. We found 35 pairs of syntenic segments covering 71.4% and 52.9% of the maize and rice genomes, respectively. All previously identified segments are confirmed, except for those on Chromosomes 11 and 12 of rice. No synteny is found in the vicinity of this recent duplication. There are many explanations, and they need not contradict our hypothesis, as only 65.7% of the rice genome is in identifiably duplicated segments, and the region from Chromosome 11 to 12 is a minuscule 3.0% of the genome. It is possible that any traces of the WGD had already been lost by the time this recent duplication occurred. The region is also sufficiently small that any synteny with maize would be difficult to detect. It is too early to draw conclusions, especially as maize–rice synteny appears to be much more complicated than previously thought [72]. Given how so much of the rice genome is covered by segmental duplications, and the fact that all but one of our 18 segment pairs date back to the same time, give or take a standard deviation of 14.5%, the simplest interpretation is that a WGD did occur and that it happened before the origin of the grasses. However, it is equally clear that other classes of duplications are also present, and these are worth investigating too. Ongoing Individual Gene Duplications Tandem duplications are represented by the trend along the diagonal, Y = X, that is observed in all chromosomes (see Figures 6 and Figure S7). Segmental duplications within the same chromosome are possible, but their trend would not be along the diagonal, and none were actually seen in our analysis. As an indicator of the prevalence of the three different duplication classes, we use the number of homolog pairs before and after the inclusion of higher-order homologs. Segmental duplications contain 609 and 1,340 pairs, whereas tandem duplications contain 311 and 957 pairs. We can increase the tandem numbers by relaxing our definitions to allow two TBlastN homologs of an nr-KOME cDNA to count as a homolog pair (instead of insisting that one always be a cDNA). This is what we use in the Ks distribution plot of Figure 8, which contains 1,696 homolog pairs. Rather than a maximum in the distribution at some nonzero Ks, we find a big peak at zero Ks, followed afterward by an exponential decay. The implication is that tandem duplication is an ongoing evolutionary process that provides an endless source of raw materials for gene genesis. If we adopt the methods and parameters of the Arabidopsis genome paper, we find that 16.5% of the rice genome is tandemly duplicated, compared to 16.2% of the Arabidopsis genome. Note, however, that the Ks distribution for tandemly duplicated genes in Arabidopsis is highly unusual, in the sense that it does not exhibit the big peak at zero Ks that is seen in virtually every other plant genome [52]. In addition to segmental and tandem duplications, there is a third and last class of duplications that looks like background noise in our figures. The number of homolog pairs is 1,351 and 32,384 before and after higher-order homologs, respectively, although with no trend line to constrain the choice of homologs, that second number is almost certainly an overestimate, since only 4,212 cDNAs are involved. Surprisingly few of these higher-order homologs are the result of processed pseudogenes, as the number of cases in which a multiexon cDNA pairs with a single-exon TBlastN homolog is 9.8%. To demonstrate how overwhelmingly these higher-order homologs contribute to the background noise, Figure 9 depicts what Chromosome 2 would have looked like if we had included them. For simplicity of interpretation, Figure 8 is the Ks distribution of the cDNAs with one and only one homolog. This distribution has characteristics of the distribution for tandem duplications—large peak at zero Ks followed by exponential decay—except that the magnitudes of the Ks are much larger for background duplications. We believe that most of these background duplications were originally tandem duplications that, over time, migrated to other parts of the genome, but we cannot rule out the possibility of direct duplications to remote loci. Some older duplications may even be due to migration of genes from segmental duplications, but these are a small part of the overall picture. However we do the counting, it appears that this combination of recent tandem and background duplications, which we call individual gene duplications, would rival any contribution from the segmental duplications. Tandem and segmental duplications show markedly different Ka/Ks distributions, a popular test for evolutionary selection, where Ka and Ks refer to the fraction of nonsynonymous and synonymous sites, respectively, that are changed within a homolog pair [73]. Ka/Ks is one under neutrality, below one under purifying selection, and above one under adaptive selection. Tandem duplications tend to have larger Ka/Ks values, as we show in Figure 10. The averages are 0.720 (tandem) and 0.365 (segmental), and more homolog pairs exhibit Ka/Ks > 1 in tandem duplications. This is consistent with the observation that more recent duplications tend to have larger Ka/Ks values [74] and with the idea that, immediately after duplication, one of the two genes undergoes a fast evolving phase [75]. Finally, let us consider again those nr-KOME cDNAs with one and only one homolog. Among the ones assigned to a tandem duplication, 65.3% are NH, but among the ones assigned to a segmental duplication, 23.8% are NH. Hence, there is a marked correlation between NH genes and tandem duplications. Our WGD is in good agreement with the results of Paterson et al. [15], but we can also explain the seemingly contradictory results of Vandepoele et al. [14] First, they did not have a complete genome; about two-thirds of their segmental duplications were interrupted by a break in the assembly. Second, their algorithms were very likely confounded by the many NH genes with no homologs in rice itself and by the many individual gene duplications that in aggregate masked the WGD. In fact, their segmental duplications had a Ks distribution similar to ours, but they only covered 15% of the genome. Then, when they examined the distribution of Ks for all duplicates, what they found was a big peak at zero Ks. This lead them to conclude there was no WGD, when, in fact, almost every class of duplication that had been hypothesized was present, and they needed only to allow for that. Discussion Until recently, Arabidopsis was the only sequenced plant genome. When two rice genomes were first published in draft format, the comparative analyses that could be done were hindered by a lack of long-range contiguity. Now, there are three plant genomes (indica rice, japonica rice, and Arabidopsis) with multimegabase contiguity. In our analyses, we strived to maintain methodological consistency. To assess the accuracy of our assemblies, we first compared IRGSP japonica to Syngenta japonica, so that polymorphic differences would not be a confounding factor. To compare gene content in the three rice assemblies, we annotated them all with the same procedures. Our conclusion is that, even if the WGS method does fall just slightly short of the clone-by-clone method in terms of accuracy and completeness, it comes remarkably close. This is why all the genome-sequencing projects now being funded by the National Human Genome Research Institute (in the United States) are being done with WGS methods (http://www.genome.gov/11007951). Rice is also now one of the few organisms with the luxury of having a complete genome sequence for two important subspecies. Comparisons of indica and japonica reveal strikingly little difference in the gene content, but there are massive intergenic differences. This vindicates our strategy to focus on genic sequences, because if the intergenic sequences are so unstable even between indica and japonica, they are highly unlikely to be functional. Our analysis of the duplication history in rice resolves a simmering dispute and, at the same time, raises some intriguing questions. We find evidence for an ancient WGD, a recent segmental duplication, and massive ongoing individual gene duplications. This last phenomenon can explain certain unexpected findings. Sequencing of orthologous loci between grass genomes has identified many smaller-scale rearrangements that were not seen in the original map-based studies. Many of these exceptions to synteny are due to tandem duplications [76,77,78], which makes sense, given how these duplications are a frequent and ongoing event for grass genome evolution. In addition, the massive ongoing individual gene duplications provide a never-ending source of raw material for gene genesis. We believe that the large number of rice NH genes is a transient effect of this ongoing process. The contrary argument is that any such transients cannot be long-lived, as one of the two genes must decay rapidly to avoid the dosage-doubling problem [79,80]. We believe this is irrelevant when there is a continual injection of new gene duplicates. Additional details must, however, be deferred to a future article, in which we can better address other important issues, such as the critical need to confirm NH genes in proteomics and conservation in the maize genome sequence. Looking toward the future, we would point out that the Chinese Superhybrid Rice Genome Project was designed to include not only a major subspecies of rice, namely, the indica variety represented by93–11, but also the maternal strain of the LYP9 superhybrid, PA64s, which has a complex breeding history incorporating genetic material from indica, japonica, and javanica—all of the major subspecies of cultivated rice. Work on PA64s is continuing at our Beijing center. For the research community, we will be providing DNA microarrays to facilitate the systematic studies of gene expression in different tissues and developmental stages, and under different physiological and environmental conditions. We will develop molecular markers for mapping causative genes in mutant lines and marker-assisted breeding. This publication, and the associated data release, is also a fitting way to celebrate the end of 2004, which the General Assembly of the United Nations declared to be the International Year of Rice (http://www.fao.org/rice2004). Materials and Methods Construction of reference cDNAs: nr-KOME The initial Knowledge-Based Oryza Molecular-Biological Encyclopedia dataset [25] had 28,444 japonica cDNAs with complete open reading frames. These cDNAs were aligned to Syngenta japonica, and when two alignments overlapped by at least 100 bp, the smaller cDNA was removed. A small number of clones could not be aligned—not even partially—to any of our three rice assemblies (Beijing indica, Syngenta japonica, and IRGSP japonica). Removing these as nonrice contaminants gave a set of 19,079 nonredundant cDNAs that we call nr-KOME. Because the sequence quality is so high, we could use the longest open reading frame for the overwhelming majority of these cDNAs, without having to correct for sequencing errors. Minor corrections are applied to 2.5% of these cDNAs, following the methods first developed for GenScan [81]. Repeats and their effects on WGS misassembly The basic procedure for converting sequence reads into contigs and scaffolds was described in our original publication on RePS [26], our WGS assembler. A common source of confusion is the distinction between mathematically defined repeats (MDRs) and biologically defined repeats. What we focus on are MDRs, which refer to 20-mer sequences that are exactly repeated in the genome, without regard to their underlying biological context. In our nomenclature, “depth” refers to the number of times that a 20-mer appears in the unassembled sequence reads and “copy number” refers to the number of times that it appears in the (correctly assembled) genome. “Coverage” is the number of times that the genome is redundantly sampled, and therefore depth = copy number × coverage. Special procedures are used to compute depths efficiently [27]. In a WGS assembly, the problems arise from the MDRs, which are not equivalent to the biologically defined repeats. For example, TEs qualify as biologically defined repeats, and they can be recognized, even after many millions of years of degradation, by specialized programs like RepeatMasker (http://www.repeatmasker.org). However, the degradation makes it trivial to distinguish between two copies of an ancient TE, so these do not cause assembly problems. It is also relatively easy to distinguish between gene duplicates, because their introns and flanking intergenic regions are under fewer evolutionary constraints than their exons. Even for recent TEs and gene duplicates, assembly problems can be avoided, because RePS computes the copy number for every 20-mer in the WGS assembly, and it will refuse to join anything that might be ambiguous. Indeed, the only way a misassembly can occur is if there is a low copy MDR and its copy number is underestimated by RePS. All of our tests show that, although this can happen, it is a rare event. On the usefulness (or not) of BAC end pairs The fundamental challenge was that we had to create super-scaffolds of megabase size from scaffolds of 30-kb size. It is generally thought that BAC end pairs are useful for this purpose, but this is not true when the BAC inserts, typically 122–187 kb, are much bigger than the scaffold sizes. Instead of linking adjacent scaffolds, they link every fourth to sixth scaffold. The fact that the density of BAC ends is 2.3 kb does not help, because there is no way to determine the order and orientation of the overlapping BACs. Fingerprint maps do provide some ordering information, but nothing like 2.3-kb resolution, and orientation information is still missing. The danger in using the BACs at this point is that you end up with a morass of interleaving super-scaffolds [26], with no way to untangle them. We actually did an assembly with only the BACs, and the result was that the super-scaffolds were 87% larger than they should have been. In the mouse project [82], the solution was to use fosmid end pairs, because these inserts are constrained to an almost ideal size of 40 kb. In the case of rice, we did not need to sequence fosmid end pairs, because by combining the indica and japonica WGS assemblies, it is possible to get linking information at the requisite length scales. We did of course use all available BAC end pairs [83] (http://rgp.dna.affrc.go.jp/blast/runblast.html, but they were only useful after the intermediate-range linking that came from combining WGS assemblies. Misassemblies versus polymorphic differences To verify our WGS assemblies on the smaller-length scales that are more characteristic of genes, we compare them with IRGSP japonica, taking the latter as the “gold standard” not because it is perfect but because it more likely to be correct. We focus on gene regions by aligning nr-KOME cDNAs to IRGSP japonica and excising the sequences from the 5′ to 3′ UTRs, including introns and an additional 500 bp at both ends. What we search for are potential misassemblies due to misplaced reads. Given that a typical read is 500 bp, these should appear as segments of 500 bp or more in which the excised gene sequence cannot be aligned with the WGS assembly. Such discrepancies are noted based on where they occur in the context of the gene. Although it is possible to detect more than one discrepancy per gene, we only count the most serious discrepancy in each gene based on the likelihood of it being functional. The prioritization is from coding exon, to UTR exon, to intron. Notice that discrepancies of this nature are not always from misassemblies. In the Beijing indica comparison, they can also be due to polymorphic differences. Although there is no way to tell what any particular discrepancy is, we know the misassembly rate from the Syngenta japonica comparison. Therefore, any increase in the discrepancy rate in the Beijing indica comparison can be attributed to polymorphic differences. Ab initio predictions in WH versus NH genes FGENESH [35] behaves very differently for WH and NH genes, as defined by nr-KOME. Following the methods of our recent review [84], we compute false positive (FP) and false negative (FN) rates. Error rates are given on a per amino acid basis. This means that in addition to correctly identifying the coding bases, we require the reading frame to be correctly determined. WH genes show very low error rates (FP = 0.10 and FN = 0.05). Although NH genes show higher error rates (FP = 0.35 and FN = 0.25), these are not that much worse than human genes (FP = 0.30 and FN = 0.12), and like it or not, error rates like these are the state of the art in ab initio prediction. On closer examination, it is clear that most of the problems in rice are caused by single-exon genes with small coding regions, which are more prevalent among NH genes and form a category that all ab initio algorithms handle poorly. This category of genes does not affect the gene count because FP and FN cancel each other out. We therefore focus on removing TEs that are mistakenly called genes. Comparison of indica-japonica to identify SNPs The sequence alignments for indica and japonica are straightforward, with almost no chance of paralog confusion, because of our 34,190 unique “anchor points” (see Figure 1). We partition the sequence into four nonoverlapping categories called unassembled, assembled-but-unmapped, mapped-but-unaligned, and aligned. The last category is where almost all of the genes are, and where we can get polymorphism data. Detailed sequence alignments are computed with CrossMatch, a Smith-Waterman algorithm that is included in Phrap (http://www.phrap.org). This is preferred to any of the BLAST alignment tools, which, although they are faster, occasionally miss subtle details. To discriminate between polymorphisms and sequencing errors, we use the error probability p attached to every base, and given as Q = −10 × log(p). Following the rules established in the early days of large-scale polymorphism discovery [85], we use thresholds of Q > 23 at the SNP site and Q > 15 for the two flanking 5-bp regions. Experience has taught us that higher thresholds (30 and 22, respectively) are required for the indels. For comparison, an independent analysis [86] reported mean rates of 7.1 SNP/kb and 2.0 indel/kb, with 98% of these SNPs experimentally confirmed. Our SNP rates are two times higher because we aligned more of the intergenic sequence. If we eliminate this factor, say, by restricting our rates to the introns of the genic regions defined by nr-KOME, our rates are 6.1 SNP/kb and 1.3 indel/kb, which are actually lower than the rates from that independent analysis. On the reliability of the p–p interaction data Bioverse annotations in this article are dated July 2003 (FGENESH) and November 2002 (nr-KOME). Two proteins are said to interact if they are similar to two other proteins that are known to interact. Our criterion is that the product of the similarity measures (percentage identity) must exceed 0.15. For example, two proteins with 45% and 30% identity to two other proteins that are experimentally determined to interact would be rejected, as their score is 0.45 × 0.30 = 0.135. The reliability of this approach, especially for transfer of interaction data between organisms, has been demonstrated in Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, and Helicobacter pylori analyses [87]. As an example of a predicted interaction for rice that has been independently confirmed, Bioverse identification numbers 21736 and 8526 (score 0.21) show an interaction between CDK-activating kinase and H-type cyclins [88]. A general way to verify the predicted interactions is to compare them against known protein complexes in the Protein Data Bank. Unfortunately, there are few Protein Data Bank structures from rice, and even fewer are of protein complexes. Given this dearth of experimentally determined interactions for rice, Bioverse is almost the only source of large-scale interaction data. Details of the duplication and synteny analysis We defined a homolog pair as a single nr-KOME cDNA and its TBlastN homolog, but occasionally that TBlastN homolog will overlap with another cDNA. To avoid double counting, we keep only the larger of these two cDNAs. Segmental duplications identified by visual inspection must have at least five homolog pairs, with no more than 5 Mb between adjacent homolog pairs. We approximate the trend line with a second- or third-order polynomial, and to capture what our eyes indicate should be captured, we accept homolog pairs within a 500-kb radius of this polynomial. Slightly different definitions are used for tandem duplications, depending on application. For Ks, we allow two TBlastN homologs to count as a homolog pair and accept homolog pairs within a 50-kb radius of the diagonal, although the mean (median) center-to-center distance is 6.8 kb (4.7 kb). To compare tandem duplications in rice and Arabidopsis, we use the methods described in the Arabidopsis genome paper and analyze predicted genes with BlastP at E-values of 10−20. To determine the maize–rice synteny, we began with 1,063 maize genetic markers [71] and searched for BlastN alignments to rice of at least 100-bp size and 80% identity. Given the segmental allotetraploid origins of maize [89], many markers are associated with two loci in maize. Each marker aligns to a mean (median) of 1.9 (1) loci in rice. We used only the longest of these alignments and verified in retrospect that using all of them would not have mattered. In the end, there are 35 pairs of syntenic segments, which cover 71.4% and 52.9% of the maize and rice genomes, respectively, and the mean (median) number of markers per syntenic segment is 18 (12). Supporting Information Figure S1 Genetic Versus Physical Map Distance for All 12 Rice Chromosomes, Based on Beijing indica Similar results are seen with the other two assemblies, Syngenta japonica and IRGSP japonica. (1 MB EPS). Click here for additional data file. Figure S2 Number of Discrepant Markers in Comparisons of Genetic and Physical Maps for 1,519 Markers Found in All Three Rice Assemblies We count discrepancies where the markers are found (A) on different chromosomes and (B) in different locations on the same chromosome. (458 KB ZIP). Click here for additional data file. Figure S3 Gene Prediction by FGENESH, Tested against nr-KOME cDNAs Genomic size refers to the unspliced transcript, with introns, but constrained to the region from the start to stop codons. CDS size refers to the spliced transcript, without introns. Predictions are assessed with FP and FN rates, where per-aa (per amino acid) refers to the fact that we check whether the reading frame is correct. (351 KB ZIP). Click here for additional data file. Figure S4 Distribution of Sizes for Gene Islands and Intergenic Repeat Clusters, Based on Complete Sequence of Chromosomes 1 and 10 from IRGSP japonica Intergenic repeat clusters are regions of size larger than 1.5 kb (i.e., between a MITE and a gypsy/copia TE), where most of the 20-mer copy numbers exceed ten. Lower copy number regions are tolerated up to a “maximum gap size,” which defaults to 150 bp. Regions lying between two adjacent intergenic repeat clusters are taken to be gene islands. (233 KB ZIP). Click here for additional data file. Figure S5 Gene and TE Densities for Beijing indica Chromosome 7, as a Percentage of Sequence Length Near the centromeres, there is an increase in TE density (especially for the large, class I TEs such as gypsy and copia) and a decrease in gene density. This is not an artifact of the fact that WGS assemblies underrepresent larger TEs, as much the same effect is observed when we use IRGSP japonica instead (data not shown). (362 KB ZIP). Click here for additional data file. Figure S6 Coordinated Annotation of the Individual Chromosomes for Beijing indica and Syngenta japonica We depict all the genetic markers, nr-KOME cDNAs, FGENESH gene predictions, and transposable elements identified by RepeatMasker. Genes are depicted as WH (colored blue) or NH (colored red) based on their similarity to Arabidopsis. TEs are decomposed into classes I, II, and III. Correspondence between indica and japonica is indicated by drawing a connecting line between the 5′ ends of the nr-KOME cDNAs that clearly align to both assemblies. (9.6 MB ZIP). Click here for additional data file. Figure S7 Duplicated Segments in the Beijing indica Assembly for All 12 Chromosomes, Plotted in the Manner of Figure 6, and with a Total of 12 Panels (507 KB ZIP). Click here for additional data file. Figure S8 Complete Synteny between Maize and Rice I Each point indicates the genomic positions for a maize genetic marker and its highest confidence match in rice. The x-axis shows a specific chromosome for one genome, and the y-axis shows all chromosomes for a second genome, with the chromosome numbers color-coded as per the legend. We show here 12 panels for rice. (311 KB ZIP). Click here for additional data file. Figure S9 Complete Synteny between Maize and Rice II Each point indicates the genomic positions for a maize genetic marker and its highest confidence match in rice. The x-axis shows a specific chromosome for one genome, and the y-axis shows all chromosomes for a second genome, with the chromosome numbers color-coded as per the legend. We show here ten panels for maize. (288 KB ZIP). Click here for additional data file. Table S1 Raw Data for Beijing indica and Syngenta japonica Assemblies Read length is the number of Q20 bases with an error rate of 10−2 or better. Effective coverage is based on the depth of reads in contigs over 5 kb in size, ignoring regions with 20-mer repeats. Clone insert sizes are specified in terms of tenth and 90th percentiles. (16 KB XLS). Click here for additional data file. Table S2 Transposable Elements Identified with RepeatMasker Are Put into Classes I, II, and III As a result of our efforts to identify indica–japonica polymorphisms, the sequence is divided into four nonoverlapping categories: unassembled, assembled-but-unmapped, mapped-but-unaligned, and aligned (with all the SNPs). (28 KB XLS). Click here for additional data file. Table S3 Detailed Analysis of Gene Overlaps from Figure 3 For each region of the Venn diagram, we use BLAT to align the predicted gene to the other assembly (or assemblies) where the gene is supposedly missing. The objective is to determine whether it is the sequence that is missing, or whether the discrepancy is due to the errors in the ab initio predictions. What we find is a bit of both. However, fragmented sequence assemblies are not a problem. If the gene is found at all, it is usually found in one piece. What is striking is that predicted genes that are unique to the two WGS assemblies do tend to be genuinely missing from IRGSP japonica sequence. This supports the idea that the WGS method can sometimes identify genes that are not well represented in the BAC clone libraries. (17 KB XLS). Click here for additional data file. Table S4 Table of InterPro Domain Rankings One table compares predicted genes from Arabidopsis and Beijing indica. The second table compares predicted genes from Beijing indica with nr-KOME cDNAs. (169 KB XLS). Click here for additional data file. Accession Numbers The DNA Data Bank of Japan/European Molecular Biology Laboratory/GenBank (BGI-RIS http://rise.genomics.org.cn [16]) project accession numbers for the WGS sequences discussed in this article are Beijing indica ( AAAA00000000, version AAAA02000000) and Syngenta japonica (AACV00000000, version AACV01000000). Note Added in Proof The idea that TEs are often mistakenly annotated as genes was also suggested in a recent paper by Bennetzen et al. [90].
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Plant NBS-LRR proteins: adaptable guards

              Most of the disease resistance genes (R genes) in plants cloned to date encode nucleotide-binding site leucine-rich repeat (NBS-LRR) proteins characterized by nucleotide-binding site (NBS) and leucine-rich repeat (LRR) domains as well as variable amino- and carboxy-terminal domains (Figure 1). These large, abundant, proteins are involved in the detection of diverse pathogens, including bacteria, viruses, fungi, nematodes, insects and oomycetes. There have been numerous extensive reviews since the first NBS-LRR-encoding genes were cloned from plants in 1994 (for example [1-5]). This article aims to provide a current overview of the structure and function of this protein family as well as to highlight recent advances. Plant NBS-LRR proteins are similar in sequence to members of the mammalian nucleotide-binding oligomerization domain (NOD)-LRR protein family (also called 'CARD, transcription enhancer, R (purine)-binding, pyrin, lots of leucine repeats' (CATERPILLER) proteins), which function in inflammatory and immune responses [6]. But although mammalian NOD-LRR proteins have the same tripartite domain organization as plant NBS-LRR proteins, including a nucleotide-binding domain and a LRR domain, the functional similarities between NBS-LRR and mammalian NOD proteins are probably the result of convergent evolution [7]. There are no NOD-related proteins in Caenorhabditis elegans or Drosophila melanogaster and the downstream partners of the two families differ [7,8]. The human NOD protein apoptotic protease activating factor 1 (APAF-1) has an NBS domain with greater protein-sequence similarity to plant NBS-LRR proteins than to other mammalian NOD proteins; however, it shares neither the amino-terminal nor the carboxy-terminal LRR domains characteristic of plant NBS-LRR proteins. Evolution and genome organization Plant NBS-LRR proteins are numerous and ancient in origin. They are encoded by one of the largest gene families known in plants. There are approximately 150 NBS-LRR-encoding genes in Arabidopsis thaliana, over 400 in Oryza sativa [3,9,10], and probably considerably more in larger plant genomes that have yet to be fully sequenced. Many NBS-encoding sequences have now been amplified from a diverse array of plant species using PCR with degenerate primers based on conserved sequences within the NBS domain and there are currently over 1,600 NBS sequences in public databases (Additional data file 1). They are found in non-vascular plants and gymnosperms as well as in angiosperms; orthologous relationships are difficult to determine, however, owing to lineage-specific gene duplications and losses [11,12]. In several lineages, NBS-LRR-encoding genes have become amplified, resulting in family-specific subfamilies (Figure 2; Additional data file 1) [13]. Of the 150 NBS-LRR sequences in Arabidopsis, 62 have NBS regions more similar to each other than to any other non-Brassica sequences (Figure 2; Additional data file 2). Different subfamilies have been amplified in the legumes (which includes beans), the Solanaceae (which includes tomato and potato), and the Asteraceae (which includes sunflower and lettuce) [13-15]. The spectrum of NBS-LRR proteins present in one species is not therefore characteristic of the diversity of NBS-LRR proteins in other plant families. NBS-LRR-encoding genes are frequently clustered in the genome, the result of both segmental and tandem duplications [3,10,16,17]. There can be wide intraspecific variation in copy number because of unequal crossing-over within clusters [18,19]. NBS-LRR-encoding genes have high levels of inter- and intraspecific variation but not high rates of mutation or recombination [19]. Variation is generated by normal genetic mechanisms, including unequal crossing-over, sequence exchange, and gene conversion, rather than genetic events particular to NBS-LRR-encoding genes [3,19-21]. The rate of evolution of NBS-LRR-encoding genes can be rapid or slow, even within an individual cluster of similar sequences. For example, the major cluster of NBS-LRR-encoding genes in lettuce includes genes with two patterns of evolution [19]: type I genes evolve rapidly with frequent gene conversions between them, whereas type II genes evolve slowly with rare gene conversion events between clades. This heterogeneous rate of evolution is consistent with a birth-and-death model of R gene evolution, in which gene duplication and unequal crossing-over can be followed by density-dependent purifying selection acting on the haplotype, resulting in varying numbers of semi-independently evolving groups of R genes [19,22]. The impact of selection on the different domains of individual NBS-LRR-encoding genes is also heterogeneous [19]. The NBS domain seems to be subject to purifying selection but not to frequent gene-conversion events, whereas the LRR region tends to be highly variable. Diversifying selection, as indicated by significantly elevated ratios of non-synonymous to synonymous nucleotide substitutions, has maintained variation in the solvent-exposed residues of the β-sheets of the LRR domain (see below) [19,23]. Unequal crossing-over and gene conversion have generated variation in the number and position of LRRs, and in-frame insertions and/or deletions in the regions between the β-sheets have probably changed the orientation of individual β-sheets. There are, on average, 14 LRRs per protein and often 5 to 10 sequence variants for each repeat; therefore, even within Arabidopsis, there is the potential for well over 9 × 1011 variants, which emphasizes the highly variable nature of the putative binding surface of these proteins. There are two major subfamilies of plant NBS-LRR proteins, defined by the presence of Toll/interleukin-1 receptor (TIR) or coiled-coil (CC) motifs in the amino-terminal domain (Figure 1). Although TIR-NBS-LRR proteins (TNLs) and CC-NBS-LRR proteins (CNLs) are both involved in pathogen recognition, the two subfamilies are distinct both in sequence and in signaling pathways (see below) and cluster separately in phylogenetic analyses using their NBS domains (see Additional data file 2) [24,25]. TNLs are completely absent from cereal species, which suggests that the early angiosperm ancestors had few TNLs and that these were lost in the cereal lineage. The presence or absence of TNLs in basal monocots is not currently known. CNLs from monocots and dicots cluster together, indicating that angiosperm ancestors had multiple CNLs (Figure 2) [26]. There are also 58 proteins in Arabidopsis that are related to the TNL or CNL subfamilies but lack the full complement of domains [3,27]. These include 21 TIR-NBS (TN) and five CC-NBS (CN) proteins that have amino-terminal and NBS domains but lack a LRR domain [27]. The function of these proteins is not known, but they have the potential to act as adaptors or regulators of TNL and CNL proteins. Characteristic structural features NBS-LRR proteins are some of the largest proteins known in plants, ranging from about 860 to about 1,900 amino acids. They have at least four distinct domains joined by linker regions: a variable amino-terminal domain, the NBS domain, the LRR region, and variable carboxy-terminal domains (Figure 1). Four subfamilies of CNLs and eight subfamilies of TNLs were identified in Arabidopsis from sequence homology, motifs, intron positions and intron phase [3]. No crystal structures have been determined for any part of a plant NBS-LRR protein; crystal structures of mammalian NBS and LRR domains are, however, available as templates for homology-modeling approaches. The amino-terminal domain There is little experimental information on the function of the amino-terminal domain. In animals, the TIR domain is involved in signaling downstream of Toll-like receptors. Many plant NBS-LRR proteins are thought to monitor the status of ('guard') targets of pathogen virulence effectors (see below). Given the presence of TIR or CC motifs as well as the diversity of these domains, the amino termini are thought to be involved in protein-protein interactions, possibly with the proteins being guarded or with downstream signaling components [4]. Polymorphism in the TIR domain of the flax TNL protein L6 affects the specificity of pathogen recognition [28]. An alanine-polyserine motif that may be involved in protein stability is located immediately adjacent to the amino-terminal methionine in many TNLs (but not CNLs) in Arabidopsis [3]. Four conserved TIR motifs span 175 amino acids within the TIR domain of TNLs [27]. A CC motif is common but not always present in the 175 amino acids amino-terminal to the NBS of CNLs [3]. Some CNLs have large amino-terminal domains; tomato Prf, for example, has 1,117 amino acids amino-terminal of the NBS, much of which is unique to this protein. The NBS domain More is known of the structure and function of the NBS domain, which is also called the NB-ARC (nucleotide binding adaptor shared by NOD-LRR proteins, APAF-1, R proteins and CED4) domain. This domain contains several defined motifs characteristic of the 'signal transduction ATPases with numerous domains' (STAND) family of ATPases, which includes the mammalian NOD proteins [29,30]. STAND proteins function as molecular switches in disease signaling pathways. Specific binding and hydrolysis of ATP has been shown for the NBS domains of two tomato CNLs, I2 and Mi [31]. ATP hydrolysis is thought to result in conformational changes that regulate downstream signaling. The first report of NBS-LRR protein oligomerization, a critical event in signaling from mammalian NOD proteins, is the oligomerization of tobacco N protein (a TNL) in response to pathogen elicitors [32]. In Arabidopsis, eight conserved NBS motifs have been identified through analysis with MEME, a program for motif identification [3]. NBS domains of TNLs and CNLs are distinguished by the sequences of three resistance NBS (RNBS) motifs within them (RNBS-A, RNBS-C, and RNBS-D motifs; see Additional data file 3) [3]. Threading plant NBS domains onto the crystal structure of human APAF-1 provides informative insights into the spatial arrangement and function of the motifs conserved in the plant NBS domains (Figure 3) [30,33]. The nucleotide-binding domain of APAF-1 consists of three subdomains: a three-layered α/β subdomain (containing the anchor region), a helical subdomain (containing the kinase-2 motif and P-loop) and a winged-helix subdomain (containing the MHDV motif; Figure 3). The specific binding of ADP by human APAF-1 is achieved by a total of eight direct and four water-mediated hydrogen bonds; the P-loop portion of the helical subdomain interacts with the α- and β-phosphates of ADP, a histidine and a serine residue on the winged-helix subdomain interacts with a phosphate and the sugar of ADP, and a small anchor region in the α/β subdomain stabilizes the adenine base [33]. The binding pocket and patterns of binding to ADP are well conserved in the threading models of TNLs (exemplified by the Arabidopsis protein RPS4) and CNLs (exemplified by the Arabidopsis protein RPS5; Figure 3) ([30] and P.K., unpublished work). The NBS domains of TNLs contain additional loops absent in the NBS domain of CNLs. TNLs and CNLs have four conserved motifs that are located around the catalytic cleft: the P-loop, the anchor region, and the MHDV motif (specifically the histidine residue), all of which serve to orient the ADP molecule, as well as the GLPL motif (the MHDV and GLPL motifs are named after their constituent amino acids in the single-letter code). While there is no obvious contact between ADP and the GLPL motif in human APAF-1, the conservation of its position on top of the binding site in APAF-1, RPS4 and RPS5 indicates that it may be involved in binding ADP. In addition, the last two aspartic acids in the kinase-2 motif are positioned to interact with the third phosphate of ATP, consistent with their role of coordination for the divalent metal ion required for phosphotransfer reactions, for example the Mg2+ of Mg-ATP (Figure 3). The anchor region in the α/β subdomain of APAF-1, which consists of the sequence Val-Thr-Arg, is present as Phe-Gly-Asn in RSP4 and as Val-Gly-Gln in RPS5. This anchor region, consisting of a hydrophobic (Val or Phe), a small (Gly or Thr) and a polar (Arg, Asn or Gln) amino acid, was previously unrecognized, but is highly conserved in plant NBS-LRR proteins (see Additional data file 3). Autoactivating mutations in two CNLs, potato Rx (Asp460Val) and tomato I2 (Asp495Val), map next to the histidine in the MHDV motif; these mutations may perturb the binding of the β-phosphate of ADP and result in a more open structure [30]. The LRR domain The LRR domain is a common motif found in more than 2,000 proteins, from viruses to eukaryotes, and it is involved in protein-protein interactions and ligand binding [1]. The crystal structures of more than 20 LRR proteins have revealed that LRR domains characteristically contain a series of β-sheets that form the concave face shaped like a horseshoe or banana [34]. Less is known, however, about the quaternary arrangements of LRR proteins. At least three different types of dimers have been observed, involving interactions of either their concave surfaces [35] or their convex surfaces [36,37], or by concatenation involving an antiparallel β-sheet at the interface [38]. Threading of the LRR domain of Arabidopsis RPS5 onto the crystal structure of the bovine decorin protein, a member of the small LRR proteoglycans (SLRP) protein family with a protein core composed of LRRs [35], provided a model consistent with a curved horseshoe-like surface of β-sheets (Figure 4; P.K., unpublished work). The number of repeats in the LRR domains in TNLs and CNLs of Arabidopsis is similar (mean 14, range 8 to 25), but this number can be considerably higher in other species. In the lettuce CNL Resistance Gene Candidate 2 (RGC2) proteins, an example of which is Dm3, the LRR domain appears to be duplicated and there can be as many as 47 LRRs in total [19]. Each LRR comprises a core of about 26 amino acids containing the Leu-xx-Leu-xx-Leu-x-Leu-xx-Cys/Asn-xx motif (where x is any amino acid), which forms a β-sheet; each core region is separated by a section of variable length that varies from zero to 30 amino acids. In many NBS-LRR proteins, the putative solvent-exposed residues (shown as x in the consensus sequence above) show significantly elevated ratios of nonsynonymous to synonymous substitutions, indicating that diversifying selection has maintained variation at these positions. The LRR domain is involved in determining the recognition specificity of several R proteins (for example [18,39-42]); direct interaction with pathogen proteins has rarely been shown, however. The LRR domain may be involved predominantly in regulatory intramolecular interactions. The LRR domain of the potato CNL Rx interacts with the NBS domain even when expressed in trans; this interaction is disrupted by the potato virus X elicitor, a viral coat protein that can induce a host defense response [43]. Also, the inner, concave surface of the β-sheets may not be the only binding surface. The LRR domain of TLR3, a human Toll-like receptor, is predicted to form a heterodimer and to bind double-stranded RNA from pathogens against its looped surface, on the opposite side from the β-sheets [37]. Analysis using MEME identified few motifs in common between the LRR domains of TNLs and CNLs in Arabidopsis [3]. The third LRR was one of the few that contained a conserved motif. Mutation in this LRR of the CNL RPS5 results in epistatic inhibitory effects on multiple NBS-LRR proteins, suggesting that the LRR may interact with downstream signaling components [5,44]; also, a mutation within this LRR in the CNL Rx of potato results in a constitutively active form [45]. The carboxyl termini CNLs and TNLs differ markedly in the size and composition of their carboxy-terminal domains. Those of TNLs are larger and more variable than those of CNLs. CNLs typically have only 40-80 amino acids carboxy-terminal to the LRR domain, whereas the carboxyl termini of TNLs often have an additional 200-300 amino acids, equaling the size of the LRR domain. Several TNLs have extensions with similarity to other proteins [3]. One of the larger TNLs in Arabidopsis, RRS1, which becomes localized to the nucleus in response to infection, encodes a 1,388 amino-acid protein with a nuclear localization signal and a WRKY motif (a motif also found in zinc-finger transcription factors and containing the sequence Trp-Arg-Lys-Tyr) at the carboxyl terminus [46]. Function, localization and regulation Disease resistance is the only function so far demonstrated for NBS-LRR proteins; however, a role in resistance has yet to be confirmed for most. Functions in other areas of plant biology cannot be excluded, particularly for the more divergent members of the family. The simplest model for NBS-LRR R protein function is as receptors that bind effector molecules secreted by pathogens, but direct interactions between NBS-LRR R proteins and effector proteins have been detected only rarely [47,48]. In an alternative model, the 'guard hypothesis', NBS-LRR R proteins monitor the status of plant proteins targeted by pathogen effectors [49,50]. Such indirect detection of pathogens allows a limited number of NBS-LRR R proteins to detect the activity of multiple pathogen effectors that target points of vulnerability in the plant. This has been best characterized in Arabidopsis: the CNL protein RPM1 detects the phosphorylation of RPM1-Interacting Protein 4 (RIN4) by the pathogen effectors AvrB and AvrRpm1 from Pseudomonas syringae pv. glycinea and pv. maculicola, respectively, and elicits the resistance response (Figure 5) [51]. The elicitation of this response can be abrogated by a third effector, AvrRpt2 from P. syringae pv. tomato, a protease that cleaves RIN4 [52,53]. The disappearance of RIN4 is detected, however, by a second CNL, RPS2, that in turn elicits the defense response [54,55]. There is increasing evidence from several systems that other R proteins similarly act as guards of host targets rather than direct receptors, at least for bacterial effectors [56-58]. NBS-LRR proteins function as components of macromolecular complexes [59]. Yeast two-hybrid and, more recently, co-immunoprecipitation experiments have identified multiple interacting proteins. All of the constituents and details of the dynamics of these complexes have yet to be determined, however. Oligomerization of animal NOD proteins through the NBS domain or oligomerization of Toll-like receptors through the TIR domain is important for activating the signaling pathway in animal innate immune systems [60-64], but there are currently few data on the oligomerization of plant NBS-LRR proteins. Effector-induced self-oligomerization of the tobacco N protein (a TNL) has recently been demonstrated in Nicotiana benthamiana; the ability to oligomerize was retained after loss-of-function mutations in the RNBS-A motif and TIR domain, but lost after P-loop mutations [32]. Little is known about the regulation of the plant genes that encode NBS-LRRs. Consistent with the need for a rapid response to pathogen attack, many NBS-LRR-encoding genes are constitutively expressed at low levels in healthy, unchallenged tissue, although some show tissue-specific expression (X.T., unpublished work). They are upregulated, however, in response to bacterial flagellin, which induces basal resistance, suggesting that plants can establish a state of heightened sensitivity to pathogen attack [65,66]. Both TNLs and CNLs include members that undergo alternative splicing. Alternative splicing of Toll-like receptors in animals is common and splice variants of the mouse Toll-like receptor TLR4 may be part of a regulatory feedback loop inhibiting excessive responses to bacterial lipopolysaccharide [67,68]. The induction of splice variants upon pathogen recognition has been observed for plant NBS-LRR proteins, suggesting that alternative splicing may have a regulatory role in the plant defense response [68]. Multiple transcripts have been detected for several TNL-encoding genes (RPP5, RPS4, and RAC1 in Arabidopsis, L6 in flax, N in tobacco, Y-1 in potato, and Bs4 in tomato) and fewer CNL-encoding genes [69-76], although their significance to disease resistance is unclear. The ratio of transcripts from the tobacco N gene is critical for resistance to tobacco mosaic virus [71]. Both full-length and alternative transcripts are necessary for resistance mediated by RPS4 in Arabidopsis [73]. Triggering of basal resistance and/or cell death associated with specific resistance imposes a heavy cost and is therefore likely to be tightly regulated. There is growing evidence for multiple layers of negative regulation, paralleling that observed in mammals. One layer involves RIN4; the disappearance of RIN4 triggers the basal resistance response (see above) [4,51]. Another level involves the interaction between the LRR and NBS regions; the LRR can act in trans as a negative regulator of the NBS in the CNLs potato Rx and tomato Mi [42,43]). A third layer involves the conformational change of the NBS following hydrolysis of ATP [31]. NBS-LRR R protein activity may also be subject to regulation by heat-shock proteins such as the Hsp90 proteins [4]; both CNLs such as Arabidopsis RPM1 and potato Rx and TNLs such as the tobacco N protein require cytosolic HSP90 for their function [77-79]. The role of protein degradation in resistance signaling is unclear, but there is increasing evidence for its importance [80]. Two proteins, 'Required for Mla12 Resistance 1' (RAR1) and 'Suppressor of G2 Allele of SKP1' (SGT1), are required for the function of several R proteins that signal through different pathways [59]. The COP9 signalosome, a multiprotein complex involved in protein degradation, is required for resistance to tobacco mosaic virus mediated by the tobacco TNL N protein [81]. The Arabidopsis CNL protein RPM1 is degraded at the onset of the hypersensitive response [82]; RING-finger E3 ubiquitin ligases in Arabidopsis are involved in RPM1- and RSP2-mediated elicitation of the hypersensitive response [83]. Therefore, either specific or general proteolysis may have roles in controlling the amplitude of the defense response and the extent of cell death associated with the hypersensitive response. Most NBS-LRR proteins lack a signal peptide or membrane-spanning regions and are therefore assumed to be cytoplasmic. Fractionation studies and interactions in yeast with membrane-associated proteins suggest that several are localized to the inner side of the membrane [51,54,55,82]. Localization studies are challenging, however, because of the probable dynamic nature of complexes and because of the low endogenous expression levels of NBS-LRR proteins; consequently, data from overexpression studies are difficult to interpret. Plant NBS-LRR proteins act through a network of signaling pathways and induce a series of plant defense responses, such as activation of an oxidative burst, calcium and ion fluxes, mitogen-associated protein kinase cascade, induction of pathogenesis-related genes, and the hypersensitive response [4,84-86]. At least three independent, genetically defined signaling pathways in Arabidopsis are induced by NBS-LRR proteins [87]. TNLs and CNLs tend to signal through different downstream pathways: TNLs signal through the 'Enhanced Disease Susceptibility' protein EDS1 and CNLs through the 'Non-race specific Disease Resistance' protein NDR1, although this correlation is not absolute. A separate pathway independent of EDS1 and NDR1 is activated by the Arabidopsis CNLs RPP8 and RPP13. Several small signaling molecules in the plant defense response, such as salicylic acid, jasmonic acid, ethylene, and nitric oxide, are involved downstream of NBS-LRR proteins and there is complicated cross-talk between the different signaling pathways, involving both synergism and mutual antagonism between pathways [88-91]. Frontiers The scope and complexity of this protein family provide many opportunities and challenges for both evolutionary and functional studies. An important immediate goal is to obtain crystal structures of NBS-LRR proteins, either in their entirety or as individual domains with and without their ligands. The coevolution of NBS-LRR proteins with their cognate bacterial effectors and their plant targets is of considerable interest, particularly as understanding these genetic changes and selective forces could lead to strategies for generating plants with more durable disease resistance. We also need to address an intriguing conundrum: if the LRR domain is acting as a negative regulator of the NBS domain and NBS-LRR proteins are monitoring the status of conserved host proteins, why is there frequently a strong evolutionary signal of divergent selection acting on solvent-exposed residues on the concave surface of the LRR? Numerous questions remain at the functional level. Are all NBS-LRR proteins involved in plant defense, or do some have other functions? What are the constituents of the macromolecular complexes involving NBS-LRR proteins and what events occur upon pathogen challenge? Do these complexes often contain multiple NBS-LRR proteins [92]? Are pathogen effectors usually detected indirectly, through monitoring their activity on plant targets, or are some effectors, for example from oomycetes or fungi, detected directly by NBS-LRR proteins? Do the proteins with only some of the domains, such as the TN and CN proteins [27], function as regulatory or adaptor molecules? Other questions include the functions of the variable amino- and carboxy-terminal domains and the multiple layers of positive and negative regulation (transcriptional, alternative splicing, phosphorylation and particularly protein degradation). Also, what is the functional significance of the lack of TNLs in cereals, and does this result in a different spectrum of resistance responses? Finally, what is the molecular basis of 'restricted taxonomic functionality' (resistance function restricted to within a plant family) of NBS-LRR proteins [93] and which additional proteins are required for function in plants other than the source species? Ultimately, once the evolutionary mechanisms and structure-function relationships are understood in detail, it might be possible to generate NBS-LRR proteins with new recognition specificities that target key pathogen constituents, resulting in new, durable forms of resistance. Additional data files The following additional data files are available: Additional data file 1 shows an alignment of 65 amino acids from 1,600 NBS sequences used to generate the neighbor-joining trees shown in Figure 2 and Additional data file 2; in both additional data files, parts (a) show TNL sequences and parts (b) CNL sequences. Additional data file 3 shows an alignment of NBS sequences used to generate the models of the NBS domain of RPS4 and RPS5 shown in Figure 3; PHYRE, a threading service available at [94], identified APAF-1 (PDB code 1z6t) as a reliable template to model the RPS4 and RPS5 NBS domains, with Z-scores of 5 × 10-23 and 1 × 10-18, respectively. The PHYRE pairwise sequence alignments of APAF-1 and RPS4 and of APAF-1 and RPS5 were collated into a single alignment without further refinement. Boxes show the positions of the eight motifs identified by Meyers et al. [3] and the position of the anchor region. Supplementary Material Additional data file 1 An alignment of 65 amino acids from 1,600 NBS sequences used to generate the neighbor-joining trees shown in Figure 2 and Additional data file 2 Click here for file Additional data file 2 A more detailed version of Figure 2 Click here for file Additional data file 3 An alignment of NBS sequences used to generate the models of the NBS domain of RPS4 and RPS5 shown in Figure 3 Click here for file
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS One
                PLoS ONE
                plos
                plosone
                PLoS ONE
                Public Library of Science (San Francisco, USA )
                1932-6203
                2013
                6 August 2013
                : 8
                : 8
                : e68435
                Affiliations
                [1 ]Key Laboratory of Crop Germplasm, Department of Agronomy, Zhejiang University, Hangzhou, Zhejiang, China
                [2 ]State Key Laboratory of Cotton Biology, Cotton Research Institute, Chinese Academy of Agricultural Sciences, Anyang, Henan, China
                Pennsylvania State University, United States of America
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                Conceived and designed the experiments: JZ HW SZ. Performed the experiments: HW WL. Analyzed the data: HW. Contributed reagents/materials/analysis tools: XS. Wrote the paper: HW. Technical guidance: JZ.

                Article
                PONE-D-13-07546
                10.1371/journal.pone.0068435
                3735570
                23936305
                50187be4-e152-4aca-9663-df5253f2ce5a
                Copyright @ 2013

                This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 19 February 2013
                : 29 May 2013
                Page count
                Pages: 13
                Funding
                This work was supported in part by the National Basic Research Program of China (2011CB109306,2009CB118404). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. No additional external funding was for this study.
                Categories
                Research Article
                Agriculture
                Crops
                Fibers
                Cotton
                Crop Diseases
                Biology
                Computational Biology
                Molecular Genetics
                Gene Duplication
                Genetics
                Molecular Genetics
                Gene Duplication
                Genomics
                Functional Genomics
                Plant Science
                Plant Pathology
                Plant Pathogens
                Plant Genomics
                Plants

                Uncategorized
                Uncategorized

                Comments

                Comment on this article