20
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      A Genomic Resource for the Development, Improvement, and Exploitation of Sorghum for Bioenergy

      research-article

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          With high productivity and stress tolerance, numerous grass genera of the Andropogoneae have emerged as candidates for bioenergy production. To optimize these candidates, research examining the genetic architecture of yield, carbon partitioning, and composition is required to advance breeding objectives. Significant progress has been made developing genetic and genomic resources for Andropogoneae, and advances in comparative and computational genomics have enabled research examining the genetic basis of photosynthesis, carbon partitioning, composition, and sink strength. To provide a pivotal resource aimed at developing a comparative understanding of key bioenergy traits in the Andropogoneae, we have established and characterized an association panel of 390 racially, geographically, and phenotypically diverse Sorghum bicolor accessions with 232,303 genetic markers. Sorghum bicolor was selected because of its genomic simplicity, phenotypic diversity, significant genomic tools, and its agricultural productivity and resilience. We have demonstrated the value of sorghum as a functional model for candidate gene discovery for bioenergy Andropogoneae by performing genome-wide association analysis for two contrasting phenotypes representing key components of structural and non-structural carbohydrates. We identified potential genes, including a cellulase enzyme and a vacuolar transporter, associated with increased non-structural carbohydrates that could lead to bioenergy sorghum improvement. Although our analysis identified genes with potentially clear functions, other candidates did not have assigned functions, suggesting novel molecular mechanisms for carbon partitioning traits. These results, combined with our characterization of phenotypic and genetic diversity and the public accessibility of each accession and genomic data, demonstrate the value of this resource and provide a foundation for future improvement of sorghum and related grasses for bioenergy production.

          Related collections

          Most cited references18

          • Record: found
          • Abstract: found
          • Article: not found

          The Sorghum bicolor genome and the diversification of grasses.

          Sorghum, an African grass related to sugar cane and maize, is grown for food, feed, fibre and fuel. We present an initial analysis of the approximately 730-megabase Sorghum bicolor (L.) Moench genome, placing approximately 98% of genes in their chromosomal context using whole-genome shotgun sequence validated by genetic, physical and syntenic information. Genetic recombination is largely confined to about one-third of the sorghum genome with gene order and density similar to those of rice. Retrotransposon accumulation in recombinationally recalcitrant heterochromatin explains the approximately 75% larger genome size of sorghum compared with rice. Although gene and repetitive DNA distributions have been preserved since palaeopolyploidization approximately 70 million years ago, most duplicated gene sets lost one member before the sorghum-rice divergence. Concerted evolution makes one duplicated chromosomal segment appear to be only a few million years old. About 24% of genes are grass-specific and 7% are sorghum-specific. Recent gene and microRNA duplications may contribute to sorghum's drought tolerance.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.

            We present a statistical model for patterns of genetic variation in samples of unrelated individuals from natural populations. This model is based on the idea that, over short regions, haplotypes in a population tend to cluster into groups of similar haplotypes. To capture the fact that, because of recombination, this clustering tends to be local in nature, our model allows cluster memberships to change continuously along the chromosome according to a hidden Markov model. This approach is flexible, allowing for both "block-like" patterns of linkage disequilibrium (LD) and gradual decline in LD with distance. The resulting model is also fast and, as a result, is practicable for large data sets (e.g., thousands of individuals typed at hundreds of thousands of markers). We illustrate the utility of the model by applying it to dense single-nucleotide-polymorphism genotype data for the tasks of imputing missing genotypes and estimating haplotypic phase. For imputing missing genotypes, methods based on this model are as accurate or more accurate than existing methods. For haplotype estimation, the point estimates are slightly less accurate than those from the best existing methods (e.g., for unrelated Centre d'Etude du Polymorphisme Humain individuals from the HapMap project, switch error was 0.055 for our method vs. 0.051 for PHASE) but require a small fraction of the computational cost. In addition, we demonstrate that the model accurately reflects uncertainty in its estimates, in that probabilities computed using the model are approximately well calibrated. The methods described in this article are implemented in a software package, fastPHASE, which is available from the Stephens Lab Web site.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Whole-genome sequencing reveals untapped genetic potential in Africa’s indigenous cereal crop sorghum

              Increasing global demand for food, driven by population growth and rising affluence, will impose severe challenges on the agricultural and social systems supporting the world’s poorest people1. The challenges facing these people will be further exacerbated by decreasing vital resources for agriculture and the inequitable negative impacts of climate change2. Input-intensive crops will fall into disfavour, particularly in agricultural systems that are resource-poor3. Sorghum, already a staple for half a billion people, and tolerant of low input levels4, will become increasingly critical in these systems5, particularly in Sub-Saharan Africa. Globally, sorghum is also an important source of animal feed and forage, an emerging biofuel crop and model for C4 grasses particularly genetically complex sugarcane6. The full exploitation of sorghum’s potential requires an understanding of genetic diversity at the genomic sequence level. Cultivated grain sorghum (Sorghum bicolor subsp. bicolor) is a domesticate of the wild progenitor S. bicolor subsp. verticilliflorum 7. It is hypothesised that sorghum, the only globally important cereal from Africa, was first domesticated in Ethiopia and Sudan >8,000 years ago8. Subsequently, sorghum spread to west and southern Africa and into Asia as far east as China. Stable weedy hybrids between cultivated sorghum and wild types are known as S. bicolor subsp. drummondii. Notably, these three subspecies appear to be sympatric in most areas that sorghum is cultivated and there is ample evidence for gene flow in both directions. Although bidirectional gene flow is not unique among major crops plants9, sorghum is the only major cereal where the wild/weedy relatives have followed the cultivated species by inadvertent introduction from Africa into the Americas, Asia and Australia; hence, crop–wild–crop gene flow occurs almost everywhere sorghum is cultivated. Here we present genomic sequences of 44 sorghum accessions spanning the dimensions of geographic origin, crop management and subgroup/race at mid–high coverage levels. Our findings indicate that sorghum offers unique and underdeveloped genetic resources amongst the major cereals. We observed strong racial structure and complex domestication events. Foremost, we find that modern cultivated sorghum is derived from a limited sample of racial variation. A great untapped pool of diversity exists not only in the other races of S. bicolor but also in the allopatric Asian species, S. propinquum. This diversity is easily accessed in breeding endeavours because of well-documented interfertility among races and beyond other sorghum species. These sequences and associated data represent an unmatched resource for research into the genetic improvement of this cereal crop to meet future demands. Results Sequencing and variation calling We selected 44 accessions to represent all major races of cultivated S. bicolor, in addition to its progenitors and S. propinquum (Supplementary Fig. S1; Supplementary Data 1). Among these lines, 18 are considered to be landraces, 17 are improved inbreds and 7 are wild and weedy sorghums (Supplementary Fig. S2). Recent studies10 11 have indicated that the guinea-margaritiferums are highly divergent from other cultivated S. bicolor races and possibly represent a separate domestication event. Consequently the guinea-margaritiferum genotypes were separated into a distinct group. Resequencing of the 44 sorghum genotypes yielded 7.97 billion 90-bp paired-end reads, which comprised 717 Gb of high-quality raw data (Supplementary Data 1; Supplementary Fig. S3). We combined this with 27 Gb (three genotypes) of publically available raw data12. Sequence reads were aligned to the sorghum reference genome, which has an effective genome length of ~700 Mb6, using BWA software13. The mapping rate in different accessions varied from 78% to 99%, averaging 97% in S. bicolor lines, 89% among wild relatives. Observed differences in mapping rates may be due to divergence between the sequenced genotypes and the reference genome, BTx623. The average final effective mapping depth achieved was ~22 × per line, ranging from 16 × to 45 × . Using a conservative quality filter pipeline (see Supplementary Methods), we identified 4,946,038 SNPs genome-wide in S. bicolor alone (Table 1), with 809,460 SNPs in the 29,346 high-confidence filtered gene set (FGS). The 1,982,971 insertions and deletions (indels) identified ranged from 1–66 bp in length. Targeted sequencing of 2,917 predicted base positions validated 99.85% of SNPs and indels (Supplementary Data 2, 3). Further, whole-genome de novo assembly of representative lines showed a common position SNP calling accuracy rate of 99.72% (Supplementary Data 4, 5). To our knowledge, this represents the largest high-quality SNP and indel data set obtained in sorghum. Of the 4.9 million high-quality SNPs, most (83%) were located in intergenic regions, with an average of 4.5% (ranging from 4.1% for landraces to 5.3% for improved inbreds) located in coding sequences (Supplementary Table S1; Supplementary Fig. S4; Supplementary Data 6). Coding regions displayed lower diversity levels relative to intron and UTR sequences (Supplementary Fig. S5). Among coding regions, there were 112,255 synonymous and 112,108 non-synonymous SNPs, resulting in a non-synonymous-to-synonymous substitution ratio of 1 and a corresponding overall Ka/Ks value of 0.441 (Supplementary Tables S2,S3). Previous studies in sorghum14 15 have also found fewer non-synonymous than synonymous substitutions. In comparison to other genome-wide studies, sorghum’s non-synonymous to synonymous substitution ratio is within the range reported for other plant species (soybean 1.37 (ref. 16), rice 1.2 (ref. 17), Arabidopsis 0.83 (ref. 18)). Non-synonymous to synonymous substitution rates varied genome-wide with heterochromatic regions having higher Ka/Ks values (euchromatin: 0.423; heterochromatin: 0.502), consistent with soybean19 (Supplementary Table S4). The average Ks value in the heterochromatin (0.0169) was lower than in euchromatin (0.0201), indicating a slower rate of evolution in heterchromatic regions. Little difference was observed between the Ka values in the heterochromatin (0.0058) and euchromatin (0.0054), which could suggest that both genomic regions experience similar levels of selective constraints. The non-synonymous to synonymous substitution ratios in duplicated genes (0.779) was marginally lower than the whole-genome average (0.999) suggesting that non-synonymous changes have not accumulated more rapidly in recently duplicated genes, in agreement with a recent report in soybean16 (Supplementary Fig. S6). A total of 1,982,971 small-to-medium length indels were identified with nearly equal numbers of insertions (872,080) and deletions (1,110,891) (Supplementary Tables S5,S6). Most indels (86%) were small (1–6 bp), with only 2.5% greater than 20 bp in length (Supplementary Fig. S7). The majority (83%) were located in intergenic regions, with only 1.5% (29,697) located in coding regions, of which 35% resulted in frame-shift mutations. In contrast, indels with lengths that were multiples of three were less prevalent in non-coding regions (an average of 18.3% versus 63%). Based on coverage depth and a recently published event-testing algorithm20, 120,929 copy number variations (CNVs) were identified, 16% of which occurred in genic regions (Supplementary Fig. S8). In total, 28% of sorghum genes in the FGS had CNVs, of which 1,379 (16%) were duplicated genes. Of these, 224 were in common across all three groups and were enriched for auxin responsive proteins and zinc finger proteins, in line with a previous sorghum study12 (Supplementary Fig. S9). The number of SNPs was higher in wild and weedy sorghum genotypes than in landraces and improved inbreds (Table 1; Supplementary Fig. S10) Wild-specific alleles (34%) were more abundant than improved inbred specific alleles (8%) and landrace specific alleles (18%). Similarly, the average total number of SNPs and indels per genotype was lowest in the improved inbreds and highest in the wild species, equating to an average of 1 SNP per 1,543 bp, 1,282 bp and 763 bp for the improved inbreds, landraces and wild and weedy groups, respectively. Guinea-margaritiferum genotypes had approximately twice the SNP density of the landrace group (1 SNP per 691 bp versus 1 SNP per 1,282 bp, respectively). The lower level of diversity in the improved inbreds was also detected using θπ and θw (Table 1) and was significantly lower (P 0.99 were further extracted to identify the putative SNP based on the following criteria; copy number ≤1.5, sequencing depth according to average depth of each accession, 100≤sequencing depth≤1,300, and distance of SNPs, the SNPs had to be a minimum of 5 bp apart, with the exception of minor allele frequencies (MAF ≥0.05) where SNPs were retained when the distance between SNPs was less than 5 bp. In this procedure, a total of 15,564,426 raw population SNPs were identified, which were further filtered to 6,873,194 when the additional filtering criteria were applied. Second, SOAPsnp48 was used to calculate the likelihood of genotypes of each individual. In each individual, SNPs were filtered by the quality value (≥20), the minimum number of required reads supporting each SNP (≥4), the maximum overall depth (≤100), the maximum copy number of flanking sequences (≤1.5) and the P value of the rank-sum test (P>0.05). Third, the final SNP set was obtained by combining the two sets of possible SNPs above; 4,946,038 were identified as SNPs in both SNP sets with a missing data rate of less than 50% in the populations. To detect insertions and deletions, the Dindel pipeline49 that uses a realignment algorithm was used, with the default parameters. Identification of regions identical-by-descent Regions that were expected to be identical-by-descent (IBD) between genotypes were masked before some analyses (Supplementary Data 17). These regions were identified using pairwise SNP density comparisons and looking for contiguous 10 kb windows with low SNP density between samples. To be considered IBD, at least 100 consecutive 10 kb windows must have pairwise SNP densities of 25 or less at an average read depth of ten or more. A single window within the 100-window blocks was permitted to exceed the minimum number of SNPs and regions were assessed using a sliding window approach. Once individual blocks meeting these criteria were identified, overlapping features were merged and a consensus set of coordinates was extracted from regions common to all genotypes in the IBD analysis. Copy number variation A modified version of the recently described event-wise testing algorithm20 was used to identify copy number variations. Read depth per 100 bp window was computed using this modified software, which adjusts for bias in read depth caused by GC content. SNP and indel validation SNP calling accuracy in addition to predicted base positions were validated utilizing Sanger sequencing and whole genome de novo assembly. Indel validation was determined via Sanger sequencing only. Sanger reads were aligned to the reference genome by BLAST software with only the best hit alignment used. All mismatch positions were recorded as SNPs. Likewise, predicted base positions that matched Sanger sequencing reads were also recorded as another measure of sequencing accuracy. The same alignments generated using BLAST software were used for indel calling; however, rather than mismatch sequences being analyzed, the gap positions were extracted. In total, 2,917 genotype-predicted base positions were detected, two of which were resequencing error. This equated to a genotype calling accuracy of 99.93%. Among the 2,917 base positions, 660 SNPs were identified, 1 of which was shown to be the result of sequencing error. Hence, SNP calling accuracy was 99.85% (Supplementary Table S2a). After data filtering using the Dindel pipeline, indels matching Sanger sequence equated to 42. Only a single indel was recorded as being erroneous, resulting in an indel calling accuracy of 97.62% (Supplementary Table S2b). Further validation of SNP calling accuracy was completed by comparing whole genome de novo assembly of two samples, S. bicolor subsp. verticilliflorum (PI300119) and S. propinquum 369-1 (both with sequencing depth > × 40) with their resequenced counterparts. Both genomes were assembled using SOAPdenovo50 (K-mer=41). Only sequences (scaffold and contig) with length ≥1,000 bp and without N bases were retained. For genotypes PI300119 and S. propinquum 369-1, comparing the common position resequencing SNPs (≈0.72 and 1.27 M, respectively) with de novo data, we identified the same percentages (99.76% and 99.68%, respectively) (Supplementary Table S3). Analysis of duplicate genes Annotated genes of S. bicolor were downloaded from the JGI website and a self-to-self BLAST performed. The four-fold degenerate transversion (4DTv) ratio was calculated for each best hit. The distribution of the 4DTv ratio of all gene pairs indicated that gene pairs in which both genes had a 4DTv ratio lower than 0.497 were recently duplicated, in agreement with previous studies6. Population genetics analysis SNPs were used to calculate the genetic distance between individuals. The neighbour-joining tree was constructed with treebest under the p-distances model, with bootstrapping (1,000). The software MEGA5 (ref. 51) was used for visualizing the phylogenetic tree. Principal component analysis of the SNPs was performed using the software EIGENSOFT52. The population structure was determined using the software FRAPPE53, we set MaxIter (Maximum iteration) parameter to 10,000, and the number of clusters (K) was considered from 2 to 12. The average pairwise divergence within a group (θπ) and the Watterson’s estimator (θw) were estimated for the whole genome of the three groups. A non-overlapping window size of 10 kb was used to estimate θπ, θw and Tajima’s D across the whole genome, in addition to a per FGS gene model basis (CDS, mRNA, intron, gene, 5′ UTR, 3′ UTR per predicted gene model). In each window, these parameters were calculated using a BioPerl module and an in-house perl script. F ST was calculated, based on the same windows, to measure population differentiation using another BioPerl module. Calculation of LD Correlation coefficient values (r 2) of alleles were calculated using Haploview54 to measure the LD level in the three populations. The parameters were set as follows: -dprime -minMAF 0.1 -hwcutoff 0.001 -memory 2000 -maxdistance 1000. The average r 2 value was calculated for each length of distance and LD decay figures were drawn using an R script55 for the three groups of sorghum genotypes. Identification of novel genes Unmapped reads from each sample were assembled into contigs using SOAPdenovo50 (default parameters). Assembled contigs for each sample were then filtered based on two criteria: (i) contigs 30% and identify >80% with BTx623. The remaining 3,960 contigs were considered to be either real novel sequences or located in non-assembled heterochromatic regions. The GC ratio of the 20.8 Mb sequences was 41.4%, comparable to the GC ratio of the whole genome (41.4%). De novo gene annotation was conducted using the software AUGUSTUS56. Only one copy of the genes with more than 90% identity and 90% coverage by BLAT was retained. In total, 276 candidate novel genes were annotated. BLASTP was then used to compare the candidate novel genes against the NCBI nr database (high-quality hits ≥60% identity and 60% coverage). Identification of gene loss events Gene loss events were identified using read depth at 100 bp resolution from all predicted Sbicolor_79 gene models across all genotypes. In addition to the sequence bound by the gene models, 1 kb (10 windows) upstream and downstream of the annotated gene boundaries were used as reference read depths for each gene. A minimum average read depth of 10 was required for the flanking regions to assess deletions within genes. Large deletions of at least 50% of the annotated gene sequence in a single block were identified by comparing the read depth of consecutive windows within the genic regions to the average read depth of the 1 kb flanking sequences. Genic windows with less than 1% of the average read depth of the flanking windows were treated as putative deletions and gene deletions were identified by grouping consecutive windows meeting these criteria into single blocks that exceeded 50% of the total gene length. Identification of candidate genes under selection Regions of the genome under purifying selection are expected to have a lower diversity and reduced allele frequency in the descendant population compared with the same region in the ancestral population. The FGS gene-based population genetics summary statistics (θπ, θw, Tajima’s D and F ST) were used to identify candidate genes in the following three population pairwise comparisons: first, wild and weedy versus landraces to identify domestication events; second, landraces versus improved inbreds to identify improvement events; and third, wild and weedy versus improved inbreds to identify both domestication and improvement events. The following criteria were used to identify candidate genes in each of the three pairwise comparisons; F ST values ≥95% of population pairwise distribution; θπ and θw higher in ancestral population and ≤10% of descendant population distribution; negative Tajima’s D values in descendant population. In total, 772 unique candidate genes under purifying selection were identified across all three pairwise comparisons: 396 candidate domestication genes based on the wild and weedy versus landraces comparison, 251 candidate improvement genes based on the landraces versus improved inbreds comparison and 324 candidate domestication and/or improvement genes based on the wild and weedy versus improved inbreds comparison. A subset of 422 of these candidate genes under selection were used as input, together with 38 neutral genes, for the mlHKA test31 for validation purposes. The MLHKA program was run under a neutral model, where numselectedloci=0, and then under a selection model, where numselectedloci>0. Significance was assessed by the mean log likelihood ratio test statistic, where twice the difference in log likelihood between the models is approximately chi-squared distributed with df equal to the difference in the number of parameters. GO-SLIM categories44 were used to assess gene family enrichment across subsets of genes under selection and invariant genes. Author contributions D.R.J., I.D.G, X.L., J.L., E.S.M. and J.W. managed the project; D.R.J., I.D.G., E.K.G., P.J.P., B.C.C., A.C., S.T. and E.S.M. designed the experiments; T.S. and B.C.C. prepared the samples; S.T. and E.S.M. led the data analysis with contributions from Y.L., L.B., W.H., X.H., C.D., H.Z., X.W., M.W., Z.S., C.F., D.J.I., E.K.D., P.J.P., I.D.G. and D.R.J.; E.S.M., I.D.G., P.J.P., B.C.C., E.K.G., A.C., D.J.I., C.F., S.T. and D.R.J. wrote the manuscript. Additional information Accession codes: The sequencing data have been deposited in the NCBI Short Read Archive under accession codes SRS378430 to SRS378473. How to cite this article: Mace, E. S. et al. Whole-genome sequencing reveals untapped genetic potential in Africa’s indigenous cereal crop sorghum. Nat. Commun. 4:2320 doi: 10.1038/ncomms3320 (2013). Supplementary Material Supplementary Figures and Tables Supplementary Figures S1-S49 and Supplementary Tables S1-S11 Supplementary Data 1 Summary of sorghum genotypes, depth of sequencing and depth of mapped reads. Supplementary Data 2 Validation of SNPs and invariant base positions using Sanger sequencing for 6 genes Supplementary Data 3 Validation of Indels using Sanger sequencing for 4 genes. Supplementary Data 4 Validation of SNPs for lines S. propinquum 369-1 and PI300119 through whole genome de novo assembly Supplementary Data 5 Whole genome de novo assembly data summary Supplementary Data 6 Total number of SNPs across chromosomes per genotype Supplementary Data 7 List of candidate domestication genes, including GO-SLIM categories Supplementary Data 8 List of candidate improvement genes, including GO-SLIM categories. Supplementary Data 9 List of candidate domestication and improvement genes, including GO-SLIM categories. Supplementary Data 10 List of candidate domestication and improvement features under selection Supplementary Data 11 List of invariant genes between the improved inbreds and landrace groups. Supplementary Data 12 List of invariant genes between the improved inbreds and wild and weedy groups. Supplementary Data 13 List of invariant genes between landrace and wild and weedy groups. Supplementary Data 14 List of genes with LE-SNPs per group Supplementary Data 15 List of genes with gene loss events identified Supplementary Data 16 List of novel genes identified Supplementary Data 17 List of genome co-ordinates of regions identical-by-descent (IBD) in SC lines (sorghum conversion program) in the landrace group.
                Bookmark

                Author and article information

                Journal
                Genetics
                Genetics
                genetics
                genetics
                genetics
                Genetics
                Genetics Society of America
                0016-6731
                1943-2631
                September 2016
                27 June 2016
                27 June 2016
                : 204
                : 1
                : 21-33
                Affiliations
                [* ]Institute of Translational Genomics, Clemson University, South Carolina 29634
                []Department of Genetics and Biochemistry, Clemson University, South Carolina 29634
                []Donald Danforth Plant Science Center, St. Louis, Missouri 63132
                [§ ]Department of Mathematical Sciences, Clemson University, South Carolina 29634
                [** ]Department of Agronomy, Kansas State University, Manhattan, Kansas 66506
                Author notes
                [1 ]Corresponding authors: Institute of Translational Genomics, Clemson University, 317 Biosystems Research Complex, 105 Collings St., Clemson, SC 29634. E-mails: zwbrent@ 123456clemson.edu and skresov@ 123456clemson.edu
                Article
                183947
                10.1534/genetics.115.183947
                5012387
                27356613
                3a9b7af0-fe8e-4c3b-9ca4-2fb34e165824
                Copyright © 2016 by the Genetics Society of America

                Available freely online through the author-supported open access option.

                History
                : 21 October 2015
                : 21 June 2016
                Page count
                Figures: 5, Tables: 3, Equations: 0, References: 58, Pages: 13
                Categories
                Investigations
                Multi-parental Block

                Genetics
                bioenergy association panel,carbon partitioning,biomass composition,nonstructural sugars,multiparent advanced generation inter-cross (magic),multiparental populations,mpp

                Comments

                Comment on this article