Blog
About

43
views
0
recommends
+1 Recommend
0 collections
    3
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Large-scale SNP discovery and construction of a high-density genetic map of Colossoma macropomum through genotyping-by-sequencing

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Colossoma macropomum, or tambaqui, is the largest native Characiform species found in the Amazon and Orinoco river basins, yet few resources for genetic studies and the genetic improvement of tambaqui exist. In this study, we identified a large number of single-nucleotide polymorphisms (SNPs) for tambaqui and constructed a high-resolution genetic linkage map from a full-sib family of 124 individuals and their parents using the genotyping by sequencing method. In all, 68,584 SNPs were initially identified using minimum minor allele frequency (MAF) of 5%. Filtering parameters were used to select high-quality markers for linkage analysis. We selected 7,734 SNPs for linkage mapping, resulting in 27 linkage groups with a minimum logarithm of odds (LOD) of 8 and maximum recombination fraction of 0.35. The final genetic map contains 7,192 successfully mapped markers that span a total of 2,811 cM, with an average marker interval of 0.39 cM. Comparative genomic analysis between tambaqui and zebrafish revealed variable levels of genomic conservation across the 27 linkage groups which allowed for functional SNP annotations. The large-scale SNP discovery obtained here, allowed us to build a high-density linkage map in tambaqui, which will be useful to enhance genetic studies that can be applied in breeding programs.

          Related collections

          Most cited references 40

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Switchgrass Genomic Diversity, Ploidy, and Evolution: Novel Insights from a Network-Based SNP Discovery Protocol

          Introduction In the past decade, switchgrass (Panicum virgatum L.) has been targeted as a prime candidate energy crop. As a C4 grass, switchgrass has high biomass production with minimal field-based inputs. Its adaptability allows it to be grown productively in large areas of the USA, including marginal lands. In addition, propagation by seed and the perennial growth habit of switchgrass enable relatively effortless establishment, field management and harvest. Although switchgrass shows great promise as a bioenergy feedstock, it would never be considered a model species for genetic or genomic research. Most of the fundamental characteristics of its biology render switchgrass a difficult taxon for the genetic dissection of even the simplest of its useful biofuel-related traits. Switchgrass is a largely self-incompatible and highly heterozygous species [1]. In contrast to species with inbred lines, both forward and reverse genetics are difficult to conduct in switchgrass. In addition, there is evidence of extensive chromosome-number variation, including multiple ploidy levels, as well as aneuploidy [2]. Moreover, switchgrass has a relatively large genome size [2], [3] and lacks a reference genome, both of which hamper the development of an effective marker system. Overall, these challenges are not unique to switchgrass: there are thousands of key species with similar characteristics, and we need tools that can be applied to all of them. Many of the challenges posed by switchgrass can be overcome through genotyping by sequencing (GBS). This protocol is a multiplexed, high-throughput, and low-cost method to explore the genetic diversity in populations [4]. It employs a reduced representation library (RRL) strategy [5] to target a fraction of the genome for sequencing, thereby decreasing cost and increasing the SNP-calling accuracy. GBS is the simplest of the RRL approaches developed thus far [6]–[9], and has already seen extensive application in a wide diversity of taxa, i.e., in barley and wheat [10], as well as, maize [4], [11], rice, grape and cacao (many publications in progress). Currently, the RRL strategy has been used for diversity evaluation in various species, resulting in the discovery of hundreds of thousands of SNPs. In most of these cases, the libraries were sequenced on the Illumina platform, and the SNP calling relied on having a reference. The reference could be a high-quality genome sequence [12]–[16], de novo assembly from deep sequencing [17]–[20] or transcriptome sequences [21]. The reference (ideally a reference genome) not only physically orders the SNPs, but also provides the sequence context for paralogs, assigning them to different sites. This reduces the false SNP calls from paralogs, especially in wholly or partly duplicated, or transposon-saturated genomes. However, in the absence of a reference genome, SNP calling may be much less accurate with short-read sequencing technologies, because true SNPs, sequencing errors and SNPs between paralogs can be difficult to distinguish. The Illumina platform and Roche GS-FLX are an effective combination to call SNPs when lacking a reference genome [17]–[20], but additional labor, time and cost are required to build a rough reference with GS-FLX. Therefore, we designed a universal and unconditionally reference-free SNP calling approach to analyze short sequence data from RRLs of any species, especially for the majority which lack a reference genome. To enable genome-wide association studies (GWAS) and genomic selection (GS) in switchgrass, we developed both linkage and association populations. Phenotypic data from these populations were collected over three field seasons. All 840 individuals in the linkage and association populations were genotyped with GBS. To overcome the inherent difficulties of the lack of a reference genome, multiple ploidy levels and high heterozygosity, a bioinformatics pipeline for SNP discovery based on a species-wide network approach called the Universal Network-Enabled Analysis Kit (UNEAK) was developed. This pipeline was validated in maize and then successfully applied to switchgrass GBS data. High density SNPs were generated to enable future GWAS and GS. Further analysis of the SNP data sets provided unparalleled insights into the diversity, genomic complexity, population structure, phylogeny, phylogeography, and evolutionary dynamics of switchgrass. Results Development of UNEAK (Universal Network Enabled Analysis Kit), a universal SNP–calling pipeline When a reference genome is available, SNP discovery can be easily performed by aligning reads to the physical map. However, when there is no reference genome, as is the case for the majority of species, significant challenges arise. The UNEAK pipeline overcomes many of these challenges. The general design of UNEAK is as follows (Figure 1): Reads are trimmed to 64 bps. The trimmed parts of the reads are ignored because the sequencing errors are enriched at the ends of reads. Identical 64-bp reads are collapsed into tags. Pairwise alignment identifies tag pairs having a single base pair mismatch (Figure 1C). These single base pair mismatches are candidate SNPs. Because of the complexity of the genome, many of the tag pairs form networks (Figure 1D and Figure 2). A network filter is employed to discard complicated networks, which are usually a mixture of repeats, paralogs and error tags (Figure 1E and Figure 2). Ideally, after application of the network filter, the only networks remaining are composed of reciprocal tag pairs, which can then be used for SNP calling. 10.1371/journal.pgen.1003215.g001 Figure 1 The analytical framework of UNEAK. (A) Multiple DNA samples are digested and sequenced using GBS (red arrows represent cut sites). The inputs of UNEAK are Illumina Qseq or Fastq files. All of the reads are computationally trimmed to 64 bp. The solid colored lines represent error-free (“real”) reads, while the dashed lines are reads containing one or more sequencing errors. (B) Identical reads are classified as a tag. The colored bars are real tags, whereas the shaded bar is a rarer error tag. (C) Pairwise alignment is performed to find tag pairs differing by only a single bp mismatch. (D) Topology of tag networks. The colored circles are real tags. The shaded circles are error tags. Lines (“edges”) are drawn only between tags that differ by a single bp mismatch. (E) Only reciprocal, real tag pairs are retained as SNPs. 10.1371/journal.pgen.1003215.g002 Figure 2 The networks of 802 representative tags from actual switchgrass data. The red circles are putative “real” tags. The blue circles are low frequency, putative error tags (see Methods). The size of each circle denotes the count of a tag. Lines connecting the circles (“edges”) join tags that differ by a single bp mismatch. Of the 802 tags, 192 (24%) formed reciprocal tag pairs and thus, were identified as SNPs by the network filter. To account for sequencing errors, we introduced a parameter called the error tolerance rate (ETR) to improve our initial network filter (see Methods). Without this feature (ETR = 0), sequencing errors can have a substantial negative impact upon the number of retained SNPs, especially when the depth of coverage is high. When sequencing errors occur and error detection is not employed, affected tag pairs are no longer reciprocal and therefore are removed from the data set (Figure 1E). By employing an appropriate ETR, the edges between error tags and real tags are cut. In this manner, complicated networks can be separated into different sub-networks, and only those sub-networks composed of reciprocal real tag pairs are kept (Figure S1). Hence, the SNPs with higher coverage, the most valuable part of the data set, are more likely to be retained (Figure 1E and Figure 2). Validation of UNEAK using maize GBS data To validate the UNEAK pipeline, we tested it with GBS data from a single RIL family (B73×B97) from the maize nested association mapping (NAM) population [22]. The large and complex genome of maize [23] makes this a useful test. The 199 inbred lines were processed using the GBS protocol applied in switchgrass. The only difference was that the maize samples were sequenced on the Illumina Genome Analyzer which has about 10% of the throughput of an Illumina Hiseq 2000. To evaluate the effectiveness of the network filter, we ignored the existence of the maize B73 reference genome and called SNPs at two stages in the pipeline, before and after application of the network filter. The first data set had 336,020 SNPs, which were composed of all tag pairs with a 1 bp mismatch. The second data set was comprised of the 92,951 SNPs that passed the network filter. Only 23.3% of the SNPs in the first data set aligned to a unique site in the maize reference genome. In contrast, after application of the network filter, 78.6% of the SNPs aligned to unique positions. Here, for a uniquely mapped SNP, one tag had a single perfect match to the reference; the other had a single best match at the same site. For the other 21.4%, either or both tags of a SNP aligned equally well to more than one site. Among them, 48.6% SNPs aligned to two sites. Considering that ApeKI is a partially methylation sensitive enzyme [4], and that potential tags from long restriction fragments are generally absent from GBS data due to PCR bias, some of the tags that aligned to multiple sites in fact may have come from a single site. To quantify this effect, we performed an in silico ApeKI digest of the maize reference genome and identified 8,420,424 potential B73 GBS tag loci. All of the 6,994,161 tags from the B73×B97 family were then aligned to the reference genome; 2,966,692 of these matched perfectly, and thus were B73 tags. These B73 tags accounted for only 35.2% (1,045,475) of the potential B73 tags from the in silico digest. Hence, there is a strong possibility that a large proportion of the 21.4% of SNPs from reciprocal tag pairs that align to multiple positions in fact derived from a single genomic position. For example, those SNPs that aligned to two sites (10.4% of total SNPs) had only a 35.2% chance of originating from two genomic positions. Therefore, we estimated that the SNPs essentially aligned to unique positions should be greater than 85%. The marked difference between the allele frequency distributions before and after application of the network filter demonstrates that this filter substantially improves the quality of the data. In contrast to the pre-network filter distribution, in which only low and high frequency error peaks were discernible, the post-network filter distribution was dominated by a central peak around the expected allele frequency of 0.5. At the same time, the two error peaks located at the two ends of the distribution were significantly reduced (Figure S2). The 92,951 SNPs were also validated by both linkage disequilibrium (LD) analysis and sequence alignment. First, we calculated LD (r 2) between these SNPs and the 1106 Illumina Golden Gate SNP markers developed in NAM [22], based on the assumption that valid SNPs should be in LD with adjacent markers. For the 20,402 SNPs with call rates >0.3, the average r 2 with the four adjacent markers were calculated. The results showed that 92.8% of the GBS SNPs were in LD with a flanking NAM SNP with an r 2 greater than 0.2 (Figure S3). Second, we aligned the non-B73 tags of these SNPs to the B97 whole genome shotgun sequences from maize HapMap2 data [24], which were sequenced at 4.2× and supposedly covered the majority of the B97 genome. The results showed that 93.2% of the GBS SNPs corresponded to HapMap2 SNPs from B97. SNP discovery in switchgrass To enable GWAS and GS in switchgrass, we created a full-sib linkage population (n = 130), a half-sib linkage population (n = 168) and an association panel (66 diverse populations, n = 540) (Table S1 and Table S2). Using GBS, approximately 350 Gb of sequence were generated from an Illumina HiSeq 2000. The UNEAK pipeline called 400,107, 476,005, and 700,236 SNPs from the full-sib, half-sib, and association populations, respectively. All together, about 1,242,860 putative SNPs were discovered in switchgrass. All of these SNPs had minor allele frequencies (MAFs) greater than 0.05. There were 29,838 (6.9%), 69,605 (12.8%) and 112,099 (13.8%) SNPs with MAFs less than 0.05 in the three populations, respectively. Because we cannot distinguish low frequency SNPs from sequencing errors, SNPs with a MAF less than 0.05 were removed from further analysis. The average coverage of the three data sets was less than 1×, but for some SNPs the coverage was as greater than 6× (Figure S4). The SNP calls can be found at http://www.maizegenetics.net/snp-discovery-in-switchgrass. Tetraploid switchgrass behaves like a diploid The parents of the full-sib population are upland tetraploids. In general, a stretch of DNA sequence should have four orthologous copies in tetraploids. Therefore, when considering the allele frequency distribution of an F1 population, we expected to see seven peaks, representing all possible allele frequency ratios of two parents (e.g., 1∶7, 2∶6, 3∶5, etc.). However, only three peaks were observed (1∶3, 1∶1 and 3∶1) after the network filter was implemented (Figure S2D), the signature of an F1 population of a heterozygous diploid. From this, we infer that tetraploid switchgrass is thoroughly diploidized. After the network filtering step, a second filter is implemented in UNEAK to remove remaining sequencing errors and paralogs. This filter is a goodness-of-fit χ2 test (α = 0.05) based on the null hypothesis that, in diploid species, the counts of the two paired tags of a SNP are equal in all heterozygous individuals. A substantial number of incorrect SNP calls were removed from the data set of the F1 full-sib population (compare Figure S2D to Figure 3). The three peaks of the allele frequency distribution for the remaining SNPs (Figure 3) represent the crosses of AA×Aa (expected allele frequencies of 0.25 and 0.75, with 1∶1 segregation of AA and Aa genotypes), AA×aa (no segregation), and Aa×Aa (expected allele frequencies of 0.5 with genotypic segregation ratios of 1∶2∶1). 10.1371/journal.pgen.1003215.g003 Figure 3 Allele frequency of 50,000 SNPs (call rate >0.8) in the full-sib F1 population (n = 130) of upland tetraploid switchgrass, showing the classic signature of a cross between two heterozygous diploids. The diploid nature of tetraploid switchgrass can also be recognized in individual plants. The tetraploid parents of the full-sib linkage population, U518 and U418, were sequenced at a high coverage of 6×. We ran UNEAK to call SNPs from loci that were heterozygous in both parents. The results showed that the two alleles at heterozygous loci have equal read frequencies within each tetraploid of 0.5 (Figure S5A and Figure S5B), providing more evidence that tetraploid switchgrass is diploidized. To compare the distribution pattern of octoploid and tetraploid, we sequenced one octoploid switchgrass, K101, also at 6×. In contrast to the tetraploids, the read frequency distribution within K101 had three peaks, at 0.25, 0.5 and 0.75 (Figure S5C). This result indicated that octoploid switchgrass behaves more like an autotetraploid. Eighteen linkage groups perfectly match the chromosome number of tetraploid switchgrass GBS is a low coverage genotyping approach, especially when the genome is digested with ApeKI, which is a frequently cutting restriction enzyme. Before constructing a high-density linkage map, we first evaluated the quality of the switchgrass genotypic data set. The low depth of sequencing, relative to the number of restriction fragments within GBS size range, has two effects on genotypic data quality. The first is a large amount of missing data. The SNP call rate increases with coverage (Figure S4). Across the 400,107 markers discovered in the switchgrass full-sib linkage population, we achieved a median coverage of 0.54×, which translated into a median SNP call rate of 40%. The second effect of low coverage is that heterozygous SNPs can be miscalled as homozygotes, even at markers with high call rates. To quantify the rate of miscalled heterozygotes, we selected markers with expected allele frequency ratios of 1∶1 (MAF>0.45 in the full sib progeny) that appeared, based upon high coverage GBS data from the parents, to be homozygous in both parents (AA×aa). These markers should be heterozygous in all of the full-sib progeny. As expected, the proportion of miscalled heterozygotes is very high at low coverage markers and declines substantially as coverage increases (Figure S6). In the subset of markers with the highest coverage (>4× coverage, or >90% SNP call rate) we estimate that 0.9, or >4× coverage (Figure S4) and with 0.5 were chosen to add to the 36 linkage groups to construct high density linkage maps. Out of these 88,217 SNPs, 9,437 could be aligned to unique positions in the foxtail millet genome; physical and genetic chromosomal assignments agreed for 7,245 of these 9,437. Based on the strong synteny between switchgrass and foxtail millet, we were able to order the 7,245 uniquely aligned SNPs, resulting in paternal and maternal framework maps consisting of 3,244 and 4,001 ordered markers, respectively. To check the quality of the synteny based order in the framework maps, we calculated the pairwise Spearman's rank correlation coefficients for the markers. High coefficient values were distributed along the diagonal in the heat map (Figure S9). This indicates that the synteny between switchgrass and foxtail millet is high enough to provide a reasonable order for switchgrass SNPs. The remainder of the 88,217 SNPs was then placed on the framework maps according to the r2 within their assigned chromosomes. A paternal linkage map (18 chromosomes, 41,709 markers) and a maternal map (18 chromosomes, 46,508 markers) were constructed. Both framework maps and the high density linkage maps can be found at http://www.maizegenetics.net/snp-discovery-in-switchgrass. Phylogenetic groups reflect ecotype, ploidy level, and geographic distribution In addition to the genomic analysis of the bi-parental populations, phylogenetic analysis was performed using the SNPs discovered in the diverse association panel. We selected all of the markers with call rates greater than 0.5 in 540 individuals, which included both tetraploid (4×) and octoploid (8×) plants. Because the size of the 8× genome is approximately twice the size of the 4× genome, the SNPs may be biased towards 8× specific SNPs. Furthermore, the octoploid plants may have half the sequencing depth of the tetraploids. Both of these factors have the potential to affect the phylogeny reconstruction. Hence, this data set was evaluated for ploidy specific SNPs as well as for coverage of 4× and 8× switchgrass. Based upon a χ2 test, 2.4% and 6.6% of the SNPs had a significantly larger number of genotype calls in 4× and 8× switchgrass, respectively (p 92% of the SNPs with MAFs greater than 0.05 were legitimate. The other 8% were probably due to sequencing errors or paralogs. However, these false positives can be significantly reduced through use of a minor allele frequency filter in biparental populations. For example, for SNPs with a MAF>0.3, the validation rate reaches 96.2%. The UNEAK pipeline was designed to perform SNP discovery in a broad range of species, therefore only the network filter, which is the key to UNEAK, was implemented and evaluated in the maize test. Essentially, based on the tag count file output from UNEAK, end users can design filters specific to the biology of their study population. For example, repulsion of alleles in inbred lines, or equal tag count of alleles at heterozygous sites in diploid species can be tested to filter for higher quality SNPs. Although UNEAK generates reliable SNPs, it cannot guarantee high quality genotype calls when sequencing coverage is low. Coverage has a major impact on genotype quality (Figure S4 and Figure S6). Call rate increases with coverage, but with diminishing returns (Figure S4). In inbred lines or haploid germplasm, especially in species with a reference genome, too much coverage is a waste of sequencing resources. On the other hand, in highly heterozygous germplasm, high rates of coverage may be required to distinguish heterozygous sites from homozygous, depending on the desired level of error tolerance. In this case, to obtain high quality genotypes from GBS, it may be helpful to use enzymes with longer recognition sequences than ApeKI. Linkage map construction using GBS A high quality linkage map is essential for QTL mapping and the assembly of whole genome sequence. The average coverage of the SNPs in the switchgrass full-sib population is only 0.95×. Even so, we were able to identify linkage groups. Using the 3,000 SNPs with the highest call rate (>0.9), 18 linkage groups were successfully identified and confirmed by alignment to the foxtail millet genome, for both paternal and maternal markers. However, without relying on synteny with foxtail millet, we were unable to order the markers within the linkage groups, even after trying many different algorithms. Our inability to order the markers stemmed from the low sequencing coverage; this led to high amounts of missing data, and heterozygotes often being miscalled as homozygotes, even for SNPs with high call rates (Figure S4 and Figure S6). For the same reason, the high density linkage maps (41,709 SNPs in the paternal map and 46,508 SNPs in the maternal map) we developed based on the lower call rate (>0.5) had approximately 14% markers grouped to wrong chromosomes. This percentage was largely reduced for the SNPs uniquely aligned to the foxtail millet genome and clustered into the one of the syntenic linkage groups of switchgrass. These SNPs comprise the framework maps, which were ordered based on the strong synteny between the two species [33]. Both high density maps (41,709 and 46,508 SNPs) and high quality framework maps (3,224 and 4,001 SNPs) are available at our website. These linkage maps should be useful for the current switchgrass genome assembly effort by the Joint Genome Institute. For example, they can be used to differentiate contigs derived from the two subgenomes as well as to order contigs on a chromosome. The GBS protocol generally provides low coverage if the enzyme ApeKI is used, but higher coverage can be obtained by choosing enzymes with longer recognition sites [4]. There is a tradeoff between coverage and number of SNPs. For linkage map construction, where the number of recombination events is limited, thousands of SNPs usually provide sufficient resolution. However, in breeding applications, a higher density of SNPs should provide a better chance to find SNPs that are tightly associated with QTLs. Therefore, further development of UNEAK will focus on linkage map construction using a six base cutter such as PstI in GBS, which is expected to result in at least 8 times higher coverage than ApeKI and thus should provide a better balance between coverage and total number of SNPs. The diploidization of switchgrass Ploidy level variation significantly complicates genetic research in switchgrass. The F1 full-sib population of switchgrass in this study was made by crossing two upland tetraploids. Although seven peaks were expected in the allele frequency distribution in the F1s, the three peaks clearly indicated that the tetraploid switchgrass behaves like a diploid (Figure 3). We also observed nearly equal intra-individual allele read frequencies of ∼0.5 at heterozygous loci within individual tetraploid plants. In addition, the MMC method successfully clustered paternal and maternal markers into 18 and 19 linkage groups respectively, with the extra maternal linkage group later merged with another via synteny. The correlations within the linkage groups are visibly higher than between groups (Figure 4). Construction of an SSR-based linkage map for switchgrass also indicated that chromosomes pair preferentially in meiosis [47]. These four lines of evidence indicate that tetraploid switchgrass shows disomic inheritance and has undergone diploidization over the past one million years [48]. The diploid nature of tetraploid switchgrass will greatly simplify genetic research and whole genome sequencing efforts. Population structure, phylogeography, and evolution of switchgrass Based on 29,221 markers, the phylogenetic analysis in this study provides high resolution to cluster the 540 individuals from 66 populations. As reported by previous studies using either sequence from chloroplast genomes [49]–[51] or SSR markers in nuclear genomes [51], [52], early divergence of the upland and lowland ecotypes was also observed in this study. In most cases, individuals from the same population were grouped together. Additionally, we found distinct subgroups within each ecotype based on their geographic distribution. This result clearly indicates isolation by distance, which could not be detected by random amplified polymorphic DNA (RAPD) markers [53]. The fact that subgroups of different ploidy form distinct clades indicates that the two ploidy levels are reproductively isolated [1]. To investigate if non-random patterns of shared missing genotypes between individuals affected the tree topology, we evaluated the ploidy specific SNPs and coverage of the 4× and 8× switchgrass. Only 6% of the SNPs (slightly higher than the type I error rate) were octoploid specific, and sequence coverage was quite similar in the two ploidy groups. These observations indicate that octoploid switchgrass behaves like an autotetraploid (Figure S5). Hence, it does not contain many private genomic regions relative to tetraploid switchgrass. Moreover, only the sites with genotype calls in both individuals were used to calculate the genetic distance, which should minimize the impact of differential amounts of missing data. In addition, we reconstructed multiple trees at different call rates ranging from 0.15 to 0.9, and the overall topology with respect to the main groups, for example Upland 8× West, Upland 4× North, Upland 8× East, Upland 8× South, Lowland South and Lowland Northeast, was stable regardless of call rate (data not shown). When call rate was greater than 0.9, there were only about 800 SNPs in the data set, many of which were repetitive paralogs. When the call rate is less than 0.15, there were not enough shared sites between individuals to calculate genetic distances. Thus, in spite of the low coverage of GBS, we concluded that the phylogenetic relationship constructed in this study is reliable. These results also suggest that missing genotypes do not alter the performance of phylogenetic analysis, provided that large numbers of SNPs are used. Our results suggest that the upland 4× arose from upland 8×. We used a stepwise method to designate the lowland ecotype as the outgroup of the upland ecotype. Phylogenetic analysis indicated the upland 8× is closer to this external, lowland branch. Additionally, upland 4× showed less diversity than upland 8×. Both lines of evidence support the hypothesis that upland 8× gave rise to upland 4×. This conclusion is contradictory to the accepted evolutionary trajectory of higher ploidy level being derived from either auto- or allo-polyploid events involving lower ploidy taxa. A reversion to the lower ploidy could occur via apomixis, whereby an unfertilized haploid (4×) gamete becomes a viable embryo. This is a well-documented phenomenon in perennial grasses [54]. Confirmed haploidy of switchgrass has been observed in two laboratories, in both cases at extremely low frequencies, on the order of 10−4 to 10−2, from the 2n = 4x = 36 to the 2n = 2x = 18 ploidy level [49], [55]. The relatively high frequency of tetraploid accessions in the northern USA (Figure 6) cannot be explained simply by this phenomenon, suggesting that selection may play a role in favoring upland tetraploid genotypes in certain northern environments. According to the phylogeny (Figure 7) and the geographic distribution of upland switchgrass (Figure 6), we confirmed a south to north migration path of upland switchgrass, which agrees with a previous study [51]. Our data also indicated a loss of diversity during the migration, manifested largely by the shift in ploidy from 8× to 4× in the north. For the lowland switchgrass, our phylogenetic analysis cannot tell the migration direction. However, it is very likely that the common ancestor of upland and lowland ecotypes came from the southern area, then migrated to the north, not vice versa. Therefore, we inferred a general south to north migration path of switchgrass (Figure S14). The natural barrier of the Appalachian Mountains split the northern spread of switchgrass into two subgroups. To the west of the mountains, the most recent common ancestor (MRCA) of the Upland ecotype was 8×. During migration to the north, a ploidy level shift occurred and the Upland 4× emerged. To the east of the Appalachian Mountains, the lowland ecotype was favored, and continued spreading northward along the coastal plains. Subsequently, the Lowland 4× Northeast subgroup diverged both geographically and genetically from the Lowland 4× South. The switchgrass germplasm used for this phylogenetic study derived mainly from northern-adapted populations, with very restricted sampling of populations from southern and central regions. Based on the geographic distribution and deep divergence of the upland and lowland ecotypes [51], [52], we expect the switchgrass from southern regions, particularly in Mexico, Texas and Florida, to be clustered with the lowland ecotype. Switchgrass from the central regions might provide useful information about the divergence of switchgrass into two ecotypes, the origin of upland ecotypes, and how the ploidy level shifted during the migration. More extensive sampling of switchgrass from all regions of North America will undoubtedly improve our understanding of switchgrass evolution. Materials and Methods Switchgrass germplasm The association panel consisted of 66 diverse switchgrass populations grown from seed in the greenhouse of the USDA-ARS Dairy Forage Research Center in Madison WI in 2007. This panel was mainly composed of northern adapted upland populations (Table S1). In addition, two tetraploid F1 linkage populations were propagated at the same time. One was derived from the bi-parental cross of two upland accessions from the germplasm collection of MDC, named U518 and U418. The other was a half-sib population whose maternal parent is U601. The numbers of individuals in these two populations were 130 and 168, respectively. For the association panel, ten clones from each population were initially planted in replicated field plots in Ithaca, NY and Arlington, WI in 2008. All genotyping was conducted on plants from the Ithaca site. Reduced representation libraries construction and sequencing The reduced representation libraries were constructed and sequenced according to the published GBS protocol [4] with one modification. Specifically, a titration experiment showed that ∼0.1 pmol of each adapter was appropriate for switchgrass (rather than ∼0.06 pmol), and that amount was used with 100 ng of genomic DNA. DNA samples were digested with the restriction enzyme ApeKI, which has a 4.5 bp cut site (CGWGC, where W = A or T). The resulting libraries were sequenced on the Illumina HiSeq 2000. Ninety five samples (plus a blank negative control) were sequenced per lane. SNP discovery and genotyping The non-reference pipeline UNEAK (http://www.maizegenetics.net/gbs-bioinformatics) was developed for SNP discovery and genotyping in species like switchgrass (Figure 1). In UNEAK, Illumina reads are trimmed to 64 bp and stored in bit format, which greatly reduces the amount of storage space and enables relatively fast computation. About 40GB of data from one lane of an Illumina HiSeq 2000 can be processed to SNP genotypes in 20 minutes on a personal computer with 2.67 GHz CPU and 8 GB memory. More technical details of UNEAK are described in Text S1. The network filter is the key step for identifying and removing paralogs. The simplest networks, reciprocal tag pairs, are more likely to be real SNPs than tag pairs that are part of complicated networks. Rare tags, with read counts below a specified error tolerance rate (ETR), are assumed to result from sequencing error. For two tags (t1 and t2) with a 1 bp mismatch and read counts c1 and c2, if c1/(c1+c2) 0.9). The MMC method [25] was used to cluster the markers into linkage groups. In the MMC input file, homozygous genotypes were assigned a value of 0 or 2, and heterozygous genotypes and missing data were assigned a value of 1. Spearman's rank correlation coefficient was used by MMC. Markers representing the cross of AA×Aa segregate from either a paternal or maternal linkage group. Using the 36 linkage groups produced by MMC from these initial 6,000 markers as seeds, 88,217 markers (0.12 0.5) were then assigned to linkage groups based upon their Spearman's rank correlation coefficient to each seed linkage group. To order the markers within each linkage group, we relied upon extensive synteny between the switchgrass and foxtail millet genomes. Markers were mapped to the foxtail millet genome (http://www.phytozome.net/foxtailmillet.php) via Basic Local Alignment Search Tool (BLAST) [56] with a P-value cutoff of 1e-5. The 7,245 markers that mapped to a single site of the foxtail millet genome, and clustered with one of the syntenic linkage groups of switchgrass, were used to construct the synteny based framework linkage maps. To construct the high density linkage maps, the rest of the 80,972 markers were mapped to the framework marker with highest value of Spearman's rank correlation coefficient within their assigned linkage groups. Phylogenetic analysis and evolution A pairwise genetic distance matrix between individuals was calculated and an un-rooted NJ tree constructed using TASSEL [57]. All of the 29,221 markers with a call rate greater than 0.5 in the diverse populations were used in this analysis. To assess the robustness of the topology of the tree, 500 bootstrap replicates were performed using MEGA [58]. To address the evolutionary trajectory of upland switchgrass, a two-step phylogenetic analysis was performed. In the first step, foxtail millet was used as an outgroup. A NJ tree was reconstructed based on the 3,144 SNPs that could be aligned to unique positions in the foxtail millet genome. This first step identified the lowland ecotype as ancestral to the remaining switchgrass ecotypes studied herein. The second step omitted foxtail millet and used the lowland ecotype as the outgroup. This second NJ tree was reconstructed based on 29,221 markers (alignment to foxtail millet not required). An MDS plot was also generated based upon the kinship matrix of individuals calculated from the 29,221 markers in TASSEL [57]. Supporting Information Figure S1 Details of the network filter. The dots represent tags. The size of dots increases with tag count. Blue dots are putative sequencing errors (rare tags). Red dots are real, more common tags. Arrows from the blue dots to the red dots indicate where the errors come from. (A) A network of tags. (B) The sequencing errors are identified if their counts are much fewer than the counts of adjacent tags. (C) The edges connecting the real tags and errors are sheared. (D) The network is divided into sub-networks. The reciprocal tag pair is kept as a potential SNP. The network with multiple tags is discarded. (E) Possible tag topologies of potential SNPs after passing through the network filter. (TIF) Click here for additional data file. Figure S2 Effect of the network filter on actual allele frequency distributions in biparental populations. SNPs were called in a single family (B73×B97) from the maize NAM population [22] (A and B) and in a switchgrass full-sib F1 linkage population (C and D). We called SNPs based on finding tag pairs mismatching at a single base (A and C) and then filtered these SNPs with the network filter (B and D). The peaks at the two ends of the distributions correspond to artifactual SNPs with low minor allele frequencies resulting from sequencing errors. (TIF) Click here for additional data file. Figure S3 LD distribution of SNPs generated from UNEAK versus 1106 external SNP markers in the maize NAM family B73×B97 [22]. For each UNEAK SNP, the average LD (r 2) with 4 adjacent, external SNPs was calculated. (TIF) Click here for additional data file. Figure S4 The relationship between coverage and SNP call rate in the switchgrass data sets. The call rate represents the proportion of individuals that was covered by at least one read. A total of 3,000 SNPs are plotted in each subfigure. (A) is from the full-sib population (130 individuals). (B) is from the half-sib population (168 individuals). (C) is from the association populations (540 individuals). (TIF) Click here for additional data file. Figure S5 Relative depth of coverage (read frequency) of the two SNP alleles at heterozygous loci in the tetraploid U518 (A), the tetraploid U418 (B), and the octoploid K101 (C). (TIF) Click here for additional data file. Figure S6 The relationship between sequencing coverage and the proportion of miscalled heterozygous genotypes. A total of 3,000 SNPs are plotted. Due to the limited sequencing depth, heterozygotes are often miscalled as homozygotes. (TIF) Click here for additional data file. Figure S7 Marker selection and linkage detection via a pseudo-testcross strategy. The red and blue blocks represent bi-allelic SNPs in each site. (A) Only SNPs with allele frequencies falling within 0.2
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries.

            High-density single-nucleotide polymorphism (SNP) arrays have revolutionized the ability of genome-wide association studies to detect genomic regions harboring sequence variants that affect complex traits. Extensive numbers of validated SNPs with known allele frequencies are essential to construct genotyping assays with broad utility. We describe an economical, efficient, single-step method for SNP discovery, validation and characterization that uses deep sequencing of reduced representation libraries (RRLs) from specified target populations. Using nearly 50 million sequences generated on an Illumina Genome Analyzer from DNA of 66 cattle representing three populations, we identified 62,042 putative SNPs and predicted their allele frequencies. Genotype data for these 66 individuals validated 92% of 23,357 selected genome-wide SNPs, with a genotypic and sequence allele frequency correlation of r = 0.67. This approach for simultaneous de novo discovery of high-quality SNPs and population characterization of allele frequencies may be applied to any species with at least a partially sequenced genome.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              A review on SNP and other types of molecular markers and their use in animal genetics

              During the last ten years, the use of molecular markers, revealing polymorphism at the DNA level, has been playing an increasing part in animal genetics studies. Amongst others, the microsatellite DNA marker has been the most widely used, due to its easy use by simple PCR, followed by a denaturing gel electrophoresis for allele size determination, and to the high degree of information provided by its large number of alleles per locus. Despite this, a new marker type, named SNP, for Single Nucleotide Polymorphism, is now on the scene and has gained high popularity, even though it is only a bi-allelic type of marker. In this review, we will discuss the reasons for this apparent step backwards, and the pertinence of the use of SNPs in animal genetics, in comparison with other marker types.
                Bookmark

                Author and article information

                Journal
                Sci Rep
                Sci Rep
                Scientific Reports
                Nature Publishing Group
                2045-2322
                07 April 2017
                2017
                : 7
                Affiliations
                [1 ]Animal Science department, University of São Paulo (USP)/Luiz de Queiroz College of Agriculture (ESALQ) , Piracicaba, São Paulo, Brazil
                [2 ]The Fish Molecular Genetics and Biotechnology Laboratory, Aquatic Genomics Unit, School of Fisheries, Aquaculture and Aquatic Sciences and Program of Cell and Molecular Biosciences, Auburn University , Auburn, AL, 36849, United States of America
                [3 ]Nature and Culture Institute, Federal University of Amazon (UFAM) , Benjamin Constant, Amazonas, Brazil
                [4 ]Unit of Biotechnology, University of Mogi das Cruzes , P.O. Box 411, 08701-970, Mogi das Cruzes, SP, Brazil
                [5 ]Brazilian National Institute for Research of the Amazon, Laboratory of Ecophysiology and Molecular Evolution , Manaus, Amazonas, Brazil
                [6 ]University Nilton Lins, Aquaculture Graduate Program , Manaus, Amazonas, Brazil
                Author notes
                [*]

                These authors contributed equally to this work.

                Article
                srep46112
                10.1038/srep46112
                5384230
                28387238
                Copyright © 2017, The Author(s)

                This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

                Categories
                Article

                Uncategorized

                Comments

                Comment on this article