208
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Development and implementation of high-throughput SNP genotyping in barley

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          High density genetic maps of plants have, nearly without exception, made use of marker datasets containing missing or questionable genotype calls derived from a variety of genic and non-genic or anonymous markers, and been presented as a single linear order of genetic loci for each linkage group. The consequences of missing or erroneous data include falsely separated markers, expansion of cM distances and incorrect marker order. These imperfections are amplified in consensus maps and problematic when fine resolution is critical including comparative genome analyses and map-based cloning. Here we provide a new paradigm, a high-density consensus genetic map of barley based only on complete and error-free datasets and genic markers, represented accurately by graphs and approximately by a best-fit linear order, and supported by a readily available SNP genotyping resource.

          Results

          Approximately 22,000 SNPs were identified from barley ESTs and sequenced amplicons; 4,596 of them were tested for performance in three pilot phase Illumina GoldenGate assays. Data from three barley doubled haploid mapping populations supported the production of an initial consensus map. Over 200 germplasm selections, principally European and US breeding material, were used to estimate minor allele frequency (MAF) for each SNP. We selected 3,072 of these tested SNPs based on technical performance, map location, MAF and biological interest to fill two 1536-SNP "production" assays (BOPA1 and BOPA2), which were made available to the barley genetics community. Data were added using BOPA1 from a fourth mapping population to yield a consensus map containing 2,943 SNP loci in 975 marker bins covering a genetic distance of 1099 cM.

          Conclusion

          The unprecedented density of genic markers and marker bins enabled a high resolution comparison of the genomes of barley and rice. Low recombination in pericentric regions is evident from bins containing many more than the average number of markers, meaning that a large number of genes are recombinationally locked into the genetic centromeric regions of several barley chromosomes. Examination of US breeding germplasm illustrated the usefulness of BOPA1 and BOPA2 in that they provide excellent marker density and sensitivity for detection of minor alleles in this genetically narrow material.

          Related collections

          Most cited references23

          • Record: found
          • Abstract: found
          • Article: not found

          Efficient and Accurate Construction of Genetic Linkage Maps from the Minimum Spanning Tree of a Graph

          Introduction Genetic linkage mapping dates back to the early 20th century when scientists began to understand the recombinational nature and cellular behavior of chromosomes. In 1913 Sturtevant studied the first genetic linkage map of chromosome X of Drosophila melanogaster [1]. Genetic linkage maps began with just a few to several tens of phenotypic markers obtained one by one by observing morphological and biochemical variations of an organism, mainly following mutation. The introduction of DNA-based markers such as restriction fragment length polymorphism (RFLP), randomly amplified polymorphic DNA (RAPD), simple sequence repeats (SSR) and amplified fragment length polymorpshim (AFLP) caused genetic maps to become much more densely populated, generally into the range of several hundred to more than a thousand markers per genome. More recently, the number of markers has surged well above 1,000 in a number of organisms with the adoption of DArT, SFP and especially SNP markers, the latter providing avenues to 100,000 s to millions of markers per genome. In plants, one of the most densely populated maps is that of Brassica napus [2], which was developed from an initial set of 13,551 markers. High density genetic maps facilitate many biological studies including map-based cloning, association genetics and marker assisted breeding. Because they do not require whole genome sequencing and require relatively small expenditures for data acquisition, high density genetic linkage maps are currently of great interest. A genetic map usually is built using input data composed of the states of loci in a set of meiotically derived individuals obtained from controlled crosses. When an order of the markers is computed from the data, the recombinational distance is also estimated. To characterize the quality of an order, various objective functions have been proposed, e.g., minimum Sum of Square Errors (SSE) [3], minimum number of recombination events (COUNT) [4], Maximum Likelihood (ML) [5], Modified Maximum Likelihood (MML) [6] which tries to incorporate the presence of possible genotype errors into the ML model, maximum Sum of adjacent LOD scores (SALOD) [7], minimum Sum of Adjacent Recombination Fractions (SARF) [8], minimum Product of Adjacent Recombination Fractions (PARF) [9]. Searching for an optimal order with respect to any of these objective functions is computationally difficult. Enumerating all the possible orders quickly becomes infeasible because the total number of distinct orders is proportional to n!, which can be very large even for a small number n of markers. The connection between the traveling salesman problem and a variety of genomic mapping problem is well known, e.g., for the physical mapping problem [10],[11], the genetic mapping problem [12],[13] and the radiation hybrid ordering problem [14]. Various searching heuristics that were originally developed for the traveling salesman problem, such as simulated annealing [15], genetic algorithms [16], tabu search [17],[18], ant colony optimization, and iterative heuristics such as K-opt and Lin-Kernighan heuristic [19] have been applied to the genetic mapping problem in various computational packages. For example, JoinMap [5] and Tmap [6] implement simulated annealing, Carthagene [12],[20] uses a combination of simulated annealing, tabu search and genetic algorithms, AntMap [21] exploits the ant colony optimization heuristic, [22] is based on genetic algorithms, and [23] takes advantage of evolutionary algorithms. Finally, Record [4] implements a combination of greedy and Lin-Kernighan heuristics. Most of the algorithms proposed in the literature for genetic linkage mapping find reasonably good solutions. Nonetheless, they fail to identify and exploit the combinatorial structures hidden in the data. Some of them simply start to explore the space of the solutions from a purely random order (see, e.g., [12],[23],[5],[21]), while others start from a simple greedy solution (see, e.g., [4],[3]). Here we show both theoretically and empirically that when the data quality is high, the optimal order can be identified very quickly by computing a minimal spanning tree of the graph associated with the genotyping data. We also show that when the genotyping data is noisy or incomplete, our algorithm consistently constructs better genetic maps than the best available tools in the literature. The software implementing our algorithm is currently available as a web tool under the name MSTmap. Materials and Methods We are concerned with genetic markers in the form of single nucleotide polymorphism (SNP), more specifically biallelic SNPs. By convention, the two alternative allelic states are denoted as A and B respectively. The organisms considered here are diploids with two copies of each chromosome, one from the mother and the other from the father. A SNP locus may exist in the homozygous state if the two allele copies are identical, and in the heterozygous state otherwise. Various population types have been studied in association with genetic mapping, which includes Back Cross (BC1), Doubled Haploid (DH), Haploid (Hap), Recombinant Inbred Line (RIL), advanced RIL, etc. Our algorithm can handle all of the aforementioned population types. For the sake of clarity, in what follows we will concentrate on the DH population (see the section on barley genotyping data for details on DH populations). The application of our method to Hap, advanced RIL and BC1 populations is straightforward. In Supplementary Text S1, we will discuss the extension of our method to the RIL population (see, e.g., [24] for an introduction to RIL populations). Building a genetic map is a three-step process. First, one has to partition the markers into linkage groups, each of which usually corresponds to a chromosome (sometimes multiple linkage groups can reside on the same chromosome if they are far apart). More specifically, a linkage group is defined as a group of loci known to be physically connected, that is, they tend to act as a single group (except for recombination of alleles by crossing-over) in meiosis instead of undergoing independent assortment. The problem of assigning markers to linkage groups is essentially a clustering problem. Second, given a set of markers in the same linkage group, one needs to determine their correct order. Third, the genetic distances between adjacent markers have to be estimated. Before we describe the algorithmic details, the next section is devoted to a discussion on the input data and our optimization objectives. Genotyping Data and Optimization Objective Functions The doubled haploid individuals (a set collectively denoted by N) are genotyped on the set M of markers, i.e., the state of each marker is determined. The genotyping data are collected into a matrix of size m×n, where m = |M| and n = |N|. Each entry in corresponds to a marker and individual pair, which is also called an observation. Due to how DH mapping populations are produced (please refer to section on barley genotyping data for details), each observation can exist in two alternative states, namely homozygous A or homozygous E, which are denoted as A and B respectively. The case where there is missing data will be discussed later in this manuscript. For a pair of markers l 1, l 2 ∈ M and an individual c ∈ N, we say that c is a recombinant with respect to l 1 and l 2 if c has genotype A on l 1 and genotype B on l 2 (or vice versa). If l 1 and l 2 are in the same linkage group, then a recombinant is produced if an odd number of crossovers occurred between the paternal chromosome and the maternal chromosome within the region spanned by l 1 and l 2 during meiosis. We denote with P i,j the probability of a recombinant event with respect to a pair of markers (li ,lj ). P i,j varies from 0.0 to 0.5 depending on the distance between li and lj At one extreme, if li and lj belong to different LGs, then P i,j  = 0.5 because alleles at li and lj are passed down to next generation independently from each other. At the other extreme, when the two markers li and lj are so close to each other that no recombination can occur between them, then P i,j  = 0.0. Let (li ,lj ) and (lp ,lq ) be two pairs of markers on the same linkage group. We say that the pair (li ,lj ) is enclosed in the pair (lp ,lq ) if the region of the chromosome spanned by li and lj is fully contained in the region spanned by lp and lq . A fundamental law in genetics is that if (li ,lj ) is enclosed in (lp ,lq ) then P i,j ≤P p,q . As mentioned in the Introduction, a wide variety of objective functions have been proposed in the literature to capture the quality of the order (SSE, COUNT, ML, MML, SALOD, SARF, PARF, etc.). With the exception of SSE and MML, the rest of the objective functions listed above can be decomposed into a simple sum of terms involving only pairs of markers. Thus, we introduce a weight function w: M×M→ℜ to be defined on pairs of markers. The function w is said to be semi-linear if w(i, j)≤w(p, q) for all (li ,lj ) enclosed in (lp ,lq ). For example, if we have three markers in order {l 1,l 2,l 3} and an associated weight function w that satisfies semi-linearity, we have w(1,3)≥w(1,2) and w(1,3)≥w(2,3) since (l 1,l 2) and (l 2,l 3) are enclosed in (l 1,l 3), but it is not necessary the case that w(1,3) = w(1,2)+w(2,3). The concept of semi-linearity is essential for the development of our marker ordering algorithm as explained below. For example, the function w(i, j) = P i,j is semi-linear. Another commonly used weight function is wlp (i, j) = log(P i,j ). Since the logarithm function is monotone, then wlp (i, j) is also semi-linear. A more complicated weight function is wml (i, j) = −[P i,j log(P i,j )+(1−P i,j )log(1−P i,j )]. It is relatively easy to verify that wml (i, j) is a monotonically increasing function of P i,j when 0≤P i,j ≤0.5, and therefore wml is also semi-linear. Observe that all these weight functions are functions in P i,j . Although the precise value of P i,j is unknown, we can compute their estimates from the total number of recombinants in the input genotyping data. For DH populations, the total number of recombinants in N with respect to the pair (li ,lj ) can be easily determined by computing the number di ,j of positions in which row and row do not match, which corresponds to the Hamming distance between and . It is easy to prove that di ,j /n corresponds to the maximum likelihood estimate (MLE) for P i,j . When we replace P i,j by its maximum likelihood estimate di ,j /n, we obtain the following approximate weight functions: wp ′(i, j) = di ,j /n, wlp ′ (i, j) = log(di ,j /n), and . Our optimization objective is to identify a minimum weight traveling salesman path with respect to either of the aforementioned approximated weight functions, which will be discussed in further details below. We should mention that if wp ′ is used as the weight function, then our optimization objective is equivalent to the SARF or COUNT objective functions (up to a constant). If instead wlp ′ is used, then our optimization objective is equivalent to the logarithm of the PARF objective function (up to a constant). Lastly, if wml ′ is employed, our objective function is equivalent to the negative of the logarithm of the ML objective function as being employed in [3],[5],[12],[20] (again, up to a constant). Unless otherwise noted, wp ′ is the objective function being employed in the rest of this paper. The experimental results will show that the specific choice of objective function does not have a significant impact on the quality of the final map. In particular, both functions wp ′ and wml ′ produce very accurate final maps. Clustering Markers into Linkage Groups First observe that when two markers li and lj belong to two different linkage groups, then P i,j  = 0.5 and consequently di ,j will be large with high probability. More precisely, let li and lj be two markers that belong to two different LGs, and let di ,j be the Hamming distance between and . Then, where δ 0 for all i, j ∈ M. The excluded markers for which di ,j  = 0 are called co-segregating markers, and they identify regions of chromosomes that do not recombine. In practice, we coalesce co-segregating markers into bins, where each bin is uniquely identified by any one of its members. Let G(M, E) be an edge-weighted complete undirected graph on the set of vertices M, and let w be one of the weight functions defined above. A traveling salesman path (TSP) Γ in G is a path that visits every marker/vertex once and only once. The weight w(Γ) of a TSP Γ is the sum of the weights of the edges on Γ. The main theoretical insight behind our algorithm is the following. When w is semi-linear, the minimum weight TSP of G corresponds to the correct order of markers in M. Furthermore when the minimum spanning tree (MST) of G is unique, the minimum weight TSP of G (and thus, the correct order) can be computed by a simple MST algorithm (such as Prim's algorithm). Details of these mathematical facts (with proofs) are given in Supplementary Text S1. We now turn our attention to the problem of finding a minimum weight TSP in G with respect to one of the approximate weight functions. When the data are clean and n is large, the maximum likelihood estimates di ,j /n will be close to the true probabilities P i,j . Consequently it is reasonable to expect that those approximate weight functions will be also semi-linear, or “almost” semi-linear. Although only in the former case our theory (in particular, Lemma 1 in Supplementary Text S1) guarantees that the minimum weight TSP will correspond to the true order of the markers, in our simulations the order is recovered correctly in most instances. In order to find the minimum weight TSP, we first run Prim's algorithm on G to compute the optimum spanning tree, which takes O(nlogn). If the MLEs are accurate so that the approximate weight function is semi-linear, our theory (in particular Lemma 2 in Supplementary Text S1) ensures that the MST is a TSP. In practice, due to noise in the genotyping data or due to an insufficient number of individuals, the spanning tree may not be a path – but hopefully “very close” to a path. This is exactly what we observed when running MST algorithm on both real data and noise-free simulated data – the MST produced is always “almost” a path. In Results and Discussion we compute the fraction ρ of the total number of markers in the linkage group that belong to the longest path of the MST. The closer is ρ to 1.0, the closer is the MST to a path. Table 1 on the barley datasets and Figure 1 on simulated data show that ρ is always very close to 1.0 when the data is noise-free. 10.1371/journal.pgen.1000212.g001 Figure 1 Average ρ (rho) for thirty runs on simulated data for several choices of the error rates (and no missing data). The variable n represents the number of individuals, and m represents the number of markers. 10.1371/journal.pgen.1000212.t001 Table 1 Summary of the clustering results for the barley data sets. Data set # markers (# bins) # LGs Sizes of the LGs OWB 1562(509) 7 168(65), 235(73), 255(91), 211(60), 278(89), 202(64), 213(67) 0.9978 SM 1270(396) 8 148(49), 217(57), 242(63), 130(49), 225(80), 122(40), 183(57), 3(1) 0.9971 MB 1652(443) 8 215(60), 279(72), 246(77), 141(39), 299(74), 219(54), 248(65), 5(2) 1.0000 is the average ρ of the seven largest LGs in each population. The numbers inside the parentheses are the number of bins. When a tree is not a path, we proceed as follows. First, we find the longest path in the MST, hereafter referred to as the backbone. The nodes that do not belong to the path will be first disconnected from it. Then, the disconnected nodes will be re-inserted into the backbone one by one. Each disconnected node is re-inserted at the position which incurs the smallest additional weight to the backbone. The path obtained at the end of this process is our initial solution, which might not be locally optimal. Once the initial solution is computed, we apply three heuristics that iteratively perform local perturbations in an attempt to improve the current TSP. First, we apply the commonly-used K-opt (K = 2 in this case) heuristic. We cut the current path into three pieces, and try all the possible rearrangements of the three pieces. If any of the resulting paths has less total weight, it will be saved. This heuristic is illustrated in Figure 2-C1 . This procedure is repeated until no further improvement is possible. In the second heuristic, we try to relocate each node in the path to all the other possible positions. If this relocation reduces the weight, the new path will be saved. The second heuristic is illustrated in Figure 2-C2 . 10.1371/journal.pgen.1000212.g002 Figure 2 An illustration of the MST-based algorithm. (A) The MST obtained for a synthetic example; the MST is not a TSP yet; the backbone of the MST is shown with dotted edges. (B) An initial TSP obtained from the backbone (see text for details). The dotted edges represent marker pairs in the wrong order. Several local improvement operations are applied to further improve the TSP, namely 2-OPT (C1), node-relocation (C2) and block-optimize (C3). The final TSP is shown in (D). In our experiments, we observed that K-opt or node relocation may get stuck in local optima if a block of nodes have to be moved as a whole to a different position in order to further improve the TSP. In order to work around this limitation, we designed a third local optimization heuristic, which is called block-optimize. The heuristic works as follows. We first partition the current TSP into blocks consisting of consecutive nodes. Let l 1, l 2,…, lm be the current TSP. We will place li and li +1 in the same block if (1) w(i, i+1)≤w(i, j) for all i+1 m(m−1)/4, one of the maps is flipped and E is recomputed. Notice that E is more sensitive to global reshuffling than to local reshuffling. For example, assume that the true order is the identity permutation. The value of E for the following order is m(m−1)/4, whereas E for the order 2,1,4,3,6,5,…,m,m−1 is m(m−1)/2. For reasonably large m, m(m−1)/2 is much smaller than m(m−1)/4. The fact that E is more sensitive to global reshuffling is a desirable property since biologists are often more interested in the correctness of the global order of the markers than the local order. The number of erroneous marker pairs conveys the overall quality of the map produced by MSTmap, however E depends on the number m of markers. The larger is m, the larger E will be. Sometimes it is useful to normalize E by taking the transformation 1−(4E/(m(m−1))). The resulting statistic is essentially the Kendall's τ statistic. The τ statistic ranges from 0 to 1. The closer is the statistic to 1, the more accurate the map is. We will present the τ statistic along with the E statistic when it is necessary. The next three statistics we collected are the percentage of true positives, the percentage of false positives, and the percentage of false negatives, which are denoted as %t_pos, %tf_pos and %f_neg respectively. For each dataset, the list of suspicious observations identified by MSTmap is compared with the list of true erroneous observations that were purposely added when the data was first generated. The value of %t_pos is the number of suspicious observations that are truly erroneous divided by the total number nm of observations. The value %f _pos is the number suspicious observations that are in fact correct divided by the total number of observations. Likewise, %f_neg is the number of erroneous observations that are not identified by MSTmap. The three performance metrics are intended to capture the overall accuracy of the error detection scheme. Finally, we collected the running time on each data set. Table 2 summarizes the statistics for n = 100, m = 100, when the error rate and the missing rate range from 0% to 15%. An inspection of the table reveals that irrespective of the choice of η and γ our error detection scheme is able to detect most of the erroneous observations without introducing too many false positives. When the input data are noisy, the quality of the final maps with error detection is significantly better than those without. However, if the input data are clean (corresponding to rows in the table where η = 0), the quality of the maps with error detection deteriorates slightly. Results for other choices of m and n are presented in Table 4 and Table 5. Similar conclusions can be drawn. 10.1371/journal.pgen.1000212.t002 Table 2 Summary of the accuracy and effectiveness of our error detection scheme for m = 100, n = 100 and various choices of η and γ. n, m = 100 E γ η %f_pos %t_pos %f_neg wp ′ w′ ml wp ′ no err. 0.00 0.00 0.00186 0.00000 0.00000 1.50 1.43 1.80 0.00 0.01 0.00441 0.00956 0.00049 15.10 15.80 38.93 0.00 0.05 0.00442 0.04643 0.00357 41.37 42.93 165.50 0.00 0.10 0.00682 0.08754 0.01229 96.53 96.07 468.63 0.00 0.15 0.01086 0.12188 0.02720 221.03 238.77 1187.60 0.01 0.00 0.00177 0.00000 0.00000 1.83 3.27 1.27 0.05 0.00 0.00150 0.00000 0.00000 6.47 6.17 5.23 0.10 0.00 0.00135 0.00000 0.00000 18.07 18.60 9.23 0.15 0.00 0.00124 0.00000 0.00000 16.13 16.40 10.00 0.01 0.01 0.00357 0.00966 0.00050 11.47 11.83 44.20 0.05 0.05 0.00421 0.04305 0.00433 52.90 54.13 144.67 0.10 0.10 0.00631 0.07641 0.01300 140.67 150.40 532.47 0.15 0.15 0.00994 0.09494 0.03277 379.17 353.53 1040.70 Each row in the table is an average of 30 independent runs. The columns wp ′ and w′ ml correspond to the number of erroneous marker pairs (E) made by MSTmap under the objective function wp ′ and w′ ml respectively with error detection. The column “wp ′ no err.” corresponds to the number of erroneous markers pairs made by MSTmap under the objective function wp ′ without error detection. In Table 2, we also compare the quality of the final maps under different objective functions. The objective functions (SARF) and (ML) give very similar results. Similar results are observed for other choices of n and m (data not shown). Evaluation of the Accuracy of the Ordering In the fourth and final evaluation, we use simulated data to compare our tool against several commonly used tools including JoinMap [5], Carthagene [12] and Record [4]. JoinMap is a commercial software that is widely used in the scientific community. It implements two algorithms for genetic map construction, one is based on regression [3] whereas the other based on maximum likelihood [5]. Our experimental results for JoinMap are obtained with the “maximum likelihood based algorithm” since it is orders of magnitude faster than the “regression based algorithm” (see the manual of JoinMap for more details). Due to the fact that JoinMap is GUI-based (non-scriptable), we were able to collect statistics for only a few datasets. Carthagene and Record on the other hand are both scriptable, which allows us to carry out more extensive comparisons. However, due to the slowness of Carthagene (when n = 300, it takes more than several hours to finish), we applied it only to small data sets (n = 100). The most complete comparison was carried out between MSTmap and Record. As we have done in the previous subsection, we employ the number of erroneous pairs to compare the quality of the maps obtained by different tools. The results for n = 100 and m = 100 are summarized in Table 3. A more thorough comparison of MSTmap and Record is presented in Table 4 and Table 5. Several observations are in order. First, MSTmap constructs significantly better maps than the other tools when the input data are noisy. When the data are clean and contain many missing observations (i.e., η = 0 and γ is large), Carthagene produces maps which are slightly more accurate than those by MSTmap. However, if we knew the data were clean, by turning off the error-detection in MSTmap we would obtain maps of comparable quality to Carthagene in a much shorter running time. Please refer to the “wp ′ no err” column for the E statistics of MSTmap when the error detection feature is turned off. Second, Carthagene appears to be better than Record when the data are clean (η = 0). When the data are noisy, Record constructs more accurate maps than Carthagene. Third, MSTmap and Record are both very efficient in terms of running time, and they are much faster than Carthagene. A clearer comparison of the running time between MSTmap and Record is presented in Figure 4. The figure shows that MSTmap is even faster than Record when the data set contains no errors. However as the input data set becomes noisier, the running time for MSTmap also increases. This is because our adaptive error detection scheme needs more iterations to identify erroneous observations, and consequently takes more time. However, this lengthened execution does pay off with a significantly more accurate map. Fourth and last, Table 3, 4 and 5 show that the overall quality of the maps produced by MSTmap is usually very high. In most scenarios, the τ statistic is greater than 0.99. 10.1371/journal.pgen.1000212.g004 Figure 4 Running time of MSTmap and Record with respect to error rate or missing rate or error and missing rate. Every point in the graph is an average of 30 runs. The lines “missing only” correspond to data sets with no error (η = 0, γ is on the x-axis). Similarly, lines “error only” correspond to data sets with no missing (γ = 0, η is on the x-axis), and lines “error and missing” correspond to data sets with equal missing rate and error rate (η = γ is on the x-axis). 10.1371/journal.pgen.1000212.t003 Table 3 Comparison between MSTmap, JoinMap, Carthagene and Record for n = 100 and m = 100. n, m = 100 MSTmap Record Carthagene JoinMap γ η E time E time E time E time 0.00 0.00 1.50 1.3 1.3 2.1 2.5 255.0 1.7 <60 0.00 0.01 15.10 4.0 46.6 1.7 58.2 275.7 - - 0.00 0.05 41.37 7.7 129.1 2.3 300.3 267.4 - - 0.00 0.10 96.53 12.0 450.8 3.2 680.0 265.2 - - 0.00 0.15 221.03 17.0 1064.8 3.7 1378.5 276.8 - - 0.01 0.00 1.83 1.3 34.7 2.0 2.8 300.1 - - 0.05 0.00 6.47 1.2 44.0 2.4 5.4 363.1 - - 0.10 0.00 18.07 1.1 49.7 2.7 7.0 416.6 - - 0.15 0.00 16.13 1.0 64.8 2.9 9.0 486.2 - - 0.01 0.01 11.47 4.2 54.8 2.7 49.0 310.0 53.7 <60 0.05 0.05 52.90 9.9 164.1 3.9 296.4 368.4 370.2 <60 0.10 0.10 140.67 17.1 683.4 6.0 837.0 430.2 - - 0.15 0.15 379.17 23.7 1387.4 7.6 1273.3 500.1 - - Each number presented in the table is averaged over 30 independent runs, except for those of JoinMap, which are averaged over 10 independent runs. Column E reports the average number of erroneous marker pairs. The running time is reported as number of seconds. 10.1371/journal.pgen.1000212.t004 Table 4 Comparison between MSTmap and Record for m = 300,500 and n = 100. MSTmap MSTMap no err. detection Record γ η E τ time E τ time E τ time M = 300, n = 100 0.00 0.00 5.2 0.9998 12.1 6.3 0.9997 11.4 6.6 0.9997 25.7 0.00 0.01 42.7 0.9981 39.0 135.0 0.9940 29.1 135.5 0.9940 17.5 0.00 0.05 136.9 0.9939 80.1 407.9 0.9818 55.4 423.0 0.9811 23.3 0.00 0.10 338.1 0.9849 147.7 1104.5 0.9507 67.9 2946.3 0.8686 26.3 0.00 0.15 612.0 0.9727 221.0 5662.3 0.7475 78.8 8202.9 0.6342 31.9 0.01 0.00 6.1 0.9997 13.5 6.7 0.9997 11.5 107.5 0.9952 19.6 0.05 0.00 19.6 0.9991 12.8 14.4 0.9994 10.6 133.0 0.9941 25.4 0.10 0.00 34.8 0.9984 13.1 17.7 0.9992 10.0 156.8 0.9930 28.8 0.15 0.00 52.1 0.9977 11.5 30.3 0.9986 8.4 197.4 0.9912 32.3 0.01 0.01 32.6 0.9985 41.1 134.4 0.9940 31.6 153.5 0.9932 26.6 0.05 0.05 150.6 0.9933 114.1 399.1 0.9822 73.5 510.7 0.9772 34.7 0.10 0.10 402.0 0.9821 228.1 1089.9 0.9514 84.2 3626.1 0.8383 42.4 0.15 0.15 757.8 0.9662 356.7 5605.2 0.7500 95.6 10970.7 0.5108 54.5 m = 500, n = 100 0.00 0.00 8.6 0.9999 33.6 10.6 0.9998 32.4 10.4 0.9998 32.5 0.00 0.01 68.8 0.9989 105.4 219.8 0.9965 83.4 233.5 0.9963 57.0 0.00 0.05 207.4 0.9967 239.5 663.5 0.9894 171.6 1308.4 0.9790 75.3 0.00 0.10 532.1 0.9915 467.6 2226.6 0.9643 198.8 9797.2 0.8429 78.8 0.00 0.15 1014.9 0.9837 698.7 9652.8 0.8452 247.7 32850.8 0.4733 90.2 0.01 0.00 11.5 0.9998 32.4 11.9 0.9998 32.2 183.3 0.9971 60.2 0.05 0.00 30.0 0.9995 29.2 20.6 0.9997 28.7 225.6 0.9964 84.4 0.10 0.00 63.4 0.9990 27.4 34.6 0.9994 27.0 249.1 0.9960 95.8 0.15 0.00 86.2 0.9986 24.7 55.8 0.9991 24.4 312.0 0.9950 104.1 0.01 0.01 53.7 0.9991 106.5 224.8 0.9964 91.9 238.3 0.9962 81.4 0.05 0.05 238.0 0.9962 349.3 639.6 0.9897 234.5 4794.9 0.9231 99.0 0.10 0.10 629.0 0.9899 739.0 1694.7 0.9728 291.0 23968.4 0.6157 121.6 0.15 0.15 1256.5 0.9799 1256.0 8501.9 0.8637 267.0 37382.9 0.4007 162.0 The columns under “MSTMap no err. detection” contains results obtained when running MSTmap with no error detection. We report Kendall's τ statistic and the number E of erroneous marker pairs. The running time is reported in seconds. 10.1371/journal.pgen.1000212.t005 Table 5 Comparison between MSTmap and Record for m = 100,300,500 and n = 200. MSTmap MSTMap no err. detection Record γ η E τ time E τ time E τ time M = 100, n = 200 0.00 0.00 3.2 0.9987 4.0 0.8 0.9997 3.9 0.5 0.9998 4.4 0.00 0.01 6.7 0.9973 8.3 19.1 0.9923 6.3 18.4 0.9926 3.8 0.00 0.05 19.1 0.9923 12.2 65.5 0.9735 7.2 55.6 0.9775 5.0 0.00 0.10 58.5 0.9764 17.1 192.3 0.9223 8.4 196.0 0.9208 6.5 0.00 0.15 112.0 0.9548 22.6 533.0 0.7847 9.1 513.9 0.7924 7.9 0.01 0.00 2.3 0.9991 4.2 0.6 0.9998 3.8 16.9 0.9932 4.0 0.05 0.00 5.3 0.9979 3.8 1.2 0.9995 3.6 19.2 0.9922 4.5 0.10 0.00 11.2 0.9955 3.4 2.0 0.9992 3.2 22.7 0.9908 5.2 0.15 0.00 11.2 0.9955 3.2 3.2 0.9987 3.0 26.1 0.9895 5.6 0.01 0.01 4.5 0.9982 8.4 19.9 0.9919 6.5 22.9 0.9908 4.9 0.05 0.05 25.2 0.9898 15.6 67.8 0.9726 7.7 91.1 0.9632 10.3 0.10 0.10 71.2 0.9712 24.8 171.8 0.9306 8.5 293.2 0.8815 13.5 0.15 0.15 138.2 0.9442 40.0 587.0 0.7628 9.3 1020.9 0.5875 13.2 m = 300, n = 200 0.00 0.00 8.2 0.9996 38.1 1.9 0.9999 34.9 2.4 0.9999 98.1 0.00 0.01 19.7 0.9991 75.8 59.1 0.9974 58.1 59.6 0.9973 35.7 0.00 0.05 64.4 0.9971 111.8 188.3 0.9916 73.5 186.6 0.9917 49.0 0.00 0.10 169.3 0.9925 187.0 525.8 0.9766 93.9 534.0 0.9762 55.2 0.00 0.15 328.2 0.9854 254.3 1350.3 0.9398 94.0 2269.8 0.8988 63.5 0.01 0.00 8.5 0.9996 35.4 1.8 0.9999 34.9 48.3 0.9978 38.3 0.05 0.00 15.9 0.9993 32.4 3.8 0.9998 31.7 54.1 0.9976 43.1 0.10 0.00 28.4 0.9987 29.6 8.9 0.9996 29.0 62.5 0.9972 49.6 0.15 0.00 38.2 0.9983 27.5 12.2 0.9995 26.8 81.0 0.9964 55.4 0.01 0.01 16.7 0.9993 75.7 63.5 0.9972 61.0 66.1 0.9971 45.6 0.05 0.05 66.9 0.9970 146.9 194.0 0.9914 84.3 227.3 0.9899 63.9 0.10 0.10 205.7 0.9908 281.3 517.4 0.9769 96.8 758.6 0.9662 82.3 0.15 0.15 418.5 0.9813 457.8 1179.5 0.9474 113.5 5464.8 0.7563 104.7 m = 500, n = 200 0.00 0.00 12.1 0.9998 99.0 2.8 1.0000 97.1 4.2 0.9999 123.9 0.00 0.01 32.3 0.9995 206.7 100.1 0.9984 162.1 100.0 0.9984 113.8 0.00 0.05 104.1 0.9983 295.5 282.5 0.9955 223.2 323.6 0.9948 149.9 0.00 0.10 291.3 0.9953 495.0 810.5 0.9870 287.8 1011.1 0.9838 158.7 0.00 0.15 542.5 0.9913 772.4 2099.5 0.9663 323.0 11335.7 0.8183 184.6 0.01 0.00 15.1 0.9998 96.9 3.3 0.9999 95.5 79.8 0.9987 120.9 0.05 0.00 28.7 0.9995 90.1 6.0 0.9999 88.4 88.6 0.9986 133.5 0.10 0.00 47.4 0.9992 83.0 10.7 0.9998 81.3 107.1 0.9983 146.5 0.15 0.00 56.5 0.9991 77.0 17.7 0.9997 75.5 123.1 0.9980 167.2 0.01 0.01 30.2 0.9995 216.0 98.4 0.9984 172.7 183.8 0.9971 143.9 0.05 0.05 112.0 0.9982 463.0 331.4 0.9947 260.9 632.2 0.9899 193.2 0.10 0.10 342.1 0.9945 852.5 940.9 0.9849 349.4 2110.1 0.9662 226.1 0.15 0.15 725.9 0.9884 1506.2 2190.2 0.9649 368.7 15200.3 0.7563 274.5 The columns under “MSTMap no err. detection” contains results obtained when running MSTmap with no error detection. We report Kendall's τ statistic and the number E of erroneous marker pairs. The running time is reported in seconds. An extensive comparison of MSTmap and Record for other choices of m and n is presented in Table 4 and Table 5. Notice that even without error detection, MSTmap is more accurate than Record. We have also compared MSTmap, Record, JoinMap and Carthagene on real genotyping data for the barley project. We carried out several rounds of cleaning the input data after inspecting the output from MSTmap (in particular, we focused on the list of suspicious markers and genotype calls reported by MSTmap), then the data set was fed into MSTmap, Record, JoinMap and Carthagene. The results show that the genetic linkage maps obtained by MSTmap and JoinMap are identical in terms of marker orders. MSTmap, Carthagene and Record differ only at the places where there are missing observations. At those locations, MSTmap groups markers in the same bin, while Carthagene and Record split them into two or more bins (at a very short distance, usually less than 0.1 cm). Conclusion We presented a novel method to cluster and order genetic markers from genotyping data obtained from several population types including doubled haploid, backcross, haploid and recombinant inbred line. The method is based on solid theoretical foundations and as a result is computationally very efficient. It also gracefully handles missing observations and is capable of tolerating some genotyping errors. The proposed method has been implemented into a software tool named MSTMap, which is freely available in the public domain at http://www.cs.ucr.edu/~yonghui/mstmap.html. According to our extensive studies using simulated data, as well as results obtained using a real data set from barley, MSTMap outperforms the best tools currently available, particularly when the input data are noisy or incomplete. The next computational challenge ahead of us involves the problem of integrating multiple maps. Nowadays, it is increasingly common to have multiple genetic linkage maps for the same organism, usually from a different set of markers obtained with a variety of genotyping technologies. When multiple genetic linkage maps are available for the same organism it is often desirable to integrate them into one single consensus map, which incorporates all the markers and ideally is consistent with each individual map. The problem of constructing a consensus map from multiple individual maps remains a computationally challenging and interesting research topic. Supporting Information Text S1 Supplementary Text: Efficient and Accurate Construction of Genetic Linkage Maps. (0.08 MB PDF) Click here for additional data file.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Identification and characterization of shared duplications between rice and wheat provide new insight into grass genome evolution.

            The grass family comprises the most important cereal crops and is a good system for studying, with comparative genomics, mechanisms of evolution, speciation, and domestication. Here, we identified and characterized the evolution of shared duplications in the rice (Oryza sativa) and wheat (Triticum aestivum) genomes by comparing 42,654 rice gene sequences with 6426 mapped wheat ESTs using improved sequence alignment criteria and statistical analysis. Intraspecific comparisons identified 29 interchromosomal duplications covering 72% of the rice genome and 10 duplication blocks covering 67.5% of the wheat genome. Using the same methodology, we assessed orthologous relationships between the two genomes and detected 13 blocks of colinearity that represent 83.1 and 90.4% of the rice and wheat genomes, respectively. Integration of the intraspecific duplications data with colinearity relationships revealed seven duplicated segments conserved at orthologous positions. A detailed analysis of the length, composition, and divergence time of these duplications and comparisons with sorghum (Sorghum bicolor) and maize (Zea mays) indicated common and lineage-specific patterns of conservation between the different genomes. This allowed us to propose a model in which the grass genomes have evolved from a common ancestor with a basic number of five chromosomes through a series of whole genome and segmental duplications, chromosome fusions, and translocations.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              A high density barley microsatellite consensus map with 775 SSR loci.

              A microsatellite or simple sequence repeat (SSR) consensus map of barley was constructed by joining six independent genetic maps based on the mapping populations 'Igri x Franka', 'Steptoe x Morex', 'OWB(Rec) x OWB(Dom)', 'Lina x Canada Park', 'L94 x Vada' and 'SusPtrit x Vada'. Segregation data for microsatellite markers from different research groups including SCRI (Bmac, Bmag, EBmac, EBmag, HVGeneName, scsssr), IPK (GBM, GBMS), WUR (GBM), Virginia Polytechnic Institute (HVM), and MPI for Plant Breeding (HVGeneName), generated in above mapping populations, were used in the computer program RECORD to order the markers of the individual linkage data sets. Subsequently, a framework map was constructed for each chromosome by integrating the 496 "bridge markers" common to two or more individual maps with the help of the computer programme JoinMap 3.0. The final map was calculated by following a "neighbours" map approach. The integrated map contained 775 unique microsatellite loci, from 688 primer pairs, ranging from 93 (6H) to 132 (2H) and with an average of 111 markers per linkage group. The genomic DNA-derived SSR marker loci had a higher polymorphism information content value (average 0.61) as compared to the EST/gene-derived SSR loci (average 0.48). The consensus map spans 1,068 cM providing an average density of one SSR marker every 1.38 cM. Such a high-density consensus SSR map provides barley molecular breeding programmes with a better choice regarding the quality of markers and a higher probability of polymorphic markers in an important chromosomal interval. This map also offers the possibilities of thorough alignment for the (future) physical map and implementation in haplotype diversity studies of barley.
                Bookmark

                Author and article information

                Journal
                BMC Genomics
                BMC Genomics
                BioMed Central
                1471-2164
                2009
                4 December 2009
                : 10
                : 582
                Affiliations
                [1 ]Department of Botany & Plant Sciences, University of California (UCR), Riverside, CA, 92521, USA
                [2 ]Department of Computer Sciences, University of California (UCR), Riverside, CA, 92521, USA
                [3 ]Scottish Crop Research Institute (SCRI), Invergowrie, Dundee, DD2 5DA, UK
                [4 ]Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Corrensstrasse 3, D-06466, Gatersleben, Germany
                [5 ]USDA-ARS Biosciences Research Lab, Fargo, ND, 58105-5674, USA
                [6 ]Department of Crop and Soil Science, Oregon State University, Corvallis, OR, 97331, USA
                [7 ]Research Institute for Bioresources, Okayama University, Kurashiki, 710-0046, Japan
                [8 ]USDA-ARS, Cornell University, Ithaca, NY, 14853, USA
                [9 ]Department of Crop and Soil Sciences, Washington State University, Pullman, WA, 99164, USA
                [10 ]Department of Agronomy and Plant Genetics, University of Minnesota, St. Paul, MN, 55108, USA
                [11 ]Southern California Genotyping Consortium, University of California, Los Angeles, CA, 90095, USA
                [12 ]Monsanto Research Centre, Bangalore, 560092, India
                [13 ]Google, Mountain View, CA, 94043, USA
                [14 ]Faculty of Biology, University of Latvia, 4 Kronvalda Boulevard, Riga, LV-1586, Latvia
                [15 ]University of Copenhagen, Frederiksberg C, DK-1871, Denmark
                [16 ]NIH National Cancer Institute, Neuro-Oncology Branch, Bethesda, MD, 20892, USA
                [17 ]Department of Plant Pathology, Iowa State University, Ames, Iowa, 50011, USA
                [18 ]International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Patancheru - 502 324, Andhra Pradesh, India
                [19 ]NetSocial Marketing, Le Puech, 15600 Montmurat, France
                Article
                1471-2164-10-582
                10.1186/1471-2164-10-582
                2797026
                19961604
                6146e043-55f4-4c76-aa10-837f3e1797b2
                Copyright ©2009 Close et al; licensee BioMed Central Ltd.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 31 July 2009
                : 4 December 2009
                Categories
                Research article

                Genetics
                Genetics

                Comments

                Comment on this article