Introduction Man's earliest agricultural systems were based on the captive management of sheep and goats. The transition from hunting to animal husbandry involved human control over the reproduction, diet, and protection of animals. The process of domestication was initiated approximately 11,000 years ago in the Fertile Crescent . The impact was a profound redirection of human society, as domesticated livestock and plants increased the stability of human subsistence and fuelled population growth and expansion. Domestication also reshaped the morphology, behaviour, and genetics of the animals involved, with the first consequences likely to have included changes to coat pigmentation and horn morphology. Sheep were first reared for access to meat before human mediated specialisation for wool and milk commenced ca 4,000–5,000 years ago . Phenotypic radiation under selection is ongoing, resulting in a spectrum of modern breeds adapted to a diverse range of environments and exhibiting the specialised production of meat, milk, and fine wool. The last few hundred years has seen the pace of genetic gain increase dramatically through the division of animals into breeds, the implementation of quantitative genetics methodology, and the use of artificial insemination to prioritise genetically superior rams. Patterns of genetic variation have long proven insightful for the study of domestication, breed formation, population structure, and the consequences of selection. Variation within the mitochondrial genome has documented the global dispersal of two major haplogroups in modern sheep ,. Analysis of endogenous retroviruses suggests the development of breeds has occurred in multiple waves, where primitive breeds have been displaced by populations which display improved production traits . Investigations into the genetic relationship between populations have primarily relied on a modest collections of autosomal microsatellites –, Y chromosomal markers , or SNP . To date, the majority of populations tested have been European-derived breeds. This prompted assembly of the global sheep diversity panel, which contains animals from 74 diverse breeds sampled from Asia, Africa, South-West Asia (the Middle East), the Caribbean, North and South America, Europe, and Australasia. Our goal in assembling this animal resource was 2-fold. Firstly, we sought to examine levels and gradients of genetic diversity linking global sheep populations to better understand the genetic composition and history of sheep. We therefore genotyped all of the animals in the global diversity panel using the ovine SNP50 Beadchip, an array consisting of approximately 50,000 evenly spaced SNP. We present the relationship between breeds in terms of divergence time, estimated from the extent of haplotype sharing. Secondly, we sought to characterise the genetic legacy that selection and adaptation have imparted on the sheep genome. By performing a genome-wide scan for the signatures of selection, 31 genomic regions were identified that contain genes for coat pigmentation, skeletal morphology, body size, growth, and reproduction. By combining the collection of a global sample of ovine breeds with the ability to interrogate 50,000 genetic loci, the results provide unprecedented insight into the phylogeographic structure of sheep populations and the results of centuries of breeding practices. Results High Levels of Polymorphism and Genetic Diversity Analysis of genetic variation was performed for 2,819 animals in the global sheep diversity panel. Breeds were sampled from each continent across the species range (Figure 1), including six breeds from both Africa and America, seven from South-West Asia (the Middle East), eight from Asia, and the rest from northern, north-western, central, and southern and south-western Europe (Table S1 lists the breed and their geographic origin). All animals were genotyped using the ovine SNP50 Beadchip, an array consisting of SNP derived from three separate sequencing experiments (Roche 454, Illumina GA and Sanger sequencing; Table S2). A series of quality control filters were applied to identify 49,034 SNP used in subsequent analysis (Table S3). Levels of SNP polymorphism were generally high, with greater than 90% of loci displaying polymorphism within the majority of breeds (Table S4). The distribution of minor allele frequency (MAF) differed between population groups chosen to reflect the geographic origin of breed development. African and Asian breeds had an excess of low MAF SNP ( 0.05) using PLINK to identify 22,678 SNP. Secondly, 32,847 SNP that retain polymorphism within wild feral sheep were subjected to the same LD-based SNP pruning (>0.05) to identify 20,279 SNP. The PCA results obtained did not differ significantly dependent on the SNP set used. Model-based clustering was performed using the admixture model, correlated allele frequencies, and 15,000 burnin and 35,000 simulation cycles in STRUCTURE version 2.3 . Convergence was checked using two runs for each value of K (number of subpopulations). For supervised clustering, prior population information was introduced from six meta-populations consisting of regional pool of breeds considered to represent ancestral populations. The same meta-populations were used for updating the allele frequencies during the simulations. NeighborNet graphs were constructed from a matrix of Reynolds' distances using Splitstree . Estimates of Historic Effective Population Size, Extent of LD, Haplotype Sharing, and Divergence Times Between Breeds To estimate historic effective population size for each breed, the degree of linkage disequilibrium (LD) was calculated as r2 between all SNP pairs where MAF for each SNP in the pair was >0.10. r2 values were grouped into bins based on the distance between SNP from the physical map. Nt was then calculated as (1−r2 )/(4cr2 ), where c is the distance between the SNPs in Morgans (we assumed 100 Mb = 1 Morgan) and Nt is the effective population size t generations ago, where t = 1/2c. The most recent estimate of effective size was taken as Nt when c = 1 Mb. We performed simulations to assess the sensitivity of the estimates of effective population size over generations based on LD, in populations with and without admixture events (Figure S6). A mutation-drift model was used in the simulations following . The population consisted of individuals made up of a chromosome segment 50 Mb long with 6,901 SNP. A population of individuals was simulated with an initially very large population size 10,000 generations ago, declining to a small effective population size in recent generations. In the final 420 generations, the population was split into two “breeds.” In the non-admixed population, there was complete divergence between the breeds for the 420 generations. In the first admixed population, there was an admixture event, with crossing between the breeds (matings chosen at random across the two breeds) 220 generations ago. The admixing lasted 20 generations, after which the breeds diverged for a further 200 generations, with no more admixture events. LD (r2) was calculated between all marker papers and Ne estimated at different times in the past as described for the real data. Five replicate simulations were performed for each scenario. The extent of haplotype sharing among populations was characterised with the r statistic, where r is a signed r2 . A high correlation between r values for all locus pairs separated by the same physical distance among two breeds requires that the same haplotypes are found within both breeds. This means the sign of the r statistic is preserved across breeds only if the phase relationship among alleles is the same in both populations (leading to a high value for r if this is the case). The correlation of r between breeds was calculated for SNP separated by <10 kb, 10–25 kb, 25–50 kb, 50–100 kb, and 100–250 kb (Figures S7, S8, S9). There will be some error in calculating the correlation of r between two breeds due to finite sampling of haplotypes within a breed (e.g., limited sample size). To determine the extent of this error, we calculated the correlation of the r values at these different lengths of haplotypes for the Merino and Industry Merino samples, which are samples from the same breed. This gave a correlation between the r values for each bin size of 0.6. All correlations of r values for all breed comparisons were then divided by 0.6 to correct for sampling. Only corrected values are presented. As detailed in  the change in correlation of r between two breeds with increasing marker distance can be used to estimate generations since divergence from a common ancestral population. From , the expectation for r after T generations of divergence is E(r T) = e −2cT . The natural logarithm of the expected correlation of r then follows a linear decrease as a function of distance with slope −2T, and this was used to calculate divergence time between all breeds (Figure S10). Detection of Genomic Regions Under Selection Global F ST was calculated as described by . Raw values were ranked and used to identify regions under position selection. Centred on the top SNP (0.1%), neighbouring markers were included until consecutive markers were encountered ranking outside of the top 5%. The second marker was excluded and the Mb position of each region was determined using sheep genome assembly version 1. SNP-specific F ST values were smoothed using a local variable bandwidth estimator as described in  and plotted as a line in Figures 6 and 7. To identify genomic regions with shared selection signals across breeds, raw F ST within each population was smoothed into 500 genomic divisions (98 SNP per region). The number of breeds with smoothed F ST in excess of one standard deviation of the mean was plotted for values at each tail of the distribution. Analysis was performed to identify gene ontology (GO) terms that were significantly overrepresented in 181 genes residing within the 31 regions under selection (Table 1). The terms associated with the 181 genes were compared against a background set of 11,098 genes. Each of the 11,098 genes contain a SNP present on the SNP50 Beadchip, or a SNP within 2.5 Kb. Comparison of the two gene lists (target and background) was performed using the software GOrilla, which implements a hypergeometric distribution and mHG p value approach to determine significance . Supporting Information Figure S1 Minor allele frequency (MAF) based on SNP type. MAF was estimated for each of five geographically defined breed groupings separately using either 33,115 SNP derived using Roche 454 (top panel) or 15,427 SNP derived using Illumina GAII (bottom panel). The breed membership of each group is given in Table S4 and the percentage of SNP is plotted for each frequency bin. The excess of low MAF SNP (<0.1) present within African and Asia breeds was more pronounced within the 454 derived SNP when compared to those obtained using Illumina GA sequencing. This does not reflect differences between sequencing technologies, but rather the composition of animals used during the two SNP discovery experiments. The 454 SNP were discovered using six animals, none of which were sampled from Africa or Asia. The GA SNP were discovered using 60 animals, 21 of which were drawn from Africa and Asia (Table S2). (TIF) Click here for additional data file. Figure S2 Diversity between breeds compared using different SNP panels. Expected heterozygosity was calculated using four SNP panels: “49034” contains all SNP passing quality control (Table S3); “20279” was derived by pruning 32,847 SNP that retain polymorphism within wild feral sheep using LD (refer to the Materials and Methods section for detail); “454” contains 33,115 SNP ascertained using a small discovery panel; and “GAII” contains 15,247 SNP ascertained using a larger and genetically diverse discovery panel. Bold lines indicate breeds used in SNP discovery, and colours are used to indicate breed origin. Comparison of heterozygosity obtained from each panel revealed common SNP enriched within the 20279 panel produced the highest values for each breed, and that 454 SNP returned higher values than the GAII SNP in most breeds. Importantly, while the absolute value of heterozygosity was dependent on the SNP panel used, the ranking of breeds remained generally stable when calculated using different panels. This indicates conclusions concerning relative diversity between breeds and regions are unlikely to be heavily influenced by ascertainment bias. (TIF) Click here for additional data file. Figure S3 Principal components analysis (PCA) of European-derived sheep. To visualise the complex relationships between European-derived populations, PCA was performed separately using northern European breeds (880 animals, left panel) and central and southern European breeds (438 animals, right panel). The inbred Soay, Boreray, and Macarthur Merino have been omitted. In the left panel, PC1 and PC2 resolve all British and most central European breeds and show the intermediate position of the Swiss crossbred Swiss Alpine, Swiss Mirror, and Swiss Brown-Black Mountain sheep. Individuals from three geographically distinct populations of Texel formed a single cluster (German, New Zealand, and Scottish Texel, Table S1). This indicates that even where sampling is very broad, individuals from the same breed form a single cluster separate from other related breeds. In the right panel most Central European, Merino, and Mediterranean breeds were resolved. PC3 separates Rambouillet, Chinese Merino, and Australian Merino. PC1 to PC4 did not resolve the Australian, Australian Polled, and the Australian Industry Merino and only partially the Italian and Spanish breeds. (TIF) Click here for additional data file. Figure S4 Model-based clustering of sheep. The diverse origin of the sheep genome was examined by model-based clustering using STRUCTURE . This visualized a decomposition of the genome in a predefined number of K components, which may correspond to the genomes of ancestral populations. Representative results are shown for unsupervised clustering using K = 2, 5, and 8. At K = 2, a marked division is observed between European and Asian breeds, which corresponded to the first principal component in PCA (Figure 2). For increasing values of K up to K = 9, the results were reproducible and revealed clustering into British, Mediterranean (Southern and Western European), South-West Asian (Middle Eastern), Asian, and African breeds. Separate clusters were assigned to the Soay and Boreray (at K = 5), Dorset (at K = 6), Friesian (K = 7), and Wiltshire breeds (K = 9). The admixed nature of Brazilian and Caribbean animals can be seen at higher values of K. Essentially the same patterns were obtained with the Admixture program  using either 20,279 SNP or 22,678 SNP datasets, although some clusters appeared at different K (unpublished data). Supervised clustering (S, bottom panel) was performed using six predefined regional genomic components reconstructed by pooling breeds as indicated by the coloured horizontal line. (TIF) Click here for additional data file. Figure S5 Proportion of variance explained in principal component analysis. Genetic distance between each pair of animals in the global sheep diversity panel was analysed using PCA. The proportion of variance explained by each of the first 50 PCs is given at left (blue diamonds) and the cumulative proportion is shown at right (red triangles). The first PC explained approximately 3% of the variance, while the first 20 PCs together explain 16.3% of the total variation. (TIF) Click here for additional data file. Figure S6 Effective population size inferred from linkage disequilibrium (r2) with and without an admixture event in a simulated sheep population. The decline in effective population size was simulated to be similar to that observed in the real HapMap data for many breeds. The admixture event occurred 220 to 200 generations ago and lasted 20 generations before the breeds diverged for a further 200 generations without further admixture. LD (r2) was calculated between SNP and Ne was estimated at different times in the past as described for the real data. This shows the admixture event did affect the inferred Ne, with a higher estimate for the generations in which the admixture event is occurring. However, impact on estimates of Ne for generations that were not within 100 generations of the admixture event were minimal. Further, the pattern of inferred Ne for the admixed population suggests that there is information in the pattern of LD to pick up such events. (TIF) Click here for additional data file. Figure S7 Extensive sharing of short-length haplotypes between breeds. The persistence of haplotype phase between breeds was evaluated using the signed r statistic . The correlation of r was calculated using SNP pairs separately by short physical distances (0–10 Kb). The values are given as heat map, where each cell represents haplotype sharing between a breed pair. This revealed a high degree of conservation in LD phase between sheep populations. (TIF) Click here for additional data file. Figure S8 Haplotypes sharing between breeds at intermediate physical distances. The persistence of haplotype phase between breeds was evaluated using the signed r statistic . The correlation of r was calculated using SNP pairs separately by short physical distances (10–25 Kb). The values are given as heat map, where each cell represents haplotype sharing between a breed pair. The values were used to plot haplotype sharing in Figure 1C. (TIF) Click here for additional data file. Figure S9 Long-distance haplotype sharing between breeds. The persistence of haplotype phase between breeds was calculated in the same way as in Figure S6 using SNP pairs separated by 100–250 kb. This revealed much lower levels of conservation in LD phase between breeds. Some pair-wise population comparisons retained high LD phase, including the three geographically dispersed populations of the Texel, the Merino and its derivatives, and finally selections lines within the same breed such as the Meat Lacaune and Milk Lacaune. Long-range haplotype sharing was also detected between Swiss breeds (e.g., SBS, SMS, and SWA) and both British and Merino-type breeds, suggesting a mixed origin. (TIF) Click here for additional data file. Figure S10 Divergence time between breeds. Divergence time (in generations) between breeds was calculated from the extent of haplotype sharing that persists at increasing physical distance between SNP pairs . Breed pairs separated by short divergence times are represented by dark squares, while breeds separated by longer divergence are given as progressively lighter squares. The divergence time values were used to generate the NeighborNet graph in Figure 3. (TIF) Click here for additional data file. Figure S11 The effect of SNP ascertainment on breed relationships. To evaluate the effect of SNP ascertainment on visualised relationships between sheep breeds, NeighborNet graphs were constructed and compared using five different SNP sets. Graphs were constructed using all 49,034 SNP (A); 33,115 SNP identified using Roche 454 (B); 15,427 SNP identified using Illumina GA (C); 22,678 SNP identified by application of LD pruning of the full SNP set (D); and 20,279 SNP polymorphic in non-domestic sheep that displayed LD<0.05 (E). The topology of each graph is very similar, indicating SNP ascertainment doe not have a strong impact on interpretation of genetic relationships. (TIF) Click here for additional data file. Table S1 Global Sheep Diversity Panel. (DOC) Click here for additional data file. Table S2 SNP discovery for the ovine SNP50 BeadChip. (DOC) Click here for additional data file. Table S3 Quality control filters used to remove SNP. (DOC) Click here for additional data file. Table S4 Genetic diversity and recent effective population size. (DOC) Click here for additional data file. Table S5 Enrichment analysis of gene ontology (GO) terms for genes in regions under selection. (DOC) Click here for additional data file. Table S6 Selection signals identified in both sheep and cattle. (DOC) Click here for additional data file.