Introduction The vast phenotypic diversity of the domestic dog, its unique breed structure, and growing genomic resources present a powerful system for genetic dissection of traits with complex inheritance (reviewed in [1]). In the past three years, dozens of genetic variants and Quantitative Trait Loci (QTL) have been identified which influence breed-defining traits including those for skeletal size [2], coat color [3],[4], leg length [5], hairlessness [6], wrinkled skin [7], hair length, curl, and texture [8], and presence of a dorsal fur ridge [9]. Here, we describe the development and application of a high-density map of common genetic variation in the domestic dog (the “CanMap Project”). We simultaneously delineate genomic regions underlying 57 morphological traits defined at the breed level, including body weight, absolute and relative length and width of long bones, absolute and proportional skull length and width, teeth characters, and a key domestication correlate—prick versus floppy ears (see Figure S1). We are particularly interested in assessing whether the majority of phenotypic variation among breed-affiliated dogs is a consequence of QTLs of large effect or whether much of the variation is attributable to many QTLs of modest or small effect. The latter situation resembles the emerging picture from genome-wide association studies in humans, laboratory animals, and outcrossed domesticated plants such as maize [10],[11]. In those systems, the genetic architecture of most phenotypes tested to date—including body size, body mass index (BMI), lipid level, and flowering time—appear to be under the control of hundreds of genes, each contributing a very modest amount to the overall heritability of the trait. The alternative model is that mutations of large phenotypic effect underlie most of these traits in dogs and that the same variants have been transferred to a wide diversity of dog breeds leading to phenotypic diversity from a narrow genetic base [5],[8],[12]. To distinguish between these two genetic models and to understand the extent to which domestication and artificial selection have shaped the dog genome, we genotyped more than 120,000 potential single nucleotide polymorphisms using DNA isolated from 915 dogs representing 80 American Kennel Club (AKC) recognized breeds as well as 83 wild canids that included wolves, jackals, and coyotes and 10 Egyptian shelter dogs [13]. We developed a new genotype-calling algorithm for Affymetrix array data (MAGIC) that relaxes key assumptions and limitations of current callers such as Hardy-Weinberg equilibrium among genotype clusters. This dramatically improved the performance of the Affymetrix v2 Canine GeneChip, producing 99.9% concordance across 154 technical replicates for 60,968 SNPs (see Methods). The high density of markers and the inclusion of wild canids and outbred village dogs allowed for unprecedented resolution of the effect of domestication and artificial selection on the dog genome. Detailed results can be obtained from the Canine SNP browser track hosted at http://genome-mirror.bscb.cornell.edu/. Results Genomic Signatures of Dog Demography History To investigate how human-directed breeding has altered the landscape of the dog genome, we quantified pairwise SNP linkage disequilibrium (LD), haplotype diversity across 15-SNP windows (as inferred by fastPhase [14]) and runs of homozygosity (ROHs) greater than 1 Mb for each individual (indicative of autozygosity) using the genotype data from the 59 breeds with ≥10 individuals and a population of village dogs and wolves (see Methods). Long ROHs are a product of recent inbreeding, indicative of contemporary population size and mating system, whereas haplotype diversity and LD across shorter genomic scales ( 0.55 and minor allele frequency (MAF) >15% across CanMap breeds. Derived Allele Frequency Marker FST Dog Wolf Coyote Jackal FST Region Trait X.105092232 0.795 0.594 1.000 0.000 0.000 1045486877–108201633 body size; skull shape 10.11000274 0.713 0.593 0.031 0.000 0.000 10707312–11616330 ear type [23]; body size [7],[23],[47] 13.11659792 0.710 0.337 0.000 0.000 0.000 11659792–11660194 furnishings[8] 15.44267011 0.673 0.437 0.008 0.000 0.000 44187156–44427593 body size[ [2],[23],[47] ] 18.23298242 0.671 0.196 0.287 0.042 0.778 singleton height [5] X.87234117 0.658 0.642 0.505 0.000 0.267 86813164–87299370 limb/tail length 3.93933450 0.650 0.219 0.111 0.000 0.250 singleton body size 24.26359293 0.641 0.426 0.000 0.000 0.000 26359293–26370499 coat color[4] 20.24889547 0.630 0.561 0.382 0.286 0.000 24674148–24969549 coat color[22] 1.96282083 0.594 0.580 0.227 0.000 0.667 96103038–96338823 snout ratio[23] 5.66708382 0.576 0.437 0.016 0.000 0.000 singleton coat color[21] 1.71150281 0.573 0.160 0.177 0.000 0.000 71150281–71206706 26.10903577 0.569 0.158 0.000 0.000 0.000 singleton 23.8488359 0.567 0.483 0.024 0.250 0.000 singleton 1.59179746 0.554 0.188 0.629 0.550 0.000 59179746–59182125 snout length 21.51391768 0.554 0.293 0.414 0.929 0.000 singleton 15.32610857 0.554 0.294 0.009 0.000 0.000 32383555–33021330 1.114924791 0.553 0.209 0.000 0.000 0.000 114914236–114924791 29.30499875 0.553 0.205 0.359 0.000 0.000 singleton 16.55231367 0.551 0.155 0.145 0.125 0.000 singleton 2.18668826 0.551 0.475 0.066 0.000 0.000 singleton 10.69071007 0.550 0.435 0.140 0.500 0.000 69071007–69166227 Derived allele determined by the minor allele in jackals (black-backed and side-striped) and coyotes. Each FST region is defined as the genomic region surrounding the top FST hit where neighboring SNPs on the array also had FSTs above the 95th percentile (FST = 0.4). Traits with associations to each region are listed; underlining denotes an association from this study. Genome-Wide Association Mapping of Morphological Differences among Dog Breeds We investigated the genetic architecture of morphological variation in dogs using a breed-mapping approach to look for correlations between allele frequency and average phenotypic values across 80 breeds at 60,968 SNPs (see Methods). We computed male breed-average phenotypes for each of 20 different tape measurements, and also computed breed averages from museum specimens for 15 long bone and 20 skull/tooth dimensions (Figure S1). For all 55 measures, we conducted association scans with and without controlling for overall breed body size, and also controlled for breed relatedness by using breed-average relatedness as a random effect in the linear mixed model. We also looked for genomic regions underlying body size variation and ear floppiness. Body size variation is greater across dog breeds than in any other terrestrial species [25], with smaller stature likely being selected for during domestication and with large and small body sizes being alternatively selected for in different breeds. Our scan for body size (defined as log(body weight)) yielded several significant genomic associations, with the six strongest signals occurring at CFA15.44226659, CFAX.106866624, CFA10.11440860, CFAX.86813164, CFA4.42351982, and CFA7.46842856. The corresponding P-P plot compares the observed distribution of −log10 p values (i.e., blue and red points in Figure 3A) to the expected distribution under a model of no-association (i.e., dashed line which represents equality of Expected and Observed) and demonstrates an excess of significant signals since the tail of the distribution is well above the diagonal dashed line. When the top six regions (and linked SNPs) are removed, the observed p value distribution (i.e., gray points in Figure 3A) is strongly shifted towards the null expectation, suggesting these six QTLs account for the bulk of the association signal in our data. The first four signals are among the highest FST regions in the dog genome (Table 1) with the CFA4 signal also exhibiting an elevated FST (0.46), consistent with diversifying selection among breeds for body size. The signal on CFA15 corresponds to the location of IGF1 which encodes a growth factor previously described to control a significant proportion of size variation across dog breeds [2]. The CFA10 signal corresponds to the location of HMGA2, a gene known to affect body size variation in humans [26] and mice [27]. Both HMGA2 and a locus corresponding to the CFA7 signal, SMAD2, have been previously associated with dog body size [23]. In contrast, the signals on CFA4 and CFAX hits have not previously been associated with body size variation in dogs. Interestingly, the CFA4 signal contains (among other genes) the STC2 locus, a known growth inhibitor in mice [28]. The two signals on the X chromosome lie in separate LD blocks that each contains dozens of genes. Other than IGF1, all the other regions will require fine-mapping in order to confirm a single candidate gene. In all six regions, wolves are not highly polymorphic (MAF 70%) of the variance in most of the morphological traits we studied across domestic dog breeds. A similar pattern of few QTLs of large effect is apparent in a few other genetic systems (e.g., sticklebacks [34]), suggesting this genetic architecture could be a result of recent adaptation and a hallmark of diversifying selection. The dominance of a few genes of large effect likely reflects several unique aspects of selection in dogs. First, many of the modern breeds were created during the Victorian Era where novelty was a focus of selection and breeders favored the preservation of discrete mutations. A single discrete mutation could be placed on a variety of genetic backgrounds by crossing which expanded the range of phenotypic diversity across breeds. For example, the same retrogene responsible for chondrodysplasia or foreshortened limbs (fgf4) is found in nearly 20 distinct breeds today [5]. In contrast, the progressive selection in other domestic species aimed at economically useful quantitative traits such as a high growth rate and fecundity involved subtle differences among individuals selected across many generations and, therefore, likely utilized genes of small effect segregating in an ancestral population [11]. Mutations of large effect are the basis of some domesticated phenotypes, such as the double-muscling gene in cattle [35], but the selective breeding practiced for agriculture was more intensive and sustained and drew on a segregating variation that involved the detection of small differences among individuals. Selection for novelty also led to extreme founder events and/or bottlenecks in many breeds. Coupled with the dog domestication bottleneck, this likely simplified the genetic architecture of quantitative traits, including complex disease phenotypes that are not fixed within breeds and were not the subject of selection for novelty. The rapid genetic drift between isolated breeds (pairwise FST of 25%–30% among any given set of breeds with very few pairs of breeds having significantly smaller FST) enables efficient mapping of the genomic regions underlying variation, even in some cases with un-genotyped collections such as museum specimens. The extreme phenotypic diversity observed among modern domestic dogs is unique among mammalian species, and as such, it offers unique insight regarding both the constraints and potential of evolutionary change under domestication. Methods SNP Calling We genotyped 1,659 samples on Affymetrix v2 Canine arrays which contain probes for over 127,000 SNP markers, and attempted to call genotypes on the 1,400 arrays with highest signal-to-noise intensity ratios. SNP content for this array includes variants found from the boxer genome assembly (25.5% of SNPs), comparison of the boxer and poodle assemblies (11.4% of SNPs), comparison of the boxer to low coverage sequencing from other breeds (59.9% of SNPs), and comparison between dog and wolf sequences (3.2% of SNPs). Similar to previous studies [4], we found that the BRLMM-P algorithm yielded approximately 45,000 SNPs (out of 127K markers present on the array) that passed quality control filtering, and that it consistently over-called heterozygous genotypes. Consequently, we developed a novel genotype calling algorithm, MAGIC (Multidimensional Analysis for Genotype Intensity Clustering), which did not use prior information regarding cluster locations, assumptions about Hardy-Weinberg equilibrium, or complex normalization of probe intensities (see Text S1 and Tables S2 and S3). On these same 1,400 chips, MAGIC called 60,968 SNPs that passed our quality control filters, yielding a call rate of 94.6% and a concordance rate of over 99.9% for samples run in duplicate. Over 99% of SNPs used in our analysis are within 121 kb of another SNP (median = 8.5 kb). As a final quality control step, we applied the hidden Markov model described in [16] to detect genomic regions of autozygosity within each of the 1,400 CanMap individuals. Since mean autozygosity was above 20% in the dataset, we expect nearly 300 individuals to be within an autozygous segment at any SNP on the array. All of these ∼300 individuals should have homozygous genotype calls for that SNP, although in practice some heterozygous calls can be expected owing to gene conversion or imperfect inference of the autozygous segments. SNPs with poor genotyping quality, specifically those SNPs with a spurious excess of heterozygous calls, will exhibit relatively high rates of heterozyosity even within inferred segments of autozygosity. We excluded 451 SNPs with elevated heterozygosity within autozygous segments (here defined as >10%). Visual inspection of the cluster plots suggests many of these SNPs occurred within segmental duplications or copy number variable regions, or contained a substantial fraction of null alleles mistakenly called as heterozygous. LD Decay We summarized pairwise LD by the genotype correlation coefficient (r 2). For all pairs of autosomal SNPs, r 2 was calculated using the --r2 --ld-window 99999 --ld-window-r2 0 command in PLINK. Since we calculated r 2 using the genotypes directly without phasing the data, this analysis is robust to phasing ambiguities. To compare LD decay among breeds with different sample sizes, we selected a random subset of 10 dogs from each of the 59 breeds having 10 or more dogs. Within each breed, we calculated r 2 between all pairs of SNPs where both SNPs had MAF≥15% and 100 kb spanning at least 25 SNPs. Genome-Wide Scans Phenotypic values The traits we investigated here include body weight, external measurements (e.g., height at withers, body length, etc.), and bone measurements (skull and skeleton measurements). Since these measurements are not available for most of the genotyped samples in the CanMap dataset, we treated breed averages as breed characteristics and assigned them to each individual of the same breed as phenotypic values as has been suggested previously [23]. The breed averages of body weight were obtained from [38]. The breed averages of external measurements were computed from questionnaire data, provided by dog owners, and contain 58 breeds that have genotyped individuals in the CanMap dataset. Using dogs older than 1 y, we computed the breed average of each trait for which at least two individuals had been measured. The breed averages of bone measurements were computed from the samples described in [39]. We used the breed averages of the traits for model selection, i.e., to prioritize SNPs for association, model fitting, and to fit the predictive model using associated SNPs. To account for allometry, we included log(body weight) as a covariate in the model. Some samples in the CanMap dataset also had individual body weights or external measurements. These data were used for the purpose of model validation. Genome-wide scans for associations All the SNPs that passed the quality filtering were considered in the scans for association. The allele frequencies were computed for each breed for all SNPs. An individual-by-individual IBS similarity matrix was calculated and then averaged within breeds to obtain a breed-average IBS matrix, which was used to control for genetic relatedness among breeds. For continuous traits, a linear mixed model [40],[41], as implemented in EMMA [42], was used to test each of the SNPs for association while also controlling for relatedness. Here, because mapping is being done at the breed level, we used EMMA to control for relatedness between (rather than within) breeds. The random effects were assumed to follow a multivariate normal distribution with a mean of 0 and the correlation matrix being the breed average IBS matrix [42]. For allometric traits, we used log(breed average body weight) as a covariate in the linear mixed model for all traits except for those skull- and tooth-related skeletal traits, for which we used log(breed average total skull length). For dichotomous traits, a weighted bootstrap method was used to test each of the SNPs for association. The phenotypes were bootstrapped with weights accounting for breed relatedness, and the empirical distributions of test statistics were obtained for calculating p values. Each round of bootstrap consisted of N steps where the sample size was N. The IBS matrix was denoted as K with the value between breed i and breed j equal to . At step i, we sampled a phenotype for the ith individual from the jth individual, where j is chosen with probabilities proportional to row i of the IBS matrix. Specifically, we chose the phenotype corresponding to individual j with probability . A correlation test-statistic was obtained for each round of bootstrapped phenotype and the SNP breed frequencies. The empirical p value was the number of bootstrap replicates that showed the test-statistic bigger than the test-statistic obtained from the observed phenotype. For all the scans, naïve tests without accounting for breed relatedness were also employed for comparison. Model fitting and validation We use the results of single marker EMMA scans described above in constructing multi-SNP models for predicting phenotype from genotype. Specifically, we use forward stepwise regression with breed average value of the trait as the dependent variable and a design matrix consisting of individual dog genotypes across the most highly associated SNPs from the EMMA breed-level scan. For those traits with individual phenotype and genotype measurements (such as body weight), we used the multi-SNP predictive models for validation. Specifically, for all individuals with both genotype and phenotype data, we predicted an individual's phenotype by applying the multi-SNP model to their individual genotype data and compared the observed and predicted values. The predictive models for body weight were also validated on a dataset of 50 village dogs with individual body weights and genotypes across the associated SNPs [13]. Analysis of Population Structure in Breed Dogs Using Structure and PCA A potential confounding factor in our study is relatedness among breeds that share traits of interest. For example, if small dog breeds are more closely related, on average, to each other than large dog breeds, then the loci identified may simply be distinguishing genomic regions associated with historical relatedness (and not size, per se). To test this notion, we undertook a systematic dissection of the population structure of modern dog breeds. Using 5,157 unlinked SNPs genotyped on 890 dogs from 80 breeds, we evaluated population structure using Principal Component Analysis (PCA; [43]) and the Bayesian clustering program Structure (Figure S4) [44],[45]. Both methods distinguish “ancient” and “modern” breeds in their initial clustering (K = 2 or PC1) as previously observed with boxers (one of the two main breeds used for SNP ascertainment) and basenjis (the only African breed in the sample) being identified next (K = 3,4 or PC2/3). Importantly, in both methods breed groups did not tend to form clusters; instead, single breeds or pairs of closely related breeds are “pulled out” as one examines higher dimension PCs or adds new Structure groups (i.e., increases K). When Structure was run at K = 80, three pairs of breeds and one trio were indistinguishable (Samoyeds – American Eskimo Dogs, Collies – Shetland Sheepdogs, Bull Terriers – Miniature Bull Terriers, and Chow Chows – Akitas – Shar Pei) and some of the 80 clusters became degenerate, as has been reported previously with cluster analysis using microsatellites [17],[46]. However these breeds were still separated out by PCA (for example, PC29 separates Chows and Akitas, PC42 separates Shetland Sheepdogs and Collies, etc.). This pattern was consistent with modern breeds being, for the most part, a recent adaptive radiation (star phylogeny) with few significant internal branches. In fact, a Molecular Analysis of Variance suggested only 4% of the genetic variance was attributable to major phenotypic groupings (such as herding/gun/toy, see also [12]). Supporting Information Figure S1 Diagrams depicting a subset of measurements used to calculate breed averages for morphological trait mapping. (A) Body measurements taken on live dogs. Red lines represent the path of superficial measurements. The skeleton is shown for anatomical clarity. Measurements collected included: height at withers (1), height at base of tail (2), snout length (5), head length (6), neck length (7), body length (8), tail length (9), neck girth (11), abdominal girth (12), hind foot length (14), hind foot circumference (15), lower hind leg length (16), upper hind leg length (17), fore foot length (18), fore foot circumference (19), lower fore leg length (20), upper fore leg length (21). (B) The skull measurements taken on the museum specimens. The measurements include: total skull length (TSL), face length (FL), upper tooth row length (TRL), upper P3 length (P3L), upper P4 length (P4L), upper M1 length (M1L), upper M2 length (M2L), maximum cranial width (MCW), zygomatic width (ZW), least cranial width (LCW), cranial depth (CD), premaxilla depth (PD), mandible height (MH), mandible length (ML), lower M1 length (M1L), basicranial length (BCL). The cranioskeletal diagram was reproduced with author permission from Wayne, R. (Evolution 40, 243–261, 1986). (0.45 MB PDF) Click here for additional data file. Figure S2 Correlation between the allele frequency of the most highly associated SNP (lower diagonal) and the phenotype for (A) log(body weight) and (B) ear floppiness (upper diagonal) across the 80 CanMap breeds. (0.24 MB PPT) Click here for additional data file. Figure S3 Fine-scale resolution of CFA10 region associated with both body size traits and ear floppiness. Single-marker analysis shows strongest association with body weight near HMGA2, while the strongest association with ear floppiness is near MSRB3. High FST between small- and large-breed dogs and reduced heterozygosity in small breed dogs extends several hundred kb away from HMGA2. The strongest ear flop association and FST signal between erect- and floppy-eared breeds are relatively localized within 100 kb region near MSRB3, although reduced heterozygosity in floppy-eared breeds extends for 500 kb. (0.49 MB PDF) Click here for additional data file. Figure S4 Genome-wide association scans using naïve tests without accounting for breed relatedness. Scans show (A) log(body weight), (B) ear erectness (floppy versus erect ears), (C) proportional snout length, (D) proportional palatal length, and (E) snout type (brachycephalic versus average). (0.30 MB PPT) Click here for additional data file. Figure S5 Population structure across CanMap breeds determined by PCA (top) and Structure (bottom). Each individual is a thin column and individuals are grouped by breed (black vertical lines separate breeds, with bold lines denoting separation between breed groups). Values for PC1 through PC80 are shown in descending order for each individual by color with blue indicating lower-than-average PC values and red indicating higher-than-average values. The height of each PC is proportional to the proportion of variance explained by each principal component (shown on right axis). Ordering of individuals along the x-axis (6–12 individuals per breed) is identical for both panels. (0.47 MB PDF) Click here for additional data file. Table S1 Proportion variance explained by models incorporating the top one to six SNPs for each trait. Blanks denote traits with too few significant SNPs to parameterize a full model. (0.10 MB DOC) Click here for additional data file. Table S2 Comparison of BRLMM-P and MAGIC genotype calling algorithms using common Affymetrix .cel files and QC filters. Note that the 1,400 arrays used for the analyses in this study are a subset of the arrays used to conduct this head-to-head comparison, so total SNP counts differ somewhat between the datasets. (0.05 MB DOC) Click here for additional data file. Table S3 List of SNPs that were sequenced to validate the MAGIC genotyping algorithm. Red SNPs indicate discordant homozygous calls between MAGIC and BRLMM, which are indicative of the presence of “null alleles” (individuals lacking specific binding to either probe, usually because of a variant at the probe binding site). (0.06 MB DOC) Click here for additional data file. Text S1 Algorithmic details and validation of MAGIC genotype calling program. (0.12 MB DOC) Click here for additional data file.