A test for deviations from expected genotype frequencies on the X chromosome for sex-biased admixed populations

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Genome-wide scans for deviations from expected genotype frequencies, as determined by the Hardy–Weinberg equilibrium (HWE), are commonly applied to detect genotyping errors and deviations from random mating. In contrast to the autosomes, genotype frequencies on the X chromosome do not reach HWE within a single generation. Instead, if allele frequencies in males and females initially differ, they oscillate for a few generations toward equilibrium. Allele frequency differences between the sexes are expected in populations that have experienced recent sex-biased admixture, namely, their male and female founders differed in ancestry. Sex-biased admixture does not allow testing for HWE on X, because deviations are naturally expected, even under random mating (post admixture) and error-free genotyping. In this paper, we develop a likelihood ratio test and a χ 2 test to detect deviations from expected genotype frequencies on X, beyond natural deviations due to sex-biased admixture. We demonstrate by simulations that our tests are powerful for detecting deviations due to non-random mating, while at the same time they do not reject the null under historical sex-biased admixture and random mating thereafter. We also demonstrate that when applied to 1000 Genomes project populations, our likelihood ratio test rejects fewer SNPs than other tests, but we describe limitations in the interpretation of the results.

Related collections

Most cited references 38

Record: found
Abstract: found
Article: not found

A note on exact tests of Hardy-Weinberg equilibrium.

Janis E Wigginton, David Cutler, Gonçalo R Abecasis (2005)

Deviations from Hardy-Weinberg equilibrium (HWE) can indicate inbreeding, population stratification, and even problems in genotyping. In samples of affected individuals, these deviations can also provide evidence for association. Tests of HWE are commonly performed using a simple chi2 goodness-of-fit test. We show that this chi2 test can have inflated type I error rates, even in relatively large samples (e.g., samples of 1,000 individuals that include approximately 100 copies of the minor allele). On the basis of previous work, we describe exact tests of HWE together with efficient computational methods for their implementation. Our methods adequately control type I error in large and small samples and are computationally efficient. They have been implemented in freely available code that will be useful for quality assessment of genotype data and for the detection of genetic association or population stratification in very large data sets.

0 comments Cited 431 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Quality control and quality assurance in genotypic data for genome-wide association studies.

Peter Kraft, Claire Harris, Xiuwen Zheng … (2010)

Genome-wide scans of nucleotide variation in human subjects are providing an increasing number of replicated associations with complex disease traits. Most of the variants detected have small effects and, collectively, they account for a small fraction of the total genetic variance. Very large sample sizes are required to identify and validate findings. In this situation, even small sources of systematic or random error can cause spurious results or obscure real effects. The need for careful attention to data quality has been appreciated for some time in this field, and a number of strategies for quality control and quality assurance (QC/QA) have been developed. Here we extend these methods and describe a system of QC/QA for genotypic data in genome-wide association studies (GWAS). This system includes some new approaches that (1) combine analysis of allelic probe intensities and called genotypes to distinguish gender misidentification from sex chromosome aberrations, (2) detect autosomal chromosome aberrations that may affect genotype calling accuracy, (3) infer DNA sample quality from relatedness and allelic intensities, (4) use duplicate concordance to infer SNP quality, (5) detect genotyping artifacts from dependence of Hardy-Weinberg equilibrium test P-values on allelic frequency, and (6) demonstrate sensitivity of principal components analysis to SNP selection. The methods are illustrated with examples from the "Gene Environment Association Studies" (GENEVA) program. The results suggest several recommendations for QC/QA in the design and execution of GWAS. (c) 2010 Wiley-Liss, Inc.

0 comments Cited 168 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Reconstructing the Population Genetic History of the Caribbean

Andres Moreno-Estrada, Simon-Pierre Gravel, Fouad Zakharia … (2013)

Introduction Genomic characterization of diverse human populations is critical for enabling multi-ethnic genome-wide studies of complex traits [1]. Genome-wide data also affords reconstruction of population history at finer scales, shedding light on evolutionary processes shaping the genetic composition of peoples with complex demographic histories. This genetic reconstruction is especially relevant in recently admixed populations from the Americas. Native peoples throughout the American continent experienced a dramatic demographic change triggered by the arrival of Europeans and the subsequent African slave trade. Important progress has been made to characterize genome-wide patterns of these three continental-level ancestral components in admixed populations from the continental landmass [2] and other Hispanic/Latino populations [3], including recent genotyping and sequencing studies involving Puerto Rican samples [4], [5], [6]. However, no genomic survey has focused on multiple populations of Caribbean descent, and critical questions remain regarding their recent demographic history and fine-scale population structure. Several factors distinguish the Antilles and the broader Caribbean basin from the rest of North, Central, and South America, resulting in a unique territory with particular dynamics impacting each of its ancestral components. First, native pre-Columbian populations suffered dramatic population bottlenecks soon after contact. This poses a challenge for reconstructing population genetic history because extant admixed populations have retained a limited proportion of the native genetic lineages [7]. Second, it is widely documented that the initial encounter between Europeans and Native Americans, such as the first voyages of Columbus, took place in the Caribbean before involving mainland populations. However it remains unclear whether the earlier onset of admixture in the Caribbean translates into substantial differences in the European genetic component of present-day admixed Caribbean genomes, compared to other Hispanic/Latino populations impacted by later, and probably more numerous, waves of European migrants. Third, the Antilles and surrounding mainland of the Caribbean were the initial destination for much of the trans-Atlantic slave trade, resulting in admixed populations with higher levels of African ancestry compared to most inland populations across the continent. However, the sub-continental origins of African populations that contributed to present-day Caribbean genomes remain greatly under-characterized. Disentangling the origin and interplay among ancestral components during the process of admixture enhances our knowledge of Caribbean populations and populations of Caribbean descent, informing the design of next-generation medical genomic studies involving these groups. Here, we present SNP array data for 251 individuals of Caribbean descent sampled in South Florida using a parent-offspring trio design and 79 native Venezuelans sampled along the Caribbean coast. The family-based samples include individuals with grandparents of either Cuban, Haitian, Dominican, Puerto Rican, Colombian, or Honduran descent. The 79 native Venezuelan samples are of Yukpa, Warao, and Bari tribal affiliation. We construct a unique database which includes public and data access committee-controlled data on genomic variation from over 3,000 individuals including HapMap [8], 1000 Genomes [6], and POPRES [9] populations, and African [10] and Native American [11] SNP data from diverse sub-continental populations employed as reference panels. We apply admixture deconvolution methods and develop a novel ancestry-specific PCA method (ASPCA) to infer the sub-continental origin of haplotypes along the genome, yielding a finer-resolution picture of the ancestral components of present-day Caribbean and surrounding mainland populations. Additionally, by analyzing the tract length distribution of genomic segments attributable to distinct ancestries, we test demographic models of the recent population history of the Greater Antilles and mainland populations since the onset of inter-continental admixture. Results Population structure of the Caribbean To characterize population structure across the Antilles and neighboring mainland populations, we combined our genotype data for the six Latino populations with continental population samples from western Africa, Europe, and the Americas, as well as additional admixed Latino populations (see Table S1). To maximize SNP density, we initially restricted our reference panels to representative subsets of populations with available Affymetrix SNP array data (Figure 1A). Using a common set of ∼390 K SNPs, we applied both principal component analysis (PCA) and an unsupervised clustering algorithm, ADMIXTURE [12], to explore patterns of population structure. Figure 1B shows the distribution in PCA space of each individual, recapitulating clustering patterns previously observed in Hispanic/Latino populations [3]: Mexicans cluster largely between European and Native American components, Colombians and Puerto Ricans show three-way admixture, and Dominicans principally cluster between the African and European components. Ours is the first study to characterize genomic patterns of variation from (1) Hondurans, which we show have a higher proportion of African ancestry than Mexicans, (2) Cubans, which show extreme variation in ancestry proportions ranging from 2% to 78% West African ancestry, and (3) Haitians, which showed the largest average proportion of West African ancestry (84%). Additional clustering patterns obtained from higher PCs are shown in Figure S1. 10.1371/journal.pgen.1003925.g001 Figure 1 Population structure of Caribbean and neighboring populations. A) Areas in red indicate countries of origin of newly genotyped admixed population samples and blue circles indicate new Venezuelan (underlined) and other previously published Native American samples. B) Principal Component Analysis and C) ADMIXTURE [12] clustering analysis using the high-density dataset containing approximately 390 K autosomal SNP loci in common across admixed and reference panel populations. Unsupervised models assuming K = 3 and K = 8 ancestral clusters are shown. At K = 3, Caribbean admixed populations show extensive variation in continental ancestry proportions among and within groups. At K = 8, sub-continental components show differential proportions in recently admixed individuals. A Latino-specific European component accounts for the majority of the European ancestry among Caribbean Latinos and is exclusively shared with Iberian populations within Europe. Notably, this component is different from the two main gradients of ancestry differentiating southern from northern Europeans. Native Venezuelan components are present in higher proportions in admixed Colombians, Hondurans, and native Mayans. We used the program ADMIXTURE to fit a model of admixture in which an individual's genome is composed of sites from up to K ancestral populations. We explored K = 2 through 15 ancestral populations (Figure S2) to investigate how assumptions regarding K impact the inference of population structure. Assuming a K = 3 admixture model, population admixture patterns are driven by continental reference samples with no continental subdivision (Figure 1C, top panel). However, higher Ks show substantial substructure in all three continental components. Log likelihoods for successively increasing levels of K continue to increase substantially as K increases (Figure S3a), which is not unexpected since higher values of K add more parameters to the model (thereby improving the fit). Using cross-validation we found that K = 7 and K = 8 have the lowest predicted error (Figure S3b); thus, we focused on these two models. The first sub-continental components that emerge are represented by South American population isolates, namely the three Venezuelan tribes of Yukpa, Warao, and Bari. At higher-order Ks, we recapitulate the well-documented North-to-South American axis of clinal genetic variation described by us [13] and others [11], [14], as Mesoamerican (Maya/Nahua) and Andean (Quechua/Aymara) populations are assigned to different clusters (Figure S2). Interestingly, Mayans are the only group showing substantially higher contributions from the native Venezuelan components (Figure 1C, bottom panel). Both Mesoamerican and Andean Native American samples contain considerable amounts of European ancestry, due to post-Columbian admixture. Above K = 7, we observe a North-to-South European differentiation, which is consistent with previous analyses [15], [16]. Surprisingly, we observe another European-specific component emerge as early as K = 5 and remain constant through K = 15 (Figure S2). This component accounts for the majority of the Caribbean Latinos' European ancestry, and it only appears in Mediterranean populations, including Italy, Greece, Portugal, and Spain at intermediate proportions. Throughout this paper, we refer to this component as the “Latino European” component, and it can be seen clearly in Figure 1C (“black” bars represent the Latino European component, “Red” bars represent the “Northern European”, and pink the “Mediterranean” or “Southern European” component). At K = 8, when the clinal gradient of differentiation between Southern and Northern Europeans appears, the Latino European component is seen only in low proportions in individuals from Portugal and Spain, whereas it is the major European component among Latinos (Figure 1C, bottom panel). To identify possible sex-biased gene flow in Caribbean populations, we compared the ancestry proportions of the X chromosome vs. the autosomes in each population. We observe a significant skew towards a higher proportion of Native American ancestry on the X chromosome than on the autosomes (p-value 0.05, Figure S4). Overall, we find evidence of a high Native American, and to a lesser extent African, female contribution in Caribbean populations. Additionally, our data show a strong signature of assortative mating based on genetic ancestry among Caribbean Latinos, as suggested by previous studies [17]. In particular, we see a strong correlation between maternal and paternal ancestry proportions (Figure S5). To assess significance, we compared correlation of ancestry assignments among parent pairs to 100,000 permuted male-female pairs for each continental ancestry. All p-values were highly significant (p 3% global Native American ancestry together with the full reference panel of ancestral populations (Figure S7). ASPC1 separates the northernmost populations of the continent from the rest, while the Brazilian Surui and Central American Cabecar define the extremes of ASPC2. Most Native American haplotypes from the admixed genomes fall along this second axis of variation, forming two overlapping population clusters: one represented primarily by Colombians and Hondurans, and the other by Cubans, Dominicans, and Puerto Ricans (no Haitian haplotypes were included due to low levels of Native American ancestry). Figure 4A shows a closer view, in which Colombians and most Hondurans cluster closer to Chibchan-speaking groups from Western Colombia and Central America, including the Kogi, Embera, and Waunana. In contrast, most Caribbean islanders cluster with Amazonian groups from Eastern Colombia, Brazil, and Guiana. The closest ancestral populations include the Guahibo, Piapoco, Ticuna, Palikur, and Karitiana, among others, some of which are settled along fluvial territories of the Orinoco-Rio Negro basin. This location may have facilitated communication from the rainforest to the coast, explaining the relationship with Caribbean native components. 10.1371/journal.pgen.1003925.g004 Figure 4 Sub-continental origin of Native American components in the Caribbean. A) Ancestry-specific PCA analysis restricted to Native American segments from admixed Caribbean individuals (colored circles) and a reference panel of indigenous populations (gray symbols) from [11], grouped by sampling location. Darker symbols denote countries of origin with populations clustering closer to our Caribbean samples. Indigenous Colombian populations were classified into East and West of the Andes to ease the interpretation of their differential clustering in ASPCA. Population labels are shown for samples defining PC axes and representative clusters within locations. B) ADMIXTURE model for K = 16 ancestral clusters considering additional Latino samples, a representative subset of African and European source populations, and 52 Native American populations from [11], plus three additional Native Venezuelan tribes genotyped for this project. Vertical thin bars represent individuals and white spaces separate populations. Native American populations from [11] are grouped according to linguistic families reported therein. Labels are shown for the populations representing the 12 Native American clusters identified at K = 16. Clusters involving multiple populations are identified by those with the highest membership values. C) Map showing the major indigenous components shared across the Caribbean basin as revealed by ADMIXTURE at K = 16 from B). Namely, Mesoamerican (blue), Chibchan (yellow), and South American (green). Colored bars represent individuals and their approximate sampling locations. Bars pooling genetically similar individuals from more than one population are plotted from left to right following north to south coordinates as listed by population labels. Guarani, Wichi, and Chane from north Argentina are pooled with Arara but only the location of the latter is shown to allow us to provide a zoomed view of the Caribbean region (see [11] for the full map of sampling locations). The thick arrow represents schematically the most accepted origin of the Arawak expansion from South America into the Great Antilles around 2,500 years ago according to linguistic and archaeological evidence [30]. Asterisks next to population labels denote Arawakan populations included in our reference panel. The thin arrow indicates gene flow between South America and Mesoamerica, possibly following a coastal or maritime route, accounting for the Mayan mixture and supporting pre-Columbian back migrations across the Caribbean. Interestingly, the indigenous component of insular Caribbean samples seems to be shared across the different islands, suggesting gene flow across the Caribbean basin in pre-Columbian times. To explore this possibility into more detail, we performed a model-based clustering analysis using the full reference panel of 52 Native American populations from Reich et al. [11] in addition to our three native Venezuelan populations. Individual admixture proportions from K = 2 through 20 are given in Figure S8. Focusing on Native American components, the first sub-continental signal (at K = 4) comprised a Chibchan component mainly represented by the Cabecar from Costa Rica and the Bari from Venezuela. Higher-order clusters pulled out Amazonian population isolates such as the Surui and Warao, as well as northern populations including the Eskimo-Aleut and Pima, in agreement with the outliers detected in our ASPCA analysis (Figure S7). Interestingly, from K = 5 through 10, the Chibchan component is shared at nearly 100% with the Yukpa sample located near the Venezuelan coast, and at nearly 20% with Mayans from the Yucatan peninsula and Guatemala (Figure S8). Higher-order clusters maintain the connection between Mayans and South American components. For example, at K = 16 (the model with the lowest cross-validation error; Figure S9b), an average of 35% of the genome in Mayans is shared with a mixed South American component mainly represented by the Ticuna, Piapoco, Guahibo, Arhuaco, Kogi, Embera, Palikur, and Wichi, among others (Figure 4B and C). The presence of considerable proportions of Central and South American components in the Mayan sample is indicative of possible “back” migrations from Central America and northern South America into the Yucatan peninsula, revealing active gene flow across the Caribbean, probably following a coastal or maritime route. This observation is in agreement with our ASPCA results from admixed genomes and reinforces the notion of an expansion of South American-based Native American components across the Caribbean basin. European ancestral components We performed ASPCA analysis restricted to European segments of admixed individuals with >25% of European ancestry and a panel of European source populations, including 1,387 individuals from Europe sampled as part of the POPRES project [9], as well as additional Iberian samples from Galicia, Andalusia, and the Basque country in Spain [24]. The combined dataset included 2,882 European haplotypes and 255 haplotypes of European ancestry from the admixed populations. Figure 5 shows the first two PCs, where, as reported previously, the reference samples recapitulate a map of Europe [15], [25]. While most of the additional Iberian samples cluster together with the POPRES individuals sampled as Portuguese and Spanish, the Basques cluster separately from the centroid of most Iberian samples. The Basques are known for their historical and linguistic isolation, which could explain their genetic differentiation from the main cluster due to drift. Given the known Iberian origin of the first European settlers arriving into the Caribbean and surrounding territories of the New World, one would expect that European blocks derived from admixed Latino populations should cluster with other European haplotypes from present-day Iberians. Indeed, our Latino samples aggregate in a well-defined cluster that overlaps with the cluster of samples from the Iberian Peninsula (i.e., Portugal and Spain). However, we observed that the centroid is substantially deviated with respect to the Iberian cluster (bootstrap p-value 25% European ancestry derived from insular Caribbean (black symbols) and mainland populations (gray symbols) combined with a reference panel (colored labels) of 1,387 POPRES European samples with four grandparents from the same country [15], and 54 additional Iberian individuals (in yellow) from [24]. PC1 values have been inverted and axes rotated 16 degrees counterclockwise to approximate the geographic orientation of population samples over Europe. Population codes are detailed in Table S1 and regions within Europe are labeled as in [16]. Inset map: countries of origin for POPRES samples color-coded by region (areas not sampled in gray and Switzerland in intermediate shade of green to denote shared membership with EUR W, EUR C, and EUR S). Most Latino-derived European haplotypes cluster around the Iberian cluster. One of the two Haitian individuals included in the analysis clustered with French speaking Europeans (black arrow), in agreement with the colonial history of Haiti and illustrating the fine-scale resolution of our ASPCA approach. Importantly, when we applied ASPCA using the exact same reference panel of European samples but analyzing Mexican haplotypes of European ancestry (Moreno-Estrada, Gignoux et al., in preparation), we did not observe a deviated clustering pattern from the Iberian cluster: the effect is much weaker and not significant (bootstrap p-value = 0.099, see Figure S10). Furthermore, the deviation of the European segments of Mexican individuals from the distribution of the rest of Iberian samples is even smaller than the deviation of the Portuguese from the Spanish samples. We further evaluated whether the dispersion of the different subpopulations within the Caribbean cluster follow particular patterns along ASPC2, the axis driving the deviation from the Iberian centroid. We observed that Colombians and Hondurans tend to account for lower (more deviated) ASPC2 values compared to Cubans, Dominicans, and Puerto Ricans (Figure S11), suggesting a mainland versus insular population differentiation. We performed a Wilcoxon rank test to contrast ASPC2 for mainland (Colombia and Honduras) versus island (Cuba, Dominican Republic and Puerto Rico) populations, resulting in a highly significant p-value (1.5×10−15). Because >25% of European ancestry was required for inclusion in ASPCA, only two Haitian haplotypes were analyzed, and thus these were not included in the statistical analysis. Nonetheless, it is noteworthy that one of them clusters with the French, in agreement with historical and linguistic evidence regarding European settlements on the island (see arrow on Figure 5). Among European populations, Iberians also have the highest proportion of identical by descent (IBD) segments that are shared with Latino populations, as measured by a summed pairwise IBD statistic that is informative of the total amount of shared DNA between pairs of populations (see Materials and Methods and Figure S12). To explore the distribution of IBD sharing within continental groups, we considered Caribbean Latinos and Europeans separately by summing the cumulative amount of DNA shared IBD between each pair of individuals within each group. If European segments from Latino populations derive from a reduced number of European ancestors, then IBD sharing should be higher among Caribbean individuals compared to Europeans. Indeed, we observed a higher number of pairs sharing larger total IBD segment lengths among Latino individuals than among Europeans (Figure S13). Within-population cryptic relatedness is also compatible with increased IBD sharing. However, this is more likely to occur between individuals from the same subpopulation (e.g., COL-COL) rather than individuals from geographically separated subpopulations (e.g, COL-PUR). For this reason, we repeated the analysis, excluding within-population pairs of Latino individuals, and compared the IBD distribution to that of Iberian source populations (i.e., Spanish and Portuguese). Once again, we observed an increased proportion of IBD sharing among Latinos, arguing for a shared founder effect (Figure S13). These results are in agreement with our cluster-based analysis focused on global ancestry proportions, where the European ancestry of Latinos is dominated by a shared Latino-specific component differentiated from both southern and northern European components, although shared to some extent with Spanish and Portuguese (Figure 1C). Bottlenecked populations may exhibit differentiation from their parental gene pool due to loss of genetic diversity and stochastic shifts in allele frequencies. One way of quantifying the extent of genetic drift is to compare FST estimates among the K = 8 ancestral clusters from Figure 1C. In the absence of drift, we would expect the southern-derived Latino component and the southern European component to show a very low level of FST. However, we observe an FST = 0.021 (Table S3). To put this into perspective, the FST of southern vs. northern Europe is FST = 0.02, meaning that the differentiation of the Latino-specific component with respect to southern Europeans is at least as high as the north-south differentiation within Europe. This observation was replicated when including additional Latino and ancestral populations (Figure S8). Given the increased number of divergent clusters, we focused on K = 18 through 20, in which all sub-continental European components were jointly detected. In this case, the Latino-specific component shows further fragmentation into two components: one predominantly shared among insular Caribbean samples and the other among mainland Latinos. The FST value for southern versus northern European differentiation was 0.039, while values for southern versus insular (0.041) or mainland Latinos (0.04) were slightly inflated (Table S4), supporting the notion of additional differentiation impacting the European component of present-day admixed Latinos. African ancestral components The Caribbean region has a complex history of population exchange with the African continent as a result of slave trade practices during European colonialism. Its proximity to the North Atlantic Ocean facilitated nautical contact with the West African coast, increasing the exposure of the local population to slave trade routes and ultimately resulting in genetic admixture between Caribbean and African individuals. We found the proportion of African ancestry to be higher in Caribbean populations compared to those from the mainland (Figure 1C), a finding that is consistent across studies [3], [6], [26]. To explore the sub-continental composition of African segments derived from Caribbean admixed genomes, we performed ASPCA analysis on individuals with more than 25% of African ancestry using a diverse panel of African populations as potential sources (see Table S1). Our first approximation showed no dispersion of Afro-Caribbean haplotypes over PCA space. Instead, they form a relatively tight cluster that overlaps with that of the Yoruba sample from southwestern Nigeria (Figure S14). This is a plausible result, given the extensive historical record supporting a West African origin for the African lineages in the Americas. However, according to our tract length analysis, there is strong genetic evidence for the occurrence of at least two pulses of African migrants imprinting different genomic signatures in present day admixed Caribbean populations. This result raises the question of whether both pulses involved the same source population during the admixture process. If this were the case, it would easily explain our ASPCA results, where all African haplotypes point to a single source. Alternatively, if more than one source were involved and if enough mixing occurred since the two pulses, it is possible that what we see in ASPCA is the midpoint of the two source populations, causing the difference to remain undetected by our standard approach (which gives a point estimate averaging the signature of all African blocks along the genome). Hence, we applied a different strategy, in which ASPCA is performed separately for short (thus older) and long (younger) ancestry tracts. For this purpose, we split the African segments of each haploid genome into two categories based on a 50-cM length cutoff and intersected the data with a reference panel of West African populations (Figure 6A). Then, for each individual, we computed assignment probabilities of coming from each of the putative parental populations based on bivariate normal distributions fitted around each PCA cluster (see Materials and Methods, Figure S15). In Figure 6B we present the scaled mean probabilities for long (>50 cM) versus short ( 50 cM in red) ancestry tracts. African ancestry tracts for Puerto Ricans are shown and results for all populations are available in Figure S16. C) Proportion of African ancestry of inferred Mandenka origin as a function of block size in the combined set of Caribbean genomes. By running PCAdmix within the previously inferred African segments, we obtained posterior probabilities for Mandenka versus Yoruba ancestry. Overall, we found evidence for a differential origin of the African lineages in present day Afro-Caribbean genomes, with shorter (and thus older) ancestry tracts tracing back to Far West Africa (represented by Mandenka and Brong), and longer tracts (and thus younger) tracing back to Central West Africa. One caveat of this analysis is that short ancestry tracts are more likely to be misassigned. To rule this out as a source of the signal, we added an intermediate block size category (>5 cM and 10%. Four trios were not considered for trio phasing due to an excess of Mendelian errors (>100 K), two trios were removed due to 3rd or higher degree of relatedness between parents as inferred by IBD, and five trios were filtered due to cryptic relatedness between members of different trios above 10% IBD. After filtering, 65 complete trios remained for haplotype-based analyses. To study population structure and demographic patterns involving relevant ancestral populations, 79 previously collected samples from three native Venezuelan tribes were genotyped using the same array (i.e., 25 Yukpa [aka Yucpa], 29 Bari, and 25 Warao). We combined our data with publicly available genomic resources and assembled a global database incorporating genome-wide SNP array data for 3,042 individuals from which two datasets with different SNP densities were constructed (see Table S1). The high-density dataset included populations with available SNP data from Affymetrix arrays; namely African, European, and Mexican HapMap samples [8], Europeans from POPRES [9], West Africans from Bryc et al. [10], and Native Americans from Mao et al. [35]. After merging and quality control filtering, 389,225 SNPs remained and representative population subsets were used in different analyses as detailed through sections below. Our lower density dataset (30,860 SNPs) resulted from the intersection of our high-density dataset with available SNP data generated on Illumina platform arrays, including 52 additional Native American populations [11], as well as additional Latino populations sampled in New York City [7] and 1000 Genomes Latino samples [6]. The resulting dataset combines genomic data for 1,262 individuals from 80 populations. Full details on the population samples are available in Table S1. Population structure An unsupervised clustering algorithm, ADMIXTURE [12], was run on our high-density dataset to explore global patterns of population structure among a representative subset of 641 samples, including seven Native American, eleven POPRES European, HapMap3 Nigerian Yoruba, HapMap3 Mexican, and our six new Caribbean Latino populations (see Table S1). Fourteen ancestral clusters (K = 2 through 15) were successively tested. Log likelihoods and cross-validation errors for each K clusters are available in Figure S3. FST based on allele frequencies was calculated in ADMIXTURE v1.22 for each identified cluster at K = 8 and values are available in Table S3. Our low-density dataset comprising 1,262 samples (detailed in Table S1) was used to run K = 2 through 20. Log likelihoods, cross validation errors and FST values from ADMIXTURE are available in Figure S9 and Table S4. Principal component analysis (PCA) was applied to both datasets using EIGENSOFT 4.2 [36] and plots were generated using R 2.15.1. Sex bias in ancestry contributions was evaluated by selecting only females (to ensure we compare a diploid X chromosome to diploid autosomes), and running ADMIXTURE at K = 3 on the X chromosome and autosomes separately. The Wilcoxon signed rank test, a non-parametric version of the paired Student's t-test that does not require the normality assumption, was applied to assess the significance of the difference in X and autosomal ancestry proportions. This tests whether the average difference of ancestry proportions assigned to a given source population for the X and for the autosomes of each sample is significantly different from zero. The test was applied to the entire collection of Latino samples, revealing an over-arching trend, and then to each population in turn to identify any between-population differences. A rejection of the null hypothesis means that the ancestry proportions on the X and the autosomes are significantly different from one another but does not imply which proportion is larger. We provide box plots as a visual aid to show the direction of the difference (Figure S4). Global ancestry estimates from ADMIXTURE at K = 3 were used to test the correlation between male and female ancestry proportions considering all trio founders within each Caribbean population as well as within the full set of admixed trios. Linear models and permutations (up to 100,000) were performed using R 2.15.1. Phasing and local ancestry assignment Family trio genotypes from our six Caribbean populations and continental reference samples were phased using BEAGLE 3.0 software [37]. Local ancestry assignment was performed using PCAdmix (http://sites. google.com/site/pcadmix/ [19]) at K = 3 ancestral groups. This approach relies on phased data from reference panels and the admixed individuals. To maintain SNP density and maximize phasing accuracy we restricted to a subset of reference samples with available Affymetrix 6.0 trio data, namely 10 YRI, 10 CEU HapMap3 trios, and 10 Native American trios from Mexico [5]. Each chromosome is analyzed independently, and local ancestry assignment is based on loadings from Principal Components Analysis of the three putative ancestral population panels. The scores from the first two PCs were calculated in windows of 70 SNPs for each panel individual (in previous work we have estimated a suitable number of 10,000 windows to break the genome into when inferring local ancestry using PCAdmix, and in this case, after merging Affymetrix 6.0 data from admixed and reference panels, a total of 743,735 SNPs remained/10,000 = window length of ∼70 SNPs). For each window, the distribution of individual scores within a population is modeled by fitting a multivariate normal distribution. Given an admixed chromosome, these distributions are used to compute likelihoods of belonging to each panel. These scores are then analyzed in a Hidden Markov Model with transition probabilities as in Bryc et al. [10]. The g (generations) parameter in the HMM transition model was determined iteratively so as to maximize the total likelihood of each analyzed population. Local ancestry assignments were determined using a 0.9 posterior probability threshold for each window using the forward-background algorithm. In analyses that required estimating the length of continuous ancestry tracts, the Viterbi algorithm was used. An assessment of the accuracy of this approach is given in [5]. Tract length analysis We used the software Tracts [20] to identify the migratory model that best explains the genome-wide distribution of ancestry patterns. Specifically, we considered three migration models, each featuring a panmictic population absorbing migrants from three source populations. The models differ by the number of allowed migration events per population. In the simplest model, the population is founded by Native American and European individuals, and later receives a pulse of African migrants. The initial ancestry proportion and timing, as well as the African migration amplitude and timing, are fitted to the data as described below. The other two models feature an additional input of either European or African migrants; the timing and magnitude of this additional pulse result in two additional parameters that must be fitted to the data. Here, the data consisted of Viterbi calls from PCAdmix (see previous section and Figure 2), that is, the most probable assignment of local ancestry along the genomes. To fit parameters to these data, we tallied the inferred continuous ancestry tracts according to inferred ancestry and tract length using 50 equally spaced length bins per population, and one additional bin to account for full chromosomes. Given a migration model and parameters, Tracts calculates the expected counts per bin. Assuming that counts in each bin are Poisson distributed, it produces a likelihood estimate that is used to fit model parameters. For each population, we report the model with the best Bayesian Information Criterion (BIC) −2 Log(L)+k Log (n), with n = 153. Because we imposed a fixed number of migration pulses, we must keep in mind that migrations are likely to have been more continuous than what is displayed in the best-fitting models. One way to interpret the pulses are time points that the migrations probably spanned. Resolving the duration of each pulse would likely require refined models and a great deal more data. Ancestry-Specific Principal Component Analysis (ASPCA) To explore within-continent population structure, we applied the following approach for each of the continental ancestries (i.e., Native American, European, and African) of admixed genomes. The general framework is shown in Figure 2. It comprises locus-specific continental ancestry estimation along the genome, followed by PCA analysis restricted to ancestry-specific portions of the genome combined with sub-continental reference panels of ancestral populations. For this purpose, we used our continental-level local ancestry estimates provided by PCAdmix to partition each genome into ancestral haplotype segments, and retained for subsequent analyses only those haplotypes assigned to the continental ancestry of interest. This is achieved by masking (i.e., setting to missing) all segments from the other two continental ancestries. Because ancestry-specific segments may cover different loci from one individual to another, a large amount of missing data results from scaling this approach to a population level, which limits the resolution of PCA. To overcome this problem, we adapted the subspace PCA (ssPCA) algorithm introduced by Raiko et al. [38] to implement a novel ancestry-specific PCA (ASPCA) that allows accommodating phased haploid genomes with large amounts of missing data. Our method is analogous to the ssPCA implementation by Johnson et al. [23], which operates on genotype data. In contrast, ASPCA operates on haplotypes, allowing us to use much more of the genome (rather than just the parts estimated to have two copies of a certain ancestry) and to independently analyze the two haploid genomes of each individual. Finally, ancestry-specific haplotypes derived from admixed individuals are combined with haplotypes derived from putative parental populations and projected together onto PCA space. Details of the ASPCA algorithm and constructed datasets are described in Text S1. Differentiation of sub-European ancestry components To measure the observed deviation in ASPCA of European haplotypes derived from admixed Caribbean populations with respect to the cluster of Iberian samples, a bootstrap resampling-based test was performed. The null distribution was generated from comparing bootstraps of Portuguese and Spanish ASPCA values as models of the intrinsic Iberian population structure. We then compared the ASPCA values of the admixed individuals and tested if the observed differences between Iberian ASPCA values and those of the admixed individuals are more extreme than the differences within Iberia. The distance was determined using the chi-squared statistic of Fisher's method combining ASPC1 and ASPC2 t-tests for each bootstrap. We ran 10,000 bootstraps to determine one-tailed p-values. As Iberians we considered: POPRES Spanish, POPRES Portuguese, Andalusians, and Galicians; and as Caribbean Latinos: CUB, PUR, DOM, COL, and HON. Additional tests were performed comparing Portuguese versus the rest of Iberians and between an independent dataset of Mexican individuals analyzed by Moreno-Estrada, Gignoux et al. (in preparation) projected onto ASPCA space using the same reference panel of European populations. A bivariate test was performed to measure the relative deviation from the Iberian cluster of the distribution given by the Caribbean versus the Mexican dataset. To determine whether insular versus mainland Caribbean populations disperse over significantly different ranges in ASPC2, a Wilcoxon rank test was performed between (COL+HON) versus (CUB, PUR, DOM). Haitians were excluded due to low sample size (N = 2 haplotypes). Boxplot is available in Figure S11. Population differentiation estimates between clusters inferred with ADMIXTURE were visualized and compared across runs where both the Latino-specific and southern European components were detected. Values are available in Table S3 and Table S4. To provide independent evidence on the sub-continental ancestry of European haplotypes, we considered segments that are identical by descent (IBD) between unrelated Latino individuals and a representative subset of European populations. We used our high-density dataset to extract a subset of 203 POPRES European individuals and the founders of the 65 complete admixed trios. We first performed a genome-wide pairwise IBS estimation using PLINK [39] to ensure that the dataset contains no samples with more than 10% IBS with any other sample. Then we used fastIBD [37] to phase the data and estimate segments shared IBD longer than 2 Mb to eliminate false positive IBD matches and assuming that ancestry will be shared among pairwise IBD hits of segments this long. All 2 Mb or greater segments shared IBD between pairs of individuals were summed, and histograms were created for pairwise matches within each group (i.e., POPRES Europeans, Iberians, and Caribbean Latinos). To inform about the proportion of shared DNA between pairs of populations we calculated a summed pairwise IBD statistic, which is the sum of lengths of all segments inferred to be shared IBD between a given European population and each Latino population, normalized by sample size. Size-based ASPCA analyses Given the evidence from our tract length analysis for a second pulse of African migrants into the admixture of insular Caribbean Latinos, a modified size-based ASPCA analysis was performed. A reference panel was built integrating three different resources [8], [10], [40] and focusing on putative source populations from along the West African coast, including Mandenka from Senegal, Yoruba and Igbo from Nigeria, Bamoun and Fang from Cameroon, Brong from western Ghana, and Kongo from the Democratic Republic of the Congo. We begin with the continental local ancestry inference from PCAdmix K = 3. For each individual we then divide African ancestry tracts into small (0 to 50 cM) and large (>50 cM) size classes. Given a partition of African ancestry tracts, we take all sites included in one tract class, say short tracts, and run PCA on our sub-continental West African reference populations for only these sites. Using the first two PCs from this analysis, we fit a bivariate normal distribution to each reference population cluster. We then project our test sample into this PCA space, and estimate the probability of it coming from each reference population using the fitted distributions. This procedure is repeated for each tract class, for each individual. For each admixed Caribbean population, we can then estimate the probability that a given class of African ancestry tracts comes from a specific West African source population as the average probability of assignment to this population across all individuals. Finally, under the assumption that a given class of African tracts must come from one of the provided reference populations, we rescale these probabilities to sum to one. Each assignment estimate is also provided with error bars representing the standard error of the mean. We compare the short and long assignment probabilities for each Caribbean population to identify distinct sources for “older” and “younger” West African migratory source populations. Haitians were not included in the analysis due to low sample size (n = 4). Due to concerns that shorter tracts have a higher likelihood of mis-assignment, we added a medium tract size class (5 cM to 50 cM) to see if the results were simply due to very short (0 cM to 5 cM) European or Native American tracts being mis-classified as African. We compare the results for short and medium tracts and find that the trends are maintained suggesting the observation that older shorter tracts appear to be primarily from the Mandenka and Brong source populations is not simply due to short tract mis-assignment Local ancestry estimation within African tracts To identify likely regions of Yoruba versus Mandenka ancestry in the African component, we modified our implementation of PCAdmix to perform local ancestry deconvolution solely of the African segments of the admixed genomes. The modification is achieved in the final step of the algorithm: whereas the standard approach estimates a single HMM across an entire chromosome, here we fit J disjoint HMMs spanning each of the J blocks of African ancestry in a given chromosome for a given individual. Applying the method, we obtained posterior probabilities for Mandenka versus Yoruba ancestry within the previously inferred African segments. We then selected only those sub-regions that were confidently called as Mandenka or Yoruba, and stratified them by physical size. Supporting Information Figure S1 Principal component 1 versus lower order PCs defining sub-continental components among Native American populations. Top: PC5 separates Venezuelan population isolates from the rest of Native Americans. Bottom: PC7 separates Mesoamerican from Andean groups. Mexicans and Hondurans distribute between the European and Mesoamerican clusters, whereas Colombians slightly deviate towards the Andean and Venezuelan clusters. Global PCA analysis based on the high-density dataset (∼390 K SNPs) and thus limited to reference panel populations with available Affymetrix SNP array data (see Table S1 for details). (TIF) Click here for additional data file. Figure S2 ADMIXTURE results from K = 2 through 15 based on the high-density dataset (∼390 K SNPs) including 7 admixed Latino populations and 19 reference populations. A low-frequency Southern European component restricted to Mediterranean populations at lower order Ks and specifically to Iberian populations at higher order Ks, accounts for the majority of European ancestry among Latinos (black bars). It further decomposes into population-specific clusters (purple bars) denoting higher similarities within the European portion among Latinos compared to European source populations. (TIF) Click here for additional data file. Figure S3 ADMIXTURE metrics at increasing K values based on Log-likelihoods (A) and cross-validation errors (B) for results shown in Figure S2. (TIF) Click here for additional data file. Figure S4 Comparison of ADMIXTURE estimates obtained from autosomes and the X chromosome in different Latino/Caribbean populations. A) Cluster-based results for K = 3 using the same set of ancestral populations as in Figure S2. Because the X chromosome is diploid, the analysis was restricted to female individuals from the seven admixed Latino populations. Within each population, individuals are sorted from largest to smallest proportion of European ancestry. B) Box plot showing the directionality of the difference between X and autosomal ancestry proportions considering all populations together. P-values on top correspond to the Wilcoxon signed rank test applied to assess statistical significance (see Materials and Methods). C) Box plots and statistical tests for each population (Haitians excluded due to low sample size). The observed pattern strongly supports the presence of sex-biased gene flow during the process of admixture throughout the Caribbean, with significantly higher contribution from Native American, and to a lesser extent West African, ancestors into the composition of the X chromosome, which largely reflects the female demographic history of a population. (TIF) Click here for additional data file. Figure S5 Correlation between male and female continental ancestries. Parents' ancestry proportions from each trio were used to compare correlation coefficients between the observed values and 100,000 permuted male-female pairs (p-values shown for the combined set of Latino Caribbean samples and for each population in Table S2). (TIF) Click here for additional data file. Figure S6 Ancestry tract lengths distribution per population and demographic model tested in Tracts. For each demographic scenario, the observed distribution is compared to the predictions of the best-fitting migration model (displayed below each distribution). Solid lines represent model predictions and shaded areas are one-sigma confidence region surrounding the predictions. Three different demographic scenarios were considered, all of which assume the involvement of European and Native American tracts at the onset of admixture, followed by the introduction of African migrants (denoted by EUR,NAT+AFR). The second and third models allow for an additional pulse of European (EUR,NAT+AFR+EUR) and African (EUR,NAT+AFR+AFR) ancestry, respectively. Likelihood values for each model are shown on top of each plot. Pie charts above each migration model are proportional to the estimated number of migrants being introduced at each point in time (black arrows). GA: generations ago. (TIF) Click here for additional data file. Figure S7 ASPCA analysis of Native American haplotypes derived from admixed genomes (solid circles) and reference panel populations from [11] grouped by linguistic families as reported therein. Top panels: ASPCA with the full reference panel of Native American populations. Bottom panels: Filtered ASPCA without extreme outliers (Aleutians, Greenlanders, and Surui excluded from the analysis). Each individual from the reference panel is represented by the corresponding population label centered on its PCA coordinates. A zoomed version of PC1 vs. PC2 for the filtered set (bottom left) grouped by geographic sampling location is available in Figure 4A. (TIF) Click here for additional data file. Figure S8 ADMIXTURE results from K = 2 through 20 based on the low-density dataset (∼30 K SNPs) including additional admixed Latino and Native American reference populations (see Table S1 for details). The presence of the Latino European component (black and gray bars) is recaptured among independently sampled Latino populations. FL: Florida (this study); NY: New York; 1KG: 1000 Genomes Project samples. Native American populations from [11] are grouped according to linguistic families reported therein. Labels are shown for the populations representing the 15 Native American clusters identified at K = 20 (four of the remaining five being of European ancestry and one of West African ancestry). Clusters involving multiple populations are identified by those with the highest membership values. Throughout lower and higher order Ks, several South American components (yellow and green bars), show varying degrees of shared genetic membership with Mesoamerican Mayans, accounting for up to nearly half of their genome composition (see Figure 4 for more details). (TIF) Click here for additional data file. Figure S9 ADMIXTURE metrics at increasing K values based on Log-likelihoods (A) and cross-validation errors (B) for results shown in Figure S8. (TIF) Click here for additional data file. Figure S10 ASPCA distribution of Iberian samples (red circles) compared to European haplotypes derived from our Latino Caribbean samples (top panel) and from an independent cohort of Mexican samples (bottom panel). The relative deviation from the Iberian cluster is significantly different comparing the Caribbean versus the Mexican dataset (see the main text for details). (TIF) Click here for additional data file. Figure S11 ASPC2 values per population from the European-specific PCA analysis shown in Figure 5 and Figure S10. Population codes as in Table S1. The boxplot shows that low ASPC2 values are enriched with mainland Colombian and Honduran haplotypes, whereas insular Caribbean populations show less deviated values from the Iberian cluster. A Wilcoxon rank test between mainland (COL, HON) versus insular samples (CUB, PUR, DOM) demonstrated that these two groups disperse over significantly different ranges in ASPC2 (Haitians excluded due to low sample size). (TIF) Click here for additional data file. Figure S12 IBD sharing between different Caribbean Latino populations and a representative subset of POPRES European populations as measured by a summed pairwise IBD statistic. For each Latino population, maximum pairwise IBD levels were observed in those pairs involving Spanish and, to a lesser extent, Portuguese samples, in agreement with our ASPCA results. (TIF) Click here for additional data file. Figure S13 IBD sharing between pairs of individuals within A) Caribbean Latinos and B) a representative subset of POPRES European populations. Inset histograms display counts lower than 50 for the same binning categories. The overall count of pairs sharing short segments of total IBD is higher among Europeans, probably as a result of an older shared pool of source haplotypes. In contrast, the higher frequency of longer IBD matches among Latinos is compatible with a recent European founder effect. After excluding within-population pairs of Latino individuals (top right), there are still more and longer IBD matches among Caribbean populations compared to Iberians (bottom right). (TIF) Click here for additional data file. Figure S14 ASPCA analysis of African haplotypes derived from admixed genomes with >25% of African ancestry (black symbols) and a representative subset of African HapMap3 and other West African reference panel populations from [10]. Colombians and Hondurans excluded due to lower overall proportions of African ancestry. (TIF) Click here for additional data file. Figure S15 ASPCA analysis of short versus long African ancestry tracts from admixed genomes and West African reference panel populations. To exemplify our size-based ASPCA approach, the African genome of a Puerto Rican individual is displayed (denoted by PUR). Left: PUR clusters with Mandenka when only sites within short ancestry tracts ( 50 cM). (TIF) Click here for additional data file. Figure S16 African ancestry size-based ASPCA results per population sample. Considering three different classes of ancestry tract lengths (black: short; red: long; blue: intermediate), scaled assignment probabilities are shown for each African source population. Values on the y-axis are the average probability of assignment to each potential source population across all individuals within each Latino population (see Materials and Methods for details). (TIF) Click here for additional data file. Table S1 Summary of Latino populations and assembled reference panels. (PDF) Click here for additional data file. Table S2 Correlation p-values of male vs. female ancestry. (PDF) Click here for additional data file. Table S3 FST divergences between estimated populations for K = 8 using ADMIXTURE. (PDF) Click here for additional data file. Table S4 FST divergences between estimated populations for K = 20 using ADMIXTURE. (PDF) Click here for additional data file. Text S1 Methodology of the Ancestry-Specific PCA (ASPCA) implementation. (PDF) Click here for additional data file.