16
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Cancer- and behavior-related genes are targeted by selection in the Tasmanian devil ( Sarcophilus harrisii)

      research-article
      * , ,
      PLoS ONE
      Public Library of Science

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Devil Facial Tumor Disease (DFTD) is an aggressive cancer notorious for its rare etiology and its impact on Tasmanian devil populations. Two regions underlying an evolutionary response to this cancer were recently identified using genomic time-series pre- and post-DTFD arrival. Here, we support that DFTD shaped the genome of the Tasmanian devil in an even more extensive way than previously reported. We detected 97 signatures of selection, including 148 protein coding genes having a human orthologue, linked to DFTD. Most candidate genes are associated with cancer progression, and an important subset of candidate genes has additional influence on social behavior. This confirms the influence of cancer on the ecology and evolution of the Tasmanian devil. Our work also demonstrates the possibility to detect highly polygenic footprints of short-term selection in very small populations.

          Related collections

          Most cited references82

          • Record: found
          • Abstract: not found
          • Article: not found

          Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            A New Approach for Using Genome Scans to Detect Recent Positive Selection in the Human Genome

            Introduction In approximately the past 50,000 years, anatomically modern human emerged from Africa and colonized most of the globe. During this relatively short time, humans encountered numerous novel environments with drastically different climates, pathogens, and food sources. In addition, several important cultural developments undoubtedly had far-reaching consequences, such as the introduction of various forms of agriculture, the domestication of animals, and increasing population density, resulting in conditions favorable for epidemics and infectious diseases. It thus seems likely that there has been ample opportunity for human populations to have adapted by natural selection to the extreme changes that have accompanied their recent expansion. Detection of recent, local positive selection has long been a pivotal issue of population genetics and is of increasing interest to human geneticists [1–7]. However, demonstrating conclusively that local selection has operated on a gene remains a difficult task because it involves several aspects: demonstrating that patterns of allelic variation at the gene are not consistent with neutrality; that there is a functional difference between alleles; and finally, that the functional difference would result in a phenotypic effect that would be influenced by selection. Such efforts have been mostly focused on individual candidate genes [6], and very few have had all these aspects demonstrated to some degree at least; examples of genes for which all these aspects have been shown include the lactose tolerance gene LCT [8], salt-regulation genes at the CYP3A cluster [9], several disease resistance genes, e.g., G6PD and CASP12 [10,11], and pigmentation genes, e.g., SLC24A5 and MATP [12,13]. Recently, high-density surveys of genetic variation across the genome are available for several major human populations [14,15]; and the advent of genotyping technology makes such whole-genome surveys increasingly feasible in any specific group [16,17]. This allows systematic characterization of recent positive selection in the human genome, and several studies have recently used such datasets to identify regions of the human genome that harbor signatures of positive selection [18–23]. The obvious strategy, from a statistical standpoint, is to match the empirical data to a sensible neutral model and thereby detect significant departures from neutrality. However, this is complicated by the need for a neutral model that incorporates the demographic history of humans, which is unknown and hence arguably not practical, as well as the high sensitivity of conventional test statistics to the departures from neutral assumptions, such as bottlenecks and population structure [24]. Thus, aside from the composite-likelihood–based test [20], which shows good robustness towards several demographic factors, previous studies were largely restricted to identifying outlier loci from the genome-wide distribution of a certain test statistic, rather than defining clear cutoffs for distinguishing selection from instances of extreme neutral drift. This greatly constrained the use of genomic scans, because it is impossible to address essential questions such as the magnitude of recent positive selection across the genome and the false positive and false discovery rates of any particular approach. Among the various statistics used for recognizing signals of positive selection from polymorphism data, the extended haplotype homozygosity (EHH), first introduced by Sabeti et al. [25] is particularly useful (e.g., [22,23,26–28]). EHH incorporates information on both allele frequency and the association between single nucleotide polymorphism (SNP) sites, and may provide higher testing power than conventional statistics [25]; furthermore, it is designed to work with SNP rather than sequencing data, being less sensitive to ascertainment bias than other approaches. These properties make EHH a very promising candidate strategy for genome-wide scans of recent positive selection. Voight et al. [21] recently introduced a powerful method for identifying alleles that have been driven to intermediate frequencies by positive selection, based on the comparison of EHH between alleles within a population, and detected wide-spread signals of positive selection in the human genome. However, this method lacks power to detect selective sweeps that have resulted in near or complete fixation of an allele in a population, and hence may fail to detect a significant fraction of loci that have experienced local positive selection. Here, we describe a novel genome-scan approach for detecting local positive selection that is designed to detect selective events that have resulted in complete or near-complete fixation of a beneficial allele. Rather than comparing the EHH between alleles within one population, this approach contrasts the EHH patterns of the same allele between populations, analogous to other statistics (lnRV and lnRH [29,30]) that are also based on contrasting genetic diversity between populations. We introduce a simple counting algorithm to estimate EHH-related statistics directly from genotype data, which avoids the time-consuming, computationally intensive estimation of haplotypes. Moreover, we use simulations to demonstrate that our test has high power to detect fixed, strong selective sweeps, and that our new summary statistic is robust over a variety of demographic models of human history, while capturing apparent departures of the empirical data from neutrality. We apply our method to two genome-wide SNP datasets, define candidate regions for local positive selection, and identify functional gene categories that contain an excess of candidates. Results The Test Statistics The primary goal of our study was to detect evidence of recent, local positive selection from the whole-genome SNP data of both the International HapMap Project and Perlegen Sciences [14,15]. For the Perlegen dataset, we used the data from all 71 unrelated individuals sampled in three groups: African American (23), European American (24), and Han Chinese (24). For the HapMap dataset, we only included unrelated individuals from three groups; specifically 60 Yorubans, 40 Europeans, and 45 Han Chinese (see Methods). Given the obvious shared ancestry between the groups in Perlegen and HapMap, we hereafter refer to them as Africans (Afr), Europeans (Eur), and Chinese (Chn), respectively. Our approach is based on the idea of extended haplotype homozygosity (EHH). First proposed by Sabeti et al. [25], the EHH statistic is a measure of the decay of identity of haplotypes as a function of distance from a “core” allele, and the EHH associated with an allele that has risen to a particular frequency under neutrality is expected to differ from the EHH of an allele that has risen to the same frequency by positive selection. Under neutral genetic drift, a young derived allele that is at low frequency will have few associated recombination events, and therefore will have low haplotype diversity and high EHH, whereas a high-frequency ancestral allele will have high haplotype diversity and low EHH because of the many recombination events that have occurred. A young derived allele under positive selection, however, rises rapidly in frequency while retaining extensive EHH, and leaves the alternative allele in low frequency with low EHH. Previous approaches compare the EHH decay between the alleles (hereafter, we refer to the EHH of an allele as EHHA) of a site/core-haplotype within a single population, so that the alleles with excessive EHH and high allele frequency indicate positive selection [21,25]. An obvious caveat of this approach is that the intrapopulation comparison has low power when the selected allele is at high frequency, and becomes impossible when the selected allele is fixed. Seeking a novel strategy to overcome this problem, our approach compares the decay of EHH of an individual SNP site (EHHS), rather than EHHA, between populations. EHHS is defined as the decay of identity of haplotypes starting from the tested SNP site of a population as a function of distance. Starting at site i, the normalized EHHS at site j would be: This is the haplotype homozygosity between i and j normalized by the homozygosity at site i. Note that both the haplotype homozygosity and homozygosity calculations are based on the site, regardless of the status of each allele. In principle, EHHS is roughly the average EHHA for the two alternative alleles weighted by their squared allele frequencies, and starts at a value of one and decays towards zero (Figure 1A). EHHS is therefore largely determined by the EHHA of the high-frequency allele, decaying very fast under neutrality when the dominating allele is the ancestral allele, or remaining extensive when the beneficial derived allele sweeps to a very high allele frequency or to fixation (Figure 1A). Figure 1 Signal of Positive Selection around the SCL24A5 Gene in Europeans from the HapMap Data (A) The EHH decay plot for i, a neutral site outside the selective sweep; and ii, a site in the middle of the SCL24A5 gene, which shows strong evidence of positive selection in Europeans (Eur). Afr, Africans; Chn, Chinese. (B) iES plotted against physical distance, back to back for Europeans and Africans. (C) Signals of positive selection shown in the plot of Rsb (European/African) against physical distance. When haplotype data are available, EHHS can be derived in an analytical way: Where Pk, (hap ij) is the kth haplotype between i and j, and Pl, (alle i) is the lth allele (variant of a SNP site) for site i. However, inferring haplotypes from genotype data at a genome-wide scale is very computationally intensive and may not be accurate over long genomic distances [31–33]. Here, we propose a new way to estimate EHHS directly from the genotypic data. Assuming random mating in a population, each extant individual represents an instance of random chromosome (haplotype) pairing; the proportion of homozygotes in a population, therefore, serves as an estimator of the homozygosity. The EHH can then simply be estimated as the proportion of individuals that remain homozygous for intervals starting from the tested SNP and extending in both directions. Here the Ik, (hap ij) is the identity of the two haplotypes between site i and j in one individual, and Il, (alle i) is the identity of the alleles at site i. Because this procedure is based on counting the number of homozygotes, hereafter, we call it the counting algorithm. Our major analyses, including the simulations and the analysis of the empirical HapMap and Perlegen genotype data, use this procedure. Treatment of missing data is discussed in the Methods section. This algorithm is fast and not computationally intensive. Although there is some inevitable loss of information compared to analyses based on phase-known data, we show later that the analysis based on the counting algorithm is comparable to that based on phase-known data. We use the integrated EHHS (iES) to summarize the EHHS decay for one single site, which basically integrates the area under the curve of EHHS against distance (see Methods; Figure 1A). The log ratio of iES values between population (ln(Rsb)′) then compares the EHHS decay of a single site between two populations: Because recombination rates are largely conserved among human populations [34], the comparison between populations thereby provides an internal control that largely cancels out the effect of heterogeneous recombination rates. Extreme values of ln(Rsb)′ indicate much slower EHH decay in one population than the other, and therefore represent possible evidence of selection. To achieve robustness across different demographic scenarios, ln(Rsb)′ is standard transformed as: Here SD refers to standard deviation; the median is used instead of the mean because it is less sensitive to extreme data points. We calculated ln(Rsb) for every SNP site in the HapMap phase II and Perlegen data, and for each pairwise comparison among the three populations, namely African versus European (AE), African versus Chinese (AC), and Chinese versus European (CE). We compared the ln(iES) and ln(Rsb)′ distributions estimated from the counting algorithm and from the analytical calculation based on a phase-known dataset, the phase I HapMap data (see Methods). The agreement between the two methods is considerable; r 2 ranged from 0.855 ~ 0.914 across different comparisons for ln(iES) and from 0.647 ~ 0.731 for ln(Rsb). Higher SNP densities, such as in the full Perlegen and HapMap datasets, should give even better agreement. Effects of Recombination Rate and SNP Density In principle, our approach should be robust against varying local recombination rates and SNP density, because ln(Rsb) compares the relative iES of the same SNP between two populations, and such local effects are thus controlled. Nevertheless, we tested for any influence of recombination rate or SNP density by linear regression. Sites around large gaps in the SNP distribution were first excluded from further analysis because iES estimation is sensitive to such gaps (see Methods). We observed statistically significant, but low-level, associations between Rsb and either the recombination rate or the SNP density for all the comparisons (r 2 between ln(Rsb) and recombination rate is 0.0014–0.0086 for the HapMap data, and 0.0025–0.0153 for the Perlegen data; r 2 between ln(Rsb) and SNP density is 0.0002–0.0087 for the HapMap data, and 0.0004–0.02 for the Perlegen data, respectively). Given the low magnitude of these associations, we did not include any further corrections for recombination rate and/or SNP density. Comparison between HapMap and Perlegen Datasets The Perlegen and HapMap datasets vary greatly in their SNP density, sample size, and ascertainment; it is, therefore, of interest to compare the iES and Rsb distributions between the two. We observed good agreement between the Perlegen and HapMap data for the iES distributions, especially for the comparisons in Europeans and Chinese (Figure 2A). Lower similarity in iES distributions is seen between the two African samples, which likely reflects admixture in the African Americans sampled by Perlegen (Figure 2B). The iES distribution is more variable in the Perlegen than in the HapMap dataset, probably due to the smaller sample size and lower SNP density of the former. There is a moderate correlation of the Rsb distribution between the two datasets (Figure 2C). Figure 2 Comparisons of iES and Rsb Distributions between the Perlegen and HapMap Datasets The iES and Rsb values are plotted for a 20-Mb region on Chromosome 15. (A and B) iES against physical distance in Europeans (Eur) and Africans (Afr), respectively, back to back for the HapMap and Perlegen data. (C) Plot of Rsb of Europeans over Africans against physical distance, back to back for the HapMap and Perlegen data. Genome Evidence of Departures from Neutrality Do regions with extreme ln(Rsb) values indicate local selection, or the extremes of the neutral distribution? To answer this question without accurate information concerning the demographic history of the populations, the ideal statistic would be robust against violations of the neutral assumption due to demographic history, yet at the same time sensitive to departures from neutrality due to selection. Previous studies have shown via simulations that neutral distributions are heavily influenced by demographic parameters; exploration of a broad, neutral parameter space is necessary, but results are often ambiguous [21,24]. Nonetheless, an important feature shared by all demographic factors (but not selection) is that varying demographic parameters are expected to influence all regions of the genome equally; we therefore hypothesized that the shape of any neutral distribution should have certain invariant properties. More specifically, the density distributions of the standard transformed ln(Rsb) should show little change across a broad range of neutral parameters and varying ascertainment settings. To investigate this, we generated a series of neutral demographic models using coalescent simulation. These include the simplest constant population-size model (model 1) and models with: different recombination rates (model 2) or mutation rates (model 3); independent ancestors for Europeans and Chinese (model 4); two different bottleneck scenarios (models 5 and 6, see Methods); population expansion (model 7); and population structure (model 8). We also considered three complex models matching real data in multiple aspects—one with no migration (model 9), one with migration and uniform recombination rate (model 10), and one with no migration but heterogeneous recombination rates (model 11) [35]. Ascertainment analogous to that of the Perlegen data, but much simplified, was applied to these models. (see Methods). Three other ascertainment processes were also simulated for model 10 (hereafter referred to as the standard neutral model because it is used as the null distribution to compare to the empirical data) to assess the impacts of low-quality sequencing (model 10-asc1), skewed constitution in the diversity panel (model 10-asc2), and high genotyping failure rate (model 10-asc3, see Methods for details on all models). It should be noted that the ascertainment schemes we use here do not capture the full complexity of the empirical data. Nonetheless, they provide perspectives as to how ascertainment in general affects ln(Rsb) distribution. Data from the neutral simulations described above, as well as the Perlegen data, were compared to the standard neutral model (model 10) by three approaches, namely quantile–quantile plots (QQ plots), superimposing histograms, and the Kolmogorov-Smirnov D (KSD) statistic (see Methods). The QQ plot shows a straight line when two distributions can be linearly transformed into each other. Nonstandard ln(Rsb)′ distributions were compared in pairwise fashion using QQ plots. For superimposing the histogram, the distributions of standard transformed ln(Rsb) were superimposed onto one another using the same coordinates (Figures 3 and S1–S13). The KSD value was also determined for the pairwise comparisons of the ln(Rsb) distributions as a quantitative measure of difference between distributions (Figure 4). The p-values for the Kolmogorov-Smirnov test are omitted due to the lack of independence between SNPs. Figure 3 Comparisons of the ln(Rsb) Distributions between Various Neutral Simulations and between the Full Neutral Model and Empirical Data (A) shows the full neutral model (model 10) and several other neutral models; (B) shows model 10 and empirical data. The ln(Rsb) distributions are compared by both a QQ plot and superimposing standardized histograms. In every QQ plot, quantile points of model 10 are plotted on the x-axis. For all of the superimposed standardized histogram plots, the blue color designates the full model (model 10) and the red color designates the alternative model. Figure 4 KSD Statistic for the Comparisons of Standardized ln(Rsb) Distributions within Neutral Models, and between Neutral Models and Empirical Data Categories on the x-axis denote the neutral models 1 to 10-asc3, where 10-asc1, 10-asc2, and 10-asc3 indicate the three different ascertainment schemes (see Methods). Each neutral distribution is compared to the corresponding empirical (in blue) or model 10 (in yellow) distribution. A cluster of three bars in the same color denotes the KSD values in the three pairwise comparisons in the order AE, AC, and CE. The QQ plots indicate that, as hypothesized, varying the neutral parameters does not heavily influence the shape of the density distributions (Figures 3 and S1–S13). Most models showed good agreement with the standard model (always plotted on the x-axis), except in some cases, such as the comparisons to the models with simple assumptions (models 1, 2, and 3), in which departures from linearity are more obvious at both ends of the QQ plot curve (Figures S1–S13). This is to be expected because the demographic assumptions of models 1, 2, and 3 differ very much from the standard model, and also because QQ plots of bell-shaped distributions have fewer data points and, therefore, more fluctuation at the tails of the curve. On the other hand, comparisons between empirical data and the standard neutral model showed that the QQ plots clearly departed from linearity even in the central segment, indicating a difference between the empirical data and the standard model that seemingly is not accounted for by demographic factors. Histograms superimposing and KSD values show similar trends. Most ln(Rsb) distributions for the different neutral models with varying demographic parameters match very well with the standard model, except a few with noticeable mismatches, e.g., the models with different recombination and mutation rates (models 2 and 3) in the AC comparison (KSD = 0.0134 in both cases), and the model with heterogeneous recombination rate (KSD = 0.0131, 0.0136, and 0.0155 for AE, AC, and CE comparisons, respectively). However, superimposing the distributions for the empirical data and the standard model revealed much more pronounced differences: the modes of the empirical EA and CA distributions shift to the left, although with larger right tails compared to the simulated data; the empirical CE distribution is more symmetrical, but also shows larger tails at both ends than the neutral distribution (Figure 3). The KSD values for each neutral model against the empirical data generally exceed the corresponding comparisons against the standard model by a factor of two, with the majority (34/42) ranging between 0.02 and 0.0316 (Figure 4). One exception is with heterogeneous recombination rates (model 11), in which the KSD values against the empirical data exceed only slightly, and in the case of the CE comparison actually fall below, the KSD value for the simulated model versus the standard model. This shows that ln(Rsb) is robust towards most demographic factors, but relatively sensitive to heterogeneous recombination rate (i.e., recombination hotspots). Nonetheless, we found that the difference in ln(Rsb) distributions between the recombination hotspot model and the uniform-recombination model does not come from the tail part of the distribution, which determines the robustness of the false-positive rate. In fact, for the right tail of the distribution, when we fix the false-positive rate of the standard model to be 0.5%, and apply the corresponding critical value to model 11, 0.71%, 0.77%, and 0.66% of SNPs are significant for the AC, AE, and CE comparisons, respectively, which are not drastically elevated from 0.5%. Defining Cutoffs for Top SNPs In general, to determine cutoff values corresponding to a particular significance level, an empirical distribution should be compared to a null distribution by matching the neutral part to the null distribution. Here we simply matched the first and third quartile points of the empirical and the standard neutral model distributions. The resulting superimposed distributions resemble those of the ln(Rsb) histogram superimposition (Figures 3B and S14): For the AC and AE comparisons, we infer that the excess in the right tail of either distribution indicates the signal of positive selection in the non-African populations, whereas an excess in the left tail would indicate positive selection in the Africans. The two empirical distributions, when compared to their neutral distribution, have their modes shifted towards the left with a bigger tail at the right end, suggesting an even larger departure from the null distribution for the non-Africans if the modes are matched. There is a corresponding deficit in the left tail, obviously underevaluating the signals in Africans. In the CE comparison, although both the left and right tails extend farther than the null distribution, the empirical distribution has lower shoulders than the null distribution. Overall, these observations suggest that our approach is conservative for identifying ln(Rsb) values that truly deviate from neutral expectations. We defined the level of significance to be 0.005 for the AE and AC comparisons and 0.001 for the CE comparison so as to make the cutoffs for the empirical distribution approximately two to three times that of the level of significance. The cutoffs thus defined are 1.35% for Europeans in the EA comparison, 1.14% for Chinese in the CA comparison, 0.303% for Europeans in the CE comparison, and 0.245% for Chinese in the CE comparison. Although the African group is also expected to have experienced local positive selection, our method down-weights any signals of selection in Africans, resulting in cutoff values for Africans that are even lower than the test level. Evaluation of positive selection in Africans, therefore, is not possible by this approach. It should be noted that, given the conservative nature of our approach, the cutoffs thus defined may not precisely estimate the fractions of the human genome under the effect of recent positive selection. However, in contrast to any arbitrary cutoffs, this conservative approach should provide a lower boundary for the distribution of recent, local positive selection. A better strategy, one that precisely matches the neutral part of the empirical data to the null hypothesis, is needed in the future to investigate the signal of positive selection in Africans as well as to provide more accurate cutoffs for non-Africans. Determining Candidate Regions To further reduce noise in the data, the variance of ln(Rsb) values was estimated by bootstrapping across the genotyping panels. SNPS with ln(Rsb) values above the cutoff values (top SNPs, or tSNPs) but with high bootstrap ln(Rsb) variances were removed from further analysis (see Methods). The remaining tSNPs tend to cluster tightly into discrete regions, marked as sharp peaks in the Rsb map (Figure 1), which is expected from positive selection because the flanking SNPs are then influenced by hitchhiking. In order to define candidate regions, we connected tSNPs within 200 kilobases (kb) of each other into discrete regions. Regions that span over 40 kb (half of the genome has linkage disequilibrium [LD] blocks of size 40 kb or longer for Europeans and Asians [36]) and that include more than 14 tSNPs were considered candidate regions and were extended 50 kb in both directions (50 kb is roughly the distance LD decays by half). We defined candidate regions for the HapMap and Perlegen data separately, each consisting of four candidate region lists, namely Eur-AE, Chn-AC, Eur-CE, and Chn-CE, where Eur-AE, for example, indicates candidate regions exhibiting a signature of selection in Europeans when compared to Africans. The two candidate region sets defined in the HapMap and Perlegen datasets were also combined into a common candidate region set, by merging overlapping regions and including those that encompass any tSNPs in the alternative dataset. Figure 1 shows one candidate region in the HapMap data, which contains the SLC24A5 gene, a pigmentation gene for which there is prior evidence of selection in Europeans [12]. Table 1 lists examples of candidate regions from the common candidate region set with strong signals of local selection and the genes these regions contain. Complete lists of candidate regions are given in Tables S2–S13. The numbers of candidate regions from the HapMap data (in the order of Eur-AE, Chn-AC, Eur-CE, and Chn-CE) are: 298, 240, 62, and 49, respectively, roughly double that of the Perlegen data (147, 115, 25, and 22 regions, respectively), and also more than that of the combined candidate region set (216, 143, 23, and 19 regions, respectively). The sizes of the candidate regions in the combined set are on average 391 kb, 405 kb, 305 kb, and 339 kb for the Eur-AE, Chn-AC, Eur-CE, and Chn-CE comparisons, respectively, and range from 140 kb to 1,893 kb. The fact that the AE and AC comparisons produced many more candidate regions than the CE comparison is consistent with the notion that a common ancestral population of Europeans and Chinese diverged much earlier from Africans before subsequently splitting. Table 1 Some Examples of Candidate Regions with Strong Signals Power of Tests Based on ln(Rsb) To evaluate the power of ln(Rsb)-based tests, we performed simulations of selective sweeps using the two-population model described in Thornton and Jensen [37], in which a derived population experiences a bottleneck while splitting from an ancestral population, and then selection occurs in the derived, but not the ancestral, population. Loci that were 2 megabases (Mb) long were generated for both neutral and selection scenarios (see Methods). To allow EHH to decay substantially before reaching the ends of loci, ln(Rsb) values were calculated only for SNPs residing in the central 1Mb. The first test simply uses the ratio of tSNP over the total number of SNPs in each central 1-Mb region as the test statistic. The tSNP cutoff value is set to be the ln(Rsb) value that defines the top 0.5% tail area of the neutral distribution. SNPs with ln(Rsb) value greater than this value are assigned as tSNPs, and loci with a ratio of tSNPs above a critical value are called significant. The second test is analogous to the ad hoc procedure used to define candidate regions in the empirical data. In this test, tSNPs are defined as above, and tSNPs within 200 kb of each other are connected into regions. A region is called a candidate region if it is longer than 40 kb in physical distance and contains more than 84 tSNPs (based on the 14 tSNP criterion used for the empirical Perlegen data, adjusted for the different SNP density in the simulated data). An accepted candidate region is extended 50 kb farther from both ends. Figure 5 is the power plot for both tests, at a false-positive rate of 0.01 for the first test. Both tests have almost no power to detect weak selection (alpha = 100). However, when the selection coefficient is relatively high (alpha = 1,000 and 1,500), they both show high power: for test 1, the power ranges from 0.85 to 0.98 when the selected site is in the central 1 Mb of the simulated locus; from 0.67 to 0.94 if the selected site lies within 200 kb outside of the central 1Mb region, and from 0.59 to 0.85 for the entire 2-Mb locus. For test 2, the power ranges from 0.86 to 0.96 for alpha = 1,000 and 1,500. Test 2 has a varying, but low, false-positive rate around 0.02 for f = 0.1, 0.029 for f = 0.2, and 0.033 for f =0.4, where f is a measure of the severity of the bottleneck experienced by the derived population (see Methods). The power of the test tends to be negatively correlated with the severity of bottleneck, which is to be expected since more severe bottlenecks create more fluctuations and also decrease the genetic diversity of the neutral loci. Notably, our tests have comparable power to the composite-likelihood test and its related tests, which are generally considered to be the best of the available tests [20,38,39] Figure 5 Power of the Two ln(Rsb)-Based Tests for Different Values of f (the Strength of the Bottleneck Experienced by the Derived Population) (A) The power plot for test I, which uses ratio of tSNPs as the test statistic (see Methods). The critical value is set by independent neutral simulation at a level of 0.01. Curves in red show the power to detect selection that occurs in the central 1 Mb. Curves in blue show the power to detect selection that occurs outside the central 1 Mb, but within 200 kb. Dashed curves in green are the overall power for the entire simulated 2-Mb segment. (B) The power plot for test II, the ad hoc test. A case is called significant only if the selected site lies within a candidate region. Overlap of Signals between the Perlegen and HapMap Datasets There is significant overlap in the candidate regions from the HapMap and Perlegen datasets. Pairwise comparisons revealed 8%–40% overlap of the corresponding candidate region lists between the two sets (p < 0.001, Table 2). The concordance between the candidate regions identified in the HapMap and Perlegen datasets, although significant, is not very high, which is consistent with the moderate correlation of the ln(Rsb) distribution between the two datasets (Figure 1). This may in part reflect differences between these two datasets in SNP density, SNP discovery scheme, ascertainment, and sample size. Stochastic factors can also produce fluctuations in the magnitude of signals across different datasets, whereas conservative cutoffs decrease the chances of rediscovering the same regions. However, the biggest difference may reflect different population groups, especially in the case of Africa: the Perlegen samples consist of African Americans, who have admixed substantially with Europeans [40,41], thereby reducing the signature of selection when comparing African Americans with Europeans, whereas the HapMap sample consists of Yorubans from Nigeria, and hence are more representative of African genetic diversity. In fact, there is more overlap between the Perlegen and HapMap data for AC comparisons than for AE comparisons (fractions of overlap for the Eur-AE and Chn-AC lists are 0.13 and 0.19 in the HapMap data, and 0.26 and 0.40 in the Perlegen data, respectively, Table 2), which probably reflects the fact that the AC comparisons are less influenced by the European admixture in African Americans. Overall, the signals of selection from the HapMap data are probably more reliable because, aside from including native Africans, the HapMap data also has higher SNP density and larger sample sizes. Table 2 Pairwise Overlapping Test of Different Candidate Region Lists of Positive Selection Overlap within Candidate Region Sets Table 2 shows that the overlap profile between candidate region lists, e.g., Eur-AE and Eur-CE, are similar across the HapMap, Perlegen, and combined sets: there is extensive (p < 0.005) overlapping between the Eur-AE and the Chn-AC regions, supporting the notion that a common population ancestral to Europeans and Asians experienced shared selection pressures before or during the out-of-Africa migration. On the other hand, substantial numbers of candidate regions from the CE comparisons overlap with regions from the same populations in the AE or AC comparisons, e.g., Eur-AE or Chn-AC (28% ~ 53%), but are almost absent from the lists of the opposite population (0% ~ 5%). Such signals are population specific and likely happened after the divergence of the two non-African populations. Overlapping of ln(Rsb) Signals with LRH and Integrated Haplotype Score Signals Although our approach is designed to be most powerful for sweeps at or near fixation, it is interesting to see whether the method detects signals of partial sweeps as well. To investigate this, a long-range haplotype (LRH) test was constructed and applied to the ln(Rsb) candidates (see Methods). The LRH test assigns three p-values (p 75, p 95, and p mean) based on different criteria to each candidate region (see Methods). A high proportion of candidate regions are significant for positive selection under the LRH test, ranging from 0.193 to 0.556 in different candidate region sets (Table S1). We also determined the overlap between candidate regions defined by the ln(Rsb) statistic to those identified previously by the integrated haplotype score (iHS) statistic, which was shown to be most powerful for detecting partial sweeps [21]. A significant percentage of the candidate regions from the ln(Rsb) method overlap with candidate regions identified by the iHS method (30% ~ 74%, p < 0.001, Table 2). Given this substantial overlap, are there any differences in the properties of the ln(Rsb) and iHS candidate regions? To investigate this, we compared the major allele frequency spectra of SNPs between the iHS candidate regions, which are all 100 kb in size, and the 100-kb intervals around the centers of the ln(Rsb) candidate regions (Figure 6). The Perlegen genotype data were used for this comparison because of the more consistent ascertainment scheme. The allele frequency spectra of both iHS and Rsb regions differ substantially from that of the whole genome (Figure 6), exhibiting a deficiency of intermediate-frequency alleles and an excess of common/rare variants, a hallmark signature of positive selection. The Rsb spectra also deviate from the iHS spectra, with a slight excess of alleles near fixation (0.9 < allele frequency < 1) and a much higher abundance of fixed alleles (allele frequency = 1); on the other hand, iHS tends to have an excess of alleles between frequency 0.7 and 0.9, which is particularly evident in the European comparison (Figure 6). This clearly demonstrates the different specificity of these two methods: the ln(Rsb) method detects more fixed and nearly fixed selective sweeps, whereas the iHS approach has higher power to detect partial sweeps. It should be noted that the frequency of fixed alleles in the iHS spectra does not differ from that of the whole genome, whereas the frequency of fixed alleles in the ln(Rsb) spectra greatly exceeds that of the whole genome. This suggests that there may be a high proportion of recent sweeps that have reached fixation in individual populations. Figure 6 Comparisons of the Major Allele Spectra between the iHS and Rsb Candidate Regions, and the Genetic Diversity Pattern in the SLC24A5 Region (A and B) The frequency spectra of the major alleles for SNPs are compared between the iHS candidate regions (in blue) and the 100-kb regions around the centers of the Rsb candidate regions (in orange) in either (A) Europeans or (B) Chinese. Overlapping parts are in dark purple. For comparison, the corresponding major allele–frequency spectra for the whole genome are shown as black curves. (C) shows, as an example, the genetic diversity pattern for the 200-kb region around the SLC24A5 gene in Africans (Afr; upper section) and in Europeans (Eur; lower section), from the HapMap phase I data. Rows denote individual chromosomes, and columns denote the SNP sites. Genes in the Candidate Regions One important question is whether signatures of selection are enriched in genic versus non-genic regions. We used the Refseq gene coordinates to define genic regions; because the genic regions thus defined constitute a large proportion (~1/3) of the whole genome, we asked whether our candidate regions span more genic regions than expected by chance (Methods). The fraction of the total physical distance encompassed by the candidate regions that are genic ranges between 0.317 and 0.420 for different candidate region lists. For most cases, there is neither an excess nor a deficit of genic regions in the candidate regions, except there is a significant enrichment of genic regions in the EUR-CE candidate regions from the Perlegen data (54.8%, p = 0.005). This seemingly disagrees with the expectation that positive selection should happen more often in genes. One possible explanation is that our method detects selection sweep patterns that influence a large interval around the causal locus, which may not be gene rich. Another possibility is that regulatory regions outside genes might also have experienced substantial recent positive selection. Our approach captured some genes that have previously been shown to exhibit strong signatures of positive selection. These include drug-metabolizing genes CYP3A4 and CYP3A5 (Europeans [9]), skin pigmentation genes MYO5A (Europeans [21,42]) and SLC24A5 (Europeans [12]), immune system transcription factor interleukin 4, sepsis resistant gene Caspase12 (Chinese, Europeans [11]), the lactose tolerance gene (Europeans [8]), and the dietary calcium absorption-related ion channel gene TRPV6 (Chinese, Europeans [43,44]). Rediscovery of such positive examples supports the high power of our approach. When looked at in more detail, these genes all show the pattern expected for sweeps near to fixation. It is also notable that the sweep signals of two genes, CASP12 and SLC24A5, were not detected by the iHS approach [21], but are among the strongest signals found in our study. An intriguing question is what kinds of genes were involved in recent, local positive selection and therefore have had a potentially strong impact on recent human evolution. We performed a gene ontology (GO) analysis to determine whether any categories of genes are overrepresented in our candidate regions (Table 3, Methods). One GO category, interleukin-1 receptor antagonist activity, is found to be highly significantly overrepresented (raw p-values = 4.62 × 10−9, 2.94 × 10−12. When controlled for multiple testing, the false discovery rate [FDR] is 0.0021). However, all six of the genes in this GO category cluster in a 400-kb region in Chromosome 2. It is therefore not clear whether the signal of local selection comes from one or from multiple genes in this GO category. Other categories that show significant enrichment include those for genes involved in growth factor activity, cytokine activity, monooxygenase activity, ceramidase activity, calmodulin binding, various transporter and cofactor transporter activities, and vitamin transport. Cytokine and growth factor activity genes are involved in cell proliferation, development, and bone morphology, which might contribute to morphological differences between groups. Significant GO groups, such as cofactor transporter activity and vitamin transport, may have been involved in the adaptation to new food sources, or agriculture. Table 3 Summary of the Gene Ontology Analysis Results In contrast to a previous study of local selection [21], we did not detect significant enrichment of gamatogenesis or related genes, nor did the olfaction or chemosensory perception genes emerge as overrepresented groups. This may reflect differences between the tests used in the GO analysis, but may also suggest that the selective pressures that acted on these functional groups differ, one type being more ancient or more dramatic and causing complete or nearly complete sweeps, the other representing more recent and/or weaker selection and hence partial sweeps. It should be noted, however, that GO analysis for genome candidate regions is biased and limited in power for several reasons. First, functionally related genes are often spatially close to each other; significant results may thus arise via physical association rather than independent selection. Second, in each GO category, there are usually a very small number of member genes, and the multiple testing corrections can easily mask already weak signals. Third, in each candidate region, there may be more than one gene, and it is not clear which one hosted the selection event (or, indeed, there may even be more than one selection event in a candidate region). To gain more insights, each gene should be checked individually. Discussion Much progress has been made recently in developing methods to detect the signature of local positive selection in genomic data [18–23]. In this study, we offer three improvements and advances on existing methods. First, we have established a fast and simple counting algorithm to estimate EHH-derived statistics, such as iES, directly from dense genotype data from unrelated individuals without prior estimation of haplotype frequencies. The results of the counting algorithm are in good agreement with analytical estimation based on phased data. More importantly, with the counting algorithm, one can already achieve high test power for detecting recent, strong selective sweeps. Furthermore, by avoiding the intensive computation needed to estimate haplotypes from genotypic data, the counting algorithm provides superior flexibility for data updating, error checking, bootstrapping, and analyses of other sources of high-density SNP data, such as those produced by genotyping chips. Second, distinguishing signals of selection from extreme patterns produced by neutrality has been a major obstacle in the study of positive selection; highly variable patterns given by certain demographic factors, such as bottlenecks and population structure, further obscure the signals [1,6]. Here, we show that by using the standard transformed ln(Rsb) statistic, neutral simulation distributions remained largely invariant over a wide range of demographic parameter values and various ascertainment schemes. The ln(Rsb) distribution thus is robust to departures from neutrality due to various demographic events or other factors that are expected to have an equal impact across the genome. Moreover, the observed standardized ln(Rsb) distributions departed markedly from neutral simulations, providing strong evidence that recent positive selection has indeed been occurring in our species. It should be emphasized that the method we used to match the empirical distribution to the neutral distribution (namely, matching the two quartile points) is conservative and masked any signals of selection in Africans. Nonetheless, we defined the signal cutoffs through a rigid statistical procedure, and they are thus likely to represent the lower boundary of the fraction of recent positive selection in the human genome. Third, we introduce a new statistic, ln(Rsb), designed to complement existing statistics such as iHS [21]. The iHS statistic compares EHH between alternative alleles at a SNP, and hence has low power when one allele is at high frequency (and no power if an allele is fixed). However, the Rsb statistic, which compares EHH for the same SNP in two different populations, can provide evidence of selection in such cases. A power test demonstrates that tests based on ln(Rsb) possess high power for detecting strong selective sweeps that have reached fixation. Furthermore, our method detected numerous genes that have been previously shown to have strong evidence of selection, including CYP3A, MYO5A, SLC24A5, IL4, CASP12, and LCT [8,9,11,12,21,42,45], most of which are near or at fixation. It is especially notable that SLC24A5 and CASP12, which were not detected by the iHS test, gave strong signals in the ln(Rsb) test, supporting the reliability and power of this approach. Nonetheless, our approach failed to detect some other genes that previously showed strong evidence of selection, such as PDYN, MATP, and the 17q21 inversion [13,46,47], probably because these genes/regions have been subject to partial sweeps. Nonetheless, both the LRH test and the overlap with iHS candidates revealed that our approach should have considerable power to detect even partial selection sweeps. One potential problem is that the ln(Rsb) statistic has low power when selection was shared by all human populations, e.g., before or during the initial migration out of Africa. A more comprehensive view on recent positive selection in our species should be obtained by combining signals and investigating candidates from this study with those from other approaches such as iHS [21]. Our approach has identified approximately 500 candidate regions in the combined list, after excluding redundancy from overlaps. However, it is very unlikely that there was a separate selection event for each of these signals; multiple genes controlling a complex trait, such as pigmentation, may have been co-selected. Strong single selection events might also influence multiple genes, thereby producing multiple candidate regions. There is also substantial overlapping of candidate regions between Chinese and Europeans (0.19 ~ 0.33), which further indicates either shared selection events or parallel selection. Analyzing interactions between genes under selection pressures should provide further insights into human evolution. Finally, we detected some different GO groups that are enriched in our candidate regions. Some might help explain interpopulation differences in morphology, while others might have been involved in adaptation to novel food resources, pathogens, etc. Further detailed investigation of interesting candidate genes identified by this study should provide insights into the role and nature of adaptive events in the recent history of our species. Materials and Methods SNP databases. We used two publicly available SNP datasets, the approximately 1.5 million SNPs from Perlegen Science [14] and the approximately 3.8 million SNPs from the international HapMap Project [15]. Perlegen identified SNPs mainly by resequencing a discovery panel of 24 individuals, and genotyped 23, 24, and 24 unrelated individuals from African Americans, European Americans, and Han Chinese, respectively. Ascertainment of HapMap SNPs was heterogeneous and complicated [48]; genotypes were obtained for family trios in both Yorubans and Europeans, and unrelated individuals from both Han Chinese and Japanese. For the evaluation of the counting algorithm and the LRH test, we used the HapMap phase I data, for which the whole-chromosome phases were provided for about 1 million SNPs. Only SNPs that overlapped among all four populations were included in the analyses. LRH test. We used the HapMap phase I data, for which the haplotypes are available, for long-range haplotype (LRH) tests. The LRH statistic calculated here is similar to the iHS statistic introduced by [21]. Briefly, EHHA was first calculated for both alleles of a SNP, extending in both directions from the SNP until EHHA decreased to 0.05. The area under the EHHA curve was determined by integration to give the integrated EHHA (iEA), and the relative iEA within population (Raw) was calculated as: where the subscripts 1 and 2 denote the alternative alleles for SNP j. Singletons were removed from the analysis. Because the iEA calculation is sensitive to large gaps between successive SNPs, SNPs whose EHH decay plot hit gaps greater than 50 kb were removed from further analysis. The Raw values were then binned by major allele frequency. Within each bin, Raw values were ranked and assigned corresponding percentile values (LRHp). We then evaluated the percentages of SNPs with LRHp above the 5% or 25% threshold, and also the average LRHp for each candidate region, and derived three corresponding p-values: p 95, p 75, and p mean by comparing them to 1,000 iterations of random genome regions of the same size. The iES calculation. The calculation of the statistic iES starts by estimating EHHS around a core SNP as described in the Results. EHHS thus defined starts from one and decays towards zero as flanking sites are included that are progressively farther from the core SNP (Figure 2A). In this study, we calculated EHHS until it reached 0.1. The iES statistic then integrates the area under the EHHS curve: where a and b are the two ending positions where EHHS drops below 0.1, and Posi is the physical position of site j. Treatment of missing data for the counting algorithm. Missing genotypes were assigned as probability vectors conditioned on nearby genotypes as follows: where gj is the missing genotype, gj− 1 is the vector of genotype probabilities (g 11, g 12, g 22) in the preceding site, e.g., taking (1, 0, 0) if gj −1 is homozygous A1A1 for the first allele, and P′(j,j−1) is the matrix of conditional probability of genotype j given j − 1, defined by the non-missing data. Influence of recombination rate and SNP density. Because iES is sensitive to large gaps in the distribution of SNPs, SNPs close to such regions were excluded from analysis. To identify such SNPs, gaps larger than 200 kb were identified, and if any iES value within 200 kb of a gap exceeded 300,000, then the gap boundary was extended by 200 kb and the procedure repeated until no extreme iES values were found. Gaps derived from the three different populations were merged to give universal gaps, and SNPs within these gaps were discarded. To investigate how Rsb is influenced by varying recombination rates and SNP densities, Rsb was plotted against local recombination rate and SNP density, and correlations were calculated. Recombination rates were taken from the 1-Mb deCODE map [49]. SNP density was defined as the number of SNPs within a 20-kb window centered at each SNP. Neutral and positive selection simulations. A series of neutral demographic models were simulated, ranging from the simplest constant population-size model to complex models matching real data in multiple aspects [35]. The basic model (model 1) used N e =10,000 for all three populations, with the mutation rate μ and recombination rate γ assigned values of 1.5 × 10−8 and 1.3 × 10−8 site−1generation−1, respectively. The migration out of Africa and subsequent split between Asians and Europeans were set to be 3,500 and 2,000 generations ago, respectively. Based on this, further models doubled the recombination rate (model 2) or mutation rate (model 3), or split the ancestral population directly into three individual populations 3,500 generation ago (model 4). Two additional models added brief, severe bottlenecks immediately after population splitting (lasting for 100 generations, during which the population size shrank to 1%) either during the out-of-Africa migration and the origin of Europeans (model 5), or during the out-of-Africa migration and the origin of all three populations (model 6). The effect of population expansion was also studied by retaining the present population sizes and other configurations from the basic model while decreasing the population sizes back in time (model 7, present population sizes all set to 100,000; African population size changes to 24,000 at 200 generation ago; European and Chinese population sizes both change to 7,700 at 350 and 400 generations ago, respectively; and the ancient, common ancestor population size set to 12,500 at 17,000 generations ago). The effect of population structure was analyzed by modeling a hypothetical European population that split into two isolated groups (ratio of group sizes of 7:3) between 20 and 1,500 generations ago (model 8). Finally, we analyzed models featuring the complete best-fitting conditions, but one without migration and with uniform recombination rates (model 9), one with migration and uniform recombination rates (model 10, the standard model), and one without migration, but featuring heterogeneous recombination rates and recombination hotspots (model 11 [35]). All of the above neutral models except model 11 were simulated with the ms coalescent simulation package [50]. Model 11, which uses heterogeneous recombination rates, was simulated with Cosi [35]. Two hundred fragments of 2 Mb were generated for each model, and only SNPs in the central 1 Mb were used for iES/ln(Rsb) calculations. The iES and ln(Rsb) calculations were carried out by first randomly pairing the simulated chromosomes to form pseudo-genotype data, and than applying the counting algorithm . The ascertainment scheme approximating that of the Perlegen dataset was applied to the simulated data from models 1 through 10. Basically, the genotyping sample panels (23 Africans, 24 Chinese, and 24 Europeans) and discovery panel (six Africans, seven Chinese, and seven Europeans) were generated simultaneously, and each SNP from the genotyping sample was included only if it was polymorphic in the diversity panel. In addition, three different ascertainment processes were simulated for the standard mode. The first (model 10-asc1) assesses the influence of low-quality sequence data by using a very small discovery panel (two Africans, three Chinese, and three Europeans); the second (model 10-asc2) assesses a skewed constitution in the discovery panel (two Africans, four Europeans, and seven Chinese); the third model randomly eliminated 90% of the genotyping SNPs to match the empirical SNP density. Simulations of selective sweeps were performed using the two-population model described in Thornton and Jensen [37]. This model simulates a selective sweep in a derived population that has experienced a population bottleneck, and also generates data from an ancestral, equilibrium population. The bottleneck parameters are t r, the time at which the population recovered from the bottleneck, d, the duration of the bottleneck, and f, the relative size of the population during the bottleneck. The parameters t r and d are measured in units of 4Ne generations. The sweep parameters are, α = 2Nes, the strength of selection, X, the location of the selected site, and τ, the time in the past when the beneficial allele reached fixation in the population. Genome scan datasets consisting of 100 loci were simulated, with ten loci experiencing a selective sweep. Loci were 2 Mb in length, with θ = 0.001/site, and ρ = 4Ner = 0.00052/site. For loci experiencing a sweep, X was assigned uniformly at random along the 2 Mb. The parameters t r, d, and τ were fixed at 0.0075, 0.055, and 0.0076, respectively. We varied f to represent bottlenecks of different severity, with f = 0.1, 0.2, and 0.4. The strength of selection was varied from alpha = 100, 1,000, and 1,500. The cutoff for tSNPs and the critical value of the ratio of tSNPs in test 1 were obtained by simulating 5,000 independent neutral loci as above for each f = 0.1, 0.2, and 0.4. For both the ancestral and the derived populations, we sampled 60 chromosomes. The ancestral population was used to represent Africans and the derived to represent an out-of-Africa population. Ascertainment is similar to the neutral simulation, except there are only two populations. The discovery panel used 12 chromosomes from either population, and either genotyping panel included 48 chromosomes. Defining cutoffs. The full model (model 10) distribution and the empirical data were transformed and superimposed by matching the first and third quartile points as follows: where Q 1st and Q 3rd are the first and third quantile values of ln(Rsb). The neutral model was treated as the null hypothesis, and threshold values were positioned on the null distribution at the desired significance levels. Noise reduction by bootstrapping. The full Perlegen and HapMap genotyping panels were bootstrapped over individuals for 50 times each, and iES and Rsb were calculated for each SNP in each bootstrap resample, providing an estimated standard deviation of the ln(Rsb) value for every SNP. As inconsistent signals are more likely to come from SNPs with large variances in ln(Rsb), we constructed the distribution of SD(ln(Rsb))/ln(Rsb) for SNPs above the cutoffs, and discarded those SNPs in the bottom 50% of this distribution. The remaining SNPs are defined as tSNPs. Testing enrichment of evidence for selection in genic regions. We investigated whether genic regions exhibit more evidence of local selection than non-genic regions. We calculated the proportion of the full length of candidate regions in each candidate region list that overlapped genes, and obtained p-values by comparing the observed proportions to those observed in 1,000 samples of random genome regions of the same size. Overlapping of candidate regions between datasets. Pairwise overlapping was calculated between the candidate regions from the HapMap, Perlegen, and combined datasets. We also checked the overlapping of our combined candidate regions against a previous genome scan for positive selection [21]. Significance of the number of overlapping regions was evaluated by comparing the observed overlap to 1,000 iterations of random samples of two sets of regions with the same size as the candidate regions. Gene ontology analysis. Candidate regions obtained from the HapMap and Perlegen data were merged to produce a single universal set of genomic regions exhibiting evidence of recent local selection. This list was used to investigate the types of genes involved in local selection. We used the FUNC GO analysis package for this purpose (http://func.eva.mpg.de [51]). BioMart (http://www.biomart.org/index.html) was used to assign GO IDs for genes listed in the ENSEMBL known-gene database. The AmiGO gene ontology database (http://amigo.geneontology.org) was used to construct the GO tree. The hypergeometric test was used to find GO categories that are overrepresented in the candidate regions. Supporting Information Figure S1 Comparisons of the log(Rsb) Distributions between Neutral Model 1 and Neutral Model 10 Upper row: the QQ plots. In every QQ plot, quantile points of model 10 are plotted on the x-axis. Lower row: standardized ln(Rsb) histograms are superimposed onto one another. The blue curve designates model 10, and the red curve designates model 1. (99 KB PPT) Click here for additional data file. Figure S2 Comparisons of the log(Rsb) Distributions between Neutral Model 2 and Neutral Model 10 Upper row: the QQ plots. In every QQ plot, quantile points of model 10 are plotted on the x-axis. Lower row: standardized ln(Rsb) histograms are superimposed onto one another. The blue curve designates model 10, and the red curve designates model 2. (71 KB PPT) Click here for additional data file. Figure S3 Comparisons of the log(Rsb) Distributions between Neutral Model 3 and Neutral Model 10 Upper row: the QQ plots. In every QQ plot, quantile points of model 10 are plotted on the x-axis. Lower row: standardized ln(Rsb) histograms are superimposed onto one another. The blue curve designates model 10, and the red curve designates model 3. (71 KB PPT) Click here for additional data file. Figure S4 Comparisons of the log(Rsb) Distributions between Neutral Model 4 and Neutral Model 10 Upper row: the QQ plots. In every QQ plot, quantile points of model 10 are plotted on the x-axis. Lower row: standardized ln(Rsb) histograms are superimposed onto one another. The blue curve designates model 10, and the red curve designates model 4. (70 KB PPT) Click here for additional data file. Figure S5 Comparisons of the log(Rsb) Distributions between Neutral Model 1 and Neutral Model 10 Upper row: the QQ plots. In every QQ plot, quantile points of model 10 are plotted on the x-axis. Lower row: standardized ln(Rsb) histograms are superimposed onto one another. The blue curve designates model 10, and the red curve designates model 5. (70 KB PPT) Click here for additional data file. Figure S6 Comparisons of the log(Rsb) Distributions between Neutral Model 6 and Neutral Model 10 Upper row: the QQ plots. In every QQ plot, quantile points of model 10 are plotted on the x-axis. Lower row: standardized ln(Rsb) histograms are superimposed onto one another. The blue curve designates model 10, and the red curve designates model 6. (70 KB PPT) Click here for additional data file. Figure S7 Comparisons of the log(Rsb) Distributions between Neutral Model 7 and Neutral Model 10 Upper row: the QQ plots. In every QQ plot, quantile points of model 10 are plotted on the x-axis. Lower row: standardized ln(Rsb) histograms are superimposed onto one another. The blue curve designates model 10, and the red curve designates model 7. (70 KB PPT) Click here for additional data file. Figure S8 Comparisons of the log(Rsb) Distributions between Neutral Model 8 and Neutral Model 10 Upper row: the QQ plots. In every QQ plot, quantile points of model 10 are plotted on the x-axis. Lower row: standardized ln(Rsb) histograms are superimposed onto one another. The blue curve designates model 10, and the red curve designates model 8. (70 KB PPT) Click here for additional data file. Figure S9 Comparisons of the log(Rsb) Distributions between Neutral Model 9 and Neutral Model 10 Upper row: the QQ plots. In every QQ plot, quantile points of model 10 are plotted on the x-axis. Lower row: standardized ln(Rsb) histograms are superimposed onto one another. The blue curve designates model 10, and the red curve designates model 9. (70 KB PPT) Click here for additional data file. Figure S10 Comparisons of the log(Rsb) Distributions between Neutral Model 11 and Neutral Model 10 Upper row: the QQ plots. In every QQ plot, quantile points of model 10 are plotted on the x-axis. Lower row: standardized ln(Rsb) histograms are superimposed onto one another. The blue curve designates model 10, and the red curve designates model 11. (73 KB PPT) Click here for additional data file. Figure S11 Comparisons of the log(Rsb) Distributions between Standard Ascertainment and the Alternative Ascertainment 1 for Model 10 Upper row: the QQ plots. In every QQ plot, quantile points of the standard ascertainment are plotted on the x-axis. Lower row: standardized log(Rsb) histograms are superimposed onto each other. The blue color designates the standard ascertainment for model 10, and the blue color designates the alternative ascertainment 1. (70 KB PPT) Click here for additional data file. Figure S12 Comparisons of the log(Rsb) Distributions between Standard Ascertainment and the Alternative Ascertainment 2 for Model 10 Upper row: the QQ plots. In every QQ plot, quantile points of the standard ascertainment are plotted on the x-axis. Lower row: standardized log(Rsb) histograms are superimposed onto each other. The blue color designates the standard ascertainment for model 10, and the blue color designates the alternative ascertainment 2. (71 KB PPT) Click here for additional data file. Figure S13 Comparisons of the log(Rsb) Distributions between Standard Ascertainment and the Alternative Ascertainment 3 for Model 10 Upper row: the QQ plots. In every QQ plot, quantile points of the standard ascertainment are plotted on the x-axis. Lower row: standardized log(Rsb) histograms are superimposed onto each other. The blue color designates the standard ascertainment for model 10, and the blue color designates the alternative ascertainment 3. (70 KB PPT) Click here for additional data file. Figure S14 The Empirical ln(Rsb) Distributions Superimposed onto the Null Distribution to Define the Cutoffs The empirical ln(Rsb) distributions of the AE, AC, and CE comparisons (a, b, and c) were matched to their corresponding null distributions, defined by model 10, by the first and third quartile points. The blue curves designate the null distribution, and the red curves designate the empirical data. The cutoff positions are indicated by the short vertical lines. (59 KB PPT) Click here for additional data file. Table S1 Proportions of Candidate Region Lists That Are Significant in the LRH Test Pmn, P95, and P75 are related LRH tests that use the mean, number of SNPs above 95%, or number of SNPs above 75% as the test statistics, respectively (see Methods). (36 KB DOC) Click here for additional data file. Table S2 Candidate Regions for Europeans from the AE Comparison in the HapMap Data (37 KB XLS) Click here for additional data file. Table S3 Candidate Regions for Chinese from the AC Comparison in the HapMap Data (32 KB XLS) Click here for additional data file. Table S4 Candidate Regions for Europeans from the CE Comparison in the HapMap Data (24 KB XLS) Click here for additional data file. Table S5 Candidate Regions for Chinese from the CE Comparison in the HapMap Data (22 KB XLS) Click here for additional data file. Table S6 Candidate Regions for Europeans from the AE Comparison in the Perlegen Data (37 KB XLS) Click here for additional data file. Table S7 Candidate Regions for Chinese from the AC Comparison in the Perlegen Data (34 KB XLS) Click here for additional data file. Table S8 Candidate Regions for Europeans from the CE Comparison in the Perlegen Data (19 KB XLS) Click here for additional data file. Table S9 Candidate Regions for Chinese from the CE Comparison in the Perlegen Data (18 KB XLS) Click here for additional data file. Table S10 Combined Candidate Regions for Europeans from the AE Comparisons in the Perlegen and HapMap Data (41 KB XLS) Click here for additional data file. Table S11 Combined Candidate Regions for Chinese from the AC Comparisons in the Perlegen and HapMap Data (41 KB XLS) Click here for additional data file. Table S12 Combined Candidate Regions for Europeans from the CE Comparisons in the Perlegen and HapMap Data (20 KB XLS) Click here for additional data file. Table S13 Combined Candidate Regions for Chinese from the CE Comparisons in the Perlegen and HapMap Data (19 KB XLS) Click here for additional data file.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Genome Sequencing and Analysis of the Tasmanian Devil and Its Transmissible Cancer

              Introduction Cancers are clonal cell lineages that arise due to somatic changes that promote cell proliferation and survival. Although natural selection operating on cancers favors the outgrowth of malignant clones with replicative immortality, the continued survival of a cancer is generally restricted by the life span of its host. Tasmanian devil facial tumor disease (DFTD) is an unusual cancer that has survived beyond the death of the individual that spawned it by acquiring adaptations for transmission between hosts. This cancer has spread through the Tasmanian devil population and is threatening the species with extinction (Hawkins et al., 2006; McCallum et al., 2009). The genomes of the Tasmanian devil and its transmissible cancer, DFTD, are thus of interest both from the perspective of conservation of a threatened species as well as for the insights they may provide into the origins, somatic evolution and population genetics of an extraordinarily divergent neoplastic clonal lineage. The Tasmanian devil (Sarcophilus harrisii) is a marsupial carnivore endemic to the island of Tasmania, Australia. Tasmanian devils are solitary nocturnal scavengers that weigh up to 12 kg and generally live for 5 or 6 years in the wild (Owen and Pemberton, 2005). They are seasonal breeders and females rear a maximum of four pouch young each year (Owen and Pemberton, 2005). The species has limited genetic diversity, although three genetically distinct geographically defined subpopulations have been described (Jones et al., 2004; Miller et al., 2011). DFTD was first observed in northeastern Tasmania in 1996 (Hawkins et al., 2006). The disease is characterized by the appearance of tumors, usually on the face and inside the mouth of affected animals, which frequently metastasise and usually cause death within months (Figure 1) (Hawkins et al., 2006; Lachish et al., 2007; Loh et al., 2006a). Most commonly observed in sexually mature individuals of 2 years or older, DFTD occurs equally in male and female devils (Hawkins et al., 2006; Loh et al., 2006a). The cancer has spread rapidly through the Tasmanian devil population and has been associated with devil population decline (Hawkins et al., 2006; Lachish et al., 2007). Epidemiological studies have documented the expansion of the disease down the east coast of Tasmania and its continuing progression toward the west coast (Hawkins et al., 2006; Lachish et al., 2007). DFTD spreads by the direct transfer of living cancer cells, usually through bites inflicted on the face during mating and feeding interactions (Hamede et al., 2008; Pearse and Swift, 2006). DFTD is believed to be of neural crest origin, and the cancer cells express a number of genes of the Schwann cell lineage (Loh et al., 2006b; Murchison et al., 2010). The mechanism whereby the clone avoids immune rejection during colonization of allogeneic hosts remains unknown. Although low genetic diversity may contribute to DFTD susceptibility, experiments have indicated that devils are normally capable of mounting immune reactions to allogeneic grafts (Kreiss et al., 2011; Siddle et al., 2007). DFTD is one of two known naturally occurring clonally transmissible cancers, the other being the canine transmissible venereal tumor (CTVT) of dogs (Murchison, 2009). CTVT is a sexually transmitted lineage that is found around that world and that may have first arisen thousands of years ago from the cells of a wolf or East Asian breed of dog (Murgia et al., 2006; Rebbeck et al., 2009). Genetic analysis of the global diversity of CTVT cancers has indicated that the lineage has achieved considerable heterogeneity, with substantial lineage diversity present worldwide (Murgia et al., 2006; Rebbeck et al., 2009). Divergence time estimates suggest that modern CTVT may represent a recent global sweep of an ancient lineage (Rebbeck et al., 2009). In contrast to CTVT, little is known of the population diversity of DFTD or the dynamics of its spread through its host population. Cancer genomes are characterized by somatic changes including single-base substitution mutations, small insertions and deletions (indels), structural rearrangements, and copy number alterations. Analysis of the catalog of somatic mutations in cancer genomes can lead to greater understanding of the mutational events that triggered clonal outgrowth and the exposures or DNA repair defects that were responsible for the mutations in the first place (Pleasance et al., 2010a, 2010b). DFTD is a transmissible clone that has spread through the devil population in a process similar to metastasis. Its widely divergent lineage as a malignant clone make it an almost unique model for studying the genomic stability and long term evolution of cancer cells. We have sequenced, assembled, and annotated the normal genome of the Tasmanian devil, and we have used this reference to analyze the genomes of a second normal Tasmanian devil and two geographically distant DFTD cancer subclones. In addition, we have analyzed the genetic diversity present in 104 DFTD tumors collected from distant locations throughout Tasmania over a period of 7 years. Our analysis has led to the identification of genetic features of the original devil that gave rise to DFTD, a description of the underlying mutational processes that have characterized DFTD progression, annotation of gene variants that may have contributed to DFTD pathogenesis, and a map of the clonal dynamics of the disease during its spread through Tasmania. Results The Tasmanian Devil Genome To generate a reference genome for the Tasmanian devil, we sequenced and assembled the genome of a 5-year-old female Tasmanian devil. Sequencing libraries were prepared from genomic DNA extracted from a cell line derived from normal fibroblasts. These were sequenced from both ends, yielding 2.87 × 109 pairs of 100 bp sequence reads. Additional “mate pair” libraries, produced by circularising genomic DNA fragments of between 3 kilobase pairs (kb) and 10 kb in length, were generated and 50 bp was sequenced from both ends in order to assist with genome assembly. The genome was assembled with the Phusion2 assembly pipeline (Mullikin and Ning, 2003), and assembly features are summarized in Table 1. We estimated the size of the Tasmanian devil genome to be between 2.89 and 3.17 gigabase pairs (Gb) using both sequencing and flow cytometry data (Table S1 and Figure S1 available online). This is comparable with previous estimates of the Tasmanian devil and other marsupial genome sizes (Mikkelsen et al., 2007; Miller et al., 2011; Renfree et al., 2011). The Tasmanian devil genome has a G+C content of 36.4%, similar to that of the opossum (37.8%) but lower than that of humans (45.2%). To determine the chromosomal locations of our assembled contigs, we individually sorted each of the seven Tasmanian devil chromosomes from the female devil fibroblast cell line using a flow cytometer. Fifty thousand copies of each devil chromosome were collected, amplified, and sequenced. Alignment of the chromosome reads with the assembled contigs was used to assign the contigs to chromosomes; in addition, this method was used to detect and correct assembly errors by identifying contigs with homology to more than one chromosome. Using this method, we were able to assign 35,534 supercontigs (99%) to chromosomes. The number of bases assigned to each chromosome correlated with the flow cytometry measurement of chromosome DNA content (Table S2). Tasmanian devil chromosomes were named according to the system described by Pearse and Swift (Pearse and Swift, 2006), which differs in the naming of chromosomes 1 and 2 to a previous karyotype nomenclature for devils (Eldridge and Metcalf, 2006; Martin and Hayman, 1967). We used cross-species chromosome painting to determine the homology between devil and opossum chromosomes (Figure S2). We then used conservation with the opossum genome as a template for ordering supercontigs on each devil chromosome. Tasmanian devil genes were identified using the Ensembl genome annotation pipeline (Curwen et al., 2004; Potter et al., 2004), modified to incorporate devil transcriptome data. In total 18,775 protein-coding gene models were constructed, 1,213 of which did not have orthologs in the human or opossum genomes. We specifically searched the Tasmanian devil genome for a set of 451 genes that have been causally linked with cancer in humans (Futreal et al., 2004). Orthologs for 398 of these genes could be detected in the Tasmanian devil genome. Three hundred sixty-three microRNAs (miRNAs) were annotated in the Tasmanian devil genome by aligning small RNA sequence reads (Murchison et al., 2010). One hundred nineteen predicted miRNAs were devil orthologs of previously identified miRNAs and 244 were predicted novel miRNAs. Tasmanian Devil Cancer Genome Landscape We conducted cytogenetic analyses and sequenced the genomes of two DFTD cell lines, 87T and 53T, from geographically different regions of Tasmania (Figure 2A). 87T is derived from a tumor from a devil captured in 2007 in southeast Tasmania, and 53T was established in 2007 from a lung metastasis in a devil from the north coast of Tasmania. Alignment of the DFTD cancer cell line genomes with the reference genome yielded 691,328 and 699,156 single base substitutions in 87T and 53T, respectively, and 317,240 and 307,613 indels in 87T and 53T, respectively (Figure 2B). The number of variants in the DFTD genomes was somewhat higher than the number of variants observed in the normal female devil, and in a second normal male genome sequenced to assess normal variation (Figure 2B). This is not surprising, given that the DFTD genome contains variants that were present in the constitutional genome of the devil that first gave rise to the DFTD clone (the founder devil), as well as somatic variants that have arisen since DFTD has been a malignant clonal lineage (Figure 2B). These estimates are consistent with previous studies (Miller et al., 2011). Cytogenetic analyses indicated that the two DFTD subclones have differences in their karyotypes (Figure 2A). 87T is pseudodiploid with 13 chromosomes, whereas 53T is pseudotetraploid with 32 chromosomes. We used labeled flow-sorted chromosomes derived from a normal devil cell line as probes for forward chromosome painting of 87T. This experiment revealed several cytogenetic changes in DFTD (Figures 2C and 2D). Reverse chromosome painting, using labeled DNAs derived from flow-sorted chromosomes from 87T, provided further insights into the translocations in 87T and revealed heterozygous deletions on chromosomes 1, 2, and 3, as well as trisomy 5p (Figure 2E). Copy number analysis indicated that 87T has few detectable hemizygous deletions and no detectable high-level amplifications (Figure 2F and Figure S3). These analyses indicate that the DFTD genome contains substitutions, indels, copy number changes, and rearrangements. We next devised methods to identify subsets of variants of germline origin (i.e., those that were present in the constitutional genome of the founder devil) and those of somatic origin (i.e., those that arose during clonal proliferation of the DFTD lineage) in order to investigate the origin and somatic evolution of the DFTD clone. Origin of DFTD DFTD was first observed in 1996 in northeast Tasmania (Hawkins et al., 2006). Previous studies have indicated that the cancer is derived from the cells of one devil (the DFTD founder), and has subsequently spread through the devil population as a clone (Pearse and Swift, 2006). We do not have DNA from the founder's normal genome, as this animal was a wild Tasmanian devil that lived and probably died prior to 1996. However, variants from this devil's constitutional genome remain within the DFTD cells that make up the tumors of thousands of devils. We sought to reconstruct the genome of the founder devil by searching for common variants between the genomes of the two DFTD subclones, 87T and 53T. These variants will include normal variation that was present in the founder's genome as well as somatic variants in DFTD that arose prior to divergence of the 87T and 53T lineages. We found 700,436 common single base substitutions and 251,257 common indels between 87T and 53T (Figure 2B). At least 563,877 single-base substitution variants and 235,610 indels are likely to be the founder's germline variants, as we also found them in either the female or male normal devil genomes. The remaining 136,559 substitutions and 14,647 indels will include private germline variants that were specific to the founder devil and not found in the two normal genomes that we sequenced as well as somatic mutations that have been acquired by the DFTD lineage. The gender of the founder devil is unknown. Like other marsupials, Tasmanian devils have X and Y sex chromosomes, and males are the heterogametic sex. Previous studies have indicated that neither of the sex chromosomes is cytogenetically identifiable in DFTD (Pearse and Swift, 2006). It is possible that the sex chromosomes initially present in the constitutional genome of the founder devil have been lost during DFTD carcinogenesis or that these chromosomes have been rearranged in the DFTD genome such that they are not cytogenetically identifiable. We first searched for the presence of the Y chromosome gene SRY in DFTD. As expected, the SRY gene could be amplified from the genome of a male devil but not from a female devil; however, our assays could not detect SRY in the DFTD genome (Figure 3A). We next searched for evidence of the X chromosome in the DFTD genome. Reverse chromosome painting experiments and copy number analysis of 87T indicated that the X chromosome is present in approximately two copies in this genome (Figure 2 and Figure S3). These are likely to be a homologous pair rather than recent duplicates, as the number of single-base substitution variants mapping to the X chromosome in the two DFTD genomes was comparable to the number of variants found on the X chromosome in the female normal devil genome and approximately double the number of X chromosome variants found in the male normal genome (Figure 3B). The data therefore suggest that the DFTD founder devil was a female. DFTD was first observed in northeast Tasmania (Hawkins et al., 2006). To explore further the geographic origin of the founder devil we sequenced the mitochondrial genomes (excluding the control region) of 92 Tasmanian devils from 25 locations in Tasmania and constructed a phylogenetic tree based on their sequences (Figure 3C). We found evidence for six mitochondrial haplotypes among normal devils. Three of these had widespread distributions throughout Tasmania and three were confined to locations in the northwest of Tasmania, consistent with other studies (Miller et al., 2011). The 87T and 53T DFTD mitochondrial genomes were most closely related to one of the widespread devil haplotypes. There is evidence of horizontal transfer of mitochondrial genomes between hosts and cancers in another transmissible cancer lineage, CTVT, which has led to multiple distinct clades of CTVT mitochondrial haplotypes (Rebbeck et al., 2011). To test whether horizontal transfer of mitochondria occurs in DFTD, we sequenced the mitochondrial genomes of 104 DFTD tumors and included their haplotypes on the phylogenetic tree (Figure 3C). All of the DFTD mitochondria were either identical to or apparently derived from a single devil haplotype, suggesting that they are clonally derived from the founder devil. These analyses suggest that mitochondrial horizontal transfer does not occur or is not widespread in DFTD, and indicate that the founder devil belonged to a haplogroup that is currently widespread throughout Tasmania. DFTD was first observed in 1996 (Hawkins et al., 2006). However, we do not know the timing of the emergence of the DFTD clone. Given the overt and disfiguring symptoms of DFTD, as well as its dramatic recent effects on devil population size (Hawkins et al., 2006; Lachish et al., 2007), it seems unlikely that the disease remained undetected for a long period prior to 1996. Indeed, retrospective studies of devil skulls, preserved specimens and pelts collected between 1941 and 1989 revealed no evidence for DFTD prior to the 1990s (Loh et al., 2006a). However, it is possible that the current DFTD epidemic is the most recent manifestation of an ancient clone with a long history of coexistence with the Tasmanian devil population. Our mitochondrial genome analysis indicates that the founder devil's mitochondrial genome is identical to those found in many modern devils (Figure 3C). In addition, DFTD mitochondrial genomes are in most cases more closely related to the founder than to each other (Figure 3C). These observations are consistent with a recent origin for DFTD. Somatic Evolution of DFTD Having identified genetic features and variants present in the constitutional genome of the DFTD founder devil, we next performed a detailed analysis of DFTD variants of somatic origin. Somatic variants are those that have arisen during the establishment and progression of DFTD as a clonal lineage. Analysis of somatic variants in two divergent DFTD lineages may provide insight into the mutational processes that have operated in DFTD as well as the genetic changes that have driven its growth. We cannot directly ascertain the set of somatic variants in DFTD because the founder devil died in obscurity in the Tasmanian bush more than a decade ago. However, we compiled a set of DFTD single-base substitutions enriched for somatic mutations by identifying variants that were present in one DFTD genome but absent in the other. We identified 15,160 single-base substitutions that were present in 87T but not 53T, and 17,790 that were in 53T but absent from 87T (Figure 2B). These variants could have arisen as somatic base substitution mutations. Alternatively, as we do not know the germline genotype of the DFTD founder devil, they could have been heterozygous germline variants that were lost in either of the two DFTD lineages. However, we established that most of these variants are likely to have arisen as somatic substitutions by demonstrating the absence of 15 out of 16 in the genomes of 110 normal devils. Moreover, the nonsynonymous to synonymous (NS/S) ratios for the 87T and 53T unique variants were 2.78 and 2.08 respectively, a range typical of somatic variants in human cancers and compatible with that expected from random mutagenesis (Figure 4A). By contrast, the NS/S ratios of germline devil single-nucleotide polymorphisms (SNPs) were 0.9 and 0.98 for the normal male and female devil genomes, respectively, similar to that of common SNPs in humans and indicative of substantial negative selection. Finally, as somatic mutations are likely to arise in the heterozygous state, the observation that variants unique to each DFTD lineage contain a high proportion of heterozygous variants provides further evidence for these sets being strongly enriched for somatic mutations (Figure 4B). These estimates suggest that 87T and 53T have each acquired between 15,000 and 17,000 single-base substitution mutations since divergence from their most recent common ancestor tumor. We do not know how many somatic mutations were present in the DFTD lineage prior to 87T and 53T divergence. However, the observation that the total number of private variants inferred in the most recent common ancestor tumor (136,559) is comparable to the number of private variants in a normal male genome (135,134), as well as the NS/S ratio for these variants (0.8), suggests that the large majority of the private variants in the common ancestor tumor were of germline origin. This suggests that the prevalence of somatic substitution mutations in DFTD may not be substantially greater than 17,000. This is somewhat higher than the number of mutations observed in many human tumor types (approximately 5,000 per cancer genome) (Greenman et al., 2007). However, it is less than are found in many human melanomas and lung cancers, which are often the result of past mutagenic exposures, or in human cancers with mutator phenotypes due to DNA mismatch repair defects (Pleasance et al., 2010a, 2010b). Cancers often have mutational processes that are different to those which operate in the germline. Comparison of the mutation spectra of Tasmanian devil germline variation to the sets of variants highly enriched for somatic mutations in 87T and 53T revealed that, as expected, devil germline SNPs were enriched for transitions (Figure 4C). However, we also observed elevated proportions of A:T → T:A, A:T → C:G, and G:C → T:A transversion mutations in DFTD (Figure 4C). This pattern was independently detectable in 87T and 53T since their divergence from their most recent common ancestor tumor, but was not detectable in the variants inferred in these two tumors' most recent common ancestor (Figure 4C). This suggests that this mutation profile is the result of an endogenous mutational process—for example, a defect in DNA repair—that was acquired before the divergence of the two lineages, or that it was caused by independent exposure of the two lineages to a carcinogenic environmental agent. Copy number changes and structural rearrangements are commonly somatically acquired by cancer genomes. Although the majority of the copy number variants that we identified in 87T and 53T were common to both lineages (Figure S3), some copy number variants, including, for example, the hemizygous deletion on chromosome 3 in 87T, occurred in only one of the two tumors (Figure 4D). Such variants are likely to have arisen since the divergence of the 87T and 53T tumor lineages and have therefore been somatically acquired during DFTD evolution. We identified and validated 11 and 17 rearrangements that were specific to the 87T and 53T DFTD genomes, respectively (Figure 4E and Table S3). Thirteen of the 28 rearrangements unique to either 87T or 53T had between two and six bases of microhomology at the breakpoint region, indicating that DFTD may employ microhomology-mediated end joining as a repair process for double-stranded DNA breaks. Most of the somatic variants that are present in DFTD are likely to be selectively neutral passenger mutations. However, a subset of somatic variants in the DFTD genome will be driver mutations that have provided selective advantage to the cancer during passage through its devil hosts. Three hundred twenty-four genes were predicted to contain nonsynonymous substitution and indel variants that were present in 87T and 53T but not in either of the normal devil genomes (Table S4). These included 313 genes with single-base substitutions and 11 genes with indels. A search for predicted nonsynonymous mutations in a set of 138 genes that are known to be mutated by single-base substitutions and indels in human cancers (Futreal et al., 2004) yielded heterozygous single-base substitutions in RET and FANCD2 that were not present in either of the two normal genomes that we sequenced. Both mutations were predicted to cause single base substitution mutations that have not previously been described in cancer (Table S4). Changes in the copy number of cancer genes and truncation or fusion of genes through rearrangements can also promote oncogenesis. Two genes, MAST3 and a novel gene with similarity to BTNL9, were predicted to be homozygously deleted in DFTD. The functions of these two genes are not well understood, although the butyrophilin gene family, of which the BTNL9-like gene is a member, may be involved in immune modulation (Arnett et al., 2009; Stammers et al., 2000). Neither of the DFTD genomes contained any predicted regions of high-level amplification (Figure S3). We found several putative rearrangements involving genes, including a balanced translocation involving PDGFA (Table S4). Although DFTD is not virally transmitted, it is possible that a virus may have contributed to DFTD pathogenesis. We searched for the presence of virus DNA in DFTD by aligning virus-derived DNA sequences contained in the RefSeq database with the assembled DFTD genomes as well as the normal devil genome assembly. We did not find evidence for exogenous viruses in the DFTD genome. However, it is possible that DFTD contains viral sequences that were not detectable using this method. DFTD colonizes its devil hosts as an allogeneic graft. In order to investigate the mechanisms whereby DFTD evades host immune rejection, we searched for genetic variants in 25 genes involved in the antigen processing and presentation machinery (described by gene ontology IDs GO:0019885 and GO:0019882). Fifteen of these genes could be identified in the devil genome, and one gene, NOD1, had a predicted rearrangement that was predicted to be present in both DFTD genomes but absent from the normal Tasmanian devil genomes (Table S4). Further analysis and annotation of immune genes in the Tasmanian devil genome will be required to elucidate the genetic mechanisms of DFTD immune evasion. Divergence and Clonal Dynamics of DFTD Lineages We have described the somatic changes that have occurred in two DFTD cancers, 87T and 53T, collected in the Forestier Peninsula in the southeast of Tasmania and Narawntapu National Park on the north coast of Tasmania, respectively, since divergence from their most recent common ancestor tumor. Observational epidemiological studies have indicated that DFTD first arrived in the Forestier Peninsula in 2004. The first DFTD case observed in Narawntapu National Park was in 2007. However, we do not know the routes that were followed by these lineages across Tasmania, nor do we have any information about the clonal dynamics of DFTD disease spread. We investigated whether DFTD progression into new territories is characterized by linear colonization and occupation, or rather by repeated waves of lineage replacement. The evolutionary dynamics of the DFTD clone during its expansion across Tasmania can be traced by analysis of the observed patterns of somatic mutation. We collected 104 DFTD tumors from 69 Tasmanian devils captured in several locations throughout Tasmania between 2004 and 2010 (Figure 5). We genotyped this set of tumors for 16 variants that we had previously identified either in 87T or 53T but not in both tumors and thus are likely to be somatic (Table S5). In addition, we analyzed the mitochondrial genomes (excluding the control region) from the entire set of DFTD tumors, leading to the identification of 21 somatic mitochondrial DFTD variants (Table S5). These experiments revealed differences in the population of DFTD in different regions of Tasmania. The observation that all of the tumors in the isolated Forestier Peninsula cluster into a single lineage suggests that this tumor population was founded by a single subclone of DFTD, precursors of which are located on the east coast of Tasmania (Figure 5). Divergence within this lineage after its introduction has given rise to a number of tumor subclones found only within the Forestier Peninsula. One of these lineages (illustrated in Figure 5 with a green dot, black outline) appears to have increased in frequency between 2007 and 2010 in a manner resembling a selective sweep (Figure 5, lower panel). These fluctuations in the dominant tumor type could be due to selection, or they could alternatively be due to simple neutral processes. In contrast to the Forestier Peninsula, the mainland Tasmanian DFTD population shows the emergence and simultaneous maintenance of several distinct tumor subclones (Figure 5). The tumor lineage to which 53T belongs appears to be a dominant clone in the north and northwest (Figure 5, dots with green outline). Several tumors were found to have unique patterns of variation. For example, each of the two tumors that we sampled from northeastern Tasmania, the location where DFTD was first observed in 1996, had their own individual patterns of variation (Figure 5, orange and black dots, gray outline), suggesting that tumor diversity may be greater in this region, perhaps reflecting its status as the possible origin of DFTD (Hawkins et al., 2006). DFTD Diversity within Individual Hosts We collected two or more DFTD tumors from 20 individual devils in our set. In some cases the additional tumors were facial or oral, and in others they were in submandibular lymph nodes and internal organs. Genotyping was unable to distinguish between multiple tumors derived from the same host in 14 of the 20 cases, suggesting that most additional tumors are metastases of primary tumors originating from a single DFTD bite. There were six cases, however, in which an individual animal had tumors with two different genotypes (Figure S4). In three of these cases, both genotypes were found in tumors in other animals, indicating that the two tumors were probably derived from separate DFTD bites. This suggests that prior exposure to DFTD does not protect devils from subsequent DFTD inoculations. However, in each of the remaining three cases, one of the two genotypes was not found in a tumor in any of the other animals that we sampled. In these instances, the two genotypes differed only by a single variant, and it is possible that the novel genotypes may have arisen as new variants in these animals (Figure S4). Discussion Cancer genomes bear imprints of carcinogenic exposures, endogenous DNA repair processes and selective pressures to which the clone has been subject. DFTD has existed as a malignant clonal lineage for at least 15 years by repeated subcloning through the Tasmanian devil population. Our analysis of the genomes of two geographically distant DFTD subclones has indicated that DFTD is continuing to acquire new variations in its karyotype, genomic copy number and DNA sequence. Despite evidence for ongoing somatic change in the DFTD lineage, the overall level of mutation that has been accrued by two DFTD lineages since divergence from their most recent common ancestor tumor is comparable with the number of changes that are observed within some human cancers (Pleasance et al., 2010a, 2010b). This is perhaps surprising, given that DFTD has probably undergone a greater number of mitoses than most human cancers. It indicates, however, that DFTD is a relatively stable lineage and that a high level of genomic instability has not been required for the cancer to become transmissible. Analysis of the genetic diversity of DFTD subclones throughout mainland Tasmania suggests that the evolution of DFTD has been characterized by linear radiation of DFTD subtypes from their common origin. Geographical analysis of DFTD lineage diversity indicates a wide distribution of variant DFTD subclones as well as local coexistence of different subclones, sometimes even within a single host. Our analysis identified a DFTD founder population on an isolated peninsula. Divergence within this lineage has led to the appearance of several DFTD subtypes, one of which has recently become dominant in a manner resembling a selective sweep. Future genomic analysis of hundreds of DFTD genomes will provide further insight into the diversity and evolution of DFTD, and may perhaps help to predict the future trajectory of this clone and its impact on the devil population. The transfer of cancer cells between individuals is normally prevented both by physical barriers and by the action of the immune system. The ability of transmissible cancers to circumvent these obstacles demonstrates the potential of cancer cells to become parasitic clonal lineages. DFTD and CTVT are the only two known naturally occurring transmissible clonal lineages. Our studies have highlighted similarities and differences between the two lineages. Previous reports have indicated that the most recent common ancestor of today's globally distributed CTVT clones existed between 47 and 2,000 years ago (Murgia et al., 2006; Rebbeck et al., 2009). DFTD, however, is probably not more than 20 years old. CTVT has been observed to periodically take up mitochondria from its host by horizontal transfer (Rebbeck et al., 2011); in contrast, we do not find any evidence for this phenomenon in DFTD. Interestingly, it has been proposed that many modern CTVT tumors represent the most recent global sweep of a subclone of the disease (Rebbeck et al., 2009). We observed a similar sweep of DFTD tumors on the Forestier Peninsula. Both CTVT and DFTD continue to acquire new copy number variations (Thomas et al., 2009). Future analysis of both the DFTD and CTVT lineages and their hosts will help to determine the common and unique features of these two cancers and will perhaps reveal common genetic changes that favor the outgrowth and progression of clonally transmissible cancers. Although there are no known naturally occurring transmissible cancers that affect humans, there are rare reports of cancer transmission between two or more humans. These involve accidental transfer of cancer cells through organ transplantation or during surgical procedures, deliberate transfer of cancer cells between humans for experimental purposes, or transfer of cancer cells in utero (Gandhi and Strong, 2007; Gärtner et al., 1996; Moore et al., 1957; Tolar and Neglia, 2003). Further comparative studies of transmissible cancer genomes may indicate the mechanisms that permit cancer transmission between individuals. Cancer is an inevitable outcome of the potential of cells to reproduce and to adapt to their environment; their environment is usually limited to a single host, but cancers can sometimes escape from their hosts and become parasitic clonal lineages. Here we have described a whole-genome analysis of such a cancer, and our studies have provided insights into the genetic identity of the individual that founded the DFTD clone, as well as patterns of ongoing DFTD somatic evolution and clonal dynamics. This work will enable more detailed studies of the structure and history of the Tasmanian devil population and its response to the DFTD epidemic. Understanding the interaction between the genomes of DFTD and its host and the identification of patterns of disease spread and host response may provide information that will assist with the conservation of the Tasmanian devil. Experimental Procedures Whole-Genome Sequencing, Assembly, and Annotation DNA from a female Tasmanian devil fibroblast cell line (with trisomy 6) and from two DFTD cell lines (87T and 53T) was extracted and used to prepare short insert libraries and mate pair libraries with insert sizes from 3–10 kb (fibroblast cell line only) for paired end sequencing as previously described (Bentley et al., 2008). Short-insert library sequencing with 100 bp paired-end reads was performed on an Illumina HiSeq2000 instrument and mate-pair library sequencing with 50 bp paired-end reads was performed on an Illumina GA2 instrument. In addition, short-insert libraries were constructed from DNA extracted from the liver of a male Tasmanian devil. The library was sequenced on an Illumina Genome Analyzer IIx machine with 108 bp reads. Sequencing, raw data processing, and quality-control checks were performed as previously described (Bentley et al., 2008). The genome of the female devil was assembled with the Phusion2 Assembly Pipeline (Mullikin and Ning, 2003). In brief, paired-end sequence reads were processed to generate kmer words (k = 61). K-tuples were merged and sorted into a table, and shared kmer words were linked in a relation matrix. Small read clusters with ∼100,000 reads were used to generate contigs with Phrap (http://www.phrap.com/). RPono, a package in the Phusion2 pipeline, was then used to build supercontigs with mate-pair sequences. Genome size was estimated using kmer frequency information, flow karyotype analysis and nuclear DNA content analysis (see the Extended Experimental Procedures, Figure S1, and Table S1). The genome was annotated with the Ensembl Genebuild Pipeline (Curwen et al., 2004; Potter et al., 2004). Transcriptome sequencing of pooled RNA from 12 devil tissues was used to assist with annotation (more details are in the Extended Experimental Procedures). Chromosome Assignment Each of the seven Tasmanian devil chromosomes was individually sorted from the female devil fibroblast cell line with a flow cytometer. Fifty thousand copies of each devil chromosome were collected, amplified, and sequenced on two lanes of an Illumina Genome Analyzer IIx instrument with 100 bp paired-end reads. Alignment of chromosome-derived reads with contigs was used to assign contigs to chromosomes and to correct assembly errors. We assigned 35,534 supercontigs (99%) to individual chromosomes (Table S2). Supercontigs were ordered on chromosomes using conservation with opossum, and supercontigs that could not be assigned to a chromosome were assigned to “ChrU.” Chromosome assignment was validated with fluorescence in situ hybridization. Sorted chromosomes were also used as probes for chromosome painting (Extended Experimental Procedures). Variant Analysis Reads were aligned with the reference genome using BWA (Li and Durbin, 2009). Single-base substitutions (Figure 2B) were called using SAMtools and filtered by coverage (minimum 10, maximum 150), read quality (minimum quality, 30), mapping quality (minimum quality, 30), base quality (minimum quality, 30), and end of contig (minimum distance to end of contig, 500 bp). Indels (Figure 2B) were called using CASAVA with parameters Q(snp) ≥ 30 and Q(max_gtype) ≥ 5. One hundred eleven of 117 (95%) single-base substitutions and 119 of 124 (96%) indels, randomly selected, were confirmed with capillary sequencing. Variants from each genome were compared and subtracted to identify the set of variants that were unique to each genome. Structural rearrangements were identified as previously described (Pleasance et al., 2010a). In brief, read pairs were aligned with the draft Tasmanian devil assembly with BWA (Li and Durbin, 2009). Discordant pairs that mapped with an unexpected insert distance or orientation or to different supercontigs were identified and clustered to form regions of interest. We discarded groups which did not have at least seven reads of mapping quality ≥30 supporting the variant, as well as all reads that were within 500 bp of the end of a contig. Structural variants were filtered for those that were specific to individual samples and a subset were validated with PCR, gel electrophoresis, and sequencing from both ends with an ABI 3730xl DNA analyzer. Mitochondrial genomes (excluding the control region) were sequenced with the capillary platform, and variants were called with NovoSNP (Weckx et al., 2005). Copy number variants were identified using the DNAcopy package (Olshen et al., 2004) with a nonoverlapping window of 2,000 bp. A subset of copy number variants were validated with quantitative real-time PCR. Tasmanian Devil Samples Tissue samples were collected from wild and captive Tasmanian devils under research authorities 33/2004-2005 and 24/2006-2008 (extended) issued by the Tasmanian Department of Primary Industries, Water, and the Environment. The research was reviewed by the Wellcome Trust Sanger Institute Animal Ethics Committee. Extended Experimental Procedures Tasmanian Devil Samples and Cell Lines Tasmanian devil (Sarcophilus harrisii) samples were collected from captive animals or from wild animals either in the field, or at postmortems undertaken at the Tasmanian Department of Primary Industries Animal Health Laboratories. Cell lines were established from two DFTD tumors and from the skin of a normal captive-bred five-year-old female Tasmanian devil. The female fibroblast cell line developed trisomy 6 during cell culture. Tasmanian Devil Genome Sequencing DNA from a female Tasmanian devil fibroblast cell line and from two DFTD cell lines (87T and 53T) was extracted and used to prepare short insert libraries (fibroblast and DFTD cell lines) and mate pair libraries with insert sizes from 3–10 kb (fibroblast cell line only) for paired end sequencing as previously described (Bentley et al., 2008), with the following modifications; libraries were generated with TruSeq Adapters and a single cycle of PCR. Short insert library sequencing with 100 bp paired end reads was performed on an Illumina HiSeq2000 instrument and mate pair library sequencing with 50 bp paired end reads was performed on an Illumina GA2 instrument. In addition, short insert libraries were constructed from DNA extracted from the liver of a male Tasmanian devil; the library was generated with TruSeq Adapters and 10 cycles of PCR. The library was sequenced on an Illumina Genome Analyzer IIx machine with 108 bp reads. Sequencing, raw sequencing processing and quality control checks were performed as previously described (Bentley et al., 2008). Tasmanian Devil Genome Assembly We developed the Phusion2 genome assembly pipeline to assemble large eukaryotic genomes using Illumina short sequence reads. Read files from individual lanes were processed to generate kmer words at a given size (k = 61). K-tuples were then merged and sorted into a table so that kmer words shared by different reads could be linked. A relation matrix was used to record the shared kmer words among all the reads. Setting a minimum threshold of shared k-tuples, short reads can then be clustered into groups using kmer sharing information in the relational matrix. After obtaining small read clusters with a controllable size (∼100,000 reads), we used Phrap (http://www.phrap.com/) to generate contigs. For the normal female Tasmanian devil genome, we generated 2.87 × 109 100 bp reads. Using k = 61, the kmer frequency peaks at 25. Any kmer word occurring more than 35 times was not used in further analysis, thus removing the repetitive reads. Since single end reads are combined into “pairs,” the gaps due to missing single repetitive reads are filled by paired partners. After read clustering, we obtained 390,842 read groups containing 1.022 × 109 read pairs. Using Phrap, we generated an assembly with 2.93 Gb of assembled bases and an N50 of 20 kb. Incorporating mate pair sequence data (194 × 106 paired reads of 2x50 bp with insert sizes from 3 to 10 kb) allowed us to obtain a scaffold assembly. The PCR duplication rate of the mate pair sequence reads was very low (<5%). We removed chimeric read pairs by excluding read pairs if one or two ends were found in the middle of the contig, but the edge length was larger than the insert size. To minimize the effect of read pairs overlapping biotin junctions (the circularization junction point that is introduced during mate pair library construction) we only used those pairs with full-length genome alignment. We then used RPono, a package in the Phusion2 pipeline, to build supercontigs. At the end of this process, we had an assembly of 3.15 Gb with an N50 of 839 kb. To further improve our assembly, we aligned our supercontigs with the opossum genome assembly (Monodelphis5.0); this yielded our draft assembly with a total of 3.17 Gb assembled bases and with an N50 of 1.84 megabases (Mb) (Table 1). This version of the genome assembly (Tasmanian devil genome assembly version 7.1) was used for all of the analyses described in this paper (Table 1). In addition, we performed de novo assemblies of the two DFTD genomes, 87T and 53T, using similar methods. These assemblies have been deposited in the GenBank/DDBJ/EMBL database. The Tasmanian devil mitochondrial genome was amplified from DNA extracted from the female Tasmanian devil fibroblast cell line. Amplicons were sequenced from both ends with an ABI 3730xl DNA Analyzer and manually assembled. The control loop region could not be completely sequenced due to its repetitive structure. Genome Size Estimation Using Kmer Frequency We define the genome size as the total number of effective kmer words divided by the kmer depth or the kmer occurrence number at the peak kmer frequency Dp : (S1) G s = ( K n − K s ) D p . Here Kn is the total number of kmer words and Ks is the number of single or unique kmer words. The peak depth values for the female Tasmanian devil genome and DFTD cell lines 87T and 53T are Np  = 25, 18 and 27 respectively. With Kn  = 81.99 × 109, 68.75 × 109 and 91.20 × 109 and the numbers of single occurrence kmer words are Ks  = 6.249 × 109, 13.656 × 109 and 9.135 × 109, we estimate the genome size to be Gs  = 3.03 × 109, 3.06 × 109 and 3.04 × 109 respectively (Table S1). Flow Karyotype Analysis of Tasmanian Devil DNA Content and Genome Size To estimate the genome and chromosome size of the Tasmanian devil by flow karyotype analysis, normal Tasmanian devil fibroblast chromosome suspensions were compared to chromosome suspensions from human lymphoblastoid cell lines, both prepared as previously described (Gribble et al., 2009). Briefly, Tasmanian devil and human chromosome suspensions as well as human-devil mixed chromosome suspensions were stained overnight with Hoechst 33258 (HO) and Chromomycin A3 (CA3) and analyzed on a flow cytometer as described previously (Gribble et al., 2009). A total of 50,000 to 200,000 events were acquired for each chromosome preparation and displayed on a bivariate plot of HO versus CA3 fluorescence. Data collected from the experiments were analyzed using the Summit analysis software (Beckman Coulter). In order to estimate the Tasmanian devil chromosomal DNA content and genome size, we made use of a technique described by Trask and Schmitz (Schmitz et al., 1992; Trask et al., 1989) using a ‘DNA line’ in the human flow karyotype as a standard to estimate the Tasmanian devil chromosome DNA size. This approach involved acquiring data from a mixed human-Tasmanian devil chromosome suspension and calculating the mean HO and CA3 fluorescence intensity for a few selected human chromosomes and each of the Tasmanian devil chromosome peaks along the human DNA line. The human DNA line is a projection through the origin and the peak of a normal human chromosome 4 at an angle (α) to the x axis which in this case was 50°. DNA values for chromosome peaks were computed using the formula described by Trask (Trask et al., 1989), (S2) D n = HO n X sin α + CA3 n X cos α. D n = HO n X sin α + CA 3 n X cos α . The chromosomal DNA content for each of the Tasmanian devil chromosome peaks were determined by linear regression using the measured human D n value for selected chromosomes and the estimated human chromosomal DNA content (Trask et al., 1989). The genome size of the Tasmanian devil was obtained through the summation of the chromosomal DNA content of all the measured chromosome peaks (Table S1). DNA Content Analysis of Cells Using Propidium Iodide The genome size (DNA content) of Tasmanian devil cells was estimated by DNA index analysis. DNA cell suspensions were prepared from lysed whole blood isolated from four different Tasmanian devils. Four different human lymphoblastoid cell lines were analyzed together using propidium iodide (PI) staining. Commercially available chicken erythrocytes nuclei (CEN) of known DNA content acted as an internal reference standard. The DNA cell samples from fixed lysed whole blood of Tasmanian devil and human lymphoblastoid cell lines were stained with PI solution before analyzing on a flow cytometer (Darzynkiewicz and Juan, 1997). A total of 10,000 events were acquired per DNA cell sample. The cells were gated on PI fluorescence area versus PI fluorescence width to discriminate any doublets and clumps. The gated events were displayed on a histogram plot of PI fluorescence area (Figure S1). Data collected from the experiments were analyzed using the Summit analysis software (Beckman Coulter). The Tasmanian devil genome size was estimated as follows. An aliquot of CEN was mixed with 1 × 106 Tasmanian devil cells at a ratio of 1:20. Flow cytometric data were acquired from the mixed CEN-Tasmanian devil DNA cell suspension and the DNA Index (DI) was measured by calculating the mean PI fluorescence intensity of each peak and finding the ratio of the mean value of the Tasmanian devil peak to CEN as shown in Figure S1. The genome size of the Tasmanian devil was calculated using the formula Genome size(Mbp) = CEN genome size(pg)×(DI: TD mean peak / CEN mean peak)×(978 Mbp / pg), Genome size ( Mbp ) = CEN genome size ( pg ) × ( DI: TD mean peak / CEN mean peak ) × ( 978 Mbp / pg ) , with CEN genome size = 1.25 pg. The average genome size of Tasmanian devil was computed and is shown in Table S1. Human cells were used as a validation of the DNA index analysis (Figure S1). The genome size for human was calculated using the formulas above and compared to the value obtained from current sequence estimation. A comparison of the Tasmanian devil and human genome sizes is shown in Figure S1. In Silico Chromosome Assignment To determine the chromosomal locations of our assembled supercontigs, each of the seven Tasmanian devil chromosomes was individually sorted from the female devil fibroblast cell line using a flow cytometer. 50,000 copies of each devil chromosome were collected, amplified, and sequenced. The chromosomes were ∼90% pure, based on previous experience; however, it is possible that there was some degree of contamination between chromosomes with similar positions on the flow karyotype (e.g., chromosomes 2 and 3). Chromosomes were amplified using the Genomiphi Whole Genome Amplification Kit (GE Healthcare). Individual libraries were prepared for the flow sorted chromosomes with average insert sizes of ∼300 bp. Construction of paired end libraries was essentially as described previously (Quail et al., 2008). DNA fragmentation was targeted to 500-700bp model peak (AFA; Covaris settings: duty cycle 20%, intensity 5, cycle burst 200 for 30 s). The NEBNext DNA Sample Prep Reagent Set 1 was used with incubation times for end repair and A-tail stages of one hour and a two hour ligation step, each step was followed by column purification using QIAquick spin columns (QIAGEN). A single agarose gel size selection step was performed (selecting DNA from 500-800bp) prior to PCR amplification of 5-8 cycles depending on input DNA (estimated using the Agilant 2100 Bioanalyser). The PCR product was cleaned using 0.6 vols SPRI beads (Agencourt) and the concentration of sequencable products estimated by qPCR (using the KAPA Library Quantification Kit). Each chromosome library was sequenced on two lanes of an Illumina Genome Analyzer IIx instrument with 100 bp paired end reads with read coverage from 22-95x depending on the chromosome size. We first aligned all of the flow-sorted chromosome reads with the assembled contigs. We calculated the total number of mapped reads and the numbers of mapped reads from each chromosome library for each contig. Since the sizes of chromosome are different, we introduced an effective number of mapped reads for each contig: (S3) N i , e = ( C i , s N i , c ) N i ; ( i = 1 , 2 , 3 , 4 , 5 , 6 , 7 ) . Here Ci,s is the size of chromosome i; Ni,c is the number of raw flow-sorted chromosome sequence reads aligning to chromosome i; Ni is the number of mapped reads from flow sorted chromosome sequence library i. For a given contig, we first calculated the maximum and the second maximum values of the effective mapping numbers: Nm,e and Nm-1,e . If the ratio of Nm-1,e / Nm,e was less than 0.4, this contig was assigned to the chromosome from which the effective number of mapped reads had the maximum value. If the ratio was larger than 0.4, we sorted the mapping coordinates along the contig and divided the mapped reads into blocks of 20 reads. For these small units, we examined blocks one by one for a transition from one chromosome library to another; such a transition is strong evidence for misassembly. This method allowed us to detect and correct 2,827 misassembled contigs. Once we had an assembly in which the majority of contigs had been assigned to chromosomes and potential assembly errors had been corrected, we produced supercontigs using mate pair sequence reads and synteny with opossum. In these two steps, small unassigned contigs could be assigned to a chromosome together with other assigned contigs within one supercontig. Using this method we were able to assign 35,534 supercontigs (99%) to individual chromosomes. Supercontigs that could be assigned to a chromosome but could not be aligned with the opossum assembly were placed at the end of each chromosome assembly. Supercontigs that could not be assigned to a chromosome were assigned to “ChrU.” Transcriptome Sequencing Transcriptome sequencing was performed on cDNA libraries constructed from pooled RNA from 12 devil tissues (heart, liver, skin, spleen, testis, brain, kidney, lung, bone marrow, pancreas, adrenal gland and salivary gland). Three transcriptome libraries were prepared (total RNA library, mRNA library and mRNA-UDG library). Brief methods used for the preparation of each library, and details of the sequencing and analysis are provided below. Library Preparation Ligation TotalRNA-Seq Library The totalRNA library was constructed Illumina's modified “directional mRNA-Seq Sample Preparation” protocol. Briefly, 100 ng of total RNA was fragmented with divalent cations under elevated temperature. The ends of the fragmented RNA were modified with polynucleotide kinase for a 5′ mono phosphate group and a 3′ hydroxyl group. A preadenylated oligo and a RNA oligo were ligated sequentially to the 3′end and 5′end of the RNA respectively. Adaptor ligated RNA was reverse transcribed and amplified with 15 cycles of PCR. DSN rRNA depletion is carried out following Illumina's “DSN normalization sample prep application note.” Briefly, 100 ng of amplified PCR products were hybridized in 1x hybridization buffer (50 mM HEPES, 0.5 M NaCl) at 68°C for 5 hr. 2U of DSN enzyme was used at 68c for 25min to digest double stranded DNA, and the remaining single stranded molecules are amplified with 12 cycles of PCR. mRNA-Seq Library The mRNA-seq library was constructed with the Illumina truseq RNA sample preparation protocol. Briefly, poly A+ RNA was purified from 100 ng of totalRNA with oligo-dT beads. Purified RNA was fragmented with divalent cations under elevated temperature. cDNA was synthesized with random hexamers. Double stranded cDNA was end repaired, an A base was added, and the product was ligated to Illumina PE adaptors. Adaptor ligated cDNA was amplified with 15 cycles of PCR. DSN cDNA normalization was carried out according to modified Illumina “DSN normalization sample prep application note.” mRNA-Seq-UDG Library The mRNA-seq library was constructed with a modified Illumina truseq RNA sample prep protocol. Briefly, poly A+ RNA was purified from 100 ng of totalRNA with oligo-dT beads. Purified RNA was fragmentated with divalent cations under elevated temperature. cDNA was synthesized with random hexamers. dUTP was incorporated during second strand cDNA synthesis. Double stranded cDNA was end repaired, an A base was added, and ligated to Illumina PE adaptors. DSN cDNA normalization was carried out according to the modified Illumina “DSN normalization sample prep application note.” Sequencing and Analysis We sequenced 863,429,107 reads in total on the Illumina platform (675 × 106 pairs with 100/35bp read length, 188 × 106 unpaired reads with 150 bp read length). About 40% of the reads were sequenced from a stranded RNA protocol. To test the gene coverage of our draft genome, we performed an independent de novo assembly of the paired RNA-seq reads using T-IDBA (Peng et al., 2011). We assembled 470,729 transcripts in 239,560 Mb. The RNaseq de-novo assembly had an N50 of 687 bp (94,029 contigs) and an N80 of 314 bp (250,268 contigs). The largest transcript had a size of 83,452 bp. We aligned them to the genome using BLAST (Altschul et al., 1990). 95.3% of the transcripts could be aligned with 0.05% mismatches and 0.01% gaps to the assembly which indicates that our genome assembly has coverage for the majority of coding sequences in the devil genome. Gene Annotation Tasmanian devil gene annotation was performed using the Ensembl Genebuild pipeline (Curwen et al., 2004; Potter et al., 2004), summarized below. 1. Raw Compute Stage: Searching for Sequence Patterns, Aligning Proteins and cDNAs to the Genome The annotation process of the Tasmanian devil assembly began with the “raw compute” stage whereby the genomic sequence was screened for sequence patterns including repeats using RepeatMasker (version 3.2.8) (Smit et al., 1996–2010), Dust (Morgulis et al., 2006) and TRF (Benson, 1999). RepeatMasker and Dust combined masked 49.0% of the devil genome. Transcription start sites were predicted using Eponine-scan (Davuluri et al., 2001; Down and Hubbard, 2002). CpG islands and tRNAs were also predicted (Lowe and Eddy, 1997). Genscan (Burge and Karlin, 1997) was run across repeat masked sequence and the results were used as input for UniProt (Goujon et al., 2010), UniGene (Sayers et al., 2010), and vertebrate RNA alignments by WU-BLAST (Altschul et al., 1990). (Passing only Genscan results to BLAST is an effective way of reducing the search space and therefore the computational resources required). This resulted in 269,535 UniProt, 309,553 UniGene and 284,670 Vertebrate RNA sequences aligning to the genome. 2. Targeted Stage: Generating Coding Models from Devil Evidence Devil protein sequences were downloaded from public databases (UniProt SwissProt/TrEMBL and GenBank) and filtered to remove sequences based on predictions. The devil sequences were mapped to the genome using Pmatch (Slater and Birney, 2005). Models of the coding sequence (CDS) were produced from the proteins using Genewise (Birney et al., 2004). Two sets of models were produced, one with only consensus splice sites and one where non-consensus splices were allowed; where a single protein sequence had generated two different coding models at the same locus, the BestTargeted module was used to select the coding model that most closely matched the source protein to take through to the next stage of the gene annotation process. The generation of transcript models using devil-specific data is referred to as the “Targeted stage.” This stage resulted in 215 of 281 devil proteins used to build 215 coding models. 3. cDNA and EST Alignment Devil cDNAs were downloaded from GenBank, clipped to remove polyA tails, and aligned to the genome using Exonerate (Birney et al., 2004). Of these, 21 of 27 devil cDNAs aligned with a cut-off of 90% coverage and 97% identity. 4. Similarity Stage: Generating Additional Coding Models Using Proteins from Related Species Due to the paucity of devil specific protein and cDNA evidence, the majority of the gene models were based on proteins from other species. UniProt alignments from the Raw Compute step were filtered to favor proteins classed by UniProt's Protein Existence (PE) classification level 1 and 2. Proteins from other PE levels were used where no other evidence was available; similarly, mammalian proteins were favored over non-mammalian. WU-BLAST was rerun for these sequences and the results were passed to Genewise to build coding models. The generation of transcript models using data from related species is referred to as the “Similarity stage.” This stage resulted in 109,194 coding models. 5. Filtering Coding Models Coding models from the Similarity stage were filtered using modules such as TranscriptConsensus, RNA-Seq spliced alignments supporting introns were used to help filter the set. 61,937 models were rejected as a result of filtering. 6. Addition of RNA-Seq Models The largest set of devil specific evidence was from Illumina paired end RNASeq, this was used where appropriate to help inform our gene annotation. A set of 1.6 × 109 reads was aligned to the genome using BWA (Li and Durbin, 2009). The Ensembl RNA-Seq pipeline was used to process the BWA alignments and create a further 86 × 106 split read alignments using Exonerate (Slater and Birney, 2005). The split reads and the processed BWA alignments were combined to produce 41,011 transcript models in total; one transcript per locus. The predicted open reading frames were compared to Uniprot Protein Existence (PE) classification level 1 and 2 proteins using WU-BLAST, models with no BLAST alignment or poorly scoring BLAST alignments were discarded. The resulting models were added into the gene set where they produced a novel model or splice variant, in total 5,663 models were added. 7. Pseudogenes, Protein Annotation, Non-coding Genes, Cross Referencing, Stable Identifiers The gene set was screened for potential pseudogenes. Before public release the transcripts and translations were given cross references to external databases, while translations were searched for domains/signatures of interest and labeled where appropriate. Stable Identifiers were assigned to each gene, transcript, exon and translation. (When annotating a species for the first time, these identifiers are auto-generated. In all subsequent annotations the stable identifiers are propagated based on comparison of the new gene set to the previous gene set.) Small structured non-coding genes were added using annotations taken from RFAM (Griffiths-Jones et al., 2003) and miRBase (Griffiths-Jones et al., 2006). In addition, devil miRNAs were identified and annotated as described (Murchison et al., 2008) by aligning small RNA reads that were sequenced in a previous study (Murchison et al., 2010). The final gene set consists of 18,775 protein coding genes containing 22,391 transcripts, 178 pseudogenes, 1,446 ncRNAs including 363 miRNAs. 8. Cancer and Immune Gene Annotation We specifically searched for and annotated sets of genes with known causative roles in cancer (Futreal et al., 2004) or with known involvement in antigen processing and display in the immune system (described by Gene Ontology IDs GO:0019885 “antigen processing and presentation of endogenous peptide antigen via MHC class I” and GO:0019882 “antigen processing and presentation”). 380 of 451 (84%) cancer genes and 14 of 25 (56%) immune genes were identified with the Ensembl pipeline. An additional 18 cancer genes and 1 immune gene were manually annotated by searching for syntenic regions using the Ensembl “Multi-species view” tool and Exonerate (Slater and Birney, 2005). In order to determine our power for calling variants in the 413 cancer and immune genes, we measured average sequencing depth in the two DFTD genomes and two normal devil genomes across each exon for each gene. We identified 7 exons with average sequence coverage <10 in both of the DFTD genomes, listed below. It is possible that we missed variants in these exons due to lack of sequencing coverage. ERCC2: transcript ID, ENSSHAT00000002476; exon 13 WAS: transcript ID, ENSSHAT00000003305; exon 12 MEN1: transcript ID, ENSSHAT00000007485; exon 5 EP300: transcript ID, ENSSHAT00000007760; exon 23 AP3D1: transcript ID, ENSSHAT00000011155; exon 2 CHEK2: transcript ID, ENSSHAT00000011864; exon 3 CDH1: transcript ID, ENSSHAT00000013098; exon 3 MYD88: transcript ID, ENSSHAT00000014458; exon 2 It is also possible that some genes were incompletely annotated due to assembly errors or gaps in the assembly. Chromosome Painting Tasmanian devil or DFTD chromosomes were analyzed and flow sorted on a flow cytometer (Mo-Flo®, Beckman Coulter) as described previously (Ng et al., 2007; Rabbitts et al., 1995). In brief, 5,000 copies of selected chromosomes were flow sorted separately into sterile 500 μl Eppendorf tubes. The flow-sorted chromosomes were amplified first using GenomePlex® complete whole genome amplification (WGA) kit (WGA2, Sigma-Aldrich) following the protocol provided by the manufacturer. To make chromosome-specific paint probes, biotin-16-(Roche), Cy5-, Cy3-(Enzo), Green- (Abbotts) and Texas red-12- (Invitrogen) dUTPs were incorporated into the WGA2 products via a round of reamplification using a modified protocol adapted specially for labeling probes. Briefly, for 25 μl of probe labeling reaction, the labeling mix was created by adding 18.2 μl of dH2O, 2.5 μl of 10 × amplification master mix (A5606, Sigma-Aldrich) that came with the WGA3 kit, 2.5 μl of home-made 10 × dNTP mixture containing 2mM each of dATP, dCTP and dGTP, 2mM (dTTP and dUTP), 0.5 μl of 50 mM MgSO4, 0.3 μl of BioTaq polymerase (5 U/μl, Bioline), and 1 μl of WGA2 product. The concentrations of dTTP and labeled-dUTP were adjusted according to the labeled-dUTP. For biotin-16-dUTP and Cyanine 3- and Cyanine 5-dUTP, the concentrations of dTTP and dUTP were 1.4 mM dTTP, 0.6 mM dUTP, respectively; for Green-dUTP (Abbotts Molecular), the concentrations of dTTP and dUTP were adjusted to 1.6mM dTTP, 0.4 mM dUTP; for Texas red-dUTP (Abbotts Molecular), 1.8 mM dTTP and 20 mM dUTP. The thermo cycling followed the program suggested by the manufacturer except that the number of thermo cycles was increased from 14 cycles to 18 cycles. Karyotype analysis was performed according to standard protocols. Metaphase chromosomes were prepared from one fibroblast cell line and four tumor cell lines following the procedure described previously (Yang et al., 1995). The karyotypes were analyzed by a combination of DAPI (4',6-diamidino-2-phenylindole) banding and mutlicolor fluorescence in situ hybridization (FISH) with painting probes as described previously (Yang et al., 1995; Yang and Graphodatsky, 2009). Biotin-labeled probes were detected using Cy5.5- conjugated anti-biotin made in goat (Rockland). Images were captured using a monochrome digital camera (ORCA-EA, Hamamatsu) mounted on an epi-fluorescence microscope (Imager D1, Carl-Zeiss) equipped with narrow bandpass filters specific for Cy5.5, Cy5, Cy3.5, FITC (fluorescein isothiocyanate) and DAPI fluorescence as well as a 200W metal halide light source (Lumen 200, Prior Scientific). The FISH images were processed using SmartCapture® and SmartType softwares (Digital Scientific UK).
                Bookmark

                Author and article information

                Contributors
                Role: Formal analysisRole: InvestigationRole: SoftwareRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: Formal analysisRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: MethodologyRole: SoftwareRole: SupervisionRole: Writing – review & editing
                Role: Editor
                Journal
                PLoS One
                PLoS ONE
                plos
                plosone
                PLoS ONE
                Public Library of Science (San Francisco, CA USA )
                1932-6203
                13 August 2018
                2018
                : 13
                : 8
                : e0201838
                Affiliations
                [001]GABI, INRA, AgroParisTech, Université Paris-Saclay, Jouy-en-Josas, France
                University of Sydney, AUSTRALIA
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                Author information
                http://orcid.org/0000-0001-8564-1429
                Article
                PONE-D-18-06286
                10.1371/journal.pone.0201838
                6089428
                30102725
                3f1026a5-7c04-45e2-93bc-06ac330e4fbb
                © 2018 Hubert et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 27 February 2018
                : 22 July 2018
                Page count
                Figures: 2, Tables: 1, Pages: 15
                Funding
                JNH was supported by the French Ministry of Higher Education and Research. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology and Life Sciences
                Organisms
                Eukaryota
                Animals
                Vertebrates
                Amniotes
                Mammals
                Marsupials
                Tasmanian Devils
                Biology and Life Sciences
                Genetics
                Molecular Genetics
                Biology and Life Sciences
                Molecular Biology
                Molecular Genetics
                Biology and Life Sciences
                Evolutionary Biology
                Evolutionary Genetics
                Biology and Life Sciences
                Anatomy
                Nervous System
                Central Nervous System
                Medicine and Health Sciences
                Anatomy
                Nervous System
                Central Nervous System
                Medicine and Health Sciences
                Oncology
                Metastasis
                Medicine and Health Sciences
                Oncology
                Basic Cancer Research
                Metastasis
                Biology and Life Sciences
                Genetics
                Genomics
                Biology and Life Sciences
                Population Biology
                Population Metrics
                Population Density
                Biology and Life Sciences
                Genetics
                Gene Types
                Regulator Genes
                Custom metadata
                Data are from the Epstein et al. (2016) study and can be accessed on Dryad ( https://doi.org/10.5061/dryad.r60sv).

                Uncategorized
                Uncategorized

                Comments

                Comment on this article