Introduction How much do different classes of sequence polymorphisms contribute to human phenotypic variation and disease susceptibility? Traditionally, because they are abundant and easily detectable, single nucleotide polymorphisms (SNPs) have been expected to contribute most. Larger-scale polymorphisms, such as duplications, deletions, translocations, and inversions, are less frequent and thus might be thought to have a lesser effect [1]. However, as techniques have improved for detecting polymorphisms at larger scales, evidence has accumulated that these occur far more frequently than hitherto suspected. Some disease-associated genomic rearrangements, for example, are known to arise at least an order of magnitude more frequently than point mutations in human autosomal dominant traits [1]. Moreover, several hundred regions that are variable in copy number have been identified in both human populations [2–5] and mouse strains [6]. Although whether these large-scale copy-number variants (CNVs) are associated with disease is as yet unknown, their abundance and size imply that they may yet be found to underlie functional variation. Nonetheless, relatively few of the human CNVs detected thus far in independent studies overlap [7], indicating that, although numerous, individual CNVs may occur with low minor allele frequencies in the human population. Sequence variations are usually not uniformly distributed within genomes. In yeast, SNPs are more frequent towards telomeric chromosomal ends [8], as are segmental duplications [9,10], but not apparently CNVs in human DNA [5]. SNPs also occur more frequently within a sequence that is high in G + C content, that has experienced elevated nucleotide substitution rates, and/or that has been subject to reduced selective constraints [11,12]. Consequently, it appears that SNPs have both arisen by mutation and been purified by natural selection, nonuniformly in the human genome. The assembled human genome sequence is a composite since it is derived from the DNA of many individuals. For any region there is no guarantee that it presents the major allele found in a human population. Indeed, there are three reasons to suppose that rare large-scale sequence variations such as CNVs are not only present, but are overrepresented, in this reference sequence. First, contributing genomes that have been sequenced across boundaries between adjoining paralogous CNV sequences will be favoured for incorporation in the assembly. Second, clone selection for sequencing was biased towards larger insert clones because of the desirability of constructing a minimal tiling set [13]. As a result, clones containing high copy-number regions would be preferred for sequencing over those containing low copy-number regions. Third, because human CNVs, genome assembly gaps, and segmental duplications frequently coincide [2,3,4,5,14], it is plausible that minor allele sequences might be confounding sequence assembly of these regions. We thus predict that an as-yet-unknown proportion of the 5% of the human genome that is highly sequence similar [3,14–16] represents minor allele frequency CNV sequence. It remains to be determined how this 5% partitions between duplications that have been fixed, and thus are present throughout the human population, and others that are polymorphic and are not fixed. The presence of large-scale minor allelic variants in the reference human genome sequence complicates both CNV experimental design and CNV data interpretation. For example, virtually identical paralogous human sequences are substantially underrepresented in oligonucleotide arrays, thus diminishing the distinction of their copy-number variations in experiments. Furthermore, hybridisation absences may be interpreted as genomic deletions, whereas instead they arise from assaying for minor allelic variants in the reference sequence. Some CNVs may have been maintained in a subset of the human population due to selective advantage [17], particularly those present at relatively high minor allele frequency. For example, unusually high copy numbers of the CCL3L1 and CYP2D6 genes are associated with decreased susceptibility to HIV/AIDS [18] and increased drug metabolism [19], respectively. However, their frequencies suggest that most CNVs have been subject to purifying selection [3]. The fate of CNVs—either fixation or else loss by purifying selection or drift—has been considered theoretically for many decades [17]. Wright's physiological theory [20] predicts that haploinsufficient genes (i.e., those whose loss-of-function alleles strongly affect the phenotype of heterozygotes) experience enhanced fixation of duplicates resulting from selection for increased dosage. Such genes preferentially encode proteins with signalling roles or with binding, regulatory, and structural functions [21,22]. Selective advantage of duplicates due to gene dosage appears to have occurred, for example, for CCL3L1 [18] and CYP2D6 [19]. The neutral theory of molecular evolution [23] predicts that a duplicated gene is more rapidly lost by random genetic drift when it arises within larger populations [24,25]. In very large populations virtually all duplications that are rapidly fixed are thus strongly adaptive. By contrast, very small populations are more heterozygous with larger proportions of neutral, slightly advantageous, or disadvantageous duplicates persisting [24]. We were interested in investigating whether CNVs occur preferentially within particular sites and types of human sequence and whether neutral, purifying, or diversifying selection has acted upon them. Our null hypothesis is that CNVs arise uniformly in a genome and are selectively neutral. In this model we expect CNVs not to be enriched in protein-coding genes or other evolutionary, structural, and functional characteristics. To test the model, we surveyed 13 different properties relating to CNVs and CNV genes of human and mouse, and compared these to their genome-wide distributions. Our study relies on recent surveys of CNVs, in particular those of Sebat et al. [3], Iafrate et al. [2], Tuzun et al. [4], and Sharp et al. [5]. We assume that these CNVs have been sampled uniformly from those present in the human population. We tested whether CNVs occur more frequently, like synonymous substitutions [26], close to telomeres or to pericentromeres, whether they contain unusually high densities of genes, repeats, or G + C base content. We also examined the relative evolutionary rates of CNV genes and their functions. We find that CNVs occur more frequently towards telomeres and centromeres, are enriched in protein-coding genes and simple tandem repeats, but are not elevated in G + C content. Human CNV genes have experienced elevated synonymous and nonsynonymous nucleotide substitution rates, have a deficit of Mendelian disease genes, and have a surfeit of genes encoding secreted and immunity proteins. Mouse CNVs, on the other hand, possess significantly fewer of the genes that are overrepresented in human CNVs, although they demonstrate the same significant elevation in synonymous nucleotide substitution rates seen for human CNVs. These results indicate that natural selection has acted nonrandomly upon CNVs. We suggest that the different characteristics of human and mouse CNVs we observe may be consequences of these species' contrasting effective population sizes. Results CNV Properties Relative to Those for the Human Genome Known human CNVs are neither significantly overpopulated nor underpopulated in densities of RNA genes, interspersed repeats (either considered together, or short or long interspersed nuclear elements considered separately), CpG islands, or G + C content relative to the whole genome (p > 0.05). The apparent lack of bias of interspersed repeats and G + C content within CNVs, relative to the remainder of the genome, argues that our conclusions (below) should not be adversely affected by sequence-dependent variations in hybridisation signals [27]. Tissue-specific genes (see Materials and Methods) are also not significantly (p > 0.05) over- or underrepresented in CNVs, and no single tissue possessed unusually high or low numbers of CNV genes expressed in that tissue. By way of contrast, several properties of CNVs are significantly different (p 0.05) (Table 1). They also exhibit no significant overrepresentation close to telomeres, although this may reflect reduced coverage of BACs in these regions. Nevertheless, the genes encoded in mouse CNVs, and their associated functions, are strikingly different from those in human CNVs. In only three instances did human and mouse 1:1 orthologues overlap known CNV regions from both species. This finding is unexpected, since the probability of finding this number of 1:1 orthologues, or fewer, in both human and mouse CNVs is 4 × 10−3. (This probability was calculated using the hypergeometric distribution using the observations that among approximately 13,000 human:mouse 1:1 orthologues, 418 overlap human CNVs, and 340 overlap mouse CNVs.) As described above, human CNV genes are enriched in paralogous clusters of the reference genome assembly, they possess elevated KA/KS values, and they encode signal peptide-containing secreted proteins. However, exactly the opposite is true for mouse CNV genes: they are typically not overrepresented in paralogous clusters, they possess significantly decreased KA/KS values, and they are significantly enriched in proteins that lack signal peptides (Table 1). Moreover, in contrast to human CNVs, for which olfactory receptor genes are overrepresented, in mouse CNVs we find these genes to be underrepresented (Table 5). Only carbohydrate-binding genes are significantly (p 0.05 were considered to indicate that the CNV data were not significantly different from the genome data taken as a whole. The likelihood that a GO annotation is over- or underrepresented among CNV genes was estimated using the hypergeometric distribution [60]. The probability that two sets of KA, KS, or KA/KS values are sampling an equivalent distribution was calculated using the two-sided Kolmogorov-Smirnov test [61]. The likelihood that CNVs are overrepresented in regions close to telomeres or centromeres was estimated by fitting to a Gaussian distribution (using Origins 7.5 software from OriginLab, Northampton, Massachusetts, United States).