Introduction Genome comparisons have revealed significant variation in gene family size, both within and between species, e.g. [1]–[7]. This variation can result from either the gain or loss of genes, each of which in turn may be favored by selection. Variation in the number of genes may have important consequences for understanding differences between species, especially for key morphological, physiological, and behavioral traits, e.g. [8], [9], [10]. The observed variation in gene numbers may represent genetic diversity resulting from the evolution of gene families [11], but may also have been incorrectly inferred from sequencing and assembly artifacts. In order to assess the genomic content of a particular species, current methods rely on published genome assemblies. Unfortunately, a major problem in genomics is assembly quality, especially given that it is very difficult to determine the accuracy of de novo assemblies [12], [13] and the fact that different assembly algorithms may give very different results [14]. Both computational and experimental methods have been applied to improve upon an assembly: computational approaches include innovations in the assembly algorithms themselves, e.g. [15], as well as methods developed to compare, validate, and gauge the quality of a particular assembly, e.g. [16]–[19]. Experimental approaches have been aimed at improving the connectivity of contigs and scaffolds e.g. [20], assigning and ordering scaffolds on chromosomes, e.g. [21], [22], and validating and refining the annotated genes using RNA data, e.g. [23], [24], [25]. Often computational and experimental methods are used in conjunction to improve an assembly, as further experimental evidence will be integrated or reassembled with the original draft assembly, e.g. [26]. Improvements in sequencing technology do not necessarily mean that assemblies as a whole have improved; indeed, shorter reads have increased the computational complexity of the assembly problem, e.g. [27], [28] and have resulted in more fragmented assemblies (i.e. there are a larger number of contigs). A number of factors confound accurate assembly, including the presence of transposable elements and other repetitive sequences [29], and the allelic variation present when heterozygous individuals are sequenced, e.g. [30]. Despite these obvious problems the number of assemblies produced is increasing, and thousands of genome sequencing projects are planned or in progress [31]. In many cases, gene annotation from the closest annotated relative will be transferred to these new genomes, and will further propagate the annotation problems to many new genome sequences. Low-quality assemblies result in low-quality annotations [18], [27], and these annotation errors cause both the over- and under-estimation of gene numbers, e.g. [32], [33]. One cause of the over-estimation of gene numbers is the splitting of allelic variation (i.e. haplotypes present in heterozygous individuals) into separate loci (Fig. 1A); we refer to such cases as “split” genes. Split genes appear as highly similar duplicated loci within genome assemblies, and are often placed in tandem to one another or with one copy on a small scaffold by itself, e.g. [34], [35]. A second cause of the over-estimation of gene numbers is the fragmentation of a single gene onto multiple contigs or scaffolds (Fig. 1B); we refer to such cases as “cleaved” genes. Because ab initio gene predictors less likely to accurately infer gene models across sequence gaps, genes fragmented onto multiple contigs or scaffolds will be predicted as multiple separate genes, e.g. [30]. Note that gene models may also be cleaved simply because ab initio predictors have failed to join distant exons together in a single transcript, e.g. [36], [37], though this type of error may be independent of the underlying assembly quality. A common cause of the under-estimation of gene number is the collapse of truly paralogous gene copies into a single locus (Fig. 1C). This occurs because newly formed duplicates are highly similar in sequence, and therefore hard to assemble as separate loci, e.g. [30], [38]. A second cause of under-estimation is simply that genes may not be represented in low-coverage genomes due to a large number of gaps (Fig. 1D). In such cases both total gene numbers and the size of individual gene families may be severely underestimated, e.g. [39]. 10.1371/journal.pcbi.1003998.g001 Figure 1 Examples of missassembly leading to misannotation. Each row shows the true state of the genome on the left (“Expected assembly”) and a common misassembly error on the right (“Observed misassembly”). A) A single gene may be assembled as two apparently paralogous loci, increasing the predicted gene count. B) A singe gene may be fragmented into multiple pieces, each on different contigs or scaffolds. This cleavage can increase the number of predicted genes. C) Two paralogous genes may be collapsed into a single gene, decreasing the predicted gene count. D) A gene may be partially or entirely missing from the assembly, decreasing the number of predicted genes. Many genome assemblies and annotations have improved over time due to further efforts aimed at both increasing sequence contiguity and adding functional data (e.g. RNA-seq) in order to correct gene models. Individual researchers may also contribute to the deconvolution of specific assembly errors, e.g. [27], [40] or to the improvement of specific gene models, e.g. [41], [42]. However, it is often the case that a great deal of research will be based upon the draft assembly before it has reached a finished state, and erroneous conclusions may result, e.g. [40]. As an extreme example, the initial draft human genome contained 223 bacterial genes thought to have been gained by horizontal gene transfer [43]. Closer analysis of this result suggested that many of these cases were simply bacterial contaminants incorrectly assembled into the human genome [44]. As a less extreme example, the initial human genome predicted between 30–40,000 protein-coding genes [43], [45]. As the draft assembly was updated and the gene annotation process was improved, the estimated number of genes in human has continued to fall, and is 20,805 as of February 2014 according to Ensembl [46]. This pattern repeats itself for nearly every draft genome, but is especially true of vertebrate genomes because of their size and complexity [28], [40]. The cascading effects of these errors may affect many downstream conclusions, from inferences about the evolutionary histories of genes to the ability to map genes involved in disease. Although many consequences of low-quality assemblies have been described, e.g. [27], [28], [47]–[49], few analyses have specifically examined the effect on gene copy-number but see [32], [33]. Because many new, next-generation sequencing technologies are being used to construct genome sequences, we would also like to know the error-characteristics inherent to different platforms. Here we examine gene numbers in multiple genome assemblies, using multiple sequencing technologies, and from multiple species. Our results suggest that low-quality assemblies can result in huge numbers of both added and missing genes, and that most of the additional genes are due to genome fragmentation (“cleaved” gene models). Based on these results we present simulation analyses that suggest that published genomes with surprisingly high numbers of genes may be in error, and further show how these problems can be corrected. Results/Discussion Errors in de novo assemblies of the chicken genome To determine how total gene numbers are affected by genome assembly quality we compared predicted gene models in multiple versions of the chicken genome. We examined five different assemblies that were based on different sequencing technologies and sequencing depths. These assemblies vary in size and average coverage (Table 1; for more details on these assemblies, see [28]). The 2X fosmid-based assembly (average read length ∼950 bp) may be considered the least complete assembly, as it is the most fragmented, smallest in size, and has the least coverage of the five assemblies considered. The 13X 454-based assembly of the chicken genome was built with 454 single-end reads (average length ∼330 bp), 3 kb mate-pair inserts, and 20 kb mate-pair inserts using the Newbler assembler. The 82X Illumina-based assembly was built with high coverage of paired-end short-insert reads (average length 100 bp) and integrated with inserts of 2 kb in length using the SOAP assembler. The draft chicken reference genome (v2.1) was a 6X Sanger-based assembly that was improved with fosmid and BAC-end sequencing and reassembled with the PCAP assembler (it is also referred to as Galgal3 in some repositories). The final assembly used as a reference, the current chicken reference (v4.0; also referred to as Galgal4 in some repositories), was a further improvement to version 2.1. This hybrid assembly, which was already covered to 6X with Sanger reads, improved to 6.6X with BAC and fosmids, was again reassembled using the following additional 454 sequences: 10X fragment reads, 1.7X 3 kb inserts, and 1.2X 20 kb inserts; again, the PCAP assembler was used to integrate all the data into the final reference assembly. Although it is of high quality, even this reference is considered a “draft” genome. 10.1371/journal.pcbi.1003998.t001 Table 1 Chicken genome assemblies, predicted partial and full-length GENSCAN genes, and completeness of conserved orthologs as assessed by CEGMA. Assembly Coverage Contigs Partial genes Full-length genes Completeness Fosmids 2X 281711 138354 21250 14.1% 454 12X 45554 73262 36210 68.2% Ref 2.1 6.6X 71609 86543 38199 66.5% Illumina 82X 27093 64552 33324 74.6% Ref 4.0 12X 25017 61405 35537 80.7% We predicted genes on each of these five assemblies using the ab initio prediction methods implemented in GENSCAN [50] and Fgenesh [51]. GENSCAN was used with the “eukaryotic” model specified, and Fgenesh was used with the specific model for chicken available in the package. GENSCAN (Table 1) found a greater number of genes than Fgenesh (S1 Table), which typically produced more conservative counts but also more complete gene models. Both gene predictors found tens of thousands of genes for each assembly, and we found that the assemblies with the most scaffolds also had the most predicted genes (Table 1). However, a great many of the predicted genes (often more than 50%; Table 1) were lacking either a start or stop codon, or both. We suspected that the enrichment of small scaffolds was increasing the number of incomplete predictions, and filtered very small scaffolds ( 95% similarity over 80% of their length) were classified as “allelic splits.” Generating simulated Drosophila assemblies We attempted to transform the high-quality, near-complete D. melanogaster assembly into one resembling the Daphnia pulex assembly. In order to do this, we first collected information about the Daphnia pulex assembly from wFleabase ([68], http://wfleabase.org/), specifically, the scaffold lengths as well as positions and lengths of all gaps within those scaffolds. This filtered scaffold set contained 5,191 scaffolds [68]. However, when we examined the assembled scaffolds we found that nearly 25% of bases were gaps, represented by stretches of N's in the sequence. To understand how gene prediction software would handle such gaps, we manually inserted stretches of N's into the sequence of known D. melanogaster genes, and then predicted genes on the artificially created sequence. We found a limitation in the length of a gap that the gene prediction software could span and still predict a single gene. GENSCAN, for instance, could not predict a single full-length gene across a gap of length 50 or greater. This implies that individual contigs are the fundamental unit useful for predicting genes, and that even individual large scaffolds fragmented into many contigs may result in the over-prediction of genes. We therefore chose 50 bp as a minimum cutoff length for the length of gaps, separating scaffolds into individual contigs when stretches of N's longer than fifty characters were found. Applying this cutoff to the Daphnia pulex assembly revealed 17,924 “contigs” useful for gene prediction. Drosophila melanogaster assembly release 5.44 was obtained from Flybase [69], in the form of six chromosome files. Using the distribution of contig sizes found in the Daphnia pulex assembly, we generated 10 simulated D. melanogaster assemblies with different numbers of contigs (Table 4). To do this, for any specified number, x, of contigs needed for the simulated D. melanogaster genome we took the longest x contigs from the Daphnia pulex assembly. The reference D. melanogaster genome was then fragmented into x pieces by randomly cutting contigs of the lengths drawn from the Daphnia pulex assembly, while ensuring that the entire D. melanogaster sequence was included in each simulated dataset. Because the Daphnia pulex genome is roughly 170 Mb in length (not including N's) while the D. melanogaster genome is 138 Mb, we are conservatively excluding the class of extremely small scaffolds found in Daphnia pulex from our simulated genomes. We predicted genes on each simulated assembly using GENSCAN, Fgenesh, AUGUSTUS, and MAKER. Although GENSCAN was used with a pre-specified human model, this has been shown to be sufficient for most eukaryotes e.g. [51]. Fgenesh has a specific Drosophila model, and as a consequence produced much lower gene counts. RNA-seq analysis Paired-end RNA-seq data from an experiment by the Berkeley Drosophila Genome Project [70], was obtained from the public database ENA ([71], http://www.ebi.ac.uk/ena/). These paired end reads were mapped against the simulated D. melanogaster assembly that had ∼18,000 contigs using the software BWA [72] with default parameters. Additional processing of the alignment was performed using samtools [73]. We filtered by read quality and mapping quality, and sought connecting paired-end reads where each end mapped to different scaffold. We used the positions of every exon in the predicted gene set for our simulated assembly to determine which exons were associated by the connecting paired-end reads. A set-merging algorithm was applied to chain together connected exons before the resulting gene set was analyzed. Supporting Information S1 Table Assembly statistics and gene models predicted by Fgenesh for chicken genome assemblies. (DOCX) Click here for additional data file.