47
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      gEVAL — a web-based browser for evaluating genome assemblies

      brief-report

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Motivation: For most research approaches, genome analyses are dependent on the existence of a high quality genome reference assembly. However, the local accuracy of an assembly remains difficult to assess and improve. The gEVAL browser allows the user to interrogate an assembly in any region of the genome by comparing it to different datasets and evaluating the concordance. These analyses include: a wide variety of sequence alignments, comparative analyses of multiple genome assemblies, and consistency with optical and other physical maps. gEVAL highlights allelic variations, regions of low complexity, abnormal coverage, and potential sequence and assembly errors, and offers strategies for improvement. Although gEVAL focuses primarily on sequence integrity, it can also display arbitrary annotation including from Ensembl or TrackHub sources. We provide gEVAL web sites for many human, mouse, zebrafish and chicken assemblies to support the Genome Reference Consortium, and gEVAL is also downloadable to enable its use for any organism and assembly.

          Availability and Implementation: Web Browser: http://geval.sanger.ac.uk, Plugin: http://wchow.github.io/wtsi-geval-plugin.

          Contact: kj2@ 123456sanger.ac.uk

          Supplementary information: Supplementary data are available at Bioinformatics online.

          Related collections

          Most cited references9

          • Record: found
          • Abstract: found
          • Article: not found

          Modernizing Reference Genome Assemblies

          The Rationale for the GRC The availability of a high quality human genome assembly has revolutionized biomedical research. Genomics has now entered the realm of clinical genetics, with many groups using either whole genome sequencing [1],[2] or whole exome sequencing [3] to identify variants underlying diseases and informing treatment options [4]. Advances in technology have increased the number of sequenced human genomes; however, de novo assembly of next generation sequencing reads is still problematic. The alignment of sequencing reads from these new genomes to a high quality reference genome remains a critical aspect of data interpretation [5]. While the human reference assembly is the highest quality mammalian assembly available, it is not without shortcomings. The “finished” assembly [6] contained over 300 gaps in the euchromatic portion of the genome, tiling path errors and regions represented by uncommon alleles. Furthermore, assessment of genome-wide variation revealed regions of the genome with complex, structurally diverse, allelic representations [7]–[9] that were insufficiently represented in the reference genome. Other analyses identified sequences that failed to align to the reference assembly either because the reference assembly contained a valid deletion allele or underrepresented multi-copy genes [10]–[13]. The Genome Reference Consortium (GRC) was formed to address these issues. The GRC (the GRC consists of The Genome Institute at Washington University, The Wellcome Trust Sanger Institute, The European Bioinformatics Institute, and The National Center for Biotechnology Information) is an international consortium with expertise in genome mapping, sequencing, and informatics. The goal of the GRC is to provide high quality genome assemblies that will allow a user to place any sequence greater than 500 bp into a chromosome context. While this report focuses largely on recent GRC advances concerning the human reference assembly, the GRC is also responsible for the mouse and zebrafish reference assemblies. Continued improvement of the human reference assembly is critical as we move towards an era of clinical and personal genomics. The reference genomes of mouse and zebrafish are similarly critical in light of their importance as model organisms and the significant investments made in creating community resources such as gene knockout collections. Assembly Management Two major problems faced the GRC at the outset of this project, the decentralized nature of the Human Genome Project and the lack of a suitable data model for representing complex genomes. Much of the data underlying curation decisions had not been captured nor standardized. The human reference assembly had never been submitted to the International Nucleotide Sequence Database Collaboration (INSDC) [14] and thus lacked stable, trackable sequence identifiers that could be accessed from any INSDC database. Initial efforts at assembling the human genome were guided by the concept of “a golden path” [15], a single clone tiling path that could be reduced to one non-redundant haploid representation of the human genome. While this model fit well with the prediction that single nucleotide variants (SNVs) would be the predominant source of variation in the population, it is now clear that structural variation is a much larger source of genomic diversity than previously recognized [16],[17]. Additionally, this model did not deal robustly with sequences that were not part of chromosome assemblies. These often represent sequences that cannot be easily ordered or oriented on the chromosome assembly due to structural complexity but frequently contain genes that may be of biological interest [18] or represent alternate haplotypes of regions in the chromosome assembly [9],[19]. Earlier versions of the reference genome assembly included some of these allelic variants (such as at the MHC region) but the sequences themselves often were not used because they had no relation to the chromosome sequence and could not be easily distinguished from sequences reflecting biological or artificial duplication. The GRC has addressed these problems by establishing common tools and standard operating procedures (SOPs) so that the genome assembly is now constructed in a regularized fashion. We have developed a single database to store all data underlying the genome assembly. Finally, we have developed a system to track individual regions that are under review. All of these data are made publicly available through our Web site (http://genomereference.org/). Additionally, the GRC has formalized an assembly model (Figure 1 and Box 1) that provides for improved accounting for all sequences, including those that are not part of chromosome assemblies, and facilitates genome annotation by placing additional structure on those sequences. Structurally complex regions can be represented by more than one tiling path; one of which will be integrated into the chromosome assembly while the others will be instantiated as an independent sequence that, by alignment to the chromosome, provides the chromosome context for the alternate allele. 10.1371/journal.pbio.1001091.g001 Figure 1 Assembly representation for GRCh37.p3. The top panel shows an ideogram representation of the human genome. The primary assembly unit contains sequences for the non-redundant haploid assembly; this includes the scaffolds that make up the chromosome sequence as well as unplaced and unlocalized scaffolds that are thought to represent novel sequence (not shown in this picture). Alternate loci and patches are placed in separate assembly units to facilitate annotation. Note the seven alternate scaffolds in the MHC region are all placed in different assembly units, as they all represent different representations of the same sequences. Other alternate loci can be added to these assembly units at the next major release if they don’t overlap the existing alternates. All patches are placed in the PATCHES assembly unit and minor releases are cumulative such that the latest minor release will contain all patches. The red triangle, yellow circles, and blue circles represent regions that contain additional sequences that are not given actual chromosome coordinates, but rather are given a chromosome context via alignment to the primary assembly. The red triangles represent regions’ alternate loci; these are sequences that provide an additional tiling path to the one given in the chromosome representation and are essential for representing structurally complex loci. The circles represent patch sequences; these are minor updates made to the assembly outside of the major build cycle. Yellow circles represent “fix” patches: regions of the chromosome assembly that will change with the next major assembly update. Blue circles represent “novel” patches: these are sequences that represent new alternate loci in the next major assembly update. Unlocalized and unplaced sequences are not represented in this figure. Sequences within the assembly are placed within containers known as assembly units. Note: a region can point to more than one type of extra chromosomal sequence; for example, a region could point to an alternate locus and to a fix or novel patch. Box 1. Assembly Definitions AGP: A file used to describe the instructions for building a contig, scaffold, or chromosome sequence. This file specifies the order, orientation, and switch points for each component. Alternate Locus: A sequence that provides an alternate representation of a locus found in a largely haploid assembly. These sequences don’t represent a complete chromosome sequence, although there is no hard limit on the size of the alternate locus; currently these are less than 5 Mb. Assembly: A set of sequences (chromosomes, unlocalized, unplaced, and alternate loci) used to represent an organism’s genome. Assembly Unit: Collections of sequences used to define discrete parts of an assembly. Component: The basic genomic level sequence used to construct the genome; typically these are clone sequences, Whole Genome Shotgun sequences, or PCR fragments. These sequences must be submitted to GenBank/EMBL/DDBJ. Contig: A contiguous sequence generated from determining the non-redundant path along an ordered set of component sequences. A contig should contain no gaps. Patch: A genome patch is a scaffold sequence that is part of a minor genome release. These sequences either correct errors in the assembly (a FIX patch) or add additional alternate loci (a NOVEL patch). These sequences allow us to update the assembly information without disrupting the chromosome coordinate system. FIX patches will be removed at the next major assembly release, as the changes will be rolled into the new assembly. NOVEL patches will be moved from the PATCHES assembly unit to a proper assembly unit. Primary Assembly Unit: Represents the collection of sequences that, when combined, represent a non-redundant haploid genome. Scaffold: An ordered and oriented set of contigs. A scaffold will contain gaps, but there is typically some evidence to support the contig order, orientation, and gap size estimates. TPF: Tiling Path File; this provides the order of the component sequences that are used to build a higher order sequence (contig, scaffold, or chromosome). Switch Point: The base at which the contig sequence stops being generated from one component sequence and switches to using the next component sequence. There must be at least one switch point between adjacent component sequences in a contig. Unlocalized sequence: A sequence found in an assembly that is associated with a specific chromosome, but that cannot be ordered or oriented on that chromosome. Unplaced sequence: A sequence found in an assembly that is not associated with any chromosome. We have also introduced the concept of a “minor” assembly update, in the form of genome patches. This mechanism provides users with timely access to genome improvements without inducing frequent changes to the coordinate system upon which assembly annotations are based. Because genome patches take the same form as alternate loci the two forms of data can be similarly managed. The release cycle for major assembly updates will not occur on a fixed schedule. In order to minimize the need for frequent re-annotation, major assembly updates will occur infrequently when we have produced at least 100 fix patches or affected >1% of the euchromatic sequence. The GRC will announce planned updates on their Web site at least 6 months in advance of any major assembly release. Additional, detailed information regarding major releases will be publicly announced via the Web site as data freeze dates approach. Minor assembly updates will be made quarterly. Assembly Quality and Improvement We have produced a major release of the human reference assembly, GRCh37, which was submitted in June of 2009 to the INSDC (GCA_000001405.1), and four minor assembly updates, with the last patch, GRCh37.p4 (GCA_000002405.5), released in April 2011. Detailed information concerning genome assembly construction is on our Web site (http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/info/index.shtml). The top part of Figure 2 shows the distribution of issue types that were resolved for these assembly releases. Some assembly updates are relatively minor, involving the correction of a single nucleotide discrepancy in the assembly (e.g., HG-445; http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-445) while others involved multiple components and required generation of new, region-specific tiling paths (e.g., HG-2; http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-2). (Figure 2) [20]. 10.1371/journal.pbio.1001091.g002 Figure 2 Distribution of issues addressed and an example region. (Top Panel) Issues for GRCh37, GRCh37.p1, and GRCh37.p2, broken down by type. Issue types are: Clone Problem: The issue is contained within a single clone. This may be a single nucleotide difference or a clone mis-assembly. Path Problem: There is evidence that the tiling path within a given region is incorrect and we will need to update the path. GRC Housekeeping: Changes use to help regularize the tiling path. Missing Sequence: Sequence that we can’t yet place on the assembly. Mapping studies are ongoing to help place these sequences. Variation: There is evidence to suggest that complex variation is complicating a region and an alternate allele may need to be produced. Gap: The issue concerns filling a gap. Unknown: Issue is still under investigation for classification. (Bottom Panel) Details for issue HG-2, a Path Problem. The representation in NCBI36 was a mixed haplotype. The tiling paths for NCBI36 and GRCh37 are shown. Blue clones are anchor clones that are in NCBI36, the GRCh37 chr4 path, and the GRCh37 alternate locus path. Red clones represent the UGT2B17 insertion path and dark gray clones represent the UGT2B17 deletion path. The light gray clone was not used in NCBI36, but was used in GRCh37 to complete the alternate locus. While the model changes described above facilitated our assembly management and reporting, we also wished to investigate whether these updates would allow for improved genome analysis. To investigate this, we first tried to recover sequence identified as novel in a personal genome, theYH1 human assembly [12]. Roughly 25% could be placed in a chromosome context using GRCh37.p2 (see supplemental table 1 and supplemental figure 1 at http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/supplement/). The remaining sequences are being investigated to determine if they warrant inclusion in a future assembly release. We also wished to investigate the impact on alignment of next generation sequencing reads. We selected two samples from the 1,000 Genomes project [21], NA12156 and NA12878, (SRA accessions ERX000125 and ERX000080, respectively) and aligned their reads to GRCh37, with and without the alternate loci. We demonstrated that removal of the alternate loci leads to misalignment of approximately two-thirds of the alternate-locus specific reads (see supplemental table 2, supplemental figure 2 at http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/supplement/). These data clearly demonstrate that that inclusion of alternate representations for genomic loci can improve alignment quality and thus avoid spurious variation calls. Policy Implications We envision the high quality reference assemblies generated by the GRC having a long-term role in biomedical research because they most accurately capture all forms of human genetic variation and facilitate investigation of human disease in model organisms. With this in mind, we have built a reference assembly infrastructure to support transparent curation and assembly production. We have also updated the assembly model so that it better represents our current understanding of genome structure and diversity. We will use this model to encompass new discoveries and ultimately capture all significant variations in the human population structure as discovered through projects such as 1,000 genomes. Additionally, we wish to engage the research and clinical communities to identify regions that require targeted effort and to incorporate information from groups performing detailed work on specific loci. The GRC can only be truly successful with community input. Users can report problems directly to the GRC via our Web page (http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/ReportAnIssue.shtml). It is difficult to overstate the importance of the human reference assembly, even in the age of personal genomics. Given current sequencing and assembly technology, there is a clear need for a high quality reference that can represent structural diversity across all populations. Providing a representation of this diversity is critical for next generation sequence analysis. Even using an assembly with only three regions with alternative alleles, we show improved alignment quality and by extension variation calling, which is the primary product of personal genomics. More genomic alignment tools that can take the alternate representations into account need to be developed. Understanding how genotype influences phenotype necessitates an accurate and complete picture of all loci in multiple populations. For many genomic regions, this can be denoted by a sequence with annotated SNPs and small indels, but other loci will require multiple sequence instances for complete representation. Some human loci, such as the 1q21 region, which remains misassembled in GRCh37.p2, are sufficiently complex that significant effort is needed to obtain even one correct sequence for the region. Additional work is required to sort out the haplotypes segregating among various populations, many of which contribute to phenotypes associated with multiple developmental disorders [22]. While assemblies using next generation sequencing are beginning to approach the quality of long-read Whole Genome Shotgun assemblies [23], they continue to fail in complex regions. While it is likely that sequencing and assembly technology will improve such that de novo assembly of individual genomes will approach the quality of the human reference, it is not clear when this will happen. However, even when this is a common occurrence, we see a role for the GRC in integrating the data from thousands of human genomes to produce a “gold-standard” reference assembly. We anticipate a continued need for a high quality reference assembly that will allow any human sequence to be placed into a chromosome context quickly and easily. As we march down the path of personal genomics it is critical that we devote resources to the current reference assembly in order to support clinical applications. As we continue to understand how genotype influences phenotype, the best possible reference assembly available must be made available to the research community.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            An overview of Ensembl.

            Ensembl (http://www.ensembl.org/) is a bioinformatics project to organize biological information around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of individual genomes, and of the synteny and orthology relationships between them. It is also a framework for integration of any biological data that can be mapped onto features derived from the genomic sequence. Ensembl is available as an interactive Web site, a set of flat files, and as a complete, portable open source software system for handling genomes. All data are provided without restriction, and code is freely available. Ensembl's aims are to continue to "widen" this biological integration to include other model organisms relevant to understanding human biology as they become available; to "deepen" this integration to provide an ever more seamless linkage between equivalent components in different species; and to provide further classification of functional elements in the genome that have been previously elusive.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies

              Introduction Genome comparisons have revealed significant variation in gene family size, both within and between species, e.g. [1]–[7]. This variation can result from either the gain or loss of genes, each of which in turn may be favored by selection. Variation in the number of genes may have important consequences for understanding differences between species, especially for key morphological, physiological, and behavioral traits, e.g. [8], [9], [10]. The observed variation in gene numbers may represent genetic diversity resulting from the evolution of gene families [11], but may also have been incorrectly inferred from sequencing and assembly artifacts. In order to assess the genomic content of a particular species, current methods rely on published genome assemblies. Unfortunately, a major problem in genomics is assembly quality, especially given that it is very difficult to determine the accuracy of de novo assemblies [12], [13] and the fact that different assembly algorithms may give very different results [14]. Both computational and experimental methods have been applied to improve upon an assembly: computational approaches include innovations in the assembly algorithms themselves, e.g. [15], as well as methods developed to compare, validate, and gauge the quality of a particular assembly, e.g. [16]–[19]. Experimental approaches have been aimed at improving the connectivity of contigs and scaffolds e.g. [20], assigning and ordering scaffolds on chromosomes, e.g. [21], [22], and validating and refining the annotated genes using RNA data, e.g. [23], [24], [25]. Often computational and experimental methods are used in conjunction to improve an assembly, as further experimental evidence will be integrated or reassembled with the original draft assembly, e.g. [26]. Improvements in sequencing technology do not necessarily mean that assemblies as a whole have improved; indeed, shorter reads have increased the computational complexity of the assembly problem, e.g. [27], [28] and have resulted in more fragmented assemblies (i.e. there are a larger number of contigs). A number of factors confound accurate assembly, including the presence of transposable elements and other repetitive sequences [29], and the allelic variation present when heterozygous individuals are sequenced, e.g. [30]. Despite these obvious problems the number of assemblies produced is increasing, and thousands of genome sequencing projects are planned or in progress [31]. In many cases, gene annotation from the closest annotated relative will be transferred to these new genomes, and will further propagate the annotation problems to many new genome sequences. Low-quality assemblies result in low-quality annotations [18], [27], and these annotation errors cause both the over- and under-estimation of gene numbers, e.g. [32], [33]. One cause of the over-estimation of gene numbers is the splitting of allelic variation (i.e. haplotypes present in heterozygous individuals) into separate loci (Fig. 1A); we refer to such cases as “split” genes. Split genes appear as highly similar duplicated loci within genome assemblies, and are often placed in tandem to one another or with one copy on a small scaffold by itself, e.g. [34], [35]. A second cause of the over-estimation of gene numbers is the fragmentation of a single gene onto multiple contigs or scaffolds (Fig. 1B); we refer to such cases as “cleaved” genes. Because ab initio gene predictors less likely to accurately infer gene models across sequence gaps, genes fragmented onto multiple contigs or scaffolds will be predicted as multiple separate genes, e.g. [30]. Note that gene models may also be cleaved simply because ab initio predictors have failed to join distant exons together in a single transcript, e.g. [36], [37], though this type of error may be independent of the underlying assembly quality. A common cause of the under-estimation of gene number is the collapse of truly paralogous gene copies into a single locus (Fig. 1C). This occurs because newly formed duplicates are highly similar in sequence, and therefore hard to assemble as separate loci, e.g. [30], [38]. A second cause of under-estimation is simply that genes may not be represented in low-coverage genomes due to a large number of gaps (Fig. 1D). In such cases both total gene numbers and the size of individual gene families may be severely underestimated, e.g. [39]. 10.1371/journal.pcbi.1003998.g001 Figure 1 Examples of missassembly leading to misannotation. Each row shows the true state of the genome on the left (“Expected assembly”) and a common misassembly error on the right (“Observed misassembly”). A) A single gene may be assembled as two apparently paralogous loci, increasing the predicted gene count. B) A singe gene may be fragmented into multiple pieces, each on different contigs or scaffolds. This cleavage can increase the number of predicted genes. C) Two paralogous genes may be collapsed into a single gene, decreasing the predicted gene count. D) A gene may be partially or entirely missing from the assembly, decreasing the number of predicted genes. Many genome assemblies and annotations have improved over time due to further efforts aimed at both increasing sequence contiguity and adding functional data (e.g. RNA-seq) in order to correct gene models. Individual researchers may also contribute to the deconvolution of specific assembly errors, e.g. [27], [40] or to the improvement of specific gene models, e.g. [41], [42]. However, it is often the case that a great deal of research will be based upon the draft assembly before it has reached a finished state, and erroneous conclusions may result, e.g. [40]. As an extreme example, the initial draft human genome contained 223 bacterial genes thought to have been gained by horizontal gene transfer [43]. Closer analysis of this result suggested that many of these cases were simply bacterial contaminants incorrectly assembled into the human genome [44]. As a less extreme example, the initial human genome predicted between 30–40,000 protein-coding genes [43], [45]. As the draft assembly was updated and the gene annotation process was improved, the estimated number of genes in human has continued to fall, and is 20,805 as of February 2014 according to Ensembl [46]. This pattern repeats itself for nearly every draft genome, but is especially true of vertebrate genomes because of their size and complexity [28], [40]. The cascading effects of these errors may affect many downstream conclusions, from inferences about the evolutionary histories of genes to the ability to map genes involved in disease. Although many consequences of low-quality assemblies have been described, e.g. [27], [28], [47]–[49], few analyses have specifically examined the effect on gene copy-number but see [32], [33]. Because many new, next-generation sequencing technologies are being used to construct genome sequences, we would also like to know the error-characteristics inherent to different platforms. Here we examine gene numbers in multiple genome assemblies, using multiple sequencing technologies, and from multiple species. Our results suggest that low-quality assemblies can result in huge numbers of both added and missing genes, and that most of the additional genes are due to genome fragmentation (“cleaved” gene models). Based on these results we present simulation analyses that suggest that published genomes with surprisingly high numbers of genes may be in error, and further show how these problems can be corrected. Results/Discussion Errors in de novo assemblies of the chicken genome To determine how total gene numbers are affected by genome assembly quality we compared predicted gene models in multiple versions of the chicken genome. We examined five different assemblies that were based on different sequencing technologies and sequencing depths. These assemblies vary in size and average coverage (Table 1; for more details on these assemblies, see [28]). The 2X fosmid-based assembly (average read length ∼950 bp) may be considered the least complete assembly, as it is the most fragmented, smallest in size, and has the least coverage of the five assemblies considered. The 13X 454-based assembly of the chicken genome was built with 454 single-end reads (average length ∼330 bp), 3 kb mate-pair inserts, and 20 kb mate-pair inserts using the Newbler assembler. The 82X Illumina-based assembly was built with high coverage of paired-end short-insert reads (average length 100 bp) and integrated with inserts of 2 kb in length using the SOAP assembler. The draft chicken reference genome (v2.1) was a 6X Sanger-based assembly that was improved with fosmid and BAC-end sequencing and reassembled with the PCAP assembler (it is also referred to as Galgal3 in some repositories). The final assembly used as a reference, the current chicken reference (v4.0; also referred to as Galgal4 in some repositories), was a further improvement to version 2.1. This hybrid assembly, which was already covered to 6X with Sanger reads, improved to 6.6X with BAC and fosmids, was again reassembled using the following additional 454 sequences: 10X fragment reads, 1.7X 3 kb inserts, and 1.2X 20 kb inserts; again, the PCAP assembler was used to integrate all the data into the final reference assembly. Although it is of high quality, even this reference is considered a “draft” genome. 10.1371/journal.pcbi.1003998.t001 Table 1 Chicken genome assemblies, predicted partial and full-length GENSCAN genes, and completeness of conserved orthologs as assessed by CEGMA. Assembly Coverage Contigs Partial genes Full-length genes Completeness Fosmids 2X 281711 138354 21250 14.1% 454 12X 45554 73262 36210 68.2% Ref 2.1 6.6X 71609 86543 38199 66.5% Illumina 82X 27093 64552 33324 74.6% Ref 4.0 12X 25017 61405 35537 80.7% We predicted genes on each of these five assemblies using the ab initio prediction methods implemented in GENSCAN [50] and Fgenesh [51]. GENSCAN was used with the “eukaryotic” model specified, and Fgenesh was used with the specific model for chicken available in the package. GENSCAN (Table 1) found a greater number of genes than Fgenesh (S1 Table), which typically produced more conservative counts but also more complete gene models. Both gene predictors found tens of thousands of genes for each assembly, and we found that the assemblies with the most scaffolds also had the most predicted genes (Table 1). However, a great many of the predicted genes (often more than 50%; Table 1) were lacking either a start or stop codon, or both. We suspected that the enrichment of small scaffolds was increasing the number of incomplete predictions, and filtered very small scaffolds ( 95% similarity over 80% of their length) were classified as “allelic splits.” Generating simulated Drosophila assemblies We attempted to transform the high-quality, near-complete D. melanogaster assembly into one resembling the Daphnia pulex assembly. In order to do this, we first collected information about the Daphnia pulex assembly from wFleabase ([68], http://wfleabase.org/), specifically, the scaffold lengths as well as positions and lengths of all gaps within those scaffolds. This filtered scaffold set contained 5,191 scaffolds [68]. However, when we examined the assembled scaffolds we found that nearly 25% of bases were gaps, represented by stretches of N's in the sequence. To understand how gene prediction software would handle such gaps, we manually inserted stretches of N's into the sequence of known D. melanogaster genes, and then predicted genes on the artificially created sequence. We found a limitation in the length of a gap that the gene prediction software could span and still predict a single gene. GENSCAN, for instance, could not predict a single full-length gene across a gap of length 50 or greater. This implies that individual contigs are the fundamental unit useful for predicting genes, and that even individual large scaffolds fragmented into many contigs may result in the over-prediction of genes. We therefore chose 50 bp as a minimum cutoff length for the length of gaps, separating scaffolds into individual contigs when stretches of N's longer than fifty characters were found. Applying this cutoff to the Daphnia pulex assembly revealed 17,924 “contigs” useful for gene prediction. Drosophila melanogaster assembly release 5.44 was obtained from Flybase [69], in the form of six chromosome files. Using the distribution of contig sizes found in the Daphnia pulex assembly, we generated 10 simulated D. melanogaster assemblies with different numbers of contigs (Table 4). To do this, for any specified number, x, of contigs needed for the simulated D. melanogaster genome we took the longest x contigs from the Daphnia pulex assembly. The reference D. melanogaster genome was then fragmented into x pieces by randomly cutting contigs of the lengths drawn from the Daphnia pulex assembly, while ensuring that the entire D. melanogaster sequence was included in each simulated dataset. Because the Daphnia pulex genome is roughly 170 Mb in length (not including N's) while the D. melanogaster genome is 138 Mb, we are conservatively excluding the class of extremely small scaffolds found in Daphnia pulex from our simulated genomes. We predicted genes on each simulated assembly using GENSCAN, Fgenesh, AUGUSTUS, and MAKER. Although GENSCAN was used with a pre-specified human model, this has been shown to be sufficient for most eukaryotes e.g. [51]. Fgenesh has a specific Drosophila model, and as a consequence produced much lower gene counts. RNA-seq analysis Paired-end RNA-seq data from an experiment by the Berkeley Drosophila Genome Project [70], was obtained from the public database ENA ([71], http://www.ebi.ac.uk/ena/). These paired end reads were mapped against the simulated D. melanogaster assembly that had ∼18,000 contigs using the software BWA [72] with default parameters. Additional processing of the alignment was performed using samtools [73]. We filtered by read quality and mapping quality, and sought connecting paired-end reads where each end mapped to different scaffold. We used the positions of every exon in the predicted gene set for our simulated assembly to determine which exons were associated by the connecting paired-end reads. A set-merging algorithm was applied to chain together connected exons before the resulting gene set was analyzed. Supporting Information S1 Table Assembly statistics and gene models predicted by Fgenesh for chicken genome assemblies. (DOCX) Click here for additional data file.
                Bookmark

                Author and article information

                Journal
                Bioinformatics
                Bioinformatics
                bioinformatics
                bioinfo
                Bioinformatics
                Oxford University Press
                1367-4803
                1367-4811
                15 August 2016
                07 April 2016
                07 April 2016
                : 32
                : 16
                : 2508-2510
                Affiliations
                1Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK
                2East Anglian Medical Genetics Centre, Cambridge University Hospitals, NHS Foundation Trust, Cambridge CB2 0QQ, UK
                3National Institute of Agricultural Botany, Cambridge CB3 0LE, UK
                Author notes
                *To whom correspondence should be addressed.

                Associate Editor: John Hancock

                Article
                btw159
                10.1093/bioinformatics/btw159
                4978925
                27153597
                f35ed9a0-eb14-4888-8f1b-6651466f433d
                © The Author 2016. Published by Oxford University Press.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 27 January 2016
                : 16 March 2016
                : 17 March 2016
                Page count
                Pages: 3
                Categories
                Applications Notes
                Genome Analysis

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article