162
views
0
recommends
+1 Recommend
0 collections
    4
    shares
      • Record: found
      • Abstract: not found
      • Article: not found

      FLASH: fast length adjustment of short reads to improve genome assemblies

      ,  
      Bioinformatics
      Oxford University Press (OUP)

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Next-generation sequencing technologies generate very large numbers of short reads. Even with very deep genome coverage, short read lengths cause problems in de novo assemblies. The use of paired-end libraries with a fragment size shorter than twice the read length provides an opportunity to generate much longer reads by overlapping and merging read pairs before assembling a genome. We present FLASH, a fast computational tool to extend the length of short reads by overlapping paired-end reads from fragment libraries that are sufficiently short. We tested the correctness of the tool on one million simulated read pairs, and we then applied it as a pre-processor for genome assemblies of Illumina reads from the bacterium Staphylococcus aureus and human chromosome 14. FLASH correctly extended and merged reads >99% of the time on simulated reads with an error rate of <1%. With adequately set parameters, FLASH correctly merged reads over 90% of the time even when the reads contained up to 5% errors. When FLASH was used to extend reads prior to assembly, the resulting assemblies had substantially greater N50 lengths for both contigs and scaffolds. The FLASH system is implemented in C and is freely available as open-source code at http://www.cbcb.umd.edu/software/flash. t.magoc@gmail.com.

          Related collections

          Most cited references3

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Aggressive assembly of pyrosequencing reads with mates

          Motivation: DNA sequence reads from Sanger and pyrosequencing platforms differ in cost, accuracy, typical coverage, average read length and the variety of available paired-end protocols. Both read types can complement one another in a ‘hybrid’ approach to whole-genome shotgun sequencing projects, but assembly software must be modified to accommodate their different characteristics. This is true even of pyrosequencing mated and unmated read combinations. Without special modifications, assemblers tuned for homogeneous sequence data may perform poorly on hybrid data. Results: Celera Assembler was modified for combinations of ABI 3730 and 454 FLX reads. The revised pipeline called CABOG (Celera Assembler with the Best Overlap Graph) is robust to homopolymer run length uncertainty, high read coverage and heterogeneous read lengths. In tests on four genomes, it generated the longest contigs among all assemblers tested. It exploited the mate constraints provided by paired-end reads from either platform to build larger contigs and scaffolds, which were validated by comparison to a finished reference sequence. A low rate of contig mis-assembly was detected in some CABOG assemblies, but this was reduced in the presence of sufficient mate pair data. Availability: The software is freely available as open-source from http://wgs-assembler.sf.net under the GNU Public License. Contact: jmiller@jcvi.org Supplementary information: Supplementary data are available at Bioinformatics online.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads

            Background Recent advances in sequencing technology [1-5] have rapidly driven down the cost of DNA sequence data. However, with the reduction in cost come challenges to using the data. The new technologies deliver novel data types that vary from each other in format. In addition, they are significantly different from traditional Sanger chemistry sequencing data in both read length and error rate, making it challenging to adapt them to many applications, including de novo assembly. We consider here the problem of assembling short, high-quality reads, with the goal of producing genome sequences that are as good as or better than the 'draft sequence' quality that is the current standard. This standard is illustrated by two representative data sets: 87 bacterial genomes were recently sequenced using Roche/454 chemistry, having average contig and scaffold N50 (length-weighted median) lengths of 84 kb and 1.0 Mb, respectively (Table S1 in Additional data file 1); 13 bacterial genomes were sequenced by Sanger chemistry and subsequently finished, for which the draft sequences (prior to finishing) are, on average, 98.4% complete and have a base accuracy of less than 1 error per 104 bases (Table S2 in Additional data file 1). These data sets represent an appropriate standard of quality to which the community is accustomed and for which short read assemblies should strive. Recent work has begun to explore the possibilities of short read assembly [6-14], but high-quality assembly from experimentally generated paired reads has not been demonstrated, even for small genomes. In this work we used reads from the Illumina platform generated from two libraries, yielding linking information of different sizes (approximately 200 and 4,000 bp). The inclusion of the library from longer fragments greatly increased the potential contiguity of the assemblies. We previously demonstrated that high-quality assemblies could be obtained from simulated short paired reads [10]. We now demonstrate that high-quality assemblies can be obtained from real data as well. Our test cases consist of five finished and homozygous microbial genomes. Importantly, finished genome sequences represent an extremely high-quality standard with which to compare our assemblies, allowing us to evaluate the quality of our results precisely and rigorously. We also ran the assembly programs Velvet [12] and EULER-SR [9,14] on the same data sets and provide a side-by-side comparison. Supplemental material is included with the paper. The Illumina sequence data, reference sequences, assemblies, and software used in this paper are freely available at [15,16]. The Illumina sequence data are also available from the NCBI Short Read Archive, using project identifiers 40071, 40073, 40075, 40077, and 40079. Results We sequenced the genomes of three bacteria (Staphylococcus aureus, Escherichia coli, and Rhodobacter sphaeroides) and two fungi (Schizosaccharomyces pombe and Neurospora crassa) using the Illumina platform (Tables 1, 2, 3, 4 and 5; Table S3 in Additional data file 1). In all cases finished reference sequences were available. However, we found what appeared to be biological differences between our isolates and the reference sequences. There were only 11 differences in total for the first two bacteria, but 374 for R. sphaeroides. To provide a control data set for precisely evaluating our bacterial assemblies, we corrected the bacterial reference sequences using data from Illumina, then carefully validated the corrections using data from 454 and Sanger chemistries (Additional data file 1). For the fungi, we did not attempt to reconcile base-level differences, and thus, as compared to our samples, the accuracy of the reference sequences is lower. Table 1 ALLPATHS, Velvet and EULER assemblies of five microbial genomes: source data S. aureus E. coli R. sphaeroides S. pombe N. crassa Source data  Strain USA300 K12 MG1655 2.4.1 972 h 74A  Reference sequence Finished, curated Finished, curated Finished, curated Finished Finished  GC composition (%) 33 51 69 36 49  Genome size (kb) 2,873 4,639 4,603 12,554 39,226  Reference N50 (kb) 2,873 4,639 3,188 4,509 665  Sequence coverage (x) 89 139 370 148 123 For five genomes, statistics from ALLPATHS, Velvet, and EULER assemblies are shown. In all cases identical data were supplied to both programs. Contigs of size Q60 Q35 Q29   S. pombe Q42 Q37 Q29   N. crassa Q39 Q32 Q28 Misassemblies  % in class IV to V chunks   S. aureus 0.0% 6.6% 8.2%   E. coli 0.0% 6.9% 7.0%   R. sphaeroides 0.3% 3.0% 4.7%   S. pombe 0.5% 3.4% 7.2%   N. crassa 2.2% 13.6% 23.2% Long-range validity  At 100-kb distance   S. aureus 100.0% 45.6% 99.8%   E. coli 100.0% 86.4% N/A   R. sphaeroides 100.0% 75.8% N/A   S. pombe 99.8% 80.2% 100.0%   N. crassa 99.8% 13.6% N/A For five genomes, statistics from ALLPATHS, Velvet, and EULER assemblies are shown. Base accuracy: the inferred quality score for the bases in chunk classes I to III, obtained by dividing the number of errors in these chunks by the total number of bases, and then taking -10*log10 of this quantity. Misassemblies: total fraction of bases in chunks in classes IV to V. Long-range validity: we chose 10,000 pairs of 100-base sequences at random from the scaffolds, where the two sequences were separated by 100,000 bases, considering only cases where both sequences could be placed uniquely on the reference. We scored the placement as 'valid' if both sequences were placed on the same chromosome, in the same orientation, with separation 100,000 ± 25% bases. We report the fraction of valid placements. N/A: all scaffolds were of length 10 kb was broken into ⌊ n/104 ⌋ equal-sized chunks, ± 1 base. Contigs of size ≤10 kb were treated as separate chunks.) We then found the 'best' alignment of each chunk to the reference sequence, meaning the alignment that had the smallest total number of substitution and indel bases. We only considered alignments that subsumed a 100 base perfect match to the reference. (The special case where there was no alignment is described below.) From the best alignment we inferred the error rate of the chunk, understanding that these errors could either be errors in the assembly or in the reference sequence. Then we divided the chunks into six classes by their error rate. Class I contains the perfect chunks, class II chunks have error rates up to 0.1%, class III chunks have error rates up to 1%, class IV chunks have error rates up to 10%, and class V chunks have error rates of at least 10%. Class VI consists of chunks that appear not to match the reference at all, presumably due to either contamination of the sample or omissions from the reference, although very short and inaccurate chunks could also be in this category. We report the fraction of assembly bases in each class. Base accuracy We assayed the 'base accuracy' of the assemblies by considering the error rate for bases in chunk classes I through III (error rate 500 bp and >99% identity. The E. coli assembly has exactly one error: a single base mismatch. This event occurs in a perfectly repeated sequence of length 80 bp. The R. sphaeroides assembly has three errors. First, a 234 base deletion from the assembly, adjacent to a repeat. Second, a 10-kb component contains a 6.4-kb edge that matches a plasmid perfectly, but also a 3.4-kb edge that is misassembled and consists of repeated sequence. Third, a 160-kb component joins similar sequence between two plasmids, although all edges in the component match the reference sequence perfectly. This defect does not appear at all in the linearized scaffold. For the ALLPATHS fungal genome assemblies, contiguity and completeness were lower than for the bacterial genomes. Thus, the N50 contig size for S. pombe was 51 kb, and for N. crassa, 19 kb. A lower fraction of the genome was represented: genome coverage was 95.9% for S. pombe and 89.5% for N. crassa. These assemblies were accurate, although not as accurate as the bacterial assemblies. Base quality was computed to be about Q40, although this is a floor estimate, since errors in the reference sequences or biological differences would have been reported as assembly errors. Long-range validity is very good: the odds of being correctly connected at distance 100 kb in a scaffold are about 99.8%. Several factors can limit both the contiguity of assemblies and the fraction of the genome represented, including repeat sequence, extremes of GC composition, and non-equimolar sequences such as plasmids. We consider how these factors applied to two of the genomes: R. sphaeroides and N. crassa. R. sphaeroides There was 98.5% of the genome present in the assembly; the missing regions consisted mostly of repetitive sequences. In our experience with the Illumina sequencing method employed here, representation decreases significantly for sequences of higher GC composition (Figure S1 in Additional data file 1), such as this genome (69% GC). We therefore sequenced very deeply to compensate for reduced coverage in GC-rich parts. The genome contains plasmids with two additional characteristics that challenged assembly: they are present at higher molarity than the chromosomes and contain long near-perfect repeat sequences. N. crassa There was 89.5% of the genome present in the assembly. The 10.5% of the genome missing from the assembly is enriched in repetitive sequences and regions of very low GC composition. In summary, for bacteria, the ALLPATHS assemblies were markedly complete, contiguous and accurate. The observed base accuracy of less than one error per million rivaled that of finished sequence [18,19]. Indeed, there were only five edges with any errors in the three bacterial assemblies. These assemblies were better than the accepted community draft standard. For the fungi, and especially N. crassa, the assemblies were accurate, but at a lower level of completeness and contiguity. To understand how the ALLPATHS assemblies would compare to assemblies produced by existing software, we also assembled the identical datasets with Velvet [12] and EULER-SR [9,14], using standardized arguments for each assembler applied to all five genomes. In each case, we initially tested a range of arguments, with the goal of finding a single choice for settings that would optimize assembly quality. We note, however, that some choices optimized continuity at the expense of accuracy whereas other choices did the reverse. For Velvet and EULER-SR, we arrived at a single formula for each that was used in all assemblies presented here (Additional data file 1). We first compare the results of ALLPATHS and Velvet (see Tables 2, 3, 4 and 5 for details). For S. aureus and E. coli, the ALLPATHS contigs were about five times longer than the Velvet contigs, whereas for the other three species, contig lengths were comparable. Scaffolds were longer for ALLPATHS for two species, and longer for Velvet for the other three; however, the ALLPATHS scaffolds were far more accurate. The odds of being correctly connected at distance 100 kb in an ALLPATHS scaffold were 100%, 100%, 100%, 99.8%, or 99.8%, depending on the species, whereas the same odds for Velvet were 45.6%, 86.4%, 75.8%, 80.2%, or 13.6%. In all five cases, the ALLPATHS assemblies were somewhat more complete. The base accuracy of the ALLPATHS assemblies was higher than for Velvet. For the bacterial genomes, where the reference sequences were essentially base perfect, thus enabling highly accurate measurement, we found that the ALLPATHS base quality was approximately Q60, whereas the Velvet base quality was approximately Q34. The reported base accuracies of the fungal assemblies were also higher for ALLPATHS. Finally, misassemblies were also much less frequent in the ALLPATHS assemblies: the misassembly rates for the ALLPATHS bacterial assemblies were 0%, 0%, and 0.3%, whereas those for the Velvet bacterial assemblies were 6.6%, 6.9%, and 3.0%. The ALLPATHS fungal assemblies also had about six-fold fewer misassemblies. We also report results for EULER-SR assemblies (see Tables 2, 3, 4 and 5 for details). We note that EULER-SR scaffolding was minimal, yielding scaffolds of about the same size as contigs. In all cases EULER-SR contigs were shorter than either ALLPATHS or Velvet contigs. The EULER-SR assemblies were also substantially less accurate, in terms of both base accuracy and misassembly rate, than those produced by ALLPATHS and Velvet. Finally, we carried out a series of 50 assemblies with the goal of understanding whether the results of Velvet or EULER-SR might be improved by using only the reads from approximately 200-bp fragments, thus excluding the shorter 26 base reads from long ('jumped') fragments, that might hinder the performance of the algorithms. For each of the two programs we tried two versions of the code, as well as multiple values of K: for Velvet we tried K = 25, 28, and 31, whereas for EULER-SR we used K = 25 and 28, the maximum allowed value. However, the EULER-SR assemblies with K = 28 terminated prematurely so we were unable to include the results. Table S5 in Additional data file 2 displays the results of these auxiliary 'jump-free' assemblies. In brief, we note the following. Not surprisingly, scaffolds are much shorter for the auxiliary assemblies. For example, for S. aureus, the Velvet scaffolds are six-fold (or more) shorter than those in the assembly that uses all of the data. In some other ways, some of the assemblies from less data were better. For example, the auxiliary assembly of R. sphaeroides obtained from the older version of Velvet using K = 28 yielded contigs whose N50 size was 19% less than for the assembly of all the data, but which covered 2% more of the genome and were more accurate (0.5% misassembled versus 3%). Discussion Here we demonstrate that high-quality assemblies of small genomes can be obtained from short reads, provided they are paired. At least for bacterial genomes, the ALLPATHS 2 assembly quality clearly exceeds that of draft assemblies from Sanger method data. Indeed, the base accuracy of 10-6 greatly exceeds that of the finished sequence standard of 10-4 originally set for the human genome (which has to come to represent a standard for genome finishing), and the standard of 10-5 that was achieved [19]. Sequencing costs are already low enough that the same laboratory and computational methods might be applied to hundreds or thousands of such genomes. Given the current competitive environment among sequencing technologies, we expect costs to fall substantially further. Difficult regions, such as recent duplications, will continue to pose a challenge, as they did with Sanger method sequencing. By creating assembly graphs, rather than linear assemblies (consisting of contigs and scaffolds), we can, in principle, capture these regions without fully disambiguating them [10,20]. Resolving these ambiguities will require improvements to the molecular biology, for example, very long reads. In evaluating our results, we compared ALLPATHS to Velvet and EULER-SR, two other short read assemblers, on five data sets, and observed that the ALLPATHS assemblies are substantially more accurate. We note that all three programs were invoked with standardized settings for all genomes. Had we tuned the settings specifically for each genome using our knowledge of the reference sequences, performance would likely have been better in many cases, but the results less generalizable. We now consider further improvements and new challenges for short read assembly. While we have demonstrated that nearly perfect bacterial assemblies can be generated from short reads, typically approximately 1% of the genome is missed. The fungal genomes, which are both more complex and five- to ten-fold larger, push the limits of short read assembly. Beyond this, we would like to assemble genomes even larger and more complex, such as those of mammals. Ingredients for success would include the following. Longer reads Read lengths from Illumina's technology have increased to 100 bases in the past year, and longer reads are likely soon (data not shown). Better representation By avoiding cloning bias, the new sequencing technologies can access regions that were previously recalcitrant, but they introduce new biases, for example by under-representing GC-rich regions. Early laboratory developments in this area are promising [21,22]. Efficient library construction Libraries with large numbers of distinct read pairs are required for assembly of large genomes. The key is to control process losses so that, from available amounts of starting material, sufficiently large libraries are produced. The importance of this increases proportionally with genome size as more read pairs are needed to cover the genome. We note that assembly algorithms must account for developments in laboratory methods. For example, a recent method [5] based on blunt end ligation rather than restriction generates jumping construct libraries of sufficient complexity for large genomes and not having a hard size limit on end reads. However, this method yields many reads containing the ligation point, thus posing a substantial algorithmic challenge. Scalable computation Methods must be adapted so that the computational requirements for large genomes are feasible. All of these ingredients could be at hand within a year, enabling genome assembly to fully cross the threshold from the Sanger method era to the 'short read' era, with comparable quality assemblies produced at approximately 100-fold lower cost. Materials and methods ALLPATHS was first tested on simulated data [10]. Below we describe modifications that were needed for it to work well on real data. Removal of sequencing artifacts We now discard all read pairs where both reads of the pair consist of 90% or more A bases. On the Illumina platform, such pairs are nearly always artifacts of the sequencing process, and on some runs can be sufficiently abundant as to cause problems for the assembly algorithms. Trusted K-mer identification We identify putatively correct (or 'trusted') K-mers in the reads based on quality scores. This affects how we create unipaths and how we correct errors, as described below. For each K-mer that appears in the reads, we find all its instances in the reads, and examine the collection of read quality scores for a given base in the K-mer. That base is called trusted if there are enough good scores (default: 4 scores of at least 25). The entire K-mer is called trusted if each of its bases is trusted. Rehabilitation of K-mers between short fragment pairs If for such a pair there is no path of trusted K-mers from one end to the other, but there is a path that uses some untrusted K-mers, we rehabilitate those K-mers, changing their status to trusted. Unipath creation We find the (K+1)-mers in the reads whose first and last K-mer are trusted. The unipaths are mathematically defined by the trusted K-mers together with these (K+1)-mers, which define adjacencies between the trusted K-mers. Error correction This has been modified to use the new definition of trusted K-mers. Unipath graph shaving After the initial unipath creation, there are many cases in which read errors result in short terminal branches within the graph. Those shorter than 20 K-mers are now removed, provided that there is a longer alternative branch. Unipath recovery This code identifies unipaths that are not represented in the assembly, extends them unambiguously where possible, and then adds them to the assembly. Typically this finds small regions that have relatively high copy number. ALLPATHS computational requirements The five genomes were assembled on a 16-processor Dell server having 128 GB of memory. Some of the code is parallelized. The wall-clock times for the assemblies were: S. aureus, 1.7 hours; E. coli, 8.2 hours; R. sphaeroides, 10.2 hours; S. pombe, 80.5 hours; N. crassa, 86.6 hours. Abbreviations N50: length-weighted median. Authors' contributions AG, JM, KM, SR and LW developed methodology for EcoP15I ditag jumping libraries and constructed the libraries that were used. SY and TPS curated reference sequences and carried out assembly experiments using Velvet and EULER. SG and IM led ALLPATHS computational R&D. JB, DP and IS designed and implemented algorithms. DJ participated in R&D and wrote the manuscript. CN edited the manuscript and coordinated activities between groups. All authors read and approved the final manuscript. Additional data files The following additional data are available with the online version of this paper: Tables S1 to S4, Figure S1, and several supplemental explanations (Additional data file 1); Table S5 (Additional data file 2). Supplementary Material Additional data file 1 Tables S1 to S4, Figure S1, and several supplemental explanations. Click here for file Additional data file 2 Table S5. Click here for file
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Phenotypic connections in surprising places

              The human disease landscape The past few years have witnessed a growing number of well documented connections between and among human disease phenotypes, whose relationship would not have been obvious within the current disease classification framework. The evidence stems from a variety of sources, spanning clinical epidemiology, computational genomics and various model systems [1-5]. The implications are potentially so fundamental to disease etiology, drug development and the general diagnostic paradigm that there have been calls for an NIH Roadmap (large trans-institute transformational grants) focused on delineating the human disease landscape - a quantitative bipartite correlation map relating disease phenotypes and their genetic structures. A new study of phenotype correlations [6] now takes this idea further. One of the earliest comprehensive studies of human phenotype correlations was by Rzhetsky et al. [5], who established a correlation network between 161 diseases and disorders using evidence of comorbidities obtained from some 1.5 million patient records. Among other results, they found suggestive evidence for genetic relations between autism, which manifests in childhood, and several late onset diseases, including bipolar disorder and schizophrenia. A different approach was taken by Goh et al. [1], who constructed a human disease network by linking genetic disorders that are known to share causative genes. The network was based on an analysis of the Online Mendelian Inheritance in Man (OMIM) database, the most comprehensive compendium of well established associations between human disorders and their associated genes [7]. Among their findings was a large subnet formed by 516 of the 1,284 disorders studied, clearly showing many-to-many relationships between phenotypes and genes. For example, KRAS (encoding a small GTPase), BRCA1 and BRCA2 (both encoding tumor suppressors) are all involved in breast cancer, but KRAS is also implicated in pancreatic cancer, whereas BRCA1 and BRCA2 are associated with papillary carcinoma and prostate cancer, respectively [7]. The results of these and other studies support the concept that genes underlying a disorder tend to be functionally related; for example, they could be part of a particular protein complex, a particular pathway or process, or a particular set of coexpressed genes. A disorder can then be viewed as a phenotype that emerges from dysfunction of one or more components of a functionally coherent gene module. An important aspect of modularization is that any gene in the module that was not previously identified with the phenotype is a candidate for association with it [8]. Moreover, a particular module can, to varying extents, underlie related phenotypes, opening up the possibility of linking phenotypes not only on the basis of shared genes, but also on the basis of functionally related genes that are not shared. Inferring disease connectivity through functional linkage starts with a network of genes whose functions are correlated; that is, each pair of nodes (genes or proteins) is connected by one or more sources of evidence supporting its functional coherence, such as physical interaction, correlated expression, adjacency in the same metabolic pathway or genetic interaction [2-4]. Connections can then be inferred between the diseases whose associated genes are linked in the gene network. For instance, we and our colleagues [3] constructed a network for the human genome by integration of diverse types of evidence using a Bayesian model, and annotated it with all OMIM disease genes for diseases known to be associated with five or more genes. Genes that are most tightly linked to those known to be associated with a given disease are immediate candidates for association with that disease. Furthermore, connections between disease pairs can be quantitatively identified on the basis of the magnitude of the functional linkage between their disease-associated genes. Thus, two diseases can be linked even if they do not share disease-associated genes. The associations found ranged from phenotypically disparate disease pairs, such as multiple sclerosis with malaria, to phenotypically similar pairs, such as muscular dystrophy with myopathy. Such results suggest that the current disease classification may be much less informative than is commonly believed. Recognition of molecular connections between disease phenotypes provides immediate insight into the molecular mechanisms underlying different diseases and can therefore generate novel hypotheses for therapeutic strategies. This is especially valuable if one disease is well studied and the other is not; it might also be valuable if viable drug targets have been found for one of the diseases but not the other. This prospect of drug repositioning could accelerate the introduction of therapies by years [9]. Human disease models and phenotypic connections in surprising places Recent work from Edward Marcotte's laboratory by McGary et al. [6] takes these concepts, and indeed the entire field of phenomics, to an entirely new level by developing, establishing and exploring a quantitative method to find non-obvious phenotypic connections, not just within a species, but across species. Such cross-species phenotypic connections stem from the evolutionary conservation of the underlying associated gene modules [8,10]. The importance of the work reported by McGary et al. [6] can be glimpsed by recalling that whereas OMIM, which is the most extensive database of well established human gene-phenotype associations, contains approximately 5,400 unique associations [7], the numbers for some model organisms are 1 to 2 orders of magnitude higher [6]. Some of these relations - such as between obesity and its implicated genes in mice - have obvious equivalences in humans, and when they do, the laboratory model can serve as a useful surrogate to study the disease or disorder. Other phenotypic pairs are entirely non-obvious, such as the fact that retinal cancer in humans and ectopic vulvae in the nematode can both be caused by disruption of the human retinoblastoma 1 (RB1) gene or its nematode ortholog [6]. Because such non-obvious relations can occur with surprising frequency, a method that can rapidly and reliably link human genes to non-obviously related phenotypes in model systems offers the prospect of radically accelerating the rate at which we can explore disease landscapes in both humans and model organisms. McGary et al. [6] provide more than a glimpse at the possibilities. They begin by defining 'phenologs' as cross-species mutant phenotypes that share a significant number of orthologous genes. The statistical significance of phenologs, whose emergence results from disruption of orthologous genes, can be estimated as the probability that the observed number of orthologues common to the two phenotypes would be found by chance and correcting for multiple hypotheses. In this way, using approximately 300 human diseases and over 6,000 phenotypes in model organisms, including mouse, worm, yeast and Arabidopsis, they identify 4,390 significant phenologs. As one of the positive controls, the authors note that the 3,755 mouse-human phenologs identified contain many of the known disease models - including cataracts, deafness and retinal disease - all at P-values well below 10-8. Given that phenologs map gene-phenotype associations across the phyla, an association known in one species can be used to find non-established relations in another. Cross-validations presented by McGary et al. [6] show that phenologs can predict genes associated with about a third to a half of tested human diseases. The work [6] offers tantalizing evidence for several counterintuitive mammalian disease models, including reduced growth rate of yeast deletion strains in medium enriched with the cholesterol-lowering drug lovastatin as a model for abnormal angiogenesis in mice, and negative gravitropism defects in Arabidopsis as a model for human Waardenburg syndrome (which causes deafness with defects in neural-crest-derived tissues). Furthermore, using the yeast model, they demonstrate that SOX13 (encoding a transcription factor related to the sex-determining gene SRY) is a new gene that regulates angiogenesis, and using the Arabidopsis model they show that SEC23IP (encoding a protein that interacts with SEC23, a component of the COPII complex that controls endoplasmic reticulum-to-Golgi trafficking) is probably a new Waardenburg syndrome gene. Notwithstanding the fact that many functionally coherent gene modules are conserved across different species [10], such demonstrations - especially the identification of phenologs that predate kingdom divergence - seem to mark one of those uncommon occasions in science in which intuition built on years of experience fails completely. Additional connections As staggering as these results [6] are, it seems possible that many phenologs have been missed, because the method is confined to connections based only on phenotype pairs sharing known orthologous genes. Gene-phenotype association data may, however, be far from complete; if so, many orthologous phenotypes will be missed. One possible way to increase the discovery rate would be to consider the functional relatedness for the genes associated with each phenotype. As described earlier, gene-gene functional relatedness has been successfully used to identify phenotypic connections within the same species [2-4], and it is possible that the same principle can be applied to identify phenotype connections between different species. A deeper and far more important connection relates to the impact of McGary et al. [6] on efforts to develop a more complete picture of the topology of the human phenome. It is evident that the way phenomics looks today, the amount of information and the incredible interconnectivities, could not have been imagined even 3 years ago. It seems likely that the methods developed and demonstrated by McGary et al. [6], and their inevitable extensions, will add unprecedented knowledge and quantitative detail to the interrelated landscapes of mutant phenotypes for humans and other species, and this will offer many more surprising correlations between and among human diseases. As our picture of the human disease landscape continues to take shape, perhaps the one thing that we should not be surprised about will be the need to fundamentally rethink the current disease classification and its associated diagnostic paradigm.
                Bookmark

                Author and article information

                Journal
                Bioinformatics
                Bioinformatics
                Oxford University Press (OUP)
                1367-4803
                1460-2059
                October 21 2011
                November 01 2011
                September 07 2011
                November 01 2011
                : 27
                : 21
                : 2957-2963
                Article
                10.1093/bioinformatics/btr507
                3198573
                21903629
                95e8753a-a244-43fe-818f-090f602cd2dd
                © 2011

                Comments

                Comment on this article