Introduction An emerging challenge in genomics is the ability to examine multiple disease regions within the human genome, and to recognize a subset of key genes that are involved in a common cellular process or pathway. This is a key task to translate experimentally ascertained disease regions into meaningful understanding about pathogenesis. The importance of this challenge has been highlighted by advances in human genetics that are facilitating the rapid discovery of disease regions in the form of genomic regions around associated SNPs (single nucleotide polymorphisms) – or CNVs (copy number variants) –. These disease regions often overlap multiple genes – though only one is typically relevant to pathogenesis and the remaining are spuriously implicated by proximity. The difficulty of this task is heightened by the limited state of cataloged interactions, pathways, and functions for the vast majority of genes. However, undefined gene relationships might often be conjectured from the literature, even if they are not explicitly described yet. The general strategy of using function to prioritize genes in disease regions has been substantially explored –. However, predicted disease genes have not, in general, been easily validated. Thus far, published approaches have utilized a range of codified gene information including protein-interaction maps, gene expression data, carefully constructed gene networks based on multiple information sources, predefined gene sets and pathways, and disease-related keywords. We propose, instead, to use a flexible metric of gene relatedness that not only captures clearly established close gene relationships, but also has the ability to capture potential undocumented or distant ones. Such a metric may be a more powerful tool to approach this problem rather then relying on incomplete databases of gene functions, interactions, or relationships. To this end, we use established statistical text mining approaches to quantify relatedness between two genes – specifically, gene relatedness is the degree of similarity in the text describing them within article abstracts. The published literature represented in online PubMed abstracts encapsulates years of research on biological mechanisms. We and others have shown the great utility of statistical text mining to rapidly obtain functional information about genes, including protein-protein interactions, gene function annotation, and measuring gene-gene similarity –. Text is an abundant and underutilized resource in human genetics, and currently a total of 140,000 abstracts from articles that reference human genes are available through PubMed . Additional valuable information can be seamlessly gained by including more than 100,000 references from orthologous genes; many important pathways have been more thoroughly explored in model systems than in humans. We have developed a novel statistical method to evaluate the degree of relatedness among genes within disease regions: Gene Relationships Among Implicated Loci (GRAIL). Given only a collection of disease regions, GRAIL uses our text-based definition of relatedness (or alternative metrics of relatedness) to identify a subset of genes, more highly related than by chance; it also assigns a select set of keywords that suggest putative biological pathways. It uses no information about the phenotype, such as known pathways or genes, and is therefore not tethered to potentially biased pre-existing concepts about the disease. In addition to a flexible text-based metric of relatedness, GRAIL's ability to successfully connect genes also leverages a statistical framework that carefully accounts for differential gene content across regions. We assume that each region contains a single pathogenic gene; therefore narrow regions with one or just a few genes are more informative than expansive regions with many genes, since they are likely to have many irrelevant genes. To take advantage of this, we have designed GRAIL to set a lower threshold in considering relatedness for those genes in narrow regions, allowing for more distant relationships to be considered; on the other hand it sets a more stringent threshold for genes located in expansive mutligenic regions and considers only the very closest of relationships. This strategy prevents large regions with many genes from dominating the analysis. In this paper we apply GRAIL to four phenotypes. In each case GRAIL is able to identify a subsets of genes enriched for relatedness – more than expected by random chance. We demonstrate enrichment for relatedness among true disease regions rigorously based on both GRAIL's theoretically derived p-value and also based on parallel analysis of either (1) carefully selected random regions matched for gene content and size or (2) experimentally derived false positive disease regions. GRAIL is able to identify subsets of highly related genes among validated SNP associations. First we use GRAIL to identify related genes from SNPs associated with serum lipid levels; GRAIL correctly identifies genes already known to influence lipid levels within the cholesterol biosynthesis pathway. In comparison to randomly selected matched SNP sets, the set of lipid SNPs demonstrate significantly more relatedness. Second, we use GRAIL to identify significantly related genes near height-associated SNPs; these genes highlight plausible pathways involved in height. In comparison to randomly selected matched SNP sets, the set of height SNPs also demonstrate significantly more relatedness. Encouraged by GRAIL's ability to recognize biologically meaningful connections, we tested its ability to distinguish true disease regions from false positive regions in two practical applications in human genetics. First, in Crohn's disease, we start with a long list of putative SNP associations from a recent GWA (genome-wide association) meta-analysis . We demonstrate that a substantial fraction of these SNPs contain highly related genes—far beyond what can be expected by chance. We demonstrate that many of these SNPs subsequently validate in an independent replication genotyping experiment. Second, in schizophrenia, we previously identified an over-representation of rare deletions in schizophrenia cases compared to controls . Despite the statistical excess, it is challenging to identify exactly which case deletions are causal, given the relatively high background rate of rare deletions in controls. Using GRAIL however, we are able to demonstrate that a subset of case deletions contain related genes. We further demonstrate that these genes are highly and significantly enriched for central nervous system (CNS) expressed genes. In stark contrast, GRAIL finds no excess relatedness among genes implicated by case deletions. Results Summary of statistical approach GRAIL relies on two key methods: (1) a novel statistical framework that assesses the significance of relatedness between genes in disease regions (2) a text-based similarity measure that scores two genes for relatedness to each other based on text in PubMed abstracts. Details for both are presented in the Methods. The GRAIL statistical framework consists of four steps (see Figure 1). First, given a set of disease regions we identify the genes overlapping them (Figure 1A); for SNPs we use LD (linkage disequilibrium) characteristics to define the region. Second, for each overlapping gene we score all other human genes by their relatedness to it (Figure 1B). In this paper we use a text-based similarity measure; alternative measures of relatedness, for example similarity in gene annotations or expression data, could be easily applied instead ,. Third, for each gene we count the number of independent regions with at least one highly related gene (Figure 1C); here the threshold for relatedness varies between regions depending on the number of genes within them. We assign a p-value to that count. Fourth, for each disease region we select the single most connected gene as the key gene. We assign the disease region that key gene's p-value after adjusting for multiple hypothesis testing (if there are multiple genes within the region) (Figure 1D). This final score is listed in this paper as pmetric where the metric is text, expression, or annotation based. Very low ptext scores for one region indicate that a gene within it is more related to genes in other disease regions through PubMed abstracts than expected by chance. Simulations on random groups of SNPs demonstrate that the ptext values approximately estimate Type I error rates, being approximately uniformly distributed under the null hypothesis (see Figure S1). However, we recommend the use of careful simulations or controls rather than actual theoretical p-values to reinforce the significance of GRAIL's findings – as we do in the examples below. 10.1371/journal.pgen.1000534.g001 Figure 1 Gene Relationships Among Implicated Loci (GRAIL) method consists of four steps. (A) Identifying genes in disease regions. For each independent associated SNP or CNV from a GWA study, GRAIL defines a disease region; then GRAIL identifies genes overlapping the region. In this region there are three genes. We use gene 1 (pink arrow) as an example. (B) Assess relatedness to other human genes. GRAIL scores each gene contained in a disease region for relatedness to all other human genes. GRAIL determines gene relatedness by looking at words in gene references; related genes are defined as those whose abstract references use similar words. Here gene 1 has word counts that are highly similar to gene A but not to gene B. All human genes are ranked according to text-based similarity (green bar), and the most similar genes are considered related. (C) Counting regions with similar genes. For each gene in a disease region, GRAIL assesses whether other independent disease regions contain highly significant genes. GRAIL assigns a significance score to the count. In this illustration gene 1 is similar to genes in three of the regions (green arrows), including gene A. (D) Assigning a significance score to a disease region. After all of the genes within a region are scored, GRAIL identifies the most significant gene as the likely candidate. GRAIL corrects its significance score for multiple hypothesis testing (by adjusting for the number of genes in the region), to assign a significance score to the region. The text-based similarity metric is based on standard approaches used in statistical text mining. To avoid publications that report on or are influenced by disease regions discovered in the recent scans, we use only those PubMed abstracts published prior to December 2006, before the recent onslaught of GWA papers identifying novel associations. This approach effectively avoids the evaluation of gene relationships being confounded by papers listing genes in regions discovered as associated to these phenotypes. In addition to including primary abstract references about genes listed in Entrez Gene, we augment our text compendium with references to orthologous genes listed in Homologene ; this increases the number of articles available per gene from 6 to 12 (see Table 1). We note that the distribution of articles per gene is skewed toward a small number of genes with many references; 0.4% of genes are referenced by >500 articles, while 26% of genes are referenced by 0.1. The scatter plot on the right illustrates ptext values for actual serum cholesterol associated SNPs (blue dots). Black horizontal line marks the median ptext value. We assessed the same SNP with similarity metrics based on gene annotation (green dots) and gene expression correlation (purple dots). (B) 42 SNPs associated with height. Similar plot for 42 height associated SNPs. The histogram on the left of the graph illustrates ptext values for random SNP sets carefully matched to height-associated SNP set. 86.5% of those SNPs have ptext values that are >0.1. The scatter plot on the right illustrates ptext values for actual SNPs associated with height (blue dots). Black horizontal line marks the median ptext value. We assessed the same SNP with similarity metrics based on gene annotation (green dots) and gene expression correlation (purple dots). On the right we list for each ptext threshold the number of expected SNPs less than the threshold based on matched sets, and the number of observed SNPs less than the threshold among height associated SNPs. Despite relatively comprehensive lipid biology annotation, GO does not identify relationships between regions as effectively as published text (Figure 2A). A total of 12 out of the 19 associated SNPs obtained pannotation 10−4); the remaining 22 regions had intermediate levels of significance following replication (and can be considered as yet unresolved associations) . We applied GRAIL prospectively to these 74 nominally associated SNPs. GRAIL was initially operated independent of any knowledge of the contemporaneous replication genotyping experiment. Each region contained between 1 and 34 genes, except for two regions that contained no genes and were not scored. GRAIL identified 13 regions as significant (achieving ptext scores 0.1. 10.1371/journal.pgen.1000534.t002 Table 2 High scoring regions from a Crohn's disease GWA meta-analysis. SNP Chr Position (HG17) passociation Replication Study Result N (genes) Implicated Gene p text rs2066845 16 49314041 1.5E-24 VALIDATED 3 NOD2 0.00010 rs10863202 16 84545499 1.4E-05 INDETERMINATE 4 IRF8 0.00058 rs10045431 5 158747111 1.9E-13 VALIDATED-NOVEL 1 IL12B 0.00066 rs11465804 1 67414547 3.3E-63 VALIDATED 1 IL23R 0.00094 rs2476601 1 114089610 7.3E-09 VALIDATED-NOVEL 8 PTPN22 0.0014 rs762421 21 44439989 7.0E-10 VALIDATED-NOVEL 1 ICOSLG 0.0023 rs2188962 5 131798704 1.2E-18 VALIDATED 9 IRF1 0.0026 rs917997 2 102529086 1.1E-05 INDETERMINATE 5 IL18RAP 0.0027 rs11747270 5 150239060 1.7E-16 VALIDATED 3 IRGM 0.0032 rs2738758 20 61820069 2.7E-06 INDETERMINATE 10 TNFRSF6B 0.0038 rs9286879 1 169593891 7.7E-10 VALIDATED-NOVEL 4 TNFSF18 0.0042 rs2301436 6 167408399 5.2E-13 VALIDATED-NOVEL 3 CCR6 0.0052 rs4263839 9 114645994 1.3E-10 VALIDATED 2 TNFSF8 0.008 rs3828309 2 233962410 1.2E-32 VALIDATED 4 USP40 0.019 rs744166 17 37767727 3.4E-12 VALIDATED-NOVEL 2 STAT3 0.023 rs7758080 6 149618772 4.4E-06 INDETERMINATE 4 SUMO4 0.033 rs7161377 14 75071147 2.3E-05 INDETERMINATE 1 BATF 0.09 Here we list a subset of the 74 regions that emerged from a Crohn's disease GWA meta-analysis that GRAIL assigned the most compelling ptext scores to. The first three columns list information about the associated SNP. The fourth column lists the combined p-value of association from a GWA meta-analysis and subsequent replication. The fifth column indicates whether the region was validated, indeterminate, or failed in replication. Those regions that represent novel findings, not previously published are also indicated. The sixth column lists the number of genes in the disease region, and the seventh column lists the candidate gene identified by GRAIL. The eighth column lists the regions ptext score. Using these Crohn's results, we have compared GRAIL's performance to four other competing algorithms that also use functional information to prioritize genes, and GRAIL's performance is superior at predicting true positive associations (see Text S1, Figure S2, Table S5, Table S6). As a further test of GRAIL, we then evaluated the next most significant 74 associated SNPs that emerged from the Crohn's disease GWA meta-analysis (association p-values ranging from 5×10−5 to 2×10−4). Out of the 75 regions, 8 are not near any gene, and we did not score them. The remaining 67 regions were tested with GRAIL for relationships to the 52 replicated and indeterminate regions that emerged following replication. Two emerge with highly significant GRAIL scores: rs8178556 on chromosome 21 (IFNAR1, ptext = 1.7×10−4) and rs12928822 on chromosome 16 (SOCS1, ptext = 8.2×10−4) suggesting these independent regions may lead to novel associated SNPs for Crohn's disease (see Table S7). We next applied GRAIL to recently published sets of rare deletions seen in schizophrenia cases and matched controls. Multiple groups have recently demonstrated that extremely rare deletions, many of which are likely de novo, are notably enriched in schizophrenia –,. However, since rare deletions occur frequently in healthy individuals as well, many of these case deletions will also be non-pathogenic. In fact, we previously found that large (>100 kb), gene overlapping, singleton, deletions were present in 4.9% of cases but also in 3.8% of controls, suggesting that over two-thirds of these deletions are not relevant to disease . We identified 165 published de-novo or case-only deletions of >100 kb overlapping at least one gene; a total of 511 genes are deleted or disrupted by these deletions ,,. Additionally, we identified 122 regions similar control-only deletions; a total of 252 genes are deleted or disrupted by these deletions. We applied GRAIL separately to both the case and control sets of deletions. In the case deletions, we identified a subset containing highly connected genes (Figure 4A). Specifically, 12 of the 165 regions obtain ptext scores 0.5 (Figure 3B). These regions might have been missed since the relevant gene is either poorly studied, or even if the gene is well studied, the relevant function of that gene is not well documented in the text. An alternative possibility is that the SNP is tagging non-genic regulatory elements. Additionally, the SNP may be the first discovered representative association for a critical pathway, not represented by other SNP associations – and therefore cannot be connected to them. In this case future discoveries will clarify the significance of that association. In cases where there is no apparent published connection between associated genes, other similarity metrics based on experimentally derived data, such as gene expression, protein-protein interactions and transcription factor binding sites could also complement the text-based approaches presented here. In fact, we demonstrate how annotation-based metrics or gene expression-based metrics are able to identify a subset of the associated SNPs in lipid metabolism. As these and other metrics are optimized, they could be used in conjunction with the novel GRAIL statistical framework that we present here to help understand gene relationships. Methods Scoring regions for functional relatedness The Gene Relationships Among Implicated Loci (GRAIL) has four basic steps that are outlined below. It has two input sets of disease regions: (1) a collection of NSEED seed regions (SNPs or CNVs) and (2) a collection of NQUERY query regions. Genes in query regions are evaluated for relationships to genes in seed regions, and query regions are then assigned a significance score. In most applications we are examining a set of regions for relationships between implicated genes, the query regions and the seed regions are identical. In other circumstances where we have a set of putative regions that are being tested against validated ones, the putative regions are defined as query regions, and the validated ones are defined as seed regions. Step 1. Defining disease regions and identifying overlapping genes For each query and seed SNP we find the furthest neighboring SNPs in the 3′ and 5′ direction in LD (r2>0.5, CEU HapMap ). We then proceed outwards in each direction to the nearest recombination hotspot . The interval between those two hotspots, which would include the SNP of interest and all SNPs in LD, is defined as the disease region. The associated SNP could feasibly be tagging a stronger SNP signal from another SNP in that region. All genes that overlap that interval are considered implicated by the SNP. If there are no genes in that region, the interval is extended an additional 250 kb in either direction; we chose 250 kb as that distance since that is a range in which non-coding variants might express gene regulation . For each query and seed CNV we define an interval that represents the deleted or duplicated region—all genes that overlap that interval are associated with the CNV for testing. Step 2. Ranking gene relatedness For each gene near a query region, we rank all human genes for relatedness. Ranking may be based on text similarity, or other metrics (see below for examples). Rank values range from 1 (most related) to NG (least related), where NG is the number of available human genes, in our application is 18,875 (see Table 1). Step 3. Scoring candidate genes against regions To avoid double counting nearby regions, we first combine any seed regions sharing one or more genes. For a given gene g in a query region, we examine the degree of similarity to any of the ns genes in a given seed region s. To ensure independence, we only look at a seed region s, if it does not share a single gene with the query region that gene g is contained in. We identify in each region s, the rank of the most similar (or lowest ranking) gene in it to gene g, Rg,s . We convert the rank to a proportion: To transform this proportion to a uniformly distributed entity under the null, we recognize that Rg,s was the lowest rank selected from ns genes – and we correct accordingly for multiple hypothesis testing: Now we identify those seed regions where pg,s is less than a pre-specified threshold pf as regions connected to gene g. For all applications presented here pf is arbitrarily set to 0.1. The number of seed regions containing at least one gene exceeding this threshold, nhit , can be approximated under a random model with a Poisson distribution. We assign a greater weight to those cases where there is greater similarity; that is in the cases where pg,s is particularly small: Under a random model, if pg,s 0.2. We restrict keywords to those that appear in >500 documents, contain >3 letters, and have no numbers. For each term, i, we calculate a score which is the difference between averaged term frequencies among candidate genes and all genes: The top twenty highest scoring terms are selected as keywords. Annotation based relatedness We defined a relatedness metric between genes based on similarity in Gene Ontology annotation terms . We downloaded Gene Ontology structure and annotations on December 19, 2006. In addition to human gene GO annotations, we added orthologous gene annotations. Since GO is a hierarchically structured vocabulary, for each gene annotation we also added all of the more general ancestral terms. This resulted in a total of 843,898 annotations for 18,050 genes with 10,803 unique GO terms; this corresponds to a median of 40 terms per gene. We weighted annotations proportionally to the inverse of their frequency, so common annotations received less emphasis. We used a weighting scheme analogous to the one we used for word weighting: where gij represented the weighted code i for gene j, NG is the total number of genes, and gfi (or GO frequency) is the number of genes annotated with the term i. Gene relatedness was the correlation between these weighted annotation vectors. Gene expression based relatedness To calculate gene relatedness based on expression we downloaded the Novartis Gene Expression Atlas . The data set consists of measurements for 33,689 probes across 158 conditions. Probes were averaged into 17,581 gene profiles. Gene relatedness was calculated as the correlation between expression vectors. Lipid and height applications We applied GRAIL to score 19 lipid-associated SNPs and separately to score 42 height-associated SNPs. Specific SNPs are listed in Table S1 and Table S2. We used the SNP sets as both the seed and the query set to look for relatedness between genes across regions. We scored SNPs separately using text, annotation, and expression similarity metrics. We compiled the best candidate genes and scores for the SNP regions. Crohn's disease application Prior to replication, we had access to 74 independent SNP regions that had emerged from a meta-analysis of Crohn's Disease. All 74 SNPs were used as both the query set and as the seed set into GRAIL. We assessed whether those SNPs that replicated had different text-based significance values than those that fail to replicate. To identify additional regions of interest, we identified the next 75 most significant regions in the Crohn's disease meta-analysis – they were used in GRAIL as a query set; for the seed set included all SNPs that did not fail in replication. Schizophrenia application We identified singleton deletions or confirmed de novo deletions reported by one of three groups. We selected those deletions that were in cases only or in controls only, were at least 100 kb large, and included at least one gene. We obtained singleton deletions online published by the International Schizophrenia Consortium (2008) at . We obtained de novo deletions published by Xu et al (2008) from Table 1 . We obtained singleton deletions published in Walsh et al (2008) from Table 2 . We identified a total of 165 case-only deletions and 122 control-only deletions. We applied the GRAIL algorithm separately to case and controls. We speculated that the case deletions might hit genes from a common pathway and GRAIL p-values may therefore be enriched for significant scores. On the other hand, we hypothesized that control deletions might be located effectively at random, and so no particular pathway or common function should necessarily be enriched in this collection. To examine genes for tissue specific expression in the CNS system, we obtained a large publicly available human tissue expression microarray panel (GEO accession: GSE7307) . We analyzed the data using the robust multi-array (RMA) method for background correction, normalization and polishing . We filtered the data excluding probes with either 100% ‘absent’ calls (MAS5.0 algorithm) across tissues, expression values <20 in all samples, or an expression range <100 across all tissues. To represent each gene, we selected the corresponding probe with the greatest intensity across all samples. The data contained expression profiles for 19,088 genes. We included expression profiles from some 96 normal tissues and excluded disease tissues and treated cell lines. We averaged expression values from replicated tissues averaged into a single value. To assess whether genes had differential expression for CNS tissues, we compared the 27 tissue profiles that represented brain or spinal cord to the remaining 69 tissue profiles with a one-tailed Mann-Whitney rank-sum test. Genes obtaining p<0.01 were identified as preferentially expressed. Evaluation against other published methods We compared GRAIL's performance in its ability to prospectively predict Crohn's associations to five other published methods. The selection of these methods, and the evaluation is detailed in Text S1. Software An online version of this method is available (http://www.broad.mit.edu/mpg/grail/). Supporting Information Figure S1 GRAIL p-value scores for random SNPs. We scored 100 random groups of 50 SNPs with GRAIL. The y-axis is the fraction of SNPs in the group with values below the threshold, the x-axis lists the specific threshold. For each threshold, we plot the distribution of the fraction of the 50 SNPs below that threshold as a box plot. The bar is the median - the mean value is explicitly listed below the box-plot. The box at each threshold lists the 25%–75% range. The error-bars line depicts the 1.5 inter-quartile range. The black dots illustrate outliers outside the 1.5 inter-quartile range. (0.39 MB PDF) Click here for additional data file. Figure S2 Sensitivity versus specificity for prioritization algorithms. We used 5 algorithms to score the 74 most promising putative SNP associations from the Crohn's meta-analysis study. We assessed each algorithm's ability to predict those SNP associations that ultimately validated in follow-up genotyping. For each algorithm, we created a received-operator curve (ROC). (0.40 MB PDF) Click here for additional data file. Table S1 19 Lipid regions scored with Text based GRAIL strategy. Here we scored 19 SNPs, associated with lipid metabolism. In the first three columns we list information about the SNP. In the fourth column we list the number of genes in the SNP associated regions. In the fifth column we list the highest scoring gene in the associated region based on GRAIL using a text-based metric. In the sixth column we list the ptext values for the associated regions. We have bolded those candidate genes that are known likely causative gene. The seventh and eight columns list similar results for GRAIL with an GO annotation-based metric. The ninth and tenth columns list similar results for GRAIL with an expression-based metric. (0.15 MB DOC) Click here for additional data file. Table S2 42 Height regions scored with Text based GRAIL strategy. Here we scored 42 SNPs, associated with height. In the first three columns we list information of the SNP. In the fourth column we list the number of genes in the SNP associated regions. In the fifth column we list the highest scoring gene in the associated region for the SNP based on GRAIL using a text-based metric. In the sixth column we list the ptext values for the associated regions. The seventh and eight columns list similar results for GRAIL with an annotation-based metric. The ninth and tenth columns list similar results for GRAIL with an expression-based metric. (0.28 MB DOC) Click here for additional data file. Table S3 Keywords for Lipid and Height SNPs. We identified keywords associated with lipid and height associated SNPs; here we list the top 20. (0.06 MB DOC) Click here for additional data file. Table S4 Crohn's Disease SNPs from a meta-analysis of GWA studies. Here we list GRAIL results and summarize genotyping results for Crohn's disease SNPs. These 74 SNPs emerged from a meta-analysis and as a result of replication genotyping, they were either validated (A), indeterminate (B), or failed (C). For each of the regions we list the SNP ID and the chromosome in the second and third column. In the fourth column we list the final combined association significance score of the SNP to the Crohn's disease. In the fifth, sixth, and seventh columns we list GRAIL results including the number of genes in the region, the best candidate gene, and the text-based significance score for the region. (0.21 MB DOC) Click here for additional data file. Table S5 Algorithms to prioritize candidate genes. Our search of the literature identified nine algorithms that could be used to prioritize genes for replication. Four methods require no user-specified disease information (supervised), and five require some disease information from the user. We list in each row the name of the disease, the website, the necessary genetic data, the functional data used to prioritize genes, the disease-specific information that must be included, and the availability of the method. (0.09 MB DOC) Click here for additional data file. Table S6 Performance measures for prioritization algorithms. We used five algorithms (column 1) to score putatively associated SNPs from the Crohn's meta-analysis. After calculating an ROC curve for each algorithm, we calculated the AUC (column 2). We also calculated a p-value with a one-tailed rank-sum test comparing the median rank of the validated SNPs to the median rank of the failed SNPs (column 2). (0.04 MB DOC) Click here for additional data file. Table S7 Other promising regions in Crohn's Disease GWA meta-analysis. Information about the top six regions identified by GRAIL from the next 75 most significant regions from the Crohn's GWA study. All associations are indeterminate, and association p-values are taken from the GWA meta-analysis - these regions have not yet been replicated. (0.05 MB DOC) Click here for additional data file. Table S8 Rare or de novo schizophrenia control deletions. Here we list all of the deletions that GRAIL identified as most related to other deleted genes (ptext <0.05). For each deletion we list the chromosome, the range of the deletion, the GRAIL p-value for the region, and the best candidate gene in the region identified by GRAIL. Most genomic coordinates are listed in HG17. * HG18 coordinates. (0.06 MB DOC) Click here for additional data file. Text S1 A. Random SNP groups; B. Comparison of GRAIL to other related algorithms. (0.09 MB DOC) Click here for additional data file.