A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.

Results

Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.

Conclusions

The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.

Related collections

Most cited references 16

Record: found
Abstract: found
Article: found

Is Open Access

The Sequence Ontology: a tool for the unification of genome annotations

Karen Eilbeck, Suzanna E. Lewis, Christopher Mungall … (2005)

Background Why a sequence ontology is needed Genomic annotations are the focal point of sequencing, bioinformatics analysis, and molecular biology. They are the means by which we attach what we know about a genome to its sequence. Unfortunately, biological terminology is notoriously ambiguous; the same word is often used to describe more than one thing and there are many dialects. For example, does a coding sequence (CDS) contain the stop codon or is the stop codon part of the 3'-untranslated region (3' UTR)? There really is no right or wrong answer to such questions, but consistency is crucial when attempting to compare annotations from different sources, or even when comparing annotations performed by the same group over an extended period of time. At present, GenBank [1] houses 220 viral genomes, 152 bacterial genomes, 20 eukaryotic genomes and 18 archeal genomes. Other centers such as The Institute for Genomic Research (TIGR) [2] and the Joint Genome Institute (JGI) [3] also maintain and distribute annotations, as do many model organism databases such as FlyBase [4], WormBase [5], The Arabidopsis Information Resource (TAIR) [6] and the Saccharomyces Genome Database (SGD) [7]. Each of these groups has their own databases and many use their own data model to describe their annotations. There is no single place at which all sets of genome annotations can be found, and several sets are informally mirrored in multiple locations, leading to location-specific version differences. This can make it hazardous to exchange, combine and compare annotation data. Clearly, if genomic annotations were always described using the same language, then comparative analysis of the wealth of information distributed by these institutions would be enormously simplified: Hence the Sequence Ontology (SO) project. SO began 2 years ago, when a group of scientists and developers from the model organism databases - FlyBase, WormBase, Ensembl, SGD and MGI - came together to collect and unify the terms they used in their sequence annotation. The Goal of the SO is to provide a standardized set of terms and relationships with which to describe genomic annotations and provide the structure necessary for automated reasoning over their contents, thereby facilitating data exchange and comparative analyses of annotations. SO is a sister project to the Gene Ontology (GO) [8] and is part of the Open Biomedical Ontologies (OBO) project [9]. The scope of the SO project is the description of the features and properties of biological sequence. The features can be located in base coordinates, such as gene and intron, and the properties of these features describe an attribute of the feature; for example, a gene may be maternally_imprinted. SO terminology and format Like other ontologies, SO consists of a controlled vocabulary of terms or concepts and a restricted set of relationships between those terms. While the concepts and relationships of the sequence ontology make it possible to describe precisely the features of a genomic annotation, discussions of them can lead to much lexical confusion, as some of the terms used by SO are also common words; thus we begin our description of SO with a discussion of its naming conventions, and adhere to these rules throughout this document. Wherever possible, the terms used by SO to describe the parts of an annotation are those commonly used in the genomics community. In some cases, however, we have altered these terms in order to render them more computer-friendly so that users can create software classes and variables named after them. Thus, term names do not include spaces; instead, underscores are used to separate the words in phrases. Numbers are spelled out in full, for example five_prime_UTR, except in cases where the number is part of the accepted name. If the commonly used name begins with a number, such as 28S RNA, the stem is moved to the front - for example, RNA_28S. Symbols are spelled out in full where appropriate, for example, prime, plus, minus; as are Greek letters. Periods, points, slashes, hyphens, and brackets are not allowed. If there is a common abbreviation it is used as the term name, and case is always lower except when the term is an acronym, for example, UTR and CDS. Where there are differences in the accepted spelling between English and US usage, the US form is used. Synonyms are used to record the variant term names that have the same meaning as the term. They are used to facilitate searching of the ontology. There is no limit to the number of synonyms a term can have, nor do they adhere to SO naming conventions. They are, however, still lowercase except when they are acronyms. Throughout the remainder of this document, the terms from SO are highlighted in italics and the names of relationships between the terms are shown in bold. The terms are always depicted exactly as they appear in the ontology. The names of EM operators are underlined. SO, SOFA, and the feature table To facilitate the use of SO for the markup of gene annotation data, a subset of terms from SO consisting of some of those terms that can be located onto sequence has been selected; this condensed version of SO is especially well suited for labeling the outputs of automated or semi-automated sequence annotation pipelines. This subset is known as the Sequence Ontology Feature Annotation, or SOFA. SO, like GO, is an 'open source' ontology. New terms, definitions, and their location within the ontology are proposed, debated, and approved or rejected by an open group of individuals via a mailing list. SO is maintained in OBO format and the current version can be downloaded from the CVS repository of the SO website [10]. For development purposes, SOFA was stabilized and released (in May 2004) for at least 12 months to allow development of software and formats. SO is a directed acyclic graph (DAG), and can be viewed using the editor for OBO files, OBO-Edit [11]. The terms describing sequence features in SO and SOFA are richer than those of the Feature Table [12] of the three large genome databanks: GenBank [1], EMBL [13] and the DNA Data Bank of Japan (DDBJ) [14]. The Feature Table is a controlled vocabulary of terms describing sequence features and is used to describe the annotations distributed by these data banks. The Feature Table does provide a grouping of its terms for annotation purposes, based on the degree of specificity of the term. The relationships between the terms are not formalized; thus the interpretation of these relationships is left to the user to infer, and, more critically, must be hard-coded into software applications. Most of the terms in the Feature Table map directly to terms in SO, although the term names may have been changed to fit SO naming conventions. In general, SO contains a more extensive set of features for detailed annotation. There are currently 171 locatable sequence features in SOFA compared to 65 of the Feature Table. There are 11 terms in the Feature Table that are not included in SO. These terms fall into two categories: remarks and immunological features, both of which have been handled slightly differently in SO. A mapping between SO and the Feature Table is available from the SO website [10]. Database schemas, file formats and SO SO is not a database schema, nor is it a file format; it is an ontology. As such, SO transcends any particular database schema or file format. This means it can be used equally well as an external data-exchange format or internally as an integral component of a database. The simplest way to use SO is to label data destined for redistribution with SO terms and to make sure that the data adhere to the SO definition of the data type. Accordingly, SO provides a human-readable definition for each term that concisely states its biological meaning. Usually the definitions are drawn from standard authoritative sources such as The Molecular Biology of the Cell [15], and each definition contains a reference to its source. Defining each term in such a way is important as it aids communication and minimizes confusion and disputes as to just what data should consist of. For example, the term CDS is defined as a contiguous RNA sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon. According to SO, the sequence of a three_prime_utr does not contain the stop_codon - and files with such sequences are SO-compliant; files of three_prime_utr containing stop_codons are not. This is a trivial example, illustrating one of the simplest use cases, but it does demonstrate the power of SO to put an end to needless negotiations between parties as to the details of a data exchange. This aspect of SO is especially well suited for use with the generic feature format (GFF) [16]. Indeed, the latest version, GFF3, uses SO terms and definitions to standardize the feature type described in each row of a file and SO terms as optional attributes to a feature. SO can also be employed in a much more sophisticated manner within a database. CHADO [17] is a modular relational database schema for integrating molecular and genetic data and is part of the Generic Model Organism Database project (GMOD) [18], currently used by both FlyBase and TIGR. The CHADO relational schema is extremely flexible, and is centered on genomic features and their relationships, both of which are described using SO terms. This use of SO ensures that software that queries, populates and exports data from different CHADO databases is interoperable, and thus greatly facilitates large-scale comparisons of even very complex genomics data. Like GFF3, Chaos-XML [19] is a file format that uses SO to label and structure data, but it is more intimately tied to the CHADO project than is GFF3. Chaos-XML is a hierarchical XML mapping of the CHADO relational schema. Annotations are represented as an ontology-typed feature graph. The central concept of Chaos-XML is the sequence-feature, which is any sequence entity typed by SO. The features are interconnected via feature relationship elements, whereby each relationship connects a subject feature and an object feature. Features are located via featureloc elements which use interbase (zero-based) coordinates. Chaos-XML and CHADO are richer models than GFF3 in that feature_relationships are typed, and a more sophisticated location model is used. Chaos-XML is the substrate of a suite of programs called Comparative Genomics Library (CGL), pronounced 'seagull' [20], which we have used for the analyses presented in our Results section. The basic types in SOFA, from which other types are defined, are region and junction, equivalent to the concepts of interiors and boundaries defined in the field of topological relationships [21]. A region is a length of sequence such as an exon or a transposable_element. A junction is the space between two bases, such as an insertion_site. Building on these basic data types, SOFA can be used to describe a wide range of sequence features. Raw sequence features such as assembly components are captured by terms like contig and read. Analysis features, defined by the results of sequence-analysis programs such as BLAST [22] are captured by terms such as nucleotide_match. Gene models can be defined on the sequence using terms like gene, exon and CDS. Variation in sequence is captured by subtypes of the term sequence_variant. These terms have multiple parentages with either region or junction. SOFA (and SO) can also be used to describe many other sequence features, for example, repeat, reagent, remark. Thus, SOFA together with GFF3 or Chaos-XML provide an easy means by which parties can describe, standardize, and document the data they distribute and exchange. The SO and SOFA controlled vocabularies can be used for de novo annotation. Several groups including SGD and FlyBase now use either SO or SOFA terms in their annotation efforts. SO is not restricted to new annotations, however, and may be applied to existing annotations. For example, annotations from GenBank may be converted into SO-compliant formats using Bioperl [23] (see Materials and methods). SO relationships One essential difference between a controlled vocabulary, such as the Feature Table, and an ontology is that an ontology is not merely a collection of predefined terms that are used to describe data. Ontologies also formally specify the relationships between their terms. Labeling data with terms from an ontology makes the data a substrate for software capable of logical inference. The information necessary for making logical inferences about data resides in the class designations of the relationships that unite terms within SO. We detail this aspect of the ontology below. For purposes of reference, a section of SO illustrating the various relationships between some of its terms is shown in Figure 1. Currently, SO uses three basic kinds of relationship between its terms: kind_of, derives_from, and part_of. These relationships are defined in the OBO relationship types ontology [24]. kind_of relationships specify what something 'is'. For example, an mRNA is a kind_of transcript. Likewise an enhancer is a kind_of regulatory_region. kind_of relationships are valid in only one direction. Hence, a regulatory_region is not a kind_of enhancer. One consequence of the directional nature of kind_of relationships is that their transitivity is hierarchical - inferences as to what something 'is' proceed from the leaves towards the root of the ontology. For example, an mRNA is a kind_of processed_transcript AND a processed_transcript is a kind_of transcript. Thus, an mRNA is a kind_of transcript. kind_of relationships are synonymous with is_a relationships. We adopted the 'kind_of' notation to avoid the lexical confusion often encountered when describing relationships, as the phrase 'is a' is often used in conjunction with another relationships in English - for example 'is a part_of'. SO uses the term derives_from to denote relationships of process between two terms. For example, an EST derives_from an mRNA. derives_from relationships imply an inverse relationship; derives. Note that although a polypeptide derives_from an mRNA, a polypeptide cannot be derived from an ncRNA (non-coding RNA), because no derives_from relationship unites these two terms in the ontology. This fact illustrates another important aspect of how SO handles relationships: children always inherit from parents but never from siblings. An ncRNA is a kind_of transcript as is an mRNA. Labeling something as a transcript implies that it could possibly produce a polypeptide; labeling that same entity with the more specific term ncRNA rules that possibility out. Thus, a file that contained ncRNAs and their polypeptides would be semantically invalid. part_of relationships pertain to meronomies; that is to say 'part-whole' relationships. An exon, for example, is a part_of a transcript. part_of relationships are not valid in both directions. In other words, while an exon is a part_of a transcript, a transcript is not a part_of an exon. Instead, we say a transcript has_part exon. SO does not explicitly denote whole-part relationships, as every part_of relationship logically implies the inverse has_part relationship between the two terms. Transitivity is a more complicated issue with regards to part-whole relationships than it is for the other relationships in SO. In general, part_of relationships are transitive - an exon is a part_of a gene, because an exon is a part_of a transcript, and a transcript is a part_of a gene. Not every chain of part-whole relationships, however, obeys the principle of transitivity. This is because parts can be combined to make wholes according to different organizing principles. Winston et al. [25] have described six different subclasses of the part-whole relationship, based on the following three properties: configuration, whether the parts have a structural or functional role with respect to one another or the whole they form; substance, whether the part is made of the same stuff as the whole (homomerous or heteromerous); and invariance, whether the part can be separated from the whole. These six relations and their associated part_of subclasses are detailed in Table 1. Winston et al. [25] argue that there is transitivity across a series of part_of relationships only if they all belong to the same subclass. In other words, an exon can only be part_of a gene, if an exon is a component_part_of a transcript, and a transcript is component_part_of a gene. If, however, the two statements contain different types of part_of relationship, then transitivity does not hold. By addressing the vague English term 'part of' in this way, Winston et al. solve many of the problems associated with reasoning across part_of relationships; thus, we are adopting their approach with SO. The parts contained in the sequence ontology are mostly of the type component_part_of such as exon is a part_of transcript, although there are a few occurrences of member_part_of such as read is a part_of contig. SO's relationships facilitate software design and bioinformatics research Genomic annotations are substrates for a multitude of software applications. Annotations, for example, are rendered by graphical viewers, or, as another example, their features are searched and queried for purposes of data validation and genomics research. Using an ontology for sequence annotation purposes offers many advantages over the traditional Feature Table approach. Because controlled vocabularies do not specify the relationships that obtain between their terms, using the Feature Table has meant that relationships between features have had to be hard-coded in software applications themselves; consequently, adding a new term to the Feature Table and/or changing the details of the relationships that obtain between its terms has meant revising every software application that made use of the Feature Table. Ontologies mitigate this problem as all of the knowledge about terms and their relationships to one another is contained in the ontology, not the software. SO-compliant software need only be provided with an updated version of the ontology, and everything else will follow automatically. This is because SO-compliant software need not hard-code the fact that a tRNA is a kind_of transcript; it need merely know that kind_of relationships are transitive and hierarchical and be capable of internally navigating the network of relationships specified by the ontology (see Figure 1) in order to logically infer this fact. This means that every time a new form of ncRNA is discovered, and added to SO, all SO-compliant software applications will automatically be able to infer that any data labeled with that new term is a kind_of transcript. This means that existing graphical viewers will render those data with the appropriate transcript glyph, and validation and query tools will automatically deal with this new data-type in a coherent fashion. Placing the biological knowledge in the ontology rather than in the software means that the ontology and the software that uses it can be developed, revised, and extended independently of one another. Thus ontologies offer the bioinformatics programming community significant opportunities as regards software design and the speed of the development cycle. Using an ontology does, however, mean that software applications must meet certain professional standards; namely, they must be capable of parsing an OBO file and navigating the network of relationships that constitute the ontology, but these are minimal hurdles. SO facilitates bioinformatics research in ways that reach far beyond its utility as regards software design. For example, SO's kind_of relationships provide a subsumption hierarchy, or classification system for its terms. This added depth of knowledge greatly improves the searching and querying capabilities of software using SO. The ontology's higher-level terms may be used to query via inference, even if they are never used for annotation. We recommend that annotators label their data using terms corresponding to terminal nodes in the ontology. Transcripts, for example, might be annotated using terms such as mRNA, tRNA, and rRNA (see Figure 1). Note that doing so means that if, for example, non-coding RNA sequences are required for some subsequent analysis, then SO-compliant software tools can locate annotations labelled with the subtypes of ncRNA, and retrieve tRNAs and rRNAs to the exclusion of mRNAs, even though these data have not been explicitly labelled with the term ncRNA. Thus, many analyses become easy, for example, how many ncRNAs are annotated in H. sapiens? Of these what percent have more than one exon? Are any maternally imprinted? Moreover, using SO as part of a database schema ensures that such questions 'mean' the same thing in different databases. SO also greatly facilitates the automatic validation of annotation data, as the relationships implied by an annotation can be compared to the allowable relationships specified in the ontology. For example, an annotation that asserts an intron to be part_of an mRNA would be invalid, as this relationship is not specified in the ontology (Figure 1). On the other hand, an annotation that asserted that an UTR sequence was part_of mRNA would be valid (Figure 1). This makes possible better quality control of annotation data, and makes it possible to check existing annotations for such errors when converting them to a SO-compliant format such as GFF3. To summarize, by identifying the set of relationships between terms that are possible, we are also specifying the inferences that can be drawn from these relationships: that is, the software operations that can be carried out over the data. As a consequence, software is easier to maintain, SO can easily be extended to embrace new biological knowledge, quality controls can be readily implemented, and software to mine data can be written so as to be very flexible. EM operators and SO SO also enables some modes of analyses of genomics data that are completely new to the field. One such class of analyses involves the use of extensional mereology (EM) operators to ask questions about gene parts. Although new to genomics, EM operators are well known in the field of ontology, where they provide a basis for asking and answering questions pertaining to how parts are distributed within and among different wholes (reviewed in [26,27]). These operators are usually applied to studies of how parts are shared between complex wholes - such as different models of automobiles or personal computers - for the purpose of optimizing manufacturing procedures. Below we explain how these same operators can be applied to the analyses of genomics data. Although these operators, difference and overlap, share the same name as topological operators, they are different as they function on the parts of an object, not on its geometric coordinate space. The topological operators, regarding the coincidence of edges and interiors - equality, overlap, disjointedness, containment and coverage of spatial analysis [21] - may also be applied to biological sequence. EM is a formal theory of parts: it defines the properties of the part_of relationship and then provides a set of operations (Table 2) that can be applied to those parts. These operators are akin to those of set theory, but whereas set theory makes use of an object's kind_of relationships, EM operators function on an object's part_of relationships. Only wholes and their 'proper parts' are legitimate substrates for EM operations. Proper parts are those parts that satisfy three self-evident criteria: first, nothing is a proper part of itself (a proper part is part of but not identical to the individual or whole); second, if A is a proper part of B then the B is not a part of A; third, if A is a part of B and B is a part of C then A is a part of C. Note that the third criterion of proper parts is that they obey the rule of transitivity. As we discussed earlier, not all part_of relationships are transitive. Accordingly, we have restricted our analyses (see Results and discussion) to component parts (Table 2). Figure 2 illustrates the effects of applying EM operations to analyze the relationships 'transcript is a part_of gene' and 'exon is a part_of transcript'. The EM operations overlap and disjoint pertain to relationships between transcripts, whereas difference and binary product pertain to exons. Two transcripts overlap if they share one or more exon in common. Two transcripts are disjoint if they do not share any exons in common. The exons shared between two overlapping transcripts are the binary product of the two transcripts, and the exons not shared in common comprise the difference between the two transcripts. The binary sum of two transcripts is simply the sum of their parts. One key feature of EM operations is that they operate in 'identifier space' rather than 'coordinate space'. Two transcripts overlap only if they share a part in common rather than if their genomic coordinates overlap. Thus, two transcripts may be disjoint even if their exons partially overlap one another. This is one way in which EM analyses differ from standard bioinformatics analyses, and it has some interesting repercussions. This is particularly so with regard to modes of alternative splicing, as each of the EM operations suggests a distinct category by means of which two alternatively spliced transcripts can be related to one another. We further explore the potential of these operations to classify alternative transcripts and their exons below. Results and discussion As part of a pilot project to evaluate the practical utility of SO as a tool for data management and analysis, we have used SO to name and enumerate the parts of every protein-coding annotation in the D. melanogaster genome. Doing so has allowed us to compare annotations with respect to their parts, for example, number of exons, amount of UTR sequence, and so on. These data afford many potential analyses, but as our motivation was primarily to demonstrate the practical utility of SO as a tool for data management, rather than comparative genomics per se, we have focused more on what exon-transcript-gene part-whole relationships have to say about the annotations themselves, than what the annotations have to say about the biology of the genome. Accordingly, we have used EM-operators to characterize the annotations with respect to their parts, especially with regard to alternative splicing. The current version of FlyBase (5 August, 2004) contained 13,539 genes, (of which 10,653 have a single transcript and 2,886 are alternatively spliced), 18,735 transcripts and 61,853 exons. An EM-based scheme for classifying alternatively spliced genes As we had characterized the parts of the annotations using SO, we were able to employ the EM operators over these parts. This proved to be a natural way to explore the relative complexity of alternative splicing, as the alternatively spliced transcripts have different combinations of parts: that is, exons. We grouped alternatively spliced transcripts into two classes. An alternatively spliced gene will contain overlapping transcripts if at least one of its exons is shared between two of its transcripts, and will have disjoint transcripts if one of its transcripts shares no exons in common with any other transcript of that gene. For the purposes of this analysis, we further classified disjoint transcripts as sequence-disjoint and parts-disjoint. We term two disjoint transcripts sequence-disjoint if none of their exons shares any sequence in common with one another; and parts-disjoint if one or more of their exons overlap on the chromosome but have different exon boundaries. Note that the three operations are pairwise, and thus not mutually exclusive. To see why this is, imagine a gene having three transcripts, A, B, and C. Obviously, transcript A can be disjoint with respect to B, but overlap with respect to C. Thus, we can speak of a gene as having both disjoint and overlapping transcripts. The relative numbers of disjoint and overlapping transcripts in a genome says something about the relative complexity of alternative splicing in that genome. A gene may have any combination of these types of disjoint and overlapping transcripts, so we created a labeling system consisting of the seven possible combinations. We did this by asking three EM-based questions about the relationships between pairs of a gene's transcripts: How many pairs are there of sequence-disjoint transcripts? How many pairs are there of parts-disjoint transcripts? How many pairs are there of overlapping transcripts? Doing so allowed us to place that gene into one of seven classes with regards to the properties of its alternatively spliced transcripts. We also kept track of the number of times each of the three relationships held true for each pair combination. For example, a gene having two transcripts that are parts-disjoint with respect to one another would be labeled 0:1:0. Keeping track of the number of transcript pairs falling into each class provides an easy means to prioritize them for manual review. These results are summarized in Figure 3. Of the alternatively spliced fly genes, none has a sequence-disjoint transcript, 275 have parts-disjoint transcripts, and 2,664 have overlapping transcripts, and 53 have both parts-disjoint and overlapping transcripts. The percentage of D. melanogaster genes in each category is shown in Table 3. Most alternatively spliced genes contain at least one pair of overlapping transcripts. These data also have something to say about the ways in which research and management issues are intertwined with one another with respect to genome annotation, as some aspects of these data are clearly attributable to annotation practice. The lack of any sequence-disjoint transcripts in D. melanogaster, for example, is due to annotation practice; in fact, current FlyBase annotation practices forbid their creation, the reason being that any evidence for such transcripts is evidence for a new gene [28]. This is not true for all genomic annotations. Annotations converted from the genomes division of GenBank to a SO-compliant form, were subjected to EM analysis, and inspection of the corresponding gene-centric annotations provided by Entrez Gene [29] revealed examples of genes that fall into each of the seven categories. Some of these annotations are shown in Figure 3. The frequencies of genes that fall into each of the seven classes shown in Table 3 provides a concise summary of genome-wide trends in alternative splicing in the fly. This EM-based classification schema, when applied to many model organisms, from many original sources, makes very apparent the magnitude of the practical challenges that surround decentralized annotation, and the distribution and redistribution of annotations. Certainly, they highlight the need for data-management tools such as SO to assist the community in enforcing biological constraints and annotation standards. Only then will comparative genomic analyses show their full power. Exons as alternative parts of transcripts EM-operators can also be used to classify the exons of alternatively spliced genes. Exons shared between two transcripts comprise the binary product of the two transcripts; whereas those exons present in only one of the transcripts constitute their difference (see Table 2 and Figure 2 for more information). These basic facts suggest a very simple, three-part classification system. If an exon is the difference between all other transcripts, then it is only in one transcript; we term these UNIQUE exons. If an exon is the difference of some transcripts, and the binary product of others, it is in a fraction of transcripts; we term these SOMETIMES_FOUND exons. And, if an exon is the binary product of all combinations of transcripts, then it must be in all transcripts; we term such exons ALWAYS_FOUND exons. Classifying exons in this way allows us to look more closely at alternative splicing from the exon's perspective. As can be seen from Table 4, despite the low frequency of alternatively spliced genes, a large fraction of their exons are associated with alternatively spliced transcripts - almost 39%. A sizable proportion of SOMETIMES_FOUND and ALWAYS_FOUND exons are coding exons in some of the transcripts and entirely untranslated exons in others. In some cases, this is due to actual biology: some transcripts in D. melanogaster are known to produce more than one protein (see, for example [30]). In other cases, this situation appears to be a result of best attempts on the part of annotators to interpret ambiguous supporting evidence; in yet others the supporting data sometimes unambiguously points to patterns of alternative splicing that would seem to produce transcripts destined for nonsense-mediated decay [31]. Whatever the underlying cause, these exons, like the N:0:0 class annotations, should be subjected to further investigation. To investigate these conclusions in more detail, we further examined each exon with respect to its EM-based class and its coding and untranslated portions. These results are shown Figure 4, and naturally extend the analyses presented in Table 4. First, regardless of exon class, most entirely untranslated exons are 5-prime exons; the lower frequency of 3-prime untranslated exons is perhaps due to nonsense-mediated decay [31], as the presence of splice junctions in a processed transcript downstream of its stop codon are believed to target that transcript for degradation. A second point made clear by the data in Table 4 is that alternatively spliced genes of D. melanogaster are highly enriched for 5-prime untranslated exons compared with single-transcript genes. Most of these exons belong to ALWAYS_FOUND; thus, there seems to be a strong tendency in D. melanogaster for alternative transcripts to begin with a unique 5' UTR region. This fact suggests that alternative transcription in the fly may, in many cases, be a consequence of alternative-promoter usage and perhaps tissue-specific transcription start sites. The high percentage of untranslated 5-prime UNIQUE exons in D. melanogaster may also be a consequence of the large numbers of 5' ESTs that have been sequenced in the fly [32]. Figure 4 also shows that most (> 95%) D. melanogaster ALWAYS_FOUND exons are coding. This makes sense, as it seems likely that one reason for an exon's inclusion in every one of a gene's alternative transcripts is that it encodes a portion of the protein essential for its function(s). As with our previous analyses of alternative transcripts, our analyses of alternatively transcribed exons also illustrate the ways in which basic biology and annotation-management issues intersect one another. The fact that most ALWAYS_FOUND exons are entirely coding, for example, may have something important to say about which parts of a protein are essential for its function(s). Whereas the over-abundance of un-translated UNIQUE exons probably has more to say about the resources available to, and the protocols used by, the annotation project than it does about biology. Such considerations make it clear that the evidence used to produce an annotation is an essential part of the annotation. In this regard SO has much to offer, as it provides a rational means by which to manage annotation evidence in the context of gene-parts and the relations between those parts. Conclusion We have sought to provide an introduction to the SO and justify why its use to unify genomic annotations is beneficial to the model organism community. We illustrate some of the ways in which SO can be used to analyze and manage annotations. Relationships are an essential component of SO, and understanding their role within the ontology is a basic prerequisite for using SO in an intelligent fashion. Much of this paper revolves around the part_of relationship because SO is largely a meronomy - a particular kind of ontology concerned with the relationships of parts to wholes. Extensional mereology (EM) is an area that is largely new to bioinformatics for which there are several excellent reference works available [26,27,33], and even a cursory examination of these texts will make it clear that EM has much to offer bioinformatics. Using all of the relationships in SO allows us to automatically draw logical conclusions about data that has been labelled with SO terms and thereby provide useful insights into the underlying annotations. We have shown how SO, together with the EM-based operations it enables, can be used to standardize, analyze, and manage genome annotations. Given any standardized set of genome annotations described with SO these annotations can then be rigorously characterized. For our pilot analyses, we focused on alternatively transcribed genes and their exons, and explored the potential of EM-operators to classify and characterize them. We believe that the results of these analyses support two principle conclusions. First, EM-based classification schemes are simple to implement, and second, they capture important trends in the data and provide a concise, natural, and meaningful overview of annotations in these genomes. One criticism that might be justifiably leveled against the SO- and EM-based analyses presented here is that they are too formal, and that simpler approaches could have accomplished the same ends. As our discussion of part_of relationships made clear, however, reasoning across diverse types of parts is a complicated process; ad-hoc approaches will not suffice where the data are complex. The more formal approach afforded by SO means that analyses can be easily be extended beyond the domain of transcripts and exons to include many other gene parts and relationships as well - including evidence. It seems clear that over the next few years both the number and complexity of annotations will increase, especially with regard to the diversity of their parts. Drawing valid conclusions from comparisons of these annotations will prove challenging. That SO has much to offer such analyses is indisputable. SO and SOFA provide the model organism community with a means to unify the semantics of sequence annotation. This facilitates communication within a group and between different model organism groups. Adopting SO terminology to type the features and properties of sequence will provide both the group and the community the advantages of a common vocabulary, to use for sharing and querying data and for automated reasoning over large amounts of sequence data. Materials and methods SO and SOFA have been built and are maintained using the ontology-editing tool OBO-Edit. The ontologies are available at [34]. The FlyBase D. melanogaster [35] data was derived from the GadFly [36] relational database and converted to Chaos-XML using the Bio-chaos tools. The features were annotated to the deepest concept in the ontology possible, given the available information. For example, the degree of information in annotations was sufficiently deep to describe the transcript features with the type of RNA such as mRNA, or tRNA. It was therefore possible to restrict the analysis to given types of transcript. CGL tools were used to validate each of the annotations, iterate through the genes and query the features. EM-operators were applied to the part features of genes. Other organism data was derived from the genomes section of GenBank [37]. GenBank flat files were converted to SO-compliant Chaos-XML using the script cx-genbank2chaos.pl (available from [19]) and BioPerl [23]. The BioPerl GenBank parser, Bio::SeqIO::genbank was used to convert GenBank flat files to Bioperl SeqFeature objects. Feature_relationships between these objects were inferred from location information using the Bioperl Bio::SeqFeature::Tools::Unflattener code. GenBank Feature Table types were converted to SO terms using the Bio::SeqFeature::Tools::TypeMapper class, which contains a hardcoded mapping for the subset of the GenBank Feature Table which is currently used in the genomes section of GenBank. The same Perl class was used to type the feature_relationships according to SO relationship types. The EM analysis was performed over the Chaos-XML annotations using the CGL suite of modules to iterate over the parts of each gene.

0 comments Cited 304 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Identifying Relationships among Genomic Disease Regions: Predicting Genes at Pathogenic SNP Associations and Rare Deletions

Soumya Raychaudhuri, Robert Plenge, Elizabeth Rossin … (2009)

Introduction An emerging challenge in genomics is the ability to examine multiple disease regions within the human genome, and to recognize a subset of key genes that are involved in a common cellular process or pathway. This is a key task to translate experimentally ascertained disease regions into meaningful understanding about pathogenesis. The importance of this challenge has been highlighted by advances in human genetics that are facilitating the rapid discovery of disease regions in the form of genomic regions around associated SNPs (single nucleotide polymorphisms) [1]–[6] or CNVs (copy number variants) [7]–[10]. These disease regions often overlap multiple genes – though only one is typically relevant to pathogenesis and the remaining are spuriously implicated by proximity. The difficulty of this task is heightened by the limited state of cataloged interactions, pathways, and functions for the vast majority of genes. However, undefined gene relationships might often be conjectured from the literature, even if they are not explicitly described yet. The general strategy of using function to prioritize genes in disease regions has been substantially explored [11]–[18]. However, predicted disease genes have not, in general, been easily validated. Thus far, published approaches have utilized a range of codified gene information including protein-interaction maps, gene expression data, carefully constructed gene networks based on multiple information sources, predefined gene sets and pathways, and disease-related keywords. We propose, instead, to use a flexible metric of gene relatedness that not only captures clearly established close gene relationships, but also has the ability to capture potential undocumented or distant ones. Such a metric may be a more powerful tool to approach this problem rather then relying on incomplete databases of gene functions, interactions, or relationships. To this end, we use established statistical text mining approaches to quantify relatedness between two genes – specifically, gene relatedness is the degree of similarity in the text describing them within article abstracts. The published literature represented in online PubMed abstracts encapsulates years of research on biological mechanisms. We and others have shown the great utility of statistical text mining to rapidly obtain functional information about genes, including protein-protein interactions, gene function annotation, and measuring gene-gene similarity [19]–[22]. Text is an abundant and underutilized resource in human genetics, and currently a total of 140,000 abstracts from articles that reference human genes are available through PubMed [23]. Additional valuable information can be seamlessly gained by including more than 100,000 references from orthologous genes; many important pathways have been more thoroughly explored in model systems than in humans. We have developed a novel statistical method to evaluate the degree of relatedness among genes within disease regions: Gene Relationships Among Implicated Loci (GRAIL). Given only a collection of disease regions, GRAIL uses our text-based definition of relatedness (or alternative metrics of relatedness) to identify a subset of genes, more highly related than by chance; it also assigns a select set of keywords that suggest putative biological pathways. It uses no information about the phenotype, such as known pathways or genes, and is therefore not tethered to potentially biased pre-existing concepts about the disease. In addition to a flexible text-based metric of relatedness, GRAIL's ability to successfully connect genes also leverages a statistical framework that carefully accounts for differential gene content across regions. We assume that each region contains a single pathogenic gene; therefore narrow regions with one or just a few genes are more informative than expansive regions with many genes, since they are likely to have many irrelevant genes. To take advantage of this, we have designed GRAIL to set a lower threshold in considering relatedness for those genes in narrow regions, allowing for more distant relationships to be considered; on the other hand it sets a more stringent threshold for genes located in expansive mutligenic regions and considers only the very closest of relationships. This strategy prevents large regions with many genes from dominating the analysis. In this paper we apply GRAIL to four phenotypes. In each case GRAIL is able to identify a subsets of genes enriched for relatedness – more than expected by random chance. We demonstrate enrichment for relatedness among true disease regions rigorously based on both GRAIL's theoretically derived p-value and also based on parallel analysis of either (1) carefully selected random regions matched for gene content and size or (2) experimentally derived false positive disease regions. GRAIL is able to identify subsets of highly related genes among validated SNP associations. First we use GRAIL to identify related genes from SNPs associated with serum lipid levels; GRAIL correctly identifies genes already known to influence lipid levels within the cholesterol biosynthesis pathway. In comparison to randomly selected matched SNP sets, the set of lipid SNPs demonstrate significantly more relatedness. Second, we use GRAIL to identify significantly related genes near height-associated SNPs; these genes highlight plausible pathways involved in height. In comparison to randomly selected matched SNP sets, the set of height SNPs also demonstrate significantly more relatedness. Encouraged by GRAIL's ability to recognize biologically meaningful connections, we tested its ability to distinguish true disease regions from false positive regions in two practical applications in human genetics. First, in Crohn's disease, we start with a long list of putative SNP associations from a recent GWA (genome-wide association) meta-analysis [24]. We demonstrate that a substantial fraction of these SNPs contain highly related genes—far beyond what can be expected by chance. We demonstrate that many of these SNPs subsequently validate in an independent replication genotyping experiment. Second, in schizophrenia, we previously identified an over-representation of rare deletions in schizophrenia cases compared to controls [8]. Despite the statistical excess, it is challenging to identify exactly which case deletions are causal, given the relatively high background rate of rare deletions in controls. Using GRAIL however, we are able to demonstrate that a subset of case deletions contain related genes. We further demonstrate that these genes are highly and significantly enriched for central nervous system (CNS) expressed genes. In stark contrast, GRAIL finds no excess relatedness among genes implicated by case deletions. Results Summary of statistical approach GRAIL relies on two key methods: (1) a novel statistical framework that assesses the significance of relatedness between genes in disease regions (2) a text-based similarity measure that scores two genes for relatedness to each other based on text in PubMed abstracts. Details for both are presented in the Methods. The GRAIL statistical framework consists of four steps (see Figure 1). First, given a set of disease regions we identify the genes overlapping them (Figure 1A); for SNPs we use LD (linkage disequilibrium) characteristics to define the region. Second, for each overlapping gene we score all other human genes by their relatedness to it (Figure 1B). In this paper we use a text-based similarity measure; alternative measures of relatedness, for example similarity in gene annotations or expression data, could be easily applied instead [25],[26]. Third, for each gene we count the number of independent regions with at least one highly related gene (Figure 1C); here the threshold for relatedness varies between regions depending on the number of genes within them. We assign a p-value to that count. Fourth, for each disease region we select the single most connected gene as the key gene. We assign the disease region that key gene's p-value after adjusting for multiple hypothesis testing (if there are multiple genes within the region) (Figure 1D). This final score is listed in this paper as pmetric where the metric is text, expression, or annotation based. Very low ptext scores for one region indicate that a gene within it is more related to genes in other disease regions through PubMed abstracts than expected by chance. Simulations on random groups of SNPs demonstrate that the ptext values approximately estimate Type I error rates, being approximately uniformly distributed under the null hypothesis (see Figure S1). However, we recommend the use of careful simulations or controls rather than actual theoretical p-values to reinforce the significance of GRAIL's findings – as we do in the examples below. 10.1371/journal.pgen.1000534.g001 Figure 1 Gene Relationships Among Implicated Loci (GRAIL) method consists of four steps. (A) Identifying genes in disease regions. For each independent associated SNP or CNV from a GWA study, GRAIL defines a disease region; then GRAIL identifies genes overlapping the region. In this region there are three genes. We use gene 1 (pink arrow) as an example. (B) Assess relatedness to other human genes. GRAIL scores each gene contained in a disease region for relatedness to all other human genes. GRAIL determines gene relatedness by looking at words in gene references; related genes are defined as those whose abstract references use similar words. Here gene 1 has word counts that are highly similar to gene A but not to gene B. All human genes are ranked according to text-based similarity (green bar), and the most similar genes are considered related. (C) Counting regions with similar genes. For each gene in a disease region, GRAIL assesses whether other independent disease regions contain highly significant genes. GRAIL assigns a significance score to the count. In this illustration gene 1 is similar to genes in three of the regions (green arrows), including gene A. (D) Assigning a significance score to a disease region. After all of the genes within a region are scored, GRAIL identifies the most significant gene as the likely candidate. GRAIL corrects its significance score for multiple hypothesis testing (by adjusting for the number of genes in the region), to assign a significance score to the region. The text-based similarity metric is based on standard approaches used in statistical text mining. To avoid publications that report on or are influenced by disease regions discovered in the recent scans, we use only those PubMed abstracts published prior to December 2006, before the recent onslaught of GWA papers identifying novel associations. This approach effectively avoids the evaluation of gene relationships being confounded by papers listing genes in regions discovered as associated to these phenotypes. In addition to including primary abstract references about genes listed in Entrez Gene, we augment our text compendium with references to orthologous genes listed in Homologene [23]; this increases the number of articles available per gene from 6 to 12 (see Table 1). We note that the distribution of articles per gene is skewed toward a small number of genes with many references; 0.4% of genes are referenced by >500 articles, while 26% of genes are referenced by 0.1. The scatter plot on the right illustrates ptext values for actual serum cholesterol associated SNPs (blue dots). Black horizontal line marks the median ptext value. We assessed the same SNP with similarity metrics based on gene annotation (green dots) and gene expression correlation (purple dots). (B) 42 SNPs associated with height. Similar plot for 42 height associated SNPs. The histogram on the left of the graph illustrates ptext values for random SNP sets carefully matched to height-associated SNP set. 86.5% of those SNPs have ptext values that are >0.1. The scatter plot on the right illustrates ptext values for actual SNPs associated with height (blue dots). Black horizontal line marks the median ptext value. We assessed the same SNP with similarity metrics based on gene annotation (green dots) and gene expression correlation (purple dots). On the right we list for each ptext threshold the number of expected SNPs less than the threshold based on matched sets, and the number of observed SNPs less than the threshold among height associated SNPs. Despite relatively comprehensive lipid biology annotation, GO does not identify relationships between regions as effectively as published text (Figure 2A). A total of 12 out of the 19 associated SNPs obtained pannotation 10−4); the remaining 22 regions had intermediate levels of significance following replication (and can be considered as yet unresolved associations) [24]. We applied GRAIL prospectively to these 74 nominally associated SNPs. GRAIL was initially operated independent of any knowledge of the contemporaneous replication genotyping experiment. Each region contained between 1 and 34 genes, except for two regions that contained no genes and were not scored. GRAIL identified 13 regions as significant (achieving ptext scores 0.1. 10.1371/journal.pgen.1000534.t002 Table 2 High scoring regions from a Crohn's disease GWA meta-analysis. SNP Chr Position (HG17) passociation Replication Study Result N (genes) Implicated Gene p text rs2066845 16 49314041 1.5E-24 VALIDATED 3 NOD2 0.00010 rs10863202 16 84545499 1.4E-05 INDETERMINATE 4 IRF8 0.00058 rs10045431 5 158747111 1.9E-13 VALIDATED-NOVEL 1 IL12B 0.00066 rs11465804 1 67414547 3.3E-63 VALIDATED 1 IL23R 0.00094 rs2476601 1 114089610 7.3E-09 VALIDATED-NOVEL 8 PTPN22 0.0014 rs762421 21 44439989 7.0E-10 VALIDATED-NOVEL 1 ICOSLG 0.0023 rs2188962 5 131798704 1.2E-18 VALIDATED 9 IRF1 0.0026 rs917997 2 102529086 1.1E-05 INDETERMINATE 5 IL18RAP 0.0027 rs11747270 5 150239060 1.7E-16 VALIDATED 3 IRGM 0.0032 rs2738758 20 61820069 2.7E-06 INDETERMINATE 10 TNFRSF6B 0.0038 rs9286879 1 169593891 7.7E-10 VALIDATED-NOVEL 4 TNFSF18 0.0042 rs2301436 6 167408399 5.2E-13 VALIDATED-NOVEL 3 CCR6 0.0052 rs4263839 9 114645994 1.3E-10 VALIDATED 2 TNFSF8 0.008 rs3828309 2 233962410 1.2E-32 VALIDATED 4 USP40 0.019 rs744166 17 37767727 3.4E-12 VALIDATED-NOVEL 2 STAT3 0.023 rs7758080 6 149618772 4.4E-06 INDETERMINATE 4 SUMO4 0.033 rs7161377 14 75071147 2.3E-05 INDETERMINATE 1 BATF 0.09 Here we list a subset of the 74 regions that emerged from a Crohn's disease GWA meta-analysis that GRAIL assigned the most compelling ptext scores to. The first three columns list information about the associated SNP. The fourth column lists the combined p-value of association from a GWA meta-analysis and subsequent replication. The fifth column indicates whether the region was validated, indeterminate, or failed in replication. Those regions that represent novel findings, not previously published are also indicated. The sixth column lists the number of genes in the disease region, and the seventh column lists the candidate gene identified by GRAIL. The eighth column lists the regions ptext score. Using these Crohn's results, we have compared GRAIL's performance to four other competing algorithms that also use functional information to prioritize genes, and GRAIL's performance is superior at predicting true positive associations (see Text S1, Figure S2, Table S5, Table S6). As a further test of GRAIL, we then evaluated the next most significant 74 associated SNPs that emerged from the Crohn's disease GWA meta-analysis (association p-values ranging from 5×10−5 to 2×10−4). Out of the 75 regions, 8 are not near any gene, and we did not score them. The remaining 67 regions were tested with GRAIL for relationships to the 52 replicated and indeterminate regions that emerged following replication. Two emerge with highly significant GRAIL scores: rs8178556 on chromosome 21 (IFNAR1, ptext = 1.7×10−4) and rs12928822 on chromosome 16 (SOCS1, ptext = 8.2×10−4) suggesting these independent regions may lead to novel associated SNPs for Crohn's disease (see Table S7). We next applied GRAIL to recently published sets of rare deletions seen in schizophrenia cases and matched controls. Multiple groups have recently demonstrated that extremely rare deletions, many of which are likely de novo, are notably enriched in schizophrenia [8]–[10],[29]. However, since rare deletions occur frequently in healthy individuals as well, many of these case deletions will also be non-pathogenic. In fact, we previously found that large (>100 kb), gene overlapping, singleton, deletions were present in 4.9% of cases but also in 3.8% of controls, suggesting that over two-thirds of these deletions are not relevant to disease [8]. We identified 165 published de-novo or case-only deletions of >100 kb overlapping at least one gene; a total of 511 genes are deleted or disrupted by these deletions [8],[9],[10]. Additionally, we identified 122 regions similar control-only deletions; a total of 252 genes are deleted or disrupted by these deletions. We applied GRAIL separately to both the case and control sets of deletions. In the case deletions, we identified a subset containing highly connected genes (Figure 4A). Specifically, 12 of the 165 regions obtain ptext scores 0.5 (Figure 3B). These regions might have been missed since the relevant gene is either poorly studied, or even if the gene is well studied, the relevant function of that gene is not well documented in the text. An alternative possibility is that the SNP is tagging non-genic regulatory elements. Additionally, the SNP may be the first discovered representative association for a critical pathway, not represented by other SNP associations – and therefore cannot be connected to them. In this case future discoveries will clarify the significance of that association. In cases where there is no apparent published connection between associated genes, other similarity metrics based on experimentally derived data, such as gene expression, protein-protein interactions and transcription factor binding sites could also complement the text-based approaches presented here. In fact, we demonstrate how annotation-based metrics or gene expression-based metrics are able to identify a subset of the associated SNPs in lipid metabolism. As these and other metrics are optimized, they could be used in conjunction with the novel GRAIL statistical framework that we present here to help understand gene relationships. Methods Scoring regions for functional relatedness The Gene Relationships Among Implicated Loci (GRAIL) has four basic steps that are outlined below. It has two input sets of disease regions: (1) a collection of NSEED seed regions (SNPs or CNVs) and (2) a collection of NQUERY query regions. Genes in query regions are evaluated for relationships to genes in seed regions, and query regions are then assigned a significance score. In most applications we are examining a set of regions for relationships between implicated genes, the query regions and the seed regions are identical. In other circumstances where we have a set of putative regions that are being tested against validated ones, the putative regions are defined as query regions, and the validated ones are defined as seed regions. Step 1. Defining disease regions and identifying overlapping genes For each query and seed SNP we find the furthest neighboring SNPs in the 3′ and 5′ direction in LD (r2>0.5, CEU HapMap [50]). We then proceed outwards in each direction to the nearest recombination hotspot [51]. The interval between those two hotspots, which would include the SNP of interest and all SNPs in LD, is defined as the disease region. The associated SNP could feasibly be tagging a stronger SNP signal from another SNP in that region. All genes that overlap that interval are considered implicated by the SNP. If there are no genes in that region, the interval is extended an additional 250 kb in either direction; we chose 250 kb as that distance since that is a range in which non-coding variants might express gene regulation [52]. For each query and seed CNV we define an interval that represents the deleted or duplicated region—all genes that overlap that interval are associated with the CNV for testing. Step 2. Ranking gene relatedness For each gene near a query region, we rank all human genes for relatedness. Ranking may be based on text similarity, or other metrics (see below for examples). Rank values range from 1 (most related) to NG (least related), where NG is the number of available human genes, in our application is 18,875 (see Table 1). Step 3. Scoring candidate genes against regions To avoid double counting nearby regions, we first combine any seed regions sharing one or more genes. For a given gene g in a query region, we examine the degree of similarity to any of the ns genes in a given seed region s. To ensure independence, we only look at a seed region s, if it does not share a single gene with the query region that gene g is contained in. We identify in each region s, the rank of the most similar (or lowest ranking) gene in it to gene g, Rg,s . We convert the rank to a proportion: To transform this proportion to a uniformly distributed entity under the null, we recognize that Rg,s was the lowest rank selected from ns genes – and we correct accordingly for multiple hypothesis testing: Now we identify those seed regions where pg,s is less than a pre-specified threshold pf as regions connected to gene g. For all applications presented here pf is arbitrarily set to 0.1. The number of seed regions containing at least one gene exceeding this threshold, nhit , can be approximated under a random model with a Poisson distribution. We assign a greater weight to those cases where there is greater similarity; that is in the cases where pg,s is particularly small: Under a random model, if pg,s 0.2. We restrict keywords to those that appear in >500 documents, contain >3 letters, and have no numbers. For each term, i, we calculate a score which is the difference between averaged term frequencies among candidate genes and all genes: The top twenty highest scoring terms are selected as keywords. Annotation based relatedness We defined a relatedness metric between genes based on similarity in Gene Ontology annotation terms [27]. We downloaded Gene Ontology structure and annotations on December 19, 2006. In addition to human gene GO annotations, we added orthologous gene annotations. Since GO is a hierarchically structured vocabulary, for each gene annotation we also added all of the more general ancestral terms. This resulted in a total of 843,898 annotations for 18,050 genes with 10,803 unique GO terms; this corresponds to a median of 40 terms per gene. We weighted annotations proportionally to the inverse of their frequency, so common annotations received less emphasis. We used a weighting scheme analogous to the one we used for word weighting: where gij represented the weighted code i for gene j, NG is the total number of genes, and gfi (or GO frequency) is the number of genes annotated with the term i. Gene relatedness was the correlation between these weighted annotation vectors. Gene expression based relatedness To calculate gene relatedness based on expression we downloaded the Novartis Gene Expression Atlas [28]. The data set consists of measurements for 33,689 probes across 158 conditions. Probes were averaged into 17,581 gene profiles. Gene relatedness was calculated as the correlation between expression vectors. Lipid and height applications We applied GRAIL to score 19 lipid-associated SNPs and separately to score 42 height-associated SNPs. Specific SNPs are listed in Table S1 and Table S2. We used the SNP sets as both the seed and the query set to look for relatedness between genes across regions. We scored SNPs separately using text, annotation, and expression similarity metrics. We compiled the best candidate genes and scores for the SNP regions. Crohn's disease application Prior to replication, we had access to 74 independent SNP regions that had emerged from a meta-analysis of Crohn's Disease. All 74 SNPs were used as both the query set and as the seed set into GRAIL. We assessed whether those SNPs that replicated had different text-based significance values than those that fail to replicate. To identify additional regions of interest, we identified the next 75 most significant regions in the Crohn's disease meta-analysis – they were used in GRAIL as a query set; for the seed set included all SNPs that did not fail in replication. Schizophrenia application We identified singleton deletions or confirmed de novo deletions reported by one of three groups. We selected those deletions that were in cases only or in controls only, were at least 100 kb large, and included at least one gene. We obtained singleton deletions online published by the International Schizophrenia Consortium (2008) at [8]. We obtained de novo deletions published by Xu et al (2008) from Table 1 [10]. We obtained singleton deletions published in Walsh et al (2008) from Table 2 [9]. We identified a total of 165 case-only deletions and 122 control-only deletions. We applied the GRAIL algorithm separately to case and controls. We speculated that the case deletions might hit genes from a common pathway and GRAIL p-values may therefore be enriched for significant scores. On the other hand, we hypothesized that control deletions might be located effectively at random, and so no particular pathway or common function should necessarily be enriched in this collection. To examine genes for tissue specific expression in the CNS system, we obtained a large publicly available human tissue expression microarray panel (GEO accession: GSE7307) [30]. We analyzed the data using the robust multi-array (RMA) method for background correction, normalization and polishing [55]. We filtered the data excluding probes with either 100% ‘absent’ calls (MAS5.0 algorithm) across tissues, expression values <20 in all samples, or an expression range <100 across all tissues. To represent each gene, we selected the corresponding probe with the greatest intensity across all samples. The data contained expression profiles for 19,088 genes. We included expression profiles from some 96 normal tissues and excluded disease tissues and treated cell lines. We averaged expression values from replicated tissues averaged into a single value. To assess whether genes had differential expression for CNS tissues, we compared the 27 tissue profiles that represented brain or spinal cord to the remaining 69 tissue profiles with a one-tailed Mann-Whitney rank-sum test. Genes obtaining p<0.01 were identified as preferentially expressed. Evaluation against other published methods We compared GRAIL's performance in its ability to prospectively predict Crohn's associations to five other published methods. The selection of these methods, and the evaluation is detailed in Text S1. Software An online version of this method is available (http://www.broad.mit.edu/mpg/grail/). Supporting Information Figure S1 GRAIL p-value scores for random SNPs. We scored 100 random groups of 50 SNPs with GRAIL. The y-axis is the fraction of SNPs in the group with values below the threshold, the x-axis lists the specific threshold. For each threshold, we plot the distribution of the fraction of the 50 SNPs below that threshold as a box plot. The bar is the median - the mean value is explicitly listed below the box-plot. The box at each threshold lists the 25%–75% range. The error-bars line depicts the 1.5 inter-quartile range. The black dots illustrate outliers outside the 1.5 inter-quartile range. (0.39 MB PDF) Click here for additional data file. Figure S2 Sensitivity versus specificity for prioritization algorithms. We used 5 algorithms to score the 74 most promising putative SNP associations from the Crohn's meta-analysis study. We assessed each algorithm's ability to predict those SNP associations that ultimately validated in follow-up genotyping. For each algorithm, we created a received-operator curve (ROC). (0.40 MB PDF) Click here for additional data file. Table S1 19 Lipid regions scored with Text based GRAIL strategy. Here we scored 19 SNPs, associated with lipid metabolism. In the first three columns we list information about the SNP. In the fourth column we list the number of genes in the SNP associated regions. In the fifth column we list the highest scoring gene in the associated region based on GRAIL using a text-based metric. In the sixth column we list the ptext values for the associated regions. We have bolded those candidate genes that are known likely causative gene. The seventh and eight columns list similar results for GRAIL with an GO annotation-based metric. The ninth and tenth columns list similar results for GRAIL with an expression-based metric. (0.15 MB DOC) Click here for additional data file. Table S2 42 Height regions scored with Text based GRAIL strategy. Here we scored 42 SNPs, associated with height. In the first three columns we list information of the SNP. In the fourth column we list the number of genes in the SNP associated regions. In the fifth column we list the highest scoring gene in the associated region for the SNP based on GRAIL using a text-based metric. In the sixth column we list the ptext values for the associated regions. The seventh and eight columns list similar results for GRAIL with an annotation-based metric. The ninth and tenth columns list similar results for GRAIL with an expression-based metric. (0.28 MB DOC) Click here for additional data file. Table S3 Keywords for Lipid and Height SNPs. We identified keywords associated with lipid and height associated SNPs; here we list the top 20. (0.06 MB DOC) Click here for additional data file. Table S4 Crohn's Disease SNPs from a meta-analysis of GWA studies. Here we list GRAIL results and summarize genotyping results for Crohn's disease SNPs. These 74 SNPs emerged from a meta-analysis and as a result of replication genotyping, they were either validated (A), indeterminate (B), or failed (C). For each of the regions we list the SNP ID and the chromosome in the second and third column. In the fourth column we list the final combined association significance score of the SNP to the Crohn's disease. In the fifth, sixth, and seventh columns we list GRAIL results including the number of genes in the region, the best candidate gene, and the text-based significance score for the region. (0.21 MB DOC) Click here for additional data file. Table S5 Algorithms to prioritize candidate genes. Our search of the literature identified nine algorithms that could be used to prioritize genes for replication. Four methods require no user-specified disease information (supervised), and five require some disease information from the user. We list in each row the name of the disease, the website, the necessary genetic data, the functional data used to prioritize genes, the disease-specific information that must be included, and the availability of the method. (0.09 MB DOC) Click here for additional data file. Table S6 Performance measures for prioritization algorithms. We used five algorithms (column 1) to score putatively associated SNPs from the Crohn's meta-analysis. After calculating an ROC curve for each algorithm, we calculated the AUC (column 2). We also calculated a p-value with a one-tailed rank-sum test comparing the median rank of the validated SNPs to the median rank of the failed SNPs (column 2). (0.04 MB DOC) Click here for additional data file. Table S7 Other promising regions in Crohn's Disease GWA meta-analysis. Information about the top six regions identified by GRAIL from the next 75 most significant regions from the Crohn's GWA study. All associations are indeterminate, and association p-values are taken from the GWA meta-analysis - these regions have not yet been replicated. (0.05 MB DOC) Click here for additional data file. Table S8 Rare or de novo schizophrenia control deletions. Here we list all of the deletions that GRAIL identified as most related to other deleted genes (ptext <0.05). For each deletion we list the chromosome, the range of the deletion, the GRAIL p-value for the region, and the best candidate gene in the region identified by GRAIL. Most genomic coordinates are listed in HG17. * HG18 coordinates. (0.06 MB DOC) Click here for additional data file. Text S1 A. Random SNP groups; B. Comparison of GRAIL to other related algorithms. (0.09 MB DOC) Click here for additional data file.

0 comments Cited 161 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text.

Burr Settles (2005)

ABNER (A Biomedical Named Entity Recognizer) is an open source software tool for molecular biology text mining. At its core is a machine learning system using conditional random fields with a variety of orthographic and contextual features. The latest version is 1.5, which has an intuitive graphical interface and includes two modules for tagging entities (e.g. protein and cell line) trained on standard corpora, for which performance is roughly state of the art. It also includes a Java application programming interface allowing users to incorporate ABNER into their own systems and train models on new corpora.

0 comments Cited 130 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Karin Verspoor

Kevin Bretonnel Cohen

Arrick Lanfranchi

Colin Warner

Helen L Johnson

Christophe Roeder

Jinho D Choi

Christopher Funk

Yuriy Malenkiy

Miriam Eckert

Nianwen Xue

William A Baumgartner Jr

Michael Bada

Martha Palmer

Lawrence E Hunter

Journal

Journal ID (nlm-ta): BMC Bioinformatics

Journal ID (iso-abbrev): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central

ISSN (Electronic): 1471-2105

Publication date Collection: 2012

Publication date (Electronic): 17 August 2012

Volume: 13

Page: 207

Affiliations

[1 ]Computational Bioscience Program, U. Colorado School of Medicine, 12801 E 17th Ave, Aurora, MS 8303, CO 80045, USA

[2 ]Department of Linguistics, University of Colorado Boulder, Boulder, 290 Hellems, CO 80309, USA

[3 ]Institute of Cognitive Science, University of Colorado Boulder, Boulder, MUEN PSYCH Building D414, CO 80309, USA

[4 ]Department of Computer Science, Brandeis University, Waltham, MS 018, MA 02454, USA

Article

Publisher ID: 1471-2105-13-207

DOI: 10.1186/1471-2105-13-207

PMC ID: 3483229

PubMed ID: 22901054

SO-VID: 2cd8325c-443b-4f79-b08e-960a7ca6d261

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

Read this article at

Abstract

Background

Results

Conclusions

Related collections

Genetoberfest

Most cited references 16

The Sequence Ontology: a tool for the unification of genome annotations

Identifying Relationships among Genomic Disease Regions: Predicting Genes at Pathogenic SNP Associations and Rare Deletions

ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text.

Author and article information

Contributors

Journal

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 29

Cited by 34

Most referenced authors 385