Blog
About

58
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      A Semantic Model for Species Description Applied to the Ensign Wasps (Hymenoptera: Evaniidae) of New Caledonia

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Taxonomic descriptions are unparalleled sources of knowledge of life's phenotypic diversity. As natural language prose, these data sets are largely refractory to computation and integration with other sources of phenotypic data. By formalizing taxonomic descriptions using ontology-based semantic representation, we aim to increase the reusability and computability of taxonomists' primary data. Here, we present a revision of the ensign wasp (Hymenoptera: Evaniidae) fauna of New Caledonia using this new model for species description. Descriptive matrices, specimen data, and taxonomic nomenclature are gathered in a unified Web-based application, mx, then exported as both traditional taxonomic treatments and semantic statements using the OWL Web Ontology Language. Character:character-state combinations are then annotated following the entity–quality phenotype model, originally developed to represent mutant model organism phenotype data; concepts of anatomy are drawn from the Hymenoptera Anatomy Ontology and linked to phenotype descriptors from the Phenotypic Quality Ontology. The resulting set of semantic statements is provided in Resource Description Framework format. Applying the model to real data, that is, specimens, taxonomic names, diagnoses, descriptions, and redescriptions, provides us with a foundation to discuss limitations and potential benefits such as automated data integration and reasoner-driven queries. Four species of ensign wasp are now known to occur in New Caledonia: Szepligetella levipetiolata, Szepligetella deercreeki Deans and Mikó sp. nov., Szepligetella irwini Deans and Mikó sp. nov., and the nearly cosmopolitan Evania appendigaster. A fifth species, Szepligetella sericea, including Szepligetella impressa, syn. nov., has not yet been collected in New Caledonia but can be found on islands throughout the Pacific and so is included in the diagnostic key. [Biodiversity informatics; Evaniidae; New Caledonia; new species; ontology; semantic phenotypes; semantic species description; taxonomy.]

          Related collections

          Most cited references 50

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Uberon, an integrative multi-species anatomy ontology

          We present Uberon, an integrated cross-species ontology consisting of over 6,500 classes representing a variety of anatomical entities, organized according to traditional anatomical classification criteria. The ontology represents structures in a species-neutral way and includes extensive associations to existing species-centric anatomical ontologies, allowing integration of model organism and human data. Uberon provides a necessary bridge between anatomical structures in different taxa for cross-species inference. It uses novel methods for representing taxonomic variation, and has proved to be essential for translational phenotype analyses. Uberon is available at http://uberon.org
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature

            Introduction Text-mining tools have become indispensable for the biomedical sciences. The increasing wealth of literature in biology and medicine makes it difficult for the researcher to keep up to date with ongoing research. This problem is worsened by the fact that researchers in the biomedical sciences are turning their attention from small-scale projects involving only a few genes or proteins to large-scale projects including genome-wide analyses, making it necessary to capture extended biological networks from literature. Most information of biological discovery is stored in descriptive, full text. Distilling this information from scientific papers manually is expensive and slow, if the full text is available to the researcher at all. We therefore wanted to develop a useful text-mining tool for full-text articles that allows an individual biologist to locate efficiently information of interest. The natural language processing field distinguishes information retrieval from information extraction. Information retrieval recovers a pertinent subset of documents. Most such retrieval systems use searches for keywords. Many Internet search engines are of this type, such as PubMed (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi). Information extraction is the process of obtaining pertinent information (facts) from documents. The facts can concern any type of biological object (entity), events, or relationships among entities. Useful measures of the performance of retrieval and extraction systems are recall and precision. In the case of retrieval, recall is the number of pertinent documents returned compared to all pertinent documents in the corpus of text. Precision is the number of pertinent documents compared to the total number of documents returned. A fully attentive reader would have complete recall, but low precision, because he has to read the whole body of text to find information. The emphasis for most applications is on recall, and we thus sought a system with high recall and as high precision as possible. Attempts to annotate gene function automatically include statistical approaches, such as cooccurrence of biological entities with a keyword or Medical Subject Heading term (Stapley and Benoit 2000; Jenssen et al. 2001). These methods have high recall and low precision, as no effort is being made to identify the kind of relationship as it occurs in the literature. Another approach has involved semantic and/or syntactic text-pattern recognition methods with a keyword representing an interaction (Sekimizu et al. 1998; Thomas et al. 2000; Friedman et al. 2001; Ono et al. 2001). They have high precision but low recall, because recognition patterns are usually too specific. Other machine learning approaches have classified abstracts and sentences for relevant interactions, but have not extracted information (Marcotte et al. 2001; Donaldson et al. 2003). For a more detailed report of these and related projects, see reviews by Andrade and Bork (2000), de Bruijn and Martin (2002), and Staab (2002). The precision of a keyword search can be increased by searching for combinations of keywords. For example, a researcher might construct a search for “anchor cell” and the gene name “lin-12” because he is interested in learning whether lin-12 plays a role in the anchor cell. However, there are many potential ways to describe the same concept or biological entity. Also, one often wants to search for a category of terms such as any gene or any body part. In this case, the intended search might be of a more general nature: If the researcher asks which genes are of interest in the anchor cell at all, he might have a hard time typing in all the known gene names (either one by one or concatenated with the Boolean operator “or”) in combination with the cell name. We therefore sought to develop a system that uses categories of terms such as “gene,” “cell,” or “biological process.” We established these categories of terms and organized them as an ontology, a catalog of types of objects and concepts and their relationships. The categories impart a semantic quality to searches, because the categories are based on the meaning of the entries. In many cases literature databases only contain bibliographic information and abstracts. The latter suffer from the constraint of information compression and convolution imposed by a word limit. Access to the full text of articles is critical for sufficient coverage of facts and knowledge in the literature and for their retrieval (Blaschke and Valencia 2001); our results confirm these findings. We wanted to use the Caenorhabditis elegans literature as a test case for developing a useful information extraction system. C. elegans has a relatively small literature, so in principle we could use it to test a complete, well-defined corpus. We also wanted to support a new database curation effort involving manual literature curation (Stein et al. 2001). Literature curation consists of identifying scientific data in literature and depositing them in an appropriate manner in a database. One extreme curation method is to read through the whole corpus of literature, identifying and extracting all significant information. This approach has the advantage that quality control of the data is done to the highest degree, based on human expertise. However, the volume and growth of biological literature makes it hard to keep the biological database up to date. In addition, data in literature may be missed by oversight, an inevitable flaw of purely human curation. The other extreme curation method is to extract data automatically. We therefore wanted a system that uses the computer to assist the curators. Our system is defined by two key components: the introduction of an ontology and the searchability of full text. The ontology is organized into categories that facilitate broader searches of biological entities as illustrated above. To be useful, it should also contain other categories that are not composed of biological entities, but describe relationships between entities. We sought to offer the user an opportunity to query the literature in the framework of the ontology such that it returns sentences for inspection by the user. We hypothesized that searching the corpus of text with a combination of categories of an ontology could facilitate a query that contains the meaning of a question in a much better way than with keywords alone. For example, if there is a “gene” category containing all gene names and a “regulation” category that includes all terms (nouns, verbs, adjectives, etc.) describing regulation, searching for (at least) two instances of the category gene and one instance of the category regulation in a sentence increases the chance that the search engine will return a sentence describing a gene-gene regulation. The search could then be limited by using a particular gene name as a keyword to get a list of genes that regulate or are regulated by that particular gene. Results We have developed a text processing system, Textpresso, that splits papers into sentences, and sentences into words or phrases. Each word or phrase is then labeled using the e X tensible M arkup L anguage (XML) according to the lexicon of our ontology (described below). We then index all sentences with respect to labels and words to allow a rapid search for sentences that have a desired label and/or keyword. The labels fall into 33 categories that comprise the Textpresso ontology. We built a database of 3,800 C. elegans papers, bibliographic information from WormBase, abstracts of C. elegans meetings and the Worm Breeder's Gazette, and some additional links and WormBase entities. See Materials and Methods for details on the database preparation. Textpresso Ontology Abstracts, titles, and full texts in the Textpresso system are processed for the purpose of marking them up semantically by the ontology we constructed. An ontology is a catalog of types of objects and (abstract) concepts devised for the purpose of discussing a domain of interest. An ontology helps to clarify a domain's semantics for everyday use, as is nicely demonstrated by Gene Ontology (GO; The Gene Ontology Consortium 2000). Although GO terms are not intended as a representation of natural language prose, they are a rich source of biologically meaningful terms and synonyms. They are the foundations for three corresponding categories in Textpresso, which are added to its 30 other categories. GO terms comprise approximately 80% of the lexicon. The first group of categories in the Textpresso ontology consists of biological entities: It contains the categories gene, transgene, allele, cell and cell group, cellular component, nucleic acid, organism, entity feature, life stage, phenotype, strain, sex, drugs and small molecules, molecular function, mutant, and clone. We have incorporated the GO molecular function category and proteins in the Textpresso molecular function category. A more detailed list with definitions can be found on the Textpresso Web site, and the most important ones are provided in Table 1. Many of these categories have subcategories. For example, the molecular function category has the subcategories “source = (Go|Textpresso)” and “protein = (yes|no).” As we have imported all terms from GO, the first subcategory makes it possible to search specifically for GO terms. Terms added by us have the attribute “Textpresso.” Similarly, not all molecular function terms are classified as protein. The word “co-transporter,” for example, conveys more of a function and would be used more in this context in the literature, even though its physical realization may in fact be a protein. A list of all subcategories can be found in Table 2. The second group of categories comprises terms that characterize a biological entity or establish a relation between two of them. It includes physical association (in the sense of binding) and consort (abstract association), effect, purpose, pathway, regulation, comparison, spatial and time relation, localization in time and space, involvement, characterization (terms that express the characterization of something), method, biological process, action, and descriptor (words that describe the state or condition of an entity). These categories, while well defined, have somewhat delicate boundaries, and the common-sense aspects of our ontology apply more to this group. It is likely that its categories are going to be changed as we continue to develop the system. In some instances terms are attributed to one category, even though they might as well fit into another. As an example, the term “coexpress” is put in the “consort” category to emphasize the concurrent aspect of the process, while it could as well be classified as a biological process. However, we believe that in most cases the first sense of the word is used in the literature. The last group (auxiliary) contains categories that can be used for more involved semantic analysis of sentences. These categories are auxiliary (forms of the verbs “be” and “have”), bracket, determiner, conjunction (and, or, because, since, although, etc.), conjecture (could, might, should, suggests), negation, pronoun, preposition, and punctuation. Some of them overlap with the syntactic categories that the part-of-speech tagger (used in the preprocessing steps; see Materials and Methods) assigns to terms, but are repeated here as they also contain some semantic component. The category “conjecture” is introduced to distinguish statements that convey hypotheses, speculations, or theoretical considerations from sentences that are expressed with confidence, thus representing more of a fact. The words of this category indicate the certainty of a statement. The Textpresso ontology is organized into a shallow hierarchy with 33 parent categories. The parent categories may have one or more subcategories, which are specializations of the parent category. For example, all of the terms in the parent category “biological process” will belong to one of its subcategories, “transcription,” “translation,” “expression,” “replication,” “other,” or “no biosynthesis.” This is user friendly and certainly serves the current implementation of the user interface well, which is oriented more towards information retrieval. The ontology is populated with 14,500 Practical Extraction and Report Language (PERL) regular expressions, each of which covers terms with a length from one to eight words. These expressions are contained in a lexicon. Table 3 shows examples of regular expressions for each category and examples of text strings matching them. Each regular expression can match multiple variable patterns. The multiple forms of regular verbs, for example, can be conveniently expressed as “[Ii]nteract(s|ed|ing)?” which stands for the eight cases “interact,” “interacts,” “interacted,” “interacting,” “Interact,” “Interacts,” “Interacted,” and “Interacting.” All regularly named C. elegans genes are matched with the expression “[A–Za–z][a–z][a–z]–\d+” matching three letters ([A–Za–z][a–z][a–z]), a dash (–), and a sequence of digits (\d+). As this example illustrates, the expressions can be made case sensitive. This is important as biological nomenclature becomes more elaborate, and the ability to distinguish subtle differences is pivotal for separating terms into the correct categories. Many of the regular expressions are generated automatically via scripts, taking a list of plain words as input and transforming them as shown in this example, to account for regular forms of verbs and nouns. The text-to-XML converter (see Materials and Methods) marks up the whole corpus of abstracts, full texts, and titles and produces XML documents. Figure 1 illustrates this process with an example. The computer identifies terms by matching them against regular expressions (such as the one shown above) and encloses them with XML tags. The tag serves as a containment of terms not semantically marked up. These tags will be used for a repeated reevaluation of the lexicon, as these terms can be easily pulled out and analyzed. A list of the most frequently missed terms is then produced and included in the lexicon for the next markup. Applications of Textpresso The marked-up text is stored in a database and can be queried. We built a user interface for general queries and another one for a specific type of query for WormBase curators (gene-gene interactions; see below). Textpresso is used in several related ways. Individual biologists use it to find specific information. Database curators, whose job is to extract information from papers or abstracts and to add this to a database, use it repeatedly to find all information of a particular type, in addition to using it for individual queries. The current Textpresso user interface (http://www.textpresso.org/) includes a query interface, a side menu with links to informative pages about the ontology, a document type definition, a user guide, and example searches, as well as the two retrieval and customization interfaces. The Web site offers two different types of retrieval, simple and advanced. Options for the retrieval queries are offered: searching a combination of categories, subcategories, and keywords in a Boolean fashion, specifying the frequency of occurrences of particular items, and choosing where in the article to search (title, abstract, body). The user can also determine whether a query is to be met in the whole publication or in a sentence. These options make the search engine powerful; for example, if a query is met in the whole article, the search has the function of text categorization, while meeting it in a sentence aims at extracting facts, which can be viewed in the context of a paragraph. The specification of cooccurrence determines the character of a search. If a combination of keywords and categories is found in a sentence, the likelihood that a sentence contains a fact involving the chosen categories and keywords is quite high. If the user chooses cooccurrence within a document, he is more interested in finding a relevant document. The scope of a search can be confined to full text, abstract, title, author, year, or any combination thereof, for document searches as well as sentence searches. A typical result page shows a list of documents with all bibliographical information and the abstract as displayed in Figure 2. A simplified version of the Textpresso interface is incorporated within WormBase (http://www.wormbase.org). The result list retrieved by a query can be customized in such a way that the user can choose how to display the information. This list is sorted according to the number of occurrences of matches in the document, so the most relevant document will be on the top of the list. A series of buttons for the whole list as well as for each document is available, allowing the user to view matching sentences or prepare search results in various formats. The individual result entries have up to six links: One can view matches for each paper only, go to the Web site of the journal to read the online text of the article (this only works if the user is subscribed to the journal), view a list of related articles that is provided by PubMed, export the bibliographical information into Endnote (two different links), or, if the user is accessing Textpresso internally (currently at Caltech), one can download the PDF of the paper. The power of Textpresso's search engine unfolds when category searches are used. By searching for a category, the researcher is targeting all keywords that populate that category. For example, the researcher might be interested in facts about genetic regulation of cells. Assuming that many facts are expressed in one sentence, he would search for the categories “gene,” “regulation,” and “cell or cell group” in a sentence. He can then view the matches (and surrounding sentences) of the search return and decide which facts are relevant. If one is not interested in all genetic regulation instances mentioned in the literature, it might be more useful to combine keywords with categories. For example, the question “What entities interact with ‘daf-16' (a C. elegans gerontogene)?” can be answered by typing in the keyword “daf-16” and choosing the category “association.” Advanced Retrieval and Subcategories An extension (the advanced retrieval interface) allows the use of the subcategories of the ontology and the specification of Boolean operators, thereby concatenating categories and keywords with “or” or “not” to permit alternatives or exclude certain items. One special subdivision of terms is the distinction between named and unnamed entities: Categories can include both general terms and specific names of entities. For example, the word “gene” would be an unnamed term of the gene category, while “lin-11” is a named entity. The general terms will likely be used for fact extraction across several neighboring sentences, but they might also be useful for retrieval purposes, even though the rate of false positives might be much higher in the latter case. Lastly, the user can determine how a keyword or category term has to be matched numerically. The options “greater than,” “less than,” and “equal to” are available together with a drop-down menu for the number of occurrences. With these additional tools, document categorization can be made more effective. A detailed profile of which categories and keywords should occur a minimum, maximum, or exact number of times for triggering a match can be established. Similarly, searches on the sentence level acquire a semantic quality, i.e., they at least partially encompass a meaning. In many cases, the answers to questions, phrased in the form of a sophisticated query, can immediately be read off the result screen. If, for example, one were to ask in which cells lin-11 is expressed, one would search sentences for a combination of the category “biological process” (subcategory “biosynthesis: expression”), the category “cell or cell group” (subcategory “type: name”) and the exact keyword “lin-11.” The subcategory “expression” filters out all words that relate to expression, the subcategory “name” limits the search to specific cells which have a name, such as “anchor cell,” “HO neurons,” “IL sensillum,” etc. Other subcategory options would be “group” (for example, “head,” “vulva,” “tail”) and “lineage” (“AB lineage,” “EMS lineage,” etc.). To better understand the following results, note that the term “cell(s)” has the type “name,” to gain the correct meaning of phrases such as “AB lineage cells.” The first two words of this phrase are marked as lineage, but the last word makes the whole phrase named cells. The system returns sentences of different quality. Some of them answer the question posed immediately (returned sentences are taken from Gupta and Sternberg 2002; that paper produced the most hits). The underlined words mark the matched items: “An analysis of the expression pattern of lin-11 in vulva and uterine lineage cells earlier suggested that cellular defects arise due to a failure in the differentiation process”; “Our analysis of the expression of lin-11 in VPC granddaughters (Pn.pxx stage) has revealed the following pattern in P5.p and P7.p lineage cells (from anterior to posterior; L, low; H, high), LLHH and HHLL , respectively.” Other sentences meet the truth more by accident, as the terms are matched within a sentence, but the statement does not really express the fact sought. The cells where lin-11 is expressed might be inferred by the knowledgeable reader, and not stated explicitly: “Our results demonstrate that the tissue-specific expression of lin-11 is controlled by two distinct regulatory elements that function as independent modules and together specify a wild-type egg -laying system”; “Using a temporally controlled overexpression system, we show that lin-11 is initially required in vulval cells for establishing the correct invagination pattern.” Finally, some sentences just do not give any clue about the posed question: “ lin-11 cDNA- expressing vectors under the control of lin-11- AB (pYK452F7-3) and lin-11-C (pYK452F7-2) elements were designed as follows.” Here, “AB” is marked up as a named cell, but this is not the semantically correct tag in this context. This false positive might have been prevented if specific sections of a paper could be searched, as this statement comes from the method section. Evaluation of the Textpresso System An automatic method for retrieving or extracting information from text is only useful if it is as accurate and reliable as human curation. We devised two tests based on two common tasks performed by human experts who extract biological data from journal articles. The first task was the automatic categorization of papers according to the types of biological data they contain. Our study used a large test set of papers scanned by a curator to examine the effectiveness of automatically searching for information in the full text of a journal article compared to its abstract. The second task focused on retrieving sentences containing a specific type of biological data from text. Sentences from eight journal articles were manually inspected on a sentence-by-sentence basis and compared to the return from a Textpresso query on the same articles. From this study we present a detailed error analysis outlining the strengths and weaknesses of the current Textpresso system as an automatic method for information retrieval. We evaluated the performance of Textpresso using the information extraction performance metrics of precision, which is a measure of the amount of true returned data compared to the amount of false returned data, and recall, which is a measure of the true data returned compared to the total amount of true data in the corpus. These values are formulated as recall = number of true returns / total number of true data items and precision = number of true returns / total number of returns. Classification of Journal Articles: Full Text Versus Abstract We examined the effectiveness of automatically identifying journal articles that contain particular types of data. A test set of 965 journal articles pertaining to C. elegans biology was assessed by a human expert and categorized into groups according to six different types of data (antibody data, ablation data, expression data, mapping data, RNAi data, and transgenes). Note that there can be more than one data type per article. We first measured the value of searching for keywords in the full text of an article as opposed to searching its abstracts (Table 4). The overall information recall when searching abstracts is low (∼44.6%) compared to the information recall when searching full text (∼94.7%). Furthermore, keywords for some specific types of data (e.g., antibody data, mapping data, transgene data) are very unlikely to appear in abstracts (∼10% recall) but can be found in full text (∼70% recall). However, precision of the keyword search is reduced by almost 40% when searching full text compared to abstracts (30.4% and 52.3%, respectively). Single keyword searches of full text return a large number of irrelevant documents for most searches. This higher false positive rate might reflect the writing style found in full text, where facts can be expressed within complex sentence structures (as compared to abstracts, where authors are forced to compress information), combined with the inability of a keyword search to capture context. Small-Scale Information Retrieval Study We tested the accuracy of a search combining word categories and keywords to retrieve sentences containing genetic interaction data. For this experiment we broadly defined genetic interaction as the effect of one or more genes on the function of another gene or genes (and thus it includes genetic interaction, regulation, and interaction of gene products). To directly assess how Textpresso performs, a human expert manually evaluated the text sentence by sentence (Figure 3). We formulated a Textpresso query that searched for the presence of at least two genes mentioned by name and at least one term belonging to the “regulation” or “association” word categories (see Materials and Methods). A total of 178 sentences were matched for this query in the eight journal articles, and the results are shown in Table 5. A human expert assessed the returned sentences and determined that 63 sentences contained gene-gene interaction data according to our criterion. The same set of journal articles had been independently manually evaluated for their description of genetic interactions, and 73 true sentences were identified. In both cases, information from the article title, abstract, contents of tables, and reference section was excluded. Sentences that described genetic interaction using the gene product name rather than the gene were also excluded from this study. To measure recall, we first determined the total number of sentences that contained genetic interaction data. For this analysis we took the union of true sentences manually identified in the journal articles and the true sentences returned by Textpresso. The total number of true sentences identified by the two methods was 102. The recall of sentences containing genetic interaction was ∼62% using Textpresso compared to ∼71% for those sentences manually identified in journal articles. One-third of the sentences returned by Textpresso were true positives (35%). Although the numbers of true sentences retrieved by the automatic and manual methods were similar (63 and 73, respectively), only 34 of these sentences overlapped. To investigate this discrepancy, we manually extracted the genetic interactions described in both sets of sentences and determined the number of distinct genetic interactions found by each method (Table 6). The sentences manually identified from the journal articles yielded 23 more distinct genetic interactions than those which were extracted from true sentences retrieved by Textpresso. However, 43 interactions derived from the Textpresso output overlapped with the manually identified set, and Textpresso located sentences describing seven genetic interactions that the human expert missed. The average redundancy (how many times the same gene-gene interaction occurred) of a distinct genetic interaction extracted from both the manual and automatic methods was 3-fold. We analyzed the gene-gene interaction sentences missed by Textpresso. In many cases (65%) the word or phrase used to describe the genetic interaction belonged to neither the “association” nor the “regulation” word category and so the sentence was not returned. In some cases, the term or phrase that determined “genetic interaction” belonged to some other Textpresso word category (e.g., some terms that implied genetic interaction and were not matched by the query were “epistatic,” which belongs to the “consort” word category, and “alters,” which belongs to the “effect” word category). This type of analysis is useful for revising and updating the ontology. In other cases, due to the intricacies of natural language prose, it was difficult to isolate an interaction term in the sentence (e.g., “Thus ref-2 alone is insufficient to keep P(3–6).p unfused when lin-39 is absent.”). Approximately 8% of true sentences were missed because the genetic interaction information was discussed over a number of sentences. This is a limitation of the current Textpresso system, as search queries are matched per sentence (or per entire article). Our analysis of the false positive sentences returned by Textpresso revealed that approximately 10% discussed gene-gene interactions that did not occur (e.g., “Neither pdk-1(gf) nor akt-1(gf) suppressed the Hyp phenotype of age-1(mg44).”). While we do have a “negation” category in our Textpresso ontology, we chose not to exclude negation terms from the posed query, to avoid missing true positives (in case the negation does not apply to the interaction term in a sentence, but to some other portion of it). Twenty-one percent of the false positive sentences were determined by inspection to suggest genetic interaction, but were too weakly phrased to extract the information in confidence without the context of the sentence. However, the majority of false positives (70%) were due to the lack of context of the search terms in the sentence, where they matched the query terms (underlined) but in a context that did not mention genetic interaction: “ lin-35 and lin-53 , two genes that antagonize a C. elegans pathway, encode proteins similar to Rb and its binding protein RbAp48.” This example strongly supports the idea that an information extraction method that considers semantic context of a search query would dramatically increase the precision of the return. Large-Scale Information Retrieval to Expedite Information Extraction We performed extraction of genetic interaction information from a corpus of 3,307 journal articles. A Textpresso query searched for the presence of at least two uniquely named genes and at least one term belonging to the “regulation” or “association” word categories (see Materials and Methods for more details). A total of 17,851 sentences were returned by this query. Due to the lack of context of some sentences, true sentences were determined by a more stringent definition of genetic interaction, i.e., where one or more named genes were described as modifying the phenotype of another named gene or genes by suppression, enhancement, epistasis, or some other genetic method. To determine the frequency of true sentences, a random sample of 200 of the sentences returned by Textpresso was evaluated by a human expert according to this more stringent criterion (Table 7, column C). This sample was compared to 200 sentences chosen from the whole corpus at random (Table 7, column A) and 200 sentences randomly chosen from the whole corpus that contained two or more named genes (Table 7, column B). A typical sentence that was determined to be true for genetic interaction data is “Interestingly, at lower temperatures, the akt-2(+) transgene can supply sufficient Akt/PKB activity to weakly suppress the dauer arrest caused by age-1(mg44).” Some of the sentences strongly suggested genetic interaction but did not quite meet the genetic interaction criterion. These were grouped as “possible genetic interaction,” for example, if a phenotype was not mentioned: “For example, lin-15(lf) animals display a 54% penetrance of P11 to P12 fate transformation, while all egl-5(lf);lin-15(lf) double mutants show a P12 to P11 fate transformation.” Sometimes it is unclear exactly which genes are participating in the genetic interaction: “Evidently the effect of the sir-2.1 transgene alone is too subtle to trigger dauer formation without the sensitizing daf-1 or daf-4 mutations.” Another group was highlighted as discussing interaction, but fell outside the criterion set for genetic interaction. These were classified “non-genetic interaction.” Some examples of this are sentences that specify gene regulation: “These studies have shown that smg-3(Upf2) and smg-4(Upf3) are required for SMG-2 to become phosphorylated.” Finally, sentences that describe physical interaction were also put into the category “possible genetic interaction”: “For example, GLD-1 represses translation of tra-2, one of the sex-determination genes, by binding to the 3′-UTR or the tra-2 mRNA (Jan et al. 1999).” This analysis shows that there is a 1 in 200 chance of a sentence discussing genetic interaction (as defined above) randomly occurring in the full text of the journal articles analyzed. The odds increase to 7 in 100 if one looks at sentences containing at least two named genes. The returned matches from the Textpresso search are enriched 39-fold for genetic interaction compared to random chance, and there is a significant 3-fold enrichment when compared to sentences containing at least two named genes. There is a 1 in 5 chance that a returned Textpresso match is true. To date, 2,015 of the 17,851 returned sentences have been evaluated. Of these, 370 discuss genetic interaction, yielding 160 distinct gene-gene interactions mined from the literature. There are 213 sentences that mention nongenetic interactions, and 419 sentences are classified as possible genetic interactions. Large-Scale Simple Fact Extraction We have extracted gene-allele reference associations from the corpus of papers to populate the WormBase database by searching for the pattern . Of the 10,286 gene-allele associations extracted, 9,230 were already known by WormBase, while 1,056 associations were new and could be added to the database. In addition, 1,464 references could be added to the 2,504 allele reference associations in WormBase. Ninety-eight percent of the data extracted went into the database without any manual correction, and the last 2% were compromised because of typographical errors in the original paper or the inherent character of the data (i.e., gene name synonyms and changes). Discussion Accomplishments We have developed a system to retrieve information from the full text of biological papers and applied it to the C. elegans literature. As of March 2004, the database contains full texts of 60% of all papers listed by the Caenorhabditis Genetics Center (CGC; http://www.cbs.umn.edu/CGC/CGChomepage.htm) and almost all abstracts that are information rich for C. elegans research. The introduction of semantic categories and subsequent marking up of the corpus of texts introduce powerful new ways of querying the literature, leading towards the formulation of meaningful questions that can be answered by the computer. We have demonstrated such queries with one example and have successfully tried many others. A more thorough evaluation of the system revealed that the availability of full text is crucial for building a retrieval system that covers many biological data types with a satisfying recall rate, and thus is truly useful for curators and researchers. For biologists, an automated system with high recall and even moderate precision (like the current Textpresso) confers a great advantage over skimming text by eye. Textpresso is already a useful system, and thus serves not only as proof of principle for ontology-based, full-text information retrieval, but also as motivation for further development of this and related systems to achieve higher precision and hence even greater time savings. It is apparent that the number of articles available in the C. elegans literature (currently about 6,000) can be curated with the assistance of Textpresso, as it is much more efficient than when done by human readers alone. The larger the corpus of papers, the more useful Textpresso will become. We have shown this by calculating the frequencies of genetic interaction data in sentences in three different cases: random sentences, sentences that contain at least two genes, and sentences returned from a Textpresso advanced query. The efficiency was shown to increase dramatically (39-fold in the best case). We have outlined the first steps of how Textpresso helps the curation effort by extracting gene-gene interactions. Overall, we have shown that Textpresso has several uses for researchers and curators: It helps to identify relevant papers and facts and focuses information retrieval efforts. Indeed, Textpresso is used daily by C. elegans researchers and WormBase curators: The server sends 530 files to requests daily via the Web, a quarter of which are to WormBase curators. Areas for Improvement Textpresso is limited in two ways: the lack of complete coverage of the C. elegans literature and the fact that the ontology and its corresponding lexicon are still in their infancy. The preparation of full texts has to be better and more efficient. The conversion of PDF to plain texts was problematic because of the different layouts of each journal. Even with the software we developed, a layout template for each journal needs to be written to specify where different components of text can be found. Prior to the use of this software, we had to forgo the use of figure and table captions. Acquisition of processable text is a general problem for biologists. A new release of XPDF (a PDF viewer for X; http://www.foolabs.com/xpdf/) eases this problem considerably (see Materials and Methods). One of our studies on the effectiveness of the extraction of a specific type of biological fact, in this case gene-gene interaction, showed that the machine still cannot replace the human expert, although it increases efficiency greatly. We anticipate that the computer does better with a larger number of articles because of redundancy. While roughly 9% of distinct gene-gene interactions from a corpus of eight journal articles were missed by the human but revealed by Textpresso, 29% of the interactions were missed by Textpresso, primarily due to flaws in the ontology. Advancing the Textpresso ontology will help to increase the specificity of the retrieval system. A deeper, meaningful structure is likely to make extraction easier and more stable. Possible improvements are to include other biological ontologies and language systems, such as UMLS (http://www.nlm.nih.gov/research/umls/) and SNOMED (http://www.snomed.org/, and to establish a more sophisticated tree structure. Our core lexicon recognizes 5.5 tags per sentence (out of an average of 23.7 tags per sentence) that are of scientific interest. This density results in a term coverage of 23.2%, while the maximum that could theoretically be added is 36.5%, assuming that all terms currently not marked up belong to relevant categories. An average of 9.5 tags per sentence are apparently of no interest for information retrieval; however, this is due to the nature of human language (and will be nonetheless useful for information extraction purposes). Reevaluation of the corpus of text for terms and their meanings that have been missed is necessary. This process will result in an expansion of our ontology, thus continually expanding the resulting lexicon, or revising the structure of the ontology. Ontology and lexicon revision is most efficiently done by a human, and a feasible automated approach seems out of reach. However, we have illustrated semiautomatic methods to help make this task easier in the future: The containment of words that are not covered in our lexicon with tags serves several purposes. First, we are able to extract all words (or n-grams, which are represented as a consecutive sequence of words embedded in tags), assemble a histogram of the most frequent terms, and add important ones to our lexicon. Second, having identified frequently occurring semantic patterns in the corpus, we are able to infer likely candidates of words for specific categories. For example, one popular pattern that indicates a gene-allele association is . If one now searches for patterns such as and extracts the word enveloped by the tags, then a frequency-sorted list of words that are likely to be alleles can be assembled, presented to a curator for approval, and deposited into the lexicon. The alternative, , would give a list of possible gene names. Many other patterns, identified by statistical means and similarity measures, could be obtained and used in such a fashion. These two methods will help us to systematically and significantly reduce the number of terms not marked up in the corpus, making it more complete. The procedure can be repeated with every build of the Textpresso database and has the advantage that the list of words added to the lexicon is tailored to the literature for which it is used. In addition, shortcomings in the general structure of the ontology can be detected and corrected, if those issues have not been caught in the research and development of the information extraction aspects of the system. If the strategy outlined above is applied continually, we will be able to close this gap and reach saturation, even with the addition of new papers and abstracts. About 89% of current users take advantage primarily of the full text and multiple keywords. Some (11%) proceed to keyword plus category. Only 0.3% of users use the advanced retrieval search. It is clear that the implementation of a user test interface improvement/education cycle will greatly help the development of Textpresso and subsequently help users take full advantage of this system. More generally, biologists will become increasingly familiar with ontology-based search engines. Prospects Future development of Textpresso can be undertaken at many different levels. A synonym search could be enabled for keyword searches: After having compiled lists of them, an option could be given to automatically include synonyms for a given term (e.g., genes, cells, cellular component) in a search. Similarly, GO annotations could be used to search for and display sentences involving genes associated with gene ontology terms, after the latter ones have been queried first. As already mentioned, search targeting could be made more flexible: Papers could be subdivided into more sections (such as introduction, methods, results, conclusion, etc.), and a query could then be applied only to the specified sections. In addition, the limitation of searching criteria to just one sentence can be relaxed to a set number of neighboring sentences. Finally, one could improve on links to other databases of relevance besides WormBase and PubMed and increase the wealth of links to the latter ones. An important issue is the portability of the system to other model organism databases. This undertaking is part of the Generic Model Organism Database (GMOD) project (http://www.gmod.org, and a downloadable package with software will be made available on their Web site. For a different model organism, parts of the lexicon, and maybe also parts of the ontology, need to be modified. Language and jargon in each community differ, and terms need to be systematically collected to accommodate their specific usage in the respective communities. However, this is not too laborious, as we have been able to generate a yeast version in a few weeks (E. E. Kenny, Q. Dong, R. S. Nash, and J. M. Cherry, unpublished data). We believe that Textpresso can be extended to achieve information extraction. The wealth of information buried in semantic tag sequences of 1 million sentences asks to be massively exploited by pattern-matching, statistical, and machine learning algorithms. Having the whole corpus semantically marked up provides bioinformaticians with the opportunity to develop fact extraction algorithms that might be quite similar to sequence alignment and gene-finding methods, or, more generally, algorithms that have similarity measures at their core, because sentences can now be represented as sequences of semantic tags. Furthermore, semantic sequences of related sentences show similar properties as related genomic sequences, such as recurring motifs, insertions, and deletions. The relatively rigid structure of the English language (subject-verb-object) and the comparatively low degree of inflections and transformations certainly help. In addition, some scientific information is stored in a structured manner. We have already started to run simple pattern-matching scripts to populate gene-allele associations from the literature for WormBase, as many of them are written in the form “gene name(allele name),” such as “lin-3(n1058).” Materials and Methods Sources. Textpresso builds its C. elegans database from four sources. A collection of articles in PDF format is compiled according to the canonical C. elegans bibliography maintained at the CGC (http://www.cbs.umn.edu/CGC/CGChomepage.htm ). As of March 2004 we had around 3,800 (60%) CGC papers in our database. Software developed by us (see below) converts the PDFs to plain text. We import additional bibliographical information from WormBase: titles of documents and author and citation information. WormBase data comprise additional C. elegans-related documents such as C. elegans meeting abstracts and Worm Breeder's Gazette articles. We also curate certain types of data ourselves. Some C. elegans-related papers are not found in the CGC bibliography or WormBase. We compile lists of URLs of journal Web sites and their articles, and links to related articles (provided by PubMed). Citations are prepared in Endnote format for download. Finally, as Textpresso returns scientific text to the user, we construct links to report pages of WormBase that display detailed information about biological entities, such as genes, cells, phenotypes, clones, and proteins. All data and links produced by us are referred as “Textpresso” data in Figure 4. Ontology. The objective of an ontology is to make the concepts of a domain and the relationships and constraints between these concepts computable. For an ontology to be utilized in a search engine for biological literature, it has to include the language of everyday use and common sense. We have therefore assigned the most commonly used meaning to a word even though it has several meanings in different contexts. We have consequently adopted a strategy of devising an ontology drawing from our own knowledge. Our ontology includes all terms of the three major ontologies of GO, namely “cellular component,” “biological process,” and “molecular function.” The current ontology is unstructured for the sake of straightforward usability, our first priority. A variety of approaches were utilized to construct and populate the 33 categories of the Textpresso ontology. We first designed individual categories for well-defined biological units or concepts such as strain, phenotype, clone, or gene. The terms in some of these categories (such as clone, allele, and gene) were represented by a PERL regular expression designed to match any text that looked like that particular biological unit. This was possible where a conserved and unique nomenclature for that biological concept had been established in the literature. Any exceptions to the established nomenclature recorded in WormBase were also added to these categories. For other biological concepts (e.g., “method,” “phenotype,” “cellular component,” and “drugs and small molecules”), we extracted information from publicly accessible biological databases, such as WormBase, GO, and PubMed/NCBI to construct lists of terms. We supplemented these lists through primary literature and textbook surveys. Next, we conceived categories of terms that would describe the relationship between the biological categories. To structure these “relationship” categories, we listed words of the text of 400 C. elegans journal articles for analysis. From this list we flagged natural prose words that we felt had at least some defined meaning within the context of biological literature (for example, “expressed,” “lineage,” “bound,” “required for”). From this list we constructed 14 new categories designed to encapsulate the natural language used by biologists to describe biological events and the relationship between them (action, characterization, comparison, consort, descriptor, effect, involvement, localization in time and space, pathway, purpose, physical association, regulation, spatial relation, and time relation). We made a second pass through the subset of flagged words from the list and assigned them to one of these categories according to what the sense of the word was in the biological literature for the majority of the time. Finally, a number of categories were designed to account for syntax and grammatical construction of text, such as “preposition,” “conjunction,” and “bracket.” Names. We have manually curated a lexicon of names because it has proved difficult in the past to automatically recognize names of biologically relevant entities (Fukuda et al. 1998; Proux et al. 1998; Rindflesch et al. 2000; Blaschke and Valencia 2002; Hanisch et al. 2003). We therefore chose to curate and maintain a lexicon with names of interest by hand. In this C. elegans-specific implementation of Textpresso, the effort was helped by the fact that the C. elegans community is somewhat disciplined in choosing names and WormBase includes names of interest. Of course, there is the danger that entities not listed in WormBase (and therefore in our lexicon) will be missed in our system, and those cases are of special interest to curators (of WormBase) and researchers, such as newly defined genes or newly isolated alleles. Dictionaries tend to be incomplete and turn stale rapidly, because of the issues of synonyms, lack of naming conventions, and the rapid pace of scientific discovery. Thus, we do not rely only on WormBase, but maintain an independent, Textpresso-specific part of the lexicon. Technical aspects of the system. Figure 4 shows the details of database preparation. The regular hexagons indicate the sources from which Textpresso is built. The PDF collection was converted to plain text by a software package written by Robert Li at Caltech. The development of such a software tool had become necessary, as current PDF-to-text converters do not comply with the typesetting of each journal, i.e., footnotes, headers, figure captions, and two-column texts in general are dispersed and mixed up senselessly in the converted text. The application works with templates that specify the structure and fonts used in a particular journal and uses this information to convert the articles correctly. A high-fidelity conversion is crucial for any information retrieval and extraction application. The software will be made available at the GMOD Web site (http://www.gmod.org). While this manuscript was being written, a new version (2.0.2) of XPDF (http://www.foolabs.com/xpdf/) was released. This version, unlike its predecessors, does a superb job in converting PDF into a congruent stream of plain text. Additional bibliographic data of references for which PDFs are not available are imported from WormBase (symbolized as “WormBase data” in Figure 4). These are mainly abstracts from various meetings. The data collected from our primary sources are treated in two different ways. Author, year, and citation information are deposited “as is” into the database, while abstracts, titles, and full texts are further processed. First, the texts are tokenized. Our tokenizer script reads the ASCII text derived from the conversion from PDF and splits the text into individual sentences based on the end-of-sentence period, where words hyphenated at the end of a line are concatenated and instances of periods within sentences (which are used mainly in technical terms and entity names) are ignored. The script also adds an extra space preceding any instance of punctuation within a sentence, which is a requirement for the Brill tagger (Brill 1992), a publicly available part-of-speech tagger, to attach 36 different grammatical tags to each tokenized word. The tagger has been trained specifically to handle the C. elegans literature, and additional tagging rules are applied. For example, gene names are forced to be tagged as nouns. The grammatical tags are not further used in the current Textpresso system. After this preprocessing step, the corpus of titles, abstracts, and full texts is marked up using the lexicon of the ontology (PERL expressions), as explained in Results and exemplified in Figure 1. The tags contain the name of the category as well as all attributes that apply to a matched term. Terms that are not matched by any of the 14,500 PERL expressions are given the tag , one token at a time. The corpus of searchable full texts, abstracts, and titles has 1,035,000 sentences. A total of 351,000 keywords have been indexed, covering 19,180,000 words in the texts. The semantic mark-up yields a total of 24,542,000 tags. Table 3 shows the distribution of tags. The number of meaningful tags (the ones that are not just ) is only 15,577,368, or 15.04 tags per sentence. An average of 5.5 tags per sentence are of scientific interest, i.e., are either biological entities or words that describe a relationship or characterize an entity. When displaying sentences and paragraphs, Textpresso provides links to report pages of several biological entities, such as proteins, transgenes, alleles, cells, phenotypes, strains, clones, and loci. There are a total of 165,000 different entities in WormBase to which Textpresso links, including links to journal articles and PubMed. All these links are produced statically and again deposited on disk for fast retrieval, and these data are referred to as “Textpresso data” in Figure 4. In this way the actual link is not made on the fly from generic URLs, and the response time for queries remains short. We generated an exhaustive keyword and category index for the whole corpus. This index makes the search extremely fast, using rapid file access algorithms. All keywords and tags in the corpus are indexed. Also, all terms in the corpus that have a report page in WormBase are indexed. For 2,700 full-text articles and 16,300 abstracts, the index takes up 1.7 Gb. The interfaces for submitting queries and customizing display options are written as CGI scripts. They are supported by simple HTML pages that contain documentation. The Web site runs with a RedHat Linux operating system and an Apache http server. No special changes to the standard configuration are required. The Web interface accesses the custom-made Textpresso database; no commercial-grade database systems have been used. It takes 2–3 d to build the complete 6.9-Gb database. Methodology of evaluation. For the preliminary study, a query was formulated using three category rows of the Textpresso “advanced retrieval” interface to identify sentences containing gene-gene interaction data from a test set of eight full-text journal articles (see Table 5): the PMID:11994313 (Norman and Moerman 2002), PMID:12091304 (Alper and Kenyon 2002), PMID:12051826 (Maduzia et al. 2002), PMID:12110170 (Francis et al. 2002), PMID:12110172 (Bei et al. 2002), PMID:12065745 (Scott et al. 2002), PMID:12006612 (Piekny and Mains 2002), and PMID:12062054 (Boxem and van den Heuvel 2002). In the top row of the advanced retrieval tool the “association” ontology was selected in the “category or keyword” column. No other changes in the first row were made, which implies that no subcategory or specification was selected, and the occurrences of association terms in one sentence were “greater than 0.” In the second row, the Boolean operator “or” and the category “regulation” were selected, with no further specification, again asking the machine to return sentences with at least one regulation term. Finally, in the third row, the category “gene” was chosen, with a specification of “named” and an occurrence of “greater than 1.” The Boolean operator to connect this row with the former ones is “and.” All other values remained as default, resulting in no further query specification. As the “advanced retrieval” search engine processes queries sequentially from the top row to the bottom row, this query asks to return sentences with at least one association or regulation term in conjunction with at least two genes mentioned by name. For the semiautomatic information extraction from text, the same query was utilized as above. In addition, sentences that did not mention at least two uniquely named genes were eliminated.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Integrating phenotype ontologies across multiple species

              Background The completion of the Human Genome Project [1,2] has resulted in an increase in high-throughput systematic projects aimed at elucidating the molecular basis of human disease. Accurate, precise, and comparable phenotypic information is critical for gaining an in-depth understanding of the relationship between diseases and genes, as well as shedding light upon the influence of different environments on individual genotypes. Natural language free-text descriptions allow for maximum expressivity, but the results are difficult to compute over. Structured controlled vocabularies and ontologies provide an alternative means of recording phenotypes in a way that combines a large degree of expressivity with the benefits of computability. A number of different ontologies have been developed for describing phenotypes, and whilst this is a welcome improvement over free-text descriptions, one problem is that these ontologies are developed for use within a particular project or species, and are not mutually interoperable. This means that it is difficult or extremely difficult to combine genotype-phenotype data from multiple databases - for example, if we wanted to search a mouse or zebrafish database for genes associated with a particular set of phenotypes associated with a human disease, this would require mapping between the individual phenotype ontologies. If we are to combine the results of a variety of phenotypic studies, then phenotypes need to be recorded in a structured systematic fashion. At the same time, the system must allow for a high degree of expressivity to capture the wide range of phenotypes observed across a variety of organisms and types of investigation. Here we propose a methodology that can be used to add value to existing phenotype ontologies by mapping them to a common reference framework based on existing standard ontologies. We implement this methodology for four active phenotype ontologies, focusing primarily on a phenotype ontology used for the mouse. Our results also cover phenotype ontologies used for human and worm, and some exploratory work on plant trait ontology to demonstrate the generic utility of the approach. We demonstrate how our approach assists with the ontology development cycle, and we show how the addition of a multi-species anatomical ontology can enable queries across species. Open biological ontologies Ontologies consist of collections of classes, arranged in a relational graph, to provide a computable representation of some domain. Examples of these domains include organismal anatomy, chemical entities, biological processes, phenotypes and diseases. The Open Biological Ontologies (OBO) project was created in 2001 as an umbrella body for the developers of life-sciences ontologies [3]. OBO was largely inspired by and grew out of the Gene Ontology (GO) Consortium. The GO [4] has been recognized as a key component in the integration of biological data, due in part to its wide use by disparate groups and its integration with other ontologies. One of the goals of OBO is to rationally partition the biological domain to minimize overlap between the ontologies, and to ensure logical coherence across ontologies, such that ontologies can be used in combination to describe complex biology. Figure 1 shows the OBO libraries partitioning of different kinds of physical objects, from whole-organism scale (anatomy) down to the molecular scale (chemicals and proteins). In this paper we focus on two broad categories of ontology: anatomical and chemical structural ontologies, and phenotype ontologies. Figure 1 OBO-registered ontologies of physical objects, from the molecular scale up to gross anatomical scale. Above the cellular level, anatomical ontologies are partitioned taxonomically (the full breadth of taxonomic coverage in OBO is not shown). For mammals there is a second bipartite division, between fully formed structures and developing structures. The former are represented in the Foundational Model of Anatomy (FMA) and the adult Mouse Anatomy (MA), and the latter in the Edinburgh Human Developmental Anatomy (EHDA) and the Edinburgh Mouse Atlas Project (EMAP) ontologies. Anatomical ontologies There are a variety of ontologies representing anatomical entities such as hearts, brains and their parts. The current anatomical ontology space is segregated along taxonomic lines, with an anatomical ontology being maintained by each of the major multi-cellular model organism databases. In addition, there are anatomical ontologies for broader taxonomic groupings, such as teleost fishes and amphibians; these are focused on macroscopic anatomy and are used by evolutionary biologists [5,6]. Whilst this taxonomic division makes sense from an organizational perspective, the lack of a common ontology inhibits cross-species inferences (for example, finding zebrafish genes that are associated with phenotypes similar to those exhibited in a human disease). For the mouse, there are actually two ontologies - the mouse anatomy (MA) [7] and the Edinburgh Mouse Anatomy Project (EMAP) [8] ontologies, representing adult structures and developing structures, respectively. The situation is similar for humans, with adult human anatomy represented comprehensively in the Foundational Model of Anatomy (FMA) [9], and embryonic structures in the Edinburgh Human Developmental Atlas (EHDA). This division complicates queries even within a single species. The taxonomic partitioning of anatomical ontologies is largely at the gross anatomical level; cells and cellular components are represented in the OBO Cell ontology (CL) [10] and the GO cellular component ontology (GO-CC) and are applicable across multiple phyla. The decision to attempt to represent the full diversity of life across multiple phyla within these ontologies can complicate the development of the ontology, but the end result is more useful for cross-species queries. Similarly, the Common Anatomy Reference Ontology (CARO) [11] is an upper ontology for anatomy that consists of abstract structural classes that are extended by classes in individual anatomical ontologies in any taxon. This helps ensure that different anatomy ontologies are constructed consistently based upon common principles, but does not attempt to represent specific entities present in different species, such as hearts, blood, eyes, and so on. These anatomy ontologies are arranged as is_a hierarchies and often include additional relations such as part_of and develops_from [12]. Molecular and chemical entity ontologies Chemical Entities of Biological Interest (CHEBI) is an ontology of chemical entities [13]. The OBO Protein ontology (PRO) [14] is a classification of proteins and protein structures. At this time, PRO is a relatively new ontology, and many biologically important proteins are not yet represented. When combined with the anatomical ontologies mentioned above, we have broad coverage of physical entities at different levels of granularity, from the molecular scale up to the whole-organism level. Phenotype ontologies Phenotype information has traditionally being captured using free-text fields in databases. Whilst this does allow for the full expressivity of natural language, the descriptions are largely opaque to computational inference. For example, if one curator uses the phrase 'increased size of jaw' and another uses the phrase 'mandible hyperplasia' to describe the phenotype associated with alleles of an orthologous gene in two different species, it is difficult for a computer to detect the similarity in these phenotypic descriptions without resorting to error-prone natural language processing techniques. The success of GO has led several groups and communities to adopt or create phenotype ontologies using species-centric phenotype terminological standards. The structure of these ontologies, with classes arranged in an is_a hierarchy, allows for more intelligent searching and grouping together of genotypes and phenotypes within a species. For example, the database might record an association between a genotype of the mouse Pten gene and the class 'Purkinje cell degeneration' (MP:0005405); this genotype would be returned in a query for 'neurodegeneration' due to the graph structure and the transitivity of the is_a relation (Figure 2). Figure 2 Example portion of the MP, and the equivalence relations between MP classes and EQ descriptions. Paths to the root over is_a links from 'Purkinje cell degeneration' and siblings. The is_a hierarchy is used for query-answering and genotype-phenotype analysis. Queries for 'neurodegeneration' or 'abnormal neuron morphology' should return genes or genotypes associated with 'Purkinje cell degeneration', such as the Pten gene. Note that prior to December 2008 MP lacked the highlighted link (indicated with the asterisk between two bold boxes), which resulted in false negatives for queries to 'neurodegeneration'. Using automated reasoning we were able to infer this link from the logical definitions and associated ontologies. We presented our results to the MP editors, who subsequently amended the ontology to include the link. Examples of these species-centric phenotype ontologies include: the Mammalian Phenotype ontology (MP) [15]; the Worm Phenotype ontology (WP); the Plant Trait ontology (TO) [16]; the Human Phenotype ontology (HP) [17]; the Ascomycete Phenotype ontology (APO); and the Mouse Pathology ontology (MPATH). Whilst these ontologies serve their respective communities well, they are difficult to use for data integration across communities because there is no single ontology that is applicable to all species. PATO quality ontology and post-composed phenotype descriptions Some model organisms, such as zebrafish and Drosophila, do not use species-centric phenotype ontologies but rather have opted for a compositional approach. That is, instead of choosing from predetermined lists of phenotypes, curators have the ability to compose descriptions of phenotypes on-the-fly using combination of classes from several ontologies, including an ontology of qualities termed Phenotype and Trait ontology (PATO) [18]. These composed descriptions minimally consist of at least two variables: the entity that is observed to be affected (for example, head, liver, Purkinje cell, and so on), and the specific characteristic or quality of that entity affected (for example, size, color, shape, structure). This is dubbed the 'EQ' model [19,20]. The E variable is filled with a class from any OBO ontology (for example, FMA, MA, EMAP or CL) and the Q variable is filled with a class from PATO. PATO covers both general qualities (for example, shape) and specific qualities (for example, branched), connected in a hierarchy of is_a relations. This EQ approach has been used in the annotation of human genotype-phenotype associations, as well as in model organism databases such as FlyBase (Drosophila) [21] and ZFIN (zebrafish) [22]. When phenotype descriptions are composed by the annotator at the time of annotation, we say that we are post-composing (or post-coordinating) the description. This is in contrast to the approach exemplified by the MP, in which descriptions are pre-composed (or pre-coordinated) in advance by the ontology editor. Table 1 shows the ontologies and methodologies currently used by various different projects. The pre- and post-composed approaches appear incompatible; it may seem that if we are to fully utilize model organism data for both translational and basic research, conformance to a single scheme may be a prerequisite. To the contrary, these differing methodologies and ontologies are complementary and fully compatible. We can still compute across species using these different approaches provided two criteria are met. First, there are equivalence statements between classes in pre-composed ontologies and PATO-based EQ descriptions. For example, the MP class 'small ears' can be declared equivalent to the EQ description composed from the PATO class 'small' and the mouse anatomy class 'ear'. This equivalence relationship constitutes a 'logical definition' for the phenotype class. Second, there is a means of linking across species-centric anatomical ontologies. Table 1 Genotype-phenotype curation in different projects uses different ontologies and methodologies Project Organism Methodology Ontologies used Entities annotated MGI Mouse Pre-composed MP Genotypes NIF Mouse (neuro) Post-composed PATO, NIFSTD, Organisms WormBase Caenorhabditis elegans Both pre-composed and post-composed WP Genes SGD Saccharomyces cerevisiae Pre-composed APO Genotypes Gramene Viridiplantae Pre-composed TO Genotypes FlyBase Drosophila melanogaster Post-composed PATO, FBbt, GO Genotypes, alleles ZFIN Danio rerio (Zebrafish) Post-composed PATO, ZFA Genotypes DictyBase Dictyostelium discoideum Post-composed PATO, DDANAT Genotypes PATO OMIM-annotation project Homo sapiens Post-composed PATO, FMA, CHEBI, CL, GO Genotypes (corresponding to OMIM sub-records, for example OMIM:601653.0001) We exclude annotation efforts that use free text in place of a publicly available ontology or terminology (such as the various genome-wide association study projects), or those not specifically focused on genotypic curation. NIF: Neuroscience Information Framework; DDANAT: Dictyostelium Discoideum Anatomy Ontology; FBbt: FlyBase anatomy ontology; MGI, Mouse Genome Informatics group at Jackson Laboratory; NIFSTD: Neuroscience Information Framework Standardized Ontology; SGD, Saccharomyces Genome Database; ZFA, Zebrafish Anatomy ontology; ZFIN, Zebrafish Information Network. The lack of a set of equivalence mappings has hitherto been an obstacle to data integration across species using these different annotation approaches. In this paper we describe our methodology for connecting classes in pre-composed ontologies to EQ descriptions using an ontological framework - providing logical definitions for these classes. We illustrate this methodology primarily using the MP, and show that these mappings can be used to assist in ontology development through the use of automated reasoners. We also describe the construction of a multi-species anatomy ontology, which when combined with our EQ descriptions can be used to make cross-species queries. Results Formal representation of phenotypes We logically define phenotypes by making an equivalence relation between classes in the pre-composed phenotype ontology to EQ descriptions, with each such description consisting of the following elements: Q, the type of quality (characteristic) that the genotype affects; E, the type of entity that bears the quality; E2, an additional optional entity type, for relational qualities; M, a modifier. We can then translate the EQ description to an ontology language such as OBO Format or OWL (Web Ontology Language) - this allows us to use powerful general-purpose ontology tools such as automated reasoners to query and manipulate phenotype descriptions, and to compute subsumption hierarchies in phenotype ontologies (Figure 3). Ontology languages have a means of composing descriptions in a logically unambiguous fashion as intersections between classes. The modeling strategy used is described in detail elsewhere [23], but a brief summary as background follows here. Figure 3 Equivalence relations between MP classes and EQ descriptions. Equivalence relations between two MP classes and their equivalent EQ descriptions. Here we treat MP 'degeneration' terms as in the PATO quality (Q) 'degenerate', rather than the process of degeneration. Here the bearer entities (E) are represented in the OBO Cell Ontology (CL). The EQ notation can be translated to logical expressions using Table 2. The dotted line indicates a relationship in the MP that can be independently inferred by a reasoner. CNS, central nervous system. We use the formal inheres_in relation for relating qualities to their bearers. We treat the phenotype 'femur shape' as the class intersection of (a) the class 'shape' and (b) the class of all things that stand in an inheres_in relationship to a 'femur'. In OBO Format this is written as: intersection_of: PATO:0000052 ! shape intersection_of: inheres_in MA:0001359 ! femur Note that the text after the '!' is merely a comment, not a part of the format, used here to provide the human readable name for that class. This can be read as a genus-differentia style definition, a . We translate any EQ pair to . For relational qualities we use the towards relation to connect the quality to the additional entity type on which the quality depends (for example, the concentration in urine of calcium). Here we use a simple 'EQ syntax' to explain our results, although the underlying representation is in OBO format (OBO Format, 2009). Table 2 shows the mapping between these two schemes. Our equivalence mappings are available in both OBO and OWL formats from the PATO wiki [24], or alternatively from the OBO logical definitions download page [25]. Table 2 Translation between variables in EQ templates and logic based OBO or OWL class intersections EQ syntax OBO syntax OWL Manchester syntax E = Intersection_of: that inheres_in some Q = Intersection_of: inheres_in E = Intersection_of: that inheres_in some Q = Intersection_of: inheres_in and towards some E2 = Intersection_of: towards E = Intersection_of: that inheres_in some Q = Intersection_of: inheres_in and has_qualifier some M = Intersection_of: has_qualifier Phenotypes can be written using EQ syntax or as logical expressions in general purpose ontology languages such as OBO or OWL. Template variables are indicated by the angle brackets. For example, if = 'femur' and = 'decreased diameter', then the OWL expression would be decreased_diameter that inheres_in some femur. Note that the qualifier relation is not yet in the Relations Ontology and is not formally defined, and is used as a placeholder for now. We have developed a collection of equivalence mappings from classes in pre-composed phenotype ontologies to PATO-based formal description structures; we call these collections of mappings 'XP' ontologies (the 'XP' stands for cross-product). The descriptions are drawn from the cross-product of two sets of classes: the set of PATO classes and the set of classes from other OBO ontologies. For example, MP-XP is a collection of mappings between individual MP classes and their corresponding EQ descriptions. We can further partition the sets according to this scheme - for example, MP-XP-MA is the collection of such mappings whose descriptions are drawn from the cross-product of PATO classes and MA classes. Note that the mappings are all intended to be ones of equivalence - the EQ description should be neither more general nor more specific than the mapped pre-composed class. In this paper we focus on the MP ontology. This is partly because of its relevance to translational research, maturity, comprehensiveness (6,844 classes), and to fulfill the data analysis needs of a particular project [20]. However, we also present preliminary results in mapping other pre-composed phenotype ontologies: HP, WP and TO. The last one was chosen to demonstrate the applicability of the technique outside metazoans. The mapping of the portion of HP corresponding to musculoskeletal phenotypes is described elsewhere [17]. The total number of classes, from MP, HP, WP and TO, for which we can map to PATO-based cross-product descriptions are summarized in Table 3. We attempt to achieve maximal coverage by combining initial automated term syntax parsing methods (see Materials and methods section), followed by manual curation of the results to check for biological validity. The MP-XP set has been curated most extensively, and of that set, the MP-XP-CL subset has been analyzed most thoroughly. Table 3 Summary of equivalence mapping results Entity ontologies used Precomposed ontology Total classes (non-obsolete) Classes mapped using PATO Gross anatomy ontology CL CHEBI GO MPATH MP (mouse) 7,048 5,156 (73%) 3421 (MA) 738 294 1,064 194 130 (EMAP) WP (worm) 6,341 1,177 (19%) 324 (WBbt) 32 114 570 HP (human) 8,996 1,762 (20%) 1667 (FMA) 9 43 114 35 TO (plant) 958 398 (42%) 334 (PO) 2 106 2 The number of classes in each pre-composed phenotype ontology is shown, together with the size of the subset of these classes that have been mapped to EQ descriptions. The EQ descriptions can be broken down further into subsets, depending on which ontologies are used. Note the subset numbers are not mutually exclusive, as there are scenarios where an EQ descriptions references multiple ontologies, so the numbers are not additive. PO, Plant Ontology (anatomical structure); WBbt, Worm anatomy ontology. Phenotypic mapping groups The phenotype mappings fell into different overlapping categories, such as those based on basic anatomy, abnormality, compositional descriptions, processes, relational descriptions and absence. These phenotypes are described below, and Table 4 shows examples of these phenotype classes and the breakdown of their EQ description. Table 4 Examples of equivalence mappings between pre-composed phenotype classes and EQ descriptions Phenotype class Bearer (E) Quality (PATO) Towards (E2) Qualifier MP  Decreased diameter of femur Femur Decreased diameter  MP:0008152 MA:0001359 PATO:0001715  Spherocytosis Erythrocyte Spherical  MP:0002812 CL:0000232 PATO:0001499  Abnormal spleen iron level Spleen Concentration of Iron Abnormal  MP:0008739 MA:0000141 PATO:0000033 CHEBI:18248 PATO:0000460  Situs inversus Visceral organ system Inverted  MP:0002766 MA:0000019 PATO:0000625  Delayed kidney development Kidney development Delayed  MP:0000528 GO:0001822 PATO:0000502  Truncated notochord TS20 notochord Truncated  MP:0004714 EMAP:4109 PATO:0000936  Motor neuron degeneration CL:0000100 motor neuron Degenerate  MP:0000938 PATO:0000639  Axon degeneration Axon Degenerate  MP:0005405 GO:0030424 PATO:0000639  Loss of basal ganglion neurons Basal ganglia Has fewer parts of type Neuron  MP:0003242 MA:0000184 PATO:0002001 CL:0000540  Abnormal Purkinje cell dendrite morphology Dendrite of Purkinje cell Morphology Abnormal  MP:0008572 GO:0030425^part_of(CL:0000121) PATO:0000051 PATO:0000460 HP  Hypoplastic uterus Uterus Hypoplastic  HP:0000013 FMA:17558 PATO:0000645  Abnormality of vision Visual perception Quality Abnormal  HP:0000504 GO:0007601 PATO:0000001 PATO:0000460  Narrow pelvis Pelvis Decreased width  HP:0003275 FMA:9578 PATO:0000599 WP  Shruken intestine Intestine Shrunken  WBPhenotype:0000086 WBbt:0005772 PATO:0000585 TO  Leaf area Leaf Area  TO:0000540 PO:0009025 PATO:0001323  Auxin sensitivity Whole plant Sensitivity Auxin  TO:000163 PO:0000003 PATO:0000085 CHEBI:22676 Examples of pre-composed terms from four phenotype ontologies together with their logical definitions expressed as EQ expressions. The phenotype category can be seen by the ontologies used. Basic anatomical phenotypes use an anatomical ontology, unspecified abnormality can be seen in the final column. The one example of a compositional anatomical class (Purkinje cell dendrite is written as an OBO intersection expression. Processual phenotypes use the GO process ontology, and relational qualities have the E2 column filled in. PO, Plant Ontology (anatomical structure); WBbt, Worm anatomy ontology. Basic anatomical phenotypes Most of the classes in the pre-composed phenotype ontologies are gross anatomy phenotypes - they can be defined in terms of a quality of some part of the body. For example: MP:decreased diameter of femur*; MP:hypothalamus hypoplasia; MP:large lymphoid organs; MP:muscular atrophy; MP:truncated notochord*; MP:motor neuron degeneration*; MP:axon degeneration*; HP:narrow pelvis*; TO:leaf area*; WP:shrunken intestine*; MP:situs inversus* (examples marked with an asterisk are shown in Table 4). The first step to creating mappings for these pre-composed phenotypes is selection of the appropriate anatomical ontology. For worm and plant phenotypes, there is a single unified gross anatomy ontology covering each. For human phenotypes from HP, we use the FMA, and although the FMA does not include developing structures, this is not currently a limitation because the HP does not include many phenotypes for developing structures such as 'neural tube'. The MP is intended as a mammalian phenotype ontology. Although most of the phenotypes defined are applicable to all mammals (and sometimes more general taxa) there is a bias towards mouse, as this ontology is generally used for mouse genotype annotation. This, and the fact that there was no general mammalian anatomy ontology, led us to use solely mouse anatomy (MA) ontologies for the decomposition of MP. We used MA (the adult mouse anatomy ontology) wherever possible. EMAP (Theiler stages 1 to 26) posed a problem due to the lack of generalized classes for developmental structures, such as 'notochord', forcing us to choose an arbitrary time stage-specific class (for example, 'notochord at TS20' to define 'truncated notochord'; Table 4). For cellular phenotypes such as 'motor neuron degeneration' we used CL, which is applicable across all taxa. For subcellular anatomy phenotypes, such as 'axon degeneration', we used the GO-CC ontology (also applicable across all taxa). Many of the anatomical phenotypes are of the form 'abnormal X morphology' or 'increased/decreased size of X', where X is a class in the anatomy ontology or the cell ontology. Equivalence mappings for these were initially generated automatically (see Materials and methods). Manual assistance is required to map clinical terms such as 'situs inversus' (MP) to precise EQ descriptions (see Discussion). The majority of all mapped phenotype classes fall into this category. This holds across all phenotype ontologies, but particularly for HP, which is by nature highly morphological. Abnormality Both MP and HP are ontologies of abnormal phenotypes. Many classes are of the form 'abnormal X', where the exact nature of the abnormality is not specified; for example: MP:abnormal neuroepithelium of ampullary crest; MP:abnormal septation of the cloaca; HP:abnormality of vision*. Here we elide a detailed discussion of what constitutes 'normal' or 'abnormal', as this is beyond the scope of this paper. We simply use a has_qualifier relation to replicate the intended structure of the MP class. Note that the WP does not classify phenotypes as abnormal, but rather as 'variants'. Compositional descriptions of anatomical entities Mapping a class such as abnormal Purkinje cell dendrite morphology* (MP:0008572) requires a slight variation on the basic EQ scheme. 'Purkinje cell' is represented in CL, and 'dendrite' is represented in GO-CC, but GO-CC does not specifically pre-compose 'Purkinje cell dendrite'. Logically, this presents no problem, as we can make an anonymous class defined using an intersection construct to specify this entity, using the part_of relation from the Relations Ontology. To accomplish this, we extended the simple EQ syntax such that we can use compositional expressions as IDs [26], and write the following: E = dendrite^part_of(Purkinje_cell) Q = morphology M = abnormal When translating the above EQ description to OBO or OWL we end up with a nested description, for example, in OWL Manchester syntax: morphology that inheres_in some (dendrite that part_of some Purkinje cell) and has_qualifier some abnormal However, tools that are downstream consumers of nested MP-XP class expressions must be able to interpret these appropriately, and the additional expressivity may pose problems for these tools. In addition, we need a way in which to present the descriptions in an intuitive manner to biologists. We therefore extended EQ syntax to include the EW (Entity Whole) tag as below: E = dendrite EW = Purkinje cell Q = morphology M = abnormal This is equivalent to the above EQ description, but is simpler for tools to deal with, and simpler to present in tabular form to users. This approach could be termed 'post-compositional', as the expression denoting the anatomical entity class is created after the anatomical entity ontology is deployed. However, the terminology becomes confusing here, so we reserve the term post-compositional specifically for the creation of such expressions at annotation time. Process oriented phenotypes A significant number of classes in MP are described in terms of a biological process rather than a static description of an anatomical part. Examples include: MP:delayed kidney development*; MP:increased mast cell degranulation; TO:respiration rate; WP:hyperactive egg laying; HP:impaired spermatogenesis. For these classes, we used PATO in combination with GO biological process (GO-BP) classes. PATO is divided at the top level between qualities of biological objects and qualities of processes. The former includes qualities such as size, shape, and structure and is used in conjunction with anatomical classes. The latter includes temporal qualities such as delayed, increased rate and is used in conjunction with GO-BP classes. Chemical entities and relational qualities MP definitions occasionally reference types of chemical entities. For example: MP:hypocalciuria (excretion of abnormally low amounts of calcium in the urine); MP:abnormal spleen iron level*; TO:abscisic acid concentration. Here we used the CHEBI ontology, typically using the CHEBI class as the related entity for a relational quality, where the bearer entity is a body substance such as blood or urine. In EQ syntax we would write the definition of hypocalciuria as: E = urine Q = decreased concentration of E2 = calcium For phenotypes that reference specific proteins such as 'interleukin-1' we can use the OBO PRO. At this time, the PRO does not include many of the required classes but these are easily added to the MP-XP definitions when they become available. Absence or change in number of parts Mutations in or deletions of genes may result in the loss of a body part, or a change in the number of parts. Some example phenotypes are: MP:absent middle ear ossicles; MP:loss of basal ganglia neurons*; MP:alopecia (loss of hair); MP:absent spleen; WP:no oocytes; HP:polydactyly. With PATO we typically describe absence in terms of the entity that is missing the part. For example, the following is problematic: Q = absent E = spleen Logically this is incoherent because there is no spleen to possess the quality of non-existence. Instead we can use a cognate 'relational quality' in order to compose a description: E = abdomen Q = lacking all parts of type E2 = spleen This second form is both more coherent and more expressive. For example, in defining 'loss of basal ganglia neurons' we can say: E = basal ganglion Q = has fewer parts of type E2 = neuron This obviates the need for a class 'basal ganglion neuron' (not present in the mouse anatomy ontology or the cell ontology). These PATO classes are grouped under the PATO class 'has number of' and have logical definitions that can be used in reasoning. When translating 'absence' phenotypes to representations in ontology language such as OBO or OWL we have the option of treating the above description as a logical construct called a cardinality restriction. In OWL Manchester Syntax the absent spleen phenotype could be written as: Abdomen that has_part exactly 0 spleen This works for stating a number or number range, but cannot be used to state a relative increase or decrease in number. Another issue with the explicit representation is that it can create inconsistencies if it contradicts what is stated in the anatomy ontology. A full discussion is outside the scope of this paper, but one solution that has been previously proposed is to use non-monotonic logic [27]. Validation using automated reasoners A reasoner can be used to automatically classify (that is, place terms in the is_a hierarchy) a compositional ontology, such as a pre-composed phenotype ontology. We can also reverse the direction of implication, and use reasoners to validate the XP mappings based on the existing asserted is_a links in these ontologies. We used a variety of reasoning strategies to validate the MP mappings to EQs. For each pre-composed phenotype ontology, we reasoned over the combined set consisting of the phenotype ontology, the XP mappings, and the ontologies referenced in those mappings. This yielded additional is_a links in the phenotype ontology, which were submitted to the maintainers of the ontology for approval, and often resulted in improvements to the ontology. For example, the reasoner suggested 'Purkinje cell degeneration' is_a 'neuron degeneration' (inferred from the CL is_a hierarchy), which was previously missing from MP, and was promptly added [28]. In other cases the reasoner suggestions were rejected, because of problems in either the XP mappings or the referenced ontologies. To validate this approach, we examined a particular subset, MP-XP-CL, the terms in MP for which there are mappings that involve CL. Using the OBO-Edit reasoner we inferred the existence of 88 possibly missing is_a relationships in MP. These were submitted to the MP curator for review. Of these, 48 were deemed to be correct, and the new links were added to the MP graph. One link was only partially correct, and resulted in a small rearrangement of a portion of the MP graph. Twenty-two links were rejected outright, and traced back to errors in the MP-XP-CL mappings, which were subsequently fixed. The remaining 17 are still pending, and mostly derive from inconsistencies between classification of normal cells in CL and abnormal cells in MP. We also performed a partial validation of the mappings by attempting to recapitulate is_a links asserted in existing phenotype ontologies. We started by removing all is_a links from the phenotype ontology (but not from the ontologies referenced in the mappings) and attempted to recover these links using a reasoner. We found that 37% of the existing links in MP and 14% of the links in HP can be automatically reconstructed (Table 5). Of the false negatives (relationships between mapped classes that we cannot reconstruct), the problem was often an absence of supporting links in the referenced ontologies. For example, MP contains the statement 'asymmetric snout' is_a 'abnormal facial morphology'. At the time of reasoning, the MA contained no relationships linking the classes 'face' and 'snout', which means there is no way to infer the stated MP link from first principles. After discussion, the MA curator (TF Hayamizu, personal communication) added a part_of link to the ontology between 'snout' and 'face', which was sufficient to allow inference of the MP link from the logical definitions. This is an example of how the combination of composing logical descriptions and using a reasoner can contribute to the development of a suite of ontologies, enforcing more consistency with one another. This is a guiding principle of the OBO Foundry. Table 5 also lists the novel relationships inferred by the reasoner; not all have been evaluated, and some will be true positives that will result in additions to the MP, such as the previously mentioned Purkinje cell example. Table 5 Reasoner-inferred links for both human and mouse HP (human) MP (mouse) Number of is_a relationships asserted in ontology 10,162 7,950 Number of is_a relationships that can be inferred automatically 1,421 2,922 Number of novel is_a relationships proposed (unvetted) 407 478 To validate our approach, we attempted to derive existing non-redundant relationships in two phenotype ontologies based on equivalence mappings and external ontologies. The first row is the number of relationships manually asserted by the ontology editors. The second row is the number of these asserted relationships that we can independently infer from first principles. The final row is the number of novel new relationships found by the reasoner - some of these will be false positives, but others will represent genuine missing links in the ontology. A higher proportion was yielded for mouse due to the higher number of mappings (Table 3; we only expect to recapitulate relationships when we have mappings). One problem we encountered was that the size of the combined ontologies proved too much for existing memory-bound reasoners to handle. We used two strategies to overcome this: using a relational database backed reasoner, which is not memory bound [29]; and ontology segmentation - dividing the reasoned set into manageable subsets. For example, rather than reasoning over all the ontologies referenced in MP-XP, we would select individual pair-wise subsets, such as MP-XP-MA, and reason over these sequentially. Both approaches have strengths and drawbacks; the relational database approach is too slow to be part of the ontology development cycle, and the simple pair-wise strategy can give incomplete results for complex phenotypes involving classes from more than one other ontology. A multi-species anatomy ontology for translational research Our results show how classes in phenotype ontologies can be mapped to logical descriptions utilizing species-centric anatomical ontologies plus PATO qualities. These mappings enable us to query a mouse dataset, annotated using MP IDs such as MP:0001314 (corneal opacity), using the MA class 'cornea'. However, if we wish to query across combined multi-species datasets for all morphological phenotypes of the cornea, we need a more generalized class representing that which is shared by all vertebrate corneas. We have commenced construction of such a multi-species anatomical ontology, called Uber-ontology or Uberon. The current version of Uberon consists of over 2,800 classes, and it also contains links to over 9,300 classes in external, mostly species-centric anatomical ontologies. We do not attempt to generalize beyond metazoans [30]. Uberon is available from the main OBO website [31]. Discussion Completion of the mappings At the time of writing, MP-XP had the most comprehensive set of mappings (Table 3). The coverage of human phenotypes in HP-XP is poor by comparison for a number of reasons. The HP ontology is newer, and in comparison with MP, contains finer-grained morphological detail (exemplified by classes such as 'Bracket epiphyses of the middle phalanx of the 5th finger', in which 'bracket' denotes a complex morphological phenomenon involving translocation along a radial-ulnar axis. We have recently started working with the editors of the HP ontology to extend PATO with the required morphological qualities and have proposed logical definitions for a further 1,000 classes that we are verifying with the HP editors and the assistance of a clinical geneticist (Peter Robinson, personal communication). The limited number of equivalence mappings for WP and TO reflect the fact that we have thus far focused on organisms more closely related to humans, but we have started working with the developers of these ontologies and training them to make these mappings as part of the ontology development cycle (Jolene Fernandez and Pankaj Jaiswal, personal communication). Even within the relatively comprehensive MP-XP set, 27% of classes remain without a logical definition. With many of these the lack is due to missing classes in one or more ontologies. For these we make requests for new classes on the relevant OBO tracker and intend to go back and make the XP sets more comprehensive. In particular, we expect higher coverage as PRO becomes more comprehensive. Other classes make reference to pathological anatomical entities, such as hamartomas, which are outside the scope of MP - for these we are exploring the use of the MPATH ontology. At this time we have no good solution for classes such as MP:anhedonia, which require a publicly available behavior ontology (the Mammalian Behavior Ontology was not available at the time of writing). Logical equivalence between pre-composition and post-composition Model organism databases and sources of human genotype-phenotype data are divided as to whether they use a pre-composed ontology of phenotype classes (such as the MP) or post-compose descriptions at the time of curation using PATO and other OBO ontologies (Table 1). There are merits and drawbacks to both approaches. The post-composition approach affords a much higher degree of freedom, but this comes with the price of adding complexity to the curation process and the potential to introduce an additional source of curator inconsistency. For example, recently a curator was annotating a paper in which a mutant organism was observed to have its internal organs transposed across the left-right axis of symmetry. An informal poll (OBO-Phenotype, 2007) [32] revealed that different curators would annotate this differently; using different anatomy or PATO classes. A pre-composed ontology such as the MP leaves less room for cross-curator variability: there is a ready-made class 'situs inversus' (MP:0002766) with the text definition 'lateral transposition or mirroring of the viscera of the thorax and abdomen, sometimes incomplete, with all organs maintaining the normal relative position with respect to each other'. In addition, the term 'situs inversus' has been part of the medical lexicon for hundreds of years. This is an advantage of pre-composed ontologies. However, if a curator observes a more specific form of situs inversus (perhaps with certain specific organs inverted), they will have to either request a new class or make do with the more general class. Using a post-compositional approach in which descriptions are composed at the time of annotation gives curators freedom without introducing a bottleneck to the curation process. Happily we can have the best of all possible worlds. MP-XP includes an equivalence relation between 'situs inversus' and E="visceral organ system" [MA:0000019] Q = 'inverted" [PATO:0000625]. This means that annotations can be converted back and forth automatically. In addition, curators employing PATO to post-compose classes can look-up MP and MP-XP to determine which E and Q variables to use. In fact a mixed approach based on the work outlined here has been adopted by large scale mouse phenotyping efforts such as EUMODIC [19]. Reconciling static and process-oriented perspectives Note that there is sometimes a fine line between a process-oriented description and one described in terms of anatomical parts. For example, 'abnormal tooth development' (MP:0000116) could be defined in terms of the anatomical entity 'tooth' and the quality 'morphology' rather than the GO process 'tooth development'. However, this violates our principle that the mappings are formal ones of strict equivalence, as opposed to near-equivalence. In fact, MP declares 'abnormal tooth morphology' as a separate class (MP:0002100). Abnormal tooth development is not the same as abnormal tooth morphology, although they are correlated and presumably frequently observed together. In these situations we opted to make mappings to descriptions that corresponded exactly to the text definition in MP, using GO-BP classes if the phenotype class textual definition indicates a process phenotype. So we define 'abnormal tooth development' using GO and 'abnormal tooth morphology' using MA (Figure 4). Figure 4 Process and anatomical phenotypes. (a) MP mixes process and anatomical/morphology phenotypes in the same is_a hierarchy. (b) MP-XP maps these to GO-BP and MA based descriptions respectively. (c) GO-BP to Uberon mappings make the link between the process class 'tooth development' and the general anatomical class 'tooth' explicit. (d) Uberon declaration stating that a mouse tooth is a subclass of the more general 'tooth' class. MP declares 'abnormal tooth development' to be a subtype of 'abnormal tooth morphology' (Figure 4a). The MP-XP mappings (Figure 4b) are insufficient to recapitulate this relationship automatically. We can add further mappings, such as GO-BP to Uberon [30] (Figure 4c) and the MA to Uberon mappings (Figure 4d). This is still insufficient to recapitulate the MP relationship using the axioms provided. However, it may be possible to generalize the logical definition of classes such as abnormal tooth morphology or to add logical rules to PATO such that it is possible to infer abnormal X morphology from abnormal X development using coordinated sets of ontologies. Or, alternatively, infer a new common subsuming phenotype such as 'abnormal tooth morphology or development'. This was outside the scope of the work described in the paper. We expect that using rules such as these will increase the number of relationships that can be recapitulated in pre-composed phenotype ontologies, and increase similarity scores between similar phenotypes that have been observed by different methods. For now we recommend curators follow principles of annotation laid down in [33] and annotate to both the process term and the anatomical structure term when indicated. For example, if it is known that the process of tooth mineralization was disrupted and that abnormal enamel morphology was observed, then curators should make two distinct annotations, one using the GO process class 'tooth mineralization' and another using an anatomical ontology class 'tooth mineral'. The challenges of coordinated ontology development In this paper we have demonstrated how reasoners can be used to partially automate the placement of classes in phenotype ontologies. This requires making equivalence relationships between classes and logic-based description expressions. We note that it takes considerable effort to do this retrospectively rather than prospectively. Our approach here is retrospective - we take existing phenotype ontologies and then attempt to integrate them post hoc. Our preliminary work reveals that cryptic inconsistencies have evolved amongst ontologies that one would expect to be compatible (in that they all should conform to real-world biological knowledge); this will take some time and coordination to fix. For example, CL has 'pancreatic delta cell' as a subtype of 'enteroendocrine cell' but 'abnormal pancreatic delta cell morphology' and 'abnormal enteroendocrine cell' are unrelated in MP. In this case the MP hierarchy is correct, whereas the reference ontology in incorrect. These inconsistencies would continue to go unnoticed without explicit coordination. Although it requires more of an initial effort to build in logical definitions (that is, assign EQ descriptions) from the outset (the prospective approach), we recommend this as a course of action for phenotype ontology development. At the same time, whilst advocating this methodology, we recognize certain problems that need to be addressed. Describing phenotypes across a variety of scales and perspectives requires the use of a wide variety of ontologies. This requires that ontology developers become familiar with these ontologies, and that they coordinate more closely with the development of these ontologies. From a global OBO Foundry perspective this is a good thing, but it must be acknowledged that it requires additional effort from individual ontology developers. A more serious issue is that most reasoners do not scale to the combined union of ontologies within the OBO Foundry. More research on both improving reasoner scalability and ontology segmentation (that is, splitting the ontology into segments such as MP-XP-MA) is required. Anatomical ontology issues In many cases we found that the MP was more detailed than the corresponding MA ontology. For example, the MP contains a class 'abnormal subarachnoid space morphology', but the MA does not contain the class 'subarachnoid space'. Our methodology here is to request classes from the MA editors and use these. Another acceptable approach would be to use ontologies specialized for a particular scientific field, such as the Neuroscience Informatics Framework (NIF) anatomical ontology (see [34] for brain phenotypes). The microscopic anatomical structures represented in the NIF-anatomy are, by design, applicable to both mouse and human; however, in this particular case the NIF-anatomy does not appear to contain the class that is needed. One might also consider using the FMA since it does contain a class 'arachnoid space' - however, we prefer not to mix and match classes from anatomical ontologies dedicated to different taxa as the differentia used for the logical definitions of a single pre-composed phenotype ontology (in this case MP), as this will be problematic for reasoning. We also faced a problem defining classes such as 'truncated notochord'. MA only includes classes for the adult mouse. The EMAP ontology covers Theiler stages 1 to 26; however, EMAP was constructed according to different principles, with the result that there are no is_a relations and no single class 'notochord'. Rather, there are multiple such classes, one for each time stage and with no single general class abstracting over these stage-specific classes. There is also a new ontology EMAPA, which is an abstracted version of EMAP, but this still suffers from the same problem, with stage-specific classes and no is_a relations. Adopting CARO as an upper level ontology may address some of these issues. The same dilemma arises with representing human anatomical entities (the FMA is for adult structures only), although currently most developmental phenotypes declared in the HP have a post-embryonic presentation. Uberon and translational research We expect that perturbations in evolutionarily related genes and pathways across different species will give rise to similar phenotypes. This means that it should be possible to predict the phenotypic and clinical consequences of sequence variants based on genetic knowledge encoded in model organism databases. Previous studies have shown that these correlations can hold within a species for paralogous genes [35]. A major obstacle to extending this approach to orthologous genes is that phenotype data derived from multiple sources and species were semantically incompatible. Now, by using a reasoner-backed database combined with the anatomical associations in Uberon and the mappings between the phenotype ontologies and respective EQ descriptions, we can ask questions and perform analyses in an automated fashion [20]. For example, given a phenotype such as 'corneal opacity' we can query across human, mouse and zebrafish annotations despite the heterogeneity of ontologies involved. This presents a major opportunity for transforming vital model organism data into knowledge of relevance to human health. Conclusions We have provided a collection of equivalence mappings between classes in pre-composed phenotype ontologies and PATO-based EQ descriptions. Our mappings span four species. By translating EQ descriptions to logical axioms we used automated reasoners to validate our mappings, and demonstrated that many of the manually stated relationships in phenotype ontologies can be calculated automatically. This result indicates that logical definitions and automated reasoning can be used to make the ontology development cycle more efficient and consistent across ontologies. We have also constructed an anatomical ontology that generalizes over existing metazoan species-centric ontologies. The combination of this ontology with our EQ mappings can be used to perform powerful translational cross-species queries and analyses of phenotypes recorded in separate databases using different ontologies. We believe that this will become a necessary and integral part of translational research involving genotype-phenotype associations. Materials and methods In order to partially automate the generation of logical definitions, we defined an Obol [36] grammar that recapitulated the terminological syntax used in the different phenotype ontologies. For example, many MP class labels use a syntax that follows the simple grammar production rule: phenotype → quality bearer This yields a compositional description: . The terminal symbols in the grammar correspond to pre-composed classes in other ontologies. For example: quality → (any PATO label or exact synonym) bearer → (any OBO label or exact synonym) For example "big ears' is translated to an obo genus-differentia definition 'increased_size that inheres_in ears'. In OBO format this is: [Term] id: MP:0000017 ! big ears intersection_of: PATO:0000586 ! increased size intersection_of: inheres_in MA:0000236 ! ear The grammar is context-free, allowing us to have complex expressions describing the bearer; for example: bearer → cell_component anatomical_structure This yields a compositional description: This allows us to parse the MP class "abnormal Purkinje cell dendrite morphology" as equivalent to the (nested) expression: We can do this despite the absence of a pre-composed class 'Purkinje cell dendrite' in the GO cellular component hierarchy. The full set of grammars used can be seen at [37]. We employed a cyclical/iterative approach, with initial automatically generated cross-products manually inspected by two of us (GG and CJM) and fed into a curated cross-product ontology (MP-XP). The results were used to improve the grammar for subsequent runs. In addition, we used reasoners to check the logical entailments of the cross-product definitions. Sometimes this resulted in fixes to the pre-coordinated ontology; other times it revealed inconsistencies in our definitions. The entire process also resulted in numerous fixes to PATO and other OBO ontologies. Once we were confident in our definitions we engaged the editors of the phenotype ontologies more intensively to evaluate the cross-product definitions more thoroughly. Reasoning methods and tools We tried a variety of reasoning tools, including OWL-based reasoners such as Pellet, FaCT++ and HermiT [38-40]. We also tried the OBO-Edit reasoner [41], the Obol reasoner and the OBD-SQL reasoner [42]. The only reasoner that could scale over the full set of ontologies plus mappings was the OBD-SQL reasoner, as it is the only reasoner that is not memory bound. For other reasoners we devised an ontology segmentation strategy involving reasoning over individual cross-product sets. For example, MP-XP-MA is the union of MP, MP-XP, PATO and MA. The results reported in this paper were obtained using the OBD-SQL reasoner. This reasoner works by initializing a relational database consisting of all asserted ontology relationships and then iteratively applying rules to derive new relationships until no new relationships can be derived. Abbreviations APO: Ascomycete Phenotype ontology; CARO: Common Anatomy Reference Ontology; CHEBI: Chemical Entities of Biological Interest; CL: OBO Cell ontology; EMAP: Edinburgh Mouse Atlas (Theiler stages 1-26); FMA: Foundational Model of Anatomy (adult human anatomy ontology); GO: Gene Ontology; GO-BP, GO biological process; GO-CC: GO cellular component ontology; HP: Human Phenotype ontology; MA: Adult Mouse Anatomy Ontology, developed by the Mouse Genome Informatics group at Jackson Laboratory (Bar Harbor, Maine, USA); MP: Mammalian Phenotype ontology (sometimes MPO); MPATH: Mouse Pathology ontology; NIF: Neurosciences Informatics Framework; OBO: Open Biological Ontologies; OWL: Web Ontology Language; PATO: Phenotype and Trait ontology, an ontology of phenotypic qualities; PRO: Protein Ontology; WP: Worm Phenotype ontology (sometimes WBPhenotype); XP: cross-product (that is, equivalence mapping to a logical definition). Authors' contributions CJM conceived of and coordinated the study, drafted the manuscript, created the initial mappings and performed the reasoner analysis. GG maintains mappings and coordinates changes with PATO. CS evaluated MP-XP for biological validity, evaluated reasoners results and coordinated changes with the MP. MAH and CJM conceived of and created Uberon. SEL and MA supervised the work and assisted with the manuscript.
                Bookmark

                Author and article information

                Journal
                Syst Biol
                Syst. Biol
                sysbio
                sysbio
                Systematic Biology
                Oxford University Press
                1063-5157
                1076-836X
                September 2013
                31 May 2013
                31 May 2013
                : 62
                : 5
                : 639-659
                Affiliations
                1National Evolutionary Synthesis Center, Durham, NC 27705, USA; 2Department of Biology, University of North Carolina, Chapel Hill, NC 27599, USA; 3Insect Museum, Department of Entomology, North Carolina State University, Box 7613, Raleigh, NC 27695, USA; 4Department of Entomology, Pennsylvania State University, 501 ASI Building, University Park, PA 16802, USA; 5Illinois Natural History Survey, University of Illinois, 1816 South Oak Street, MC 652 Champaign, IL 61820, USA; and 6Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011, USA
                Author notes
                *Correspondence to be sent to: National Evolutionary Synthesis Center, Durham, NC 27705, USA; E-mail: balhoff@ 123456nescent.org .

                Associate Editor: Thomas Buckley

                Article
                syt028
                10.1093/sysbio/syt028
                3739881
                23652347
                © The Author(s) 2013. Published by Oxford University Press, on behalf of the Society of Systematic Biologists.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

                Page count
                Pages: 21
                Categories
                Regular Articles

                Animal science & Zoology

                Comments

                Comment on this article