Opportunities for text mining in the FlyBase genetic literature curation workflow

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

FlyBase is the model organism database for Drosophila genetic and genomic information. Over the last 20 years, FlyBase has had to adapt and change to keep abreast of advances in biology and database design. We are continually looking for ways to improve curation efficiency and efficacy. Genetic literature curation focuses on the extraction of genetic entities (e.g. genes, mutant alleles, transgenic constructs) and their associated phenotypes and Gene Ontology terms from the published literature. Over 2000 Drosophila research articles are now published every year. These articles are becoming ever more data-rich and there is a growing need for text mining to shoulder some of the burden of paper triage and data extraction. In this article, we describe our curation workflow, along with some of the problems and bottlenecks therein, and highlight the opportunities for text mining. We do so in the hope of encouraging the BioCreative community to help us to develop effective methods to mine this torrent of information.

Database URL: http://flybase.org

Related collections

Most cited references 13

Record: found
Abstract: found
Article: found

Is Open Access

The Sequence Ontology: a tool for the unification of genome annotations

Karen Eilbeck, Suzanna E. Lewis, Christopher Mungall … (2005)

Background Why a sequence ontology is needed Genomic annotations are the focal point of sequencing, bioinformatics analysis, and molecular biology. They are the means by which we attach what we know about a genome to its sequence. Unfortunately, biological terminology is notoriously ambiguous; the same word is often used to describe more than one thing and there are many dialects. For example, does a coding sequence (CDS) contain the stop codon or is the stop codon part of the 3'-untranslated region (3' UTR)? There really is no right or wrong answer to such questions, but consistency is crucial when attempting to compare annotations from different sources, or even when comparing annotations performed by the same group over an extended period of time. At present, GenBank [1] houses 220 viral genomes, 152 bacterial genomes, 20 eukaryotic genomes and 18 archeal genomes. Other centers such as The Institute for Genomic Research (TIGR) [2] and the Joint Genome Institute (JGI) [3] also maintain and distribute annotations, as do many model organism databases such as FlyBase [4], WormBase [5], The Arabidopsis Information Resource (TAIR) [6] and the Saccharomyces Genome Database (SGD) [7]. Each of these groups has their own databases and many use their own data model to describe their annotations. There is no single place at which all sets of genome annotations can be found, and several sets are informally mirrored in multiple locations, leading to location-specific version differences. This can make it hazardous to exchange, combine and compare annotation data. Clearly, if genomic annotations were always described using the same language, then comparative analysis of the wealth of information distributed by these institutions would be enormously simplified: Hence the Sequence Ontology (SO) project. SO began 2 years ago, when a group of scientists and developers from the model organism databases - FlyBase, WormBase, Ensembl, SGD and MGI - came together to collect and unify the terms they used in their sequence annotation. The Goal of the SO is to provide a standardized set of terms and relationships with which to describe genomic annotations and provide the structure necessary for automated reasoning over their contents, thereby facilitating data exchange and comparative analyses of annotations. SO is a sister project to the Gene Ontology (GO) [8] and is part of the Open Biomedical Ontologies (OBO) project [9]. The scope of the SO project is the description of the features and properties of biological sequence. The features can be located in base coordinates, such as gene and intron, and the properties of these features describe an attribute of the feature; for example, a gene may be maternally_imprinted. SO terminology and format Like other ontologies, SO consists of a controlled vocabulary of terms or concepts and a restricted set of relationships between those terms. While the concepts and relationships of the sequence ontology make it possible to describe precisely the features of a genomic annotation, discussions of them can lead to much lexical confusion, as some of the terms used by SO are also common words; thus we begin our description of SO with a discussion of its naming conventions, and adhere to these rules throughout this document. Wherever possible, the terms used by SO to describe the parts of an annotation are those commonly used in the genomics community. In some cases, however, we have altered these terms in order to render them more computer-friendly so that users can create software classes and variables named after them. Thus, term names do not include spaces; instead, underscores are used to separate the words in phrases. Numbers are spelled out in full, for example five_prime_UTR, except in cases where the number is part of the accepted name. If the commonly used name begins with a number, such as 28S RNA, the stem is moved to the front - for example, RNA_28S. Symbols are spelled out in full where appropriate, for example, prime, plus, minus; as are Greek letters. Periods, points, slashes, hyphens, and brackets are not allowed. If there is a common abbreviation it is used as the term name, and case is always lower except when the term is an acronym, for example, UTR and CDS. Where there are differences in the accepted spelling between English and US usage, the US form is used. Synonyms are used to record the variant term names that have the same meaning as the term. They are used to facilitate searching of the ontology. There is no limit to the number of synonyms a term can have, nor do they adhere to SO naming conventions. They are, however, still lowercase except when they are acronyms. Throughout the remainder of this document, the terms from SO are highlighted in italics and the names of relationships between the terms are shown in bold. The terms are always depicted exactly as they appear in the ontology. The names of EM operators are underlined. SO, SOFA, and the feature table To facilitate the use of SO for the markup of gene annotation data, a subset of terms from SO consisting of some of those terms that can be located onto sequence has been selected; this condensed version of SO is especially well suited for labeling the outputs of automated or semi-automated sequence annotation pipelines. This subset is known as the Sequence Ontology Feature Annotation, or SOFA. SO, like GO, is an 'open source' ontology. New terms, definitions, and their location within the ontology are proposed, debated, and approved or rejected by an open group of individuals via a mailing list. SO is maintained in OBO format and the current version can be downloaded from the CVS repository of the SO website [10]. For development purposes, SOFA was stabilized and released (in May 2004) for at least 12 months to allow development of software and formats. SO is a directed acyclic graph (DAG), and can be viewed using the editor for OBO files, OBO-Edit [11]. The terms describing sequence features in SO and SOFA are richer than those of the Feature Table [12] of the three large genome databanks: GenBank [1], EMBL [13] and the DNA Data Bank of Japan (DDBJ) [14]. The Feature Table is a controlled vocabulary of terms describing sequence features and is used to describe the annotations distributed by these data banks. The Feature Table does provide a grouping of its terms for annotation purposes, based on the degree of specificity of the term. The relationships between the terms are not formalized; thus the interpretation of these relationships is left to the user to infer, and, more critically, must be hard-coded into software applications. Most of the terms in the Feature Table map directly to terms in SO, although the term names may have been changed to fit SO naming conventions. In general, SO contains a more extensive set of features for detailed annotation. There are currently 171 locatable sequence features in SOFA compared to 65 of the Feature Table. There are 11 terms in the Feature Table that are not included in SO. These terms fall into two categories: remarks and immunological features, both of which have been handled slightly differently in SO. A mapping between SO and the Feature Table is available from the SO website [10]. Database schemas, file formats and SO SO is not a database schema, nor is it a file format; it is an ontology. As such, SO transcends any particular database schema or file format. This means it can be used equally well as an external data-exchange format or internally as an integral component of a database. The simplest way to use SO is to label data destined for redistribution with SO terms and to make sure that the data adhere to the SO definition of the data type. Accordingly, SO provides a human-readable definition for each term that concisely states its biological meaning. Usually the definitions are drawn from standard authoritative sources such as The Molecular Biology of the Cell [15], and each definition contains a reference to its source. Defining each term in such a way is important as it aids communication and minimizes confusion and disputes as to just what data should consist of. For example, the term CDS is defined as a contiguous RNA sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon. According to SO, the sequence of a three_prime_utr does not contain the stop_codon - and files with such sequences are SO-compliant; files of three_prime_utr containing stop_codons are not. This is a trivial example, illustrating one of the simplest use cases, but it does demonstrate the power of SO to put an end to needless negotiations between parties as to the details of a data exchange. This aspect of SO is especially well suited for use with the generic feature format (GFF) [16]. Indeed, the latest version, GFF3, uses SO terms and definitions to standardize the feature type described in each row of a file and SO terms as optional attributes to a feature. SO can also be employed in a much more sophisticated manner within a database. CHADO [17] is a modular relational database schema for integrating molecular and genetic data and is part of the Generic Model Organism Database project (GMOD) [18], currently used by both FlyBase and TIGR. The CHADO relational schema is extremely flexible, and is centered on genomic features and their relationships, both of which are described using SO terms. This use of SO ensures that software that queries, populates and exports data from different CHADO databases is interoperable, and thus greatly facilitates large-scale comparisons of even very complex genomics data. Like GFF3, Chaos-XML [19] is a file format that uses SO to label and structure data, but it is more intimately tied to the CHADO project than is GFF3. Chaos-XML is a hierarchical XML mapping of the CHADO relational schema. Annotations are represented as an ontology-typed feature graph. The central concept of Chaos-XML is the sequence-feature, which is any sequence entity typed by SO. The features are interconnected via feature relationship elements, whereby each relationship connects a subject feature and an object feature. Features are located via featureloc elements which use interbase (zero-based) coordinates. Chaos-XML and CHADO are richer models than GFF3 in that feature_relationships are typed, and a more sophisticated location model is used. Chaos-XML is the substrate of a suite of programs called Comparative Genomics Library (CGL), pronounced 'seagull' [20], which we have used for the analyses presented in our Results section. The basic types in SOFA, from which other types are defined, are region and junction, equivalent to the concepts of interiors and boundaries defined in the field of topological relationships [21]. A region is a length of sequence such as an exon or a transposable_element. A junction is the space between two bases, such as an insertion_site. Building on these basic data types, SOFA can be used to describe a wide range of sequence features. Raw sequence features such as assembly components are captured by terms like contig and read. Analysis features, defined by the results of sequence-analysis programs such as BLAST [22] are captured by terms such as nucleotide_match. Gene models can be defined on the sequence using terms like gene, exon and CDS. Variation in sequence is captured by subtypes of the term sequence_variant. These terms have multiple parentages with either region or junction. SOFA (and SO) can also be used to describe many other sequence features, for example, repeat, reagent, remark. Thus, SOFA together with GFF3 or Chaos-XML provide an easy means by which parties can describe, standardize, and document the data they distribute and exchange. The SO and SOFA controlled vocabularies can be used for de novo annotation. Several groups including SGD and FlyBase now use either SO or SOFA terms in their annotation efforts. SO is not restricted to new annotations, however, and may be applied to existing annotations. For example, annotations from GenBank may be converted into SO-compliant formats using Bioperl [23] (see Materials and methods). SO relationships One essential difference between a controlled vocabulary, such as the Feature Table, and an ontology is that an ontology is not merely a collection of predefined terms that are used to describe data. Ontologies also formally specify the relationships between their terms. Labeling data with terms from an ontology makes the data a substrate for software capable of logical inference. The information necessary for making logical inferences about data resides in the class designations of the relationships that unite terms within SO. We detail this aspect of the ontology below. For purposes of reference, a section of SO illustrating the various relationships between some of its terms is shown in Figure 1. Currently, SO uses three basic kinds of relationship between its terms: kind_of, derives_from, and part_of. These relationships are defined in the OBO relationship types ontology [24]. kind_of relationships specify what something 'is'. For example, an mRNA is a kind_of transcript. Likewise an enhancer is a kind_of regulatory_region. kind_of relationships are valid in only one direction. Hence, a regulatory_region is not a kind_of enhancer. One consequence of the directional nature of kind_of relationships is that their transitivity is hierarchical - inferences as to what something 'is' proceed from the leaves towards the root of the ontology. For example, an mRNA is a kind_of processed_transcript AND a processed_transcript is a kind_of transcript. Thus, an mRNA is a kind_of transcript. kind_of relationships are synonymous with is_a relationships. We adopted the 'kind_of' notation to avoid the lexical confusion often encountered when describing relationships, as the phrase 'is a' is often used in conjunction with another relationships in English - for example 'is a part_of'. SO uses the term derives_from to denote relationships of process between two terms. For example, an EST derives_from an mRNA. derives_from relationships imply an inverse relationship; derives. Note that although a polypeptide derives_from an mRNA, a polypeptide cannot be derived from an ncRNA (non-coding RNA), because no derives_from relationship unites these two terms in the ontology. This fact illustrates another important aspect of how SO handles relationships: children always inherit from parents but never from siblings. An ncRNA is a kind_of transcript as is an mRNA. Labeling something as a transcript implies that it could possibly produce a polypeptide; labeling that same entity with the more specific term ncRNA rules that possibility out. Thus, a file that contained ncRNAs and their polypeptides would be semantically invalid. part_of relationships pertain to meronomies; that is to say 'part-whole' relationships. An exon, for example, is a part_of a transcript. part_of relationships are not valid in both directions. In other words, while an exon is a part_of a transcript, a transcript is not a part_of an exon. Instead, we say a transcript has_part exon. SO does not explicitly denote whole-part relationships, as every part_of relationship logically implies the inverse has_part relationship between the two terms. Transitivity is a more complicated issue with regards to part-whole relationships than it is for the other relationships in SO. In general, part_of relationships are transitive - an exon is a part_of a gene, because an exon is a part_of a transcript, and a transcript is a part_of a gene. Not every chain of part-whole relationships, however, obeys the principle of transitivity. This is because parts can be combined to make wholes according to different organizing principles. Winston et al. [25] have described six different subclasses of the part-whole relationship, based on the following three properties: configuration, whether the parts have a structural or functional role with respect to one another or the whole they form; substance, whether the part is made of the same stuff as the whole (homomerous or heteromerous); and invariance, whether the part can be separated from the whole. These six relations and their associated part_of subclasses are detailed in Table 1. Winston et al. [25] argue that there is transitivity across a series of part_of relationships only if they all belong to the same subclass. In other words, an exon can only be part_of a gene, if an exon is a component_part_of a transcript, and a transcript is component_part_of a gene. If, however, the two statements contain different types of part_of relationship, then transitivity does not hold. By addressing the vague English term 'part of' in this way, Winston et al. solve many of the problems associated with reasoning across part_of relationships; thus, we are adopting their approach with SO. The parts contained in the sequence ontology are mostly of the type component_part_of such as exon is a part_of transcript, although there are a few occurrences of member_part_of such as read is a part_of contig. SO's relationships facilitate software design and bioinformatics research Genomic annotations are substrates for a multitude of software applications. Annotations, for example, are rendered by graphical viewers, or, as another example, their features are searched and queried for purposes of data validation and genomics research. Using an ontology for sequence annotation purposes offers many advantages over the traditional Feature Table approach. Because controlled vocabularies do not specify the relationships that obtain between their terms, using the Feature Table has meant that relationships between features have had to be hard-coded in software applications themselves; consequently, adding a new term to the Feature Table and/or changing the details of the relationships that obtain between its terms has meant revising every software application that made use of the Feature Table. Ontologies mitigate this problem as all of the knowledge about terms and their relationships to one another is contained in the ontology, not the software. SO-compliant software need only be provided with an updated version of the ontology, and everything else will follow automatically. This is because SO-compliant software need not hard-code the fact that a tRNA is a kind_of transcript; it need merely know that kind_of relationships are transitive and hierarchical and be capable of internally navigating the network of relationships specified by the ontology (see Figure 1) in order to logically infer this fact. This means that every time a new form of ncRNA is discovered, and added to SO, all SO-compliant software applications will automatically be able to infer that any data labeled with that new term is a kind_of transcript. This means that existing graphical viewers will render those data with the appropriate transcript glyph, and validation and query tools will automatically deal with this new data-type in a coherent fashion. Placing the biological knowledge in the ontology rather than in the software means that the ontology and the software that uses it can be developed, revised, and extended independently of one another. Thus ontologies offer the bioinformatics programming community significant opportunities as regards software design and the speed of the development cycle. Using an ontology does, however, mean that software applications must meet certain professional standards; namely, they must be capable of parsing an OBO file and navigating the network of relationships that constitute the ontology, but these are minimal hurdles. SO facilitates bioinformatics research in ways that reach far beyond its utility as regards software design. For example, SO's kind_of relationships provide a subsumption hierarchy, or classification system for its terms. This added depth of knowledge greatly improves the searching and querying capabilities of software using SO. The ontology's higher-level terms may be used to query via inference, even if they are never used for annotation. We recommend that annotators label their data using terms corresponding to terminal nodes in the ontology. Transcripts, for example, might be annotated using terms such as mRNA, tRNA, and rRNA (see Figure 1). Note that doing so means that if, for example, non-coding RNA sequences are required for some subsequent analysis, then SO-compliant software tools can locate annotations labelled with the subtypes of ncRNA, and retrieve tRNAs and rRNAs to the exclusion of mRNAs, even though these data have not been explicitly labelled with the term ncRNA. Thus, many analyses become easy, for example, how many ncRNAs are annotated in H. sapiens? Of these what percent have more than one exon? Are any maternally imprinted? Moreover, using SO as part of a database schema ensures that such questions 'mean' the same thing in different databases. SO also greatly facilitates the automatic validation of annotation data, as the relationships implied by an annotation can be compared to the allowable relationships specified in the ontology. For example, an annotation that asserts an intron to be part_of an mRNA would be invalid, as this relationship is not specified in the ontology (Figure 1). On the other hand, an annotation that asserted that an UTR sequence was part_of mRNA would be valid (Figure 1). This makes possible better quality control of annotation data, and makes it possible to check existing annotations for such errors when converting them to a SO-compliant format such as GFF3. To summarize, by identifying the set of relationships between terms that are possible, we are also specifying the inferences that can be drawn from these relationships: that is, the software operations that can be carried out over the data. As a consequence, software is easier to maintain, SO can easily be extended to embrace new biological knowledge, quality controls can be readily implemented, and software to mine data can be written so as to be very flexible. EM operators and SO SO also enables some modes of analyses of genomics data that are completely new to the field. One such class of analyses involves the use of extensional mereology (EM) operators to ask questions about gene parts. Although new to genomics, EM operators are well known in the field of ontology, where they provide a basis for asking and answering questions pertaining to how parts are distributed within and among different wholes (reviewed in [26,27]). These operators are usually applied to studies of how parts are shared between complex wholes - such as different models of automobiles or personal computers - for the purpose of optimizing manufacturing procedures. Below we explain how these same operators can be applied to the analyses of genomics data. Although these operators, difference and overlap, share the same name as topological operators, they are different as they function on the parts of an object, not on its geometric coordinate space. The topological operators, regarding the coincidence of edges and interiors - equality, overlap, disjointedness, containment and coverage of spatial analysis [21] - may also be applied to biological sequence. EM is a formal theory of parts: it defines the properties of the part_of relationship and then provides a set of operations (Table 2) that can be applied to those parts. These operators are akin to those of set theory, but whereas set theory makes use of an object's kind_of relationships, EM operators function on an object's part_of relationships. Only wholes and their 'proper parts' are legitimate substrates for EM operations. Proper parts are those parts that satisfy three self-evident criteria: first, nothing is a proper part of itself (a proper part is part of but not identical to the individual or whole); second, if A is a proper part of B then the B is not a part of A; third, if A is a part of B and B is a part of C then A is a part of C. Note that the third criterion of proper parts is that they obey the rule of transitivity. As we discussed earlier, not all part_of relationships are transitive. Accordingly, we have restricted our analyses (see Results and discussion) to component parts (Table 2). Figure 2 illustrates the effects of applying EM operations to analyze the relationships 'transcript is a part_of gene' and 'exon is a part_of transcript'. The EM operations overlap and disjoint pertain to relationships between transcripts, whereas difference and binary product pertain to exons. Two transcripts overlap if they share one or more exon in common. Two transcripts are disjoint if they do not share any exons in common. The exons shared between two overlapping transcripts are the binary product of the two transcripts, and the exons not shared in common comprise the difference between the two transcripts. The binary sum of two transcripts is simply the sum of their parts. One key feature of EM operations is that they operate in 'identifier space' rather than 'coordinate space'. Two transcripts overlap only if they share a part in common rather than if their genomic coordinates overlap. Thus, two transcripts may be disjoint even if their exons partially overlap one another. This is one way in which EM analyses differ from standard bioinformatics analyses, and it has some interesting repercussions. This is particularly so with regard to modes of alternative splicing, as each of the EM operations suggests a distinct category by means of which two alternatively spliced transcripts can be related to one another. We further explore the potential of these operations to classify alternative transcripts and their exons below. Results and discussion As part of a pilot project to evaluate the practical utility of SO as a tool for data management and analysis, we have used SO to name and enumerate the parts of every protein-coding annotation in the D. melanogaster genome. Doing so has allowed us to compare annotations with respect to their parts, for example, number of exons, amount of UTR sequence, and so on. These data afford many potential analyses, but as our motivation was primarily to demonstrate the practical utility of SO as a tool for data management, rather than comparative genomics per se, we have focused more on what exon-transcript-gene part-whole relationships have to say about the annotations themselves, than what the annotations have to say about the biology of the genome. Accordingly, we have used EM-operators to characterize the annotations with respect to their parts, especially with regard to alternative splicing. The current version of FlyBase (5 August, 2004) contained 13,539 genes, (of which 10,653 have a single transcript and 2,886 are alternatively spliced), 18,735 transcripts and 61,853 exons. An EM-based scheme for classifying alternatively spliced genes As we had characterized the parts of the annotations using SO, we were able to employ the EM operators over these parts. This proved to be a natural way to explore the relative complexity of alternative splicing, as the alternatively spliced transcripts have different combinations of parts: that is, exons. We grouped alternatively spliced transcripts into two classes. An alternatively spliced gene will contain overlapping transcripts if at least one of its exons is shared between two of its transcripts, and will have disjoint transcripts if one of its transcripts shares no exons in common with any other transcript of that gene. For the purposes of this analysis, we further classified disjoint transcripts as sequence-disjoint and parts-disjoint. We term two disjoint transcripts sequence-disjoint if none of their exons shares any sequence in common with one another; and parts-disjoint if one or more of their exons overlap on the chromosome but have different exon boundaries. Note that the three operations are pairwise, and thus not mutually exclusive. To see why this is, imagine a gene having three transcripts, A, B, and C. Obviously, transcript A can be disjoint with respect to B, but overlap with respect to C. Thus, we can speak of a gene as having both disjoint and overlapping transcripts. The relative numbers of disjoint and overlapping transcripts in a genome says something about the relative complexity of alternative splicing in that genome. A gene may have any combination of these types of disjoint and overlapping transcripts, so we created a labeling system consisting of the seven possible combinations. We did this by asking three EM-based questions about the relationships between pairs of a gene's transcripts: How many pairs are there of sequence-disjoint transcripts? How many pairs are there of parts-disjoint transcripts? How many pairs are there of overlapping transcripts? Doing so allowed us to place that gene into one of seven classes with regards to the properties of its alternatively spliced transcripts. We also kept track of the number of times each of the three relationships held true for each pair combination. For example, a gene having two transcripts that are parts-disjoint with respect to one another would be labeled 0:1:0. Keeping track of the number of transcript pairs falling into each class provides an easy means to prioritize them for manual review. These results are summarized in Figure 3. Of the alternatively spliced fly genes, none has a sequence-disjoint transcript, 275 have parts-disjoint transcripts, and 2,664 have overlapping transcripts, and 53 have both parts-disjoint and overlapping transcripts. The percentage of D. melanogaster genes in each category is shown in Table 3. Most alternatively spliced genes contain at least one pair of overlapping transcripts. These data also have something to say about the ways in which research and management issues are intertwined with one another with respect to genome annotation, as some aspects of these data are clearly attributable to annotation practice. The lack of any sequence-disjoint transcripts in D. melanogaster, for example, is due to annotation practice; in fact, current FlyBase annotation practices forbid their creation, the reason being that any evidence for such transcripts is evidence for a new gene [28]. This is not true for all genomic annotations. Annotations converted from the genomes division of GenBank to a SO-compliant form, were subjected to EM analysis, and inspection of the corresponding gene-centric annotations provided by Entrez Gene [29] revealed examples of genes that fall into each of the seven categories. Some of these annotations are shown in Figure 3. The frequencies of genes that fall into each of the seven classes shown in Table 3 provides a concise summary of genome-wide trends in alternative splicing in the fly. This EM-based classification schema, when applied to many model organisms, from many original sources, makes very apparent the magnitude of the practical challenges that surround decentralized annotation, and the distribution and redistribution of annotations. Certainly, they highlight the need for data-management tools such as SO to assist the community in enforcing biological constraints and annotation standards. Only then will comparative genomic analyses show their full power. Exons as alternative parts of transcripts EM-operators can also be used to classify the exons of alternatively spliced genes. Exons shared between two transcripts comprise the binary product of the two transcripts; whereas those exons present in only one of the transcripts constitute their difference (see Table 2 and Figure 2 for more information). These basic facts suggest a very simple, three-part classification system. If an exon is the difference between all other transcripts, then it is only in one transcript; we term these UNIQUE exons. If an exon is the difference of some transcripts, and the binary product of others, it is in a fraction of transcripts; we term these SOMETIMES_FOUND exons. And, if an exon is the binary product of all combinations of transcripts, then it must be in all transcripts; we term such exons ALWAYS_FOUND exons. Classifying exons in this way allows us to look more closely at alternative splicing from the exon's perspective. As can be seen from Table 4, despite the low frequency of alternatively spliced genes, a large fraction of their exons are associated with alternatively spliced transcripts - almost 39%. A sizable proportion of SOMETIMES_FOUND and ALWAYS_FOUND exons are coding exons in some of the transcripts and entirely untranslated exons in others. In some cases, this is due to actual biology: some transcripts in D. melanogaster are known to produce more than one protein (see, for example [30]). In other cases, this situation appears to be a result of best attempts on the part of annotators to interpret ambiguous supporting evidence; in yet others the supporting data sometimes unambiguously points to patterns of alternative splicing that would seem to produce transcripts destined for nonsense-mediated decay [31]. Whatever the underlying cause, these exons, like the N:0:0 class annotations, should be subjected to further investigation. To investigate these conclusions in more detail, we further examined each exon with respect to its EM-based class and its coding and untranslated portions. These results are shown Figure 4, and naturally extend the analyses presented in Table 4. First, regardless of exon class, most entirely untranslated exons are 5-prime exons; the lower frequency of 3-prime untranslated exons is perhaps due to nonsense-mediated decay [31], as the presence of splice junctions in a processed transcript downstream of its stop codon are believed to target that transcript for degradation. A second point made clear by the data in Table 4 is that alternatively spliced genes of D. melanogaster are highly enriched for 5-prime untranslated exons compared with single-transcript genes. Most of these exons belong to ALWAYS_FOUND; thus, there seems to be a strong tendency in D. melanogaster for alternative transcripts to begin with a unique 5' UTR region. This fact suggests that alternative transcription in the fly may, in many cases, be a consequence of alternative-promoter usage and perhaps tissue-specific transcription start sites. The high percentage of untranslated 5-prime UNIQUE exons in D. melanogaster may also be a consequence of the large numbers of 5' ESTs that have been sequenced in the fly [32]. Figure 4 also shows that most (> 95%) D. melanogaster ALWAYS_FOUND exons are coding. This makes sense, as it seems likely that one reason for an exon's inclusion in every one of a gene's alternative transcripts is that it encodes a portion of the protein essential for its function(s). As with our previous analyses of alternative transcripts, our analyses of alternatively transcribed exons also illustrate the ways in which basic biology and annotation-management issues intersect one another. The fact that most ALWAYS_FOUND exons are entirely coding, for example, may have something important to say about which parts of a protein are essential for its function(s). Whereas the over-abundance of un-translated UNIQUE exons probably has more to say about the resources available to, and the protocols used by, the annotation project than it does about biology. Such considerations make it clear that the evidence used to produce an annotation is an essential part of the annotation. In this regard SO has much to offer, as it provides a rational means by which to manage annotation evidence in the context of gene-parts and the relations between those parts. Conclusion We have sought to provide an introduction to the SO and justify why its use to unify genomic annotations is beneficial to the model organism community. We illustrate some of the ways in which SO can be used to analyze and manage annotations. Relationships are an essential component of SO, and understanding their role within the ontology is a basic prerequisite for using SO in an intelligent fashion. Much of this paper revolves around the part_of relationship because SO is largely a meronomy - a particular kind of ontology concerned with the relationships of parts to wholes. Extensional mereology (EM) is an area that is largely new to bioinformatics for which there are several excellent reference works available [26,27,33], and even a cursory examination of these texts will make it clear that EM has much to offer bioinformatics. Using all of the relationships in SO allows us to automatically draw logical conclusions about data that has been labelled with SO terms and thereby provide useful insights into the underlying annotations. We have shown how SO, together with the EM-based operations it enables, can be used to standardize, analyze, and manage genome annotations. Given any standardized set of genome annotations described with SO these annotations can then be rigorously characterized. For our pilot analyses, we focused on alternatively transcribed genes and their exons, and explored the potential of EM-operators to classify and characterize them. We believe that the results of these analyses support two principle conclusions. First, EM-based classification schemes are simple to implement, and second, they capture important trends in the data and provide a concise, natural, and meaningful overview of annotations in these genomes. One criticism that might be justifiably leveled against the SO- and EM-based analyses presented here is that they are too formal, and that simpler approaches could have accomplished the same ends. As our discussion of part_of relationships made clear, however, reasoning across diverse types of parts is a complicated process; ad-hoc approaches will not suffice where the data are complex. The more formal approach afforded by SO means that analyses can be easily be extended beyond the domain of transcripts and exons to include many other gene parts and relationships as well - including evidence. It seems clear that over the next few years both the number and complexity of annotations will increase, especially with regard to the diversity of their parts. Drawing valid conclusions from comparisons of these annotations will prove challenging. That SO has much to offer such analyses is indisputable. SO and SOFA provide the model organism community with a means to unify the semantics of sequence annotation. This facilitates communication within a group and between different model organism groups. Adopting SO terminology to type the features and properties of sequence will provide both the group and the community the advantages of a common vocabulary, to use for sharing and querying data and for automated reasoning over large amounts of sequence data. Materials and methods SO and SOFA have been built and are maintained using the ontology-editing tool OBO-Edit. The ontologies are available at [34]. The FlyBase D. melanogaster [35] data was derived from the GadFly [36] relational database and converted to Chaos-XML using the Bio-chaos tools. The features were annotated to the deepest concept in the ontology possible, given the available information. For example, the degree of information in annotations was sufficiently deep to describe the transcript features with the type of RNA such as mRNA, or tRNA. It was therefore possible to restrict the analysis to given types of transcript. CGL tools were used to validate each of the annotations, iterate through the genes and query the features. EM-operators were applied to the part features of genes. Other organism data was derived from the genomes section of GenBank [37]. GenBank flat files were converted to SO-compliant Chaos-XML using the script cx-genbank2chaos.pl (available from [19]) and BioPerl [23]. The BioPerl GenBank parser, Bio::SeqIO::genbank was used to convert GenBank flat files to Bioperl SeqFeature objects. Feature_relationships between these objects were inferred from location information using the Bioperl Bio::SeqFeature::Tools::Unflattener code. GenBank Feature Table types were converted to SO terms using the Bio::SeqFeature::Tools::TypeMapper class, which contains a hardcoded mapping for the subset of the GenBank Feature Table which is currently used in the genomes section of GenBank. The same Perl class was used to type the feature_relationships according to SO relationship types. The EM analysis was performed over the Chaos-XML annotations using the CGL suite of modules to iterate over the parts of each gene.

0 comments Cited 304 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature

Hans-Michael Müller, Eimear E. Kenny, Paul W. Sternberg (2004)

Introduction Text-mining tools have become indispensable for the biomedical sciences. The increasing wealth of literature in biology and medicine makes it difficult for the researcher to keep up to date with ongoing research. This problem is worsened by the fact that researchers in the biomedical sciences are turning their attention from small-scale projects involving only a few genes or proteins to large-scale projects including genome-wide analyses, making it necessary to capture extended biological networks from literature. Most information of biological discovery is stored in descriptive, full text. Distilling this information from scientific papers manually is expensive and slow, if the full text is available to the researcher at all. We therefore wanted to develop a useful text-mining tool for full-text articles that allows an individual biologist to locate efficiently information of interest. The natural language processing field distinguishes information retrieval from information extraction. Information retrieval recovers a pertinent subset of documents. Most such retrieval systems use searches for keywords. Many Internet search engines are of this type, such as PubMed (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi). Information extraction is the process of obtaining pertinent information (facts) from documents. The facts can concern any type of biological object (entity), events, or relationships among entities. Useful measures of the performance of retrieval and extraction systems are recall and precision. In the case of retrieval, recall is the number of pertinent documents returned compared to all pertinent documents in the corpus of text. Precision is the number of pertinent documents compared to the total number of documents returned. A fully attentive reader would have complete recall, but low precision, because he has to read the whole body of text to find information. The emphasis for most applications is on recall, and we thus sought a system with high recall and as high precision as possible. Attempts to annotate gene function automatically include statistical approaches, such as cooccurrence of biological entities with a keyword or Medical Subject Heading term (Stapley and Benoit 2000; Jenssen et al. 2001). These methods have high recall and low precision, as no effort is being made to identify the kind of relationship as it occurs in the literature. Another approach has involved semantic and/or syntactic text-pattern recognition methods with a keyword representing an interaction (Sekimizu et al. 1998; Thomas et al. 2000; Friedman et al. 2001; Ono et al. 2001). They have high precision but low recall, because recognition patterns are usually too specific. Other machine learning approaches have classified abstracts and sentences for relevant interactions, but have not extracted information (Marcotte et al. 2001; Donaldson et al. 2003). For a more detailed report of these and related projects, see reviews by Andrade and Bork (2000), de Bruijn and Martin (2002), and Staab (2002). The precision of a keyword search can be increased by searching for combinations of keywords. For example, a researcher might construct a search for “anchor cell” and the gene name “lin-12” because he is interested in learning whether lin-12 plays a role in the anchor cell. However, there are many potential ways to describe the same concept or biological entity. Also, one often wants to search for a category of terms such as any gene or any body part. In this case, the intended search might be of a more general nature: If the researcher asks which genes are of interest in the anchor cell at all, he might have a hard time typing in all the known gene names (either one by one or concatenated with the Boolean operator “or”) in combination with the cell name. We therefore sought to develop a system that uses categories of terms such as “gene,” “cell,” or “biological process.” We established these categories of terms and organized them as an ontology, a catalog of types of objects and concepts and their relationships. The categories impart a semantic quality to searches, because the categories are based on the meaning of the entries. In many cases literature databases only contain bibliographic information and abstracts. The latter suffer from the constraint of information compression and convolution imposed by a word limit. Access to the full text of articles is critical for sufficient coverage of facts and knowledge in the literature and for their retrieval (Blaschke and Valencia 2001); our results confirm these findings. We wanted to use the Caenorhabditis elegans literature as a test case for developing a useful information extraction system. C. elegans has a relatively small literature, so in principle we could use it to test a complete, well-defined corpus. We also wanted to support a new database curation effort involving manual literature curation (Stein et al. 2001). Literature curation consists of identifying scientific data in literature and depositing them in an appropriate manner in a database. One extreme curation method is to read through the whole corpus of literature, identifying and extracting all significant information. This approach has the advantage that quality control of the data is done to the highest degree, based on human expertise. However, the volume and growth of biological literature makes it hard to keep the biological database up to date. In addition, data in literature may be missed by oversight, an inevitable flaw of purely human curation. The other extreme curation method is to extract data automatically. We therefore wanted a system that uses the computer to assist the curators. Our system is defined by two key components: the introduction of an ontology and the searchability of full text. The ontology is organized into categories that facilitate broader searches of biological entities as illustrated above. To be useful, it should also contain other categories that are not composed of biological entities, but describe relationships between entities. We sought to offer the user an opportunity to query the literature in the framework of the ontology such that it returns sentences for inspection by the user. We hypothesized that searching the corpus of text with a combination of categories of an ontology could facilitate a query that contains the meaning of a question in a much better way than with keywords alone. For example, if there is a “gene” category containing all gene names and a “regulation” category that includes all terms (nouns, verbs, adjectives, etc.) describing regulation, searching for (at least) two instances of the category gene and one instance of the category regulation in a sentence increases the chance that the search engine will return a sentence describing a gene-gene regulation. The search could then be limited by using a particular gene name as a keyword to get a list of genes that regulate or are regulated by that particular gene. Results We have developed a text processing system, Textpresso, that splits papers into sentences, and sentences into words or phrases. Each word or phrase is then labeled using the e X tensible M arkup L anguage (XML) according to the lexicon of our ontology (described below). We then index all sentences with respect to labels and words to allow a rapid search for sentences that have a desired label and/or keyword. The labels fall into 33 categories that comprise the Textpresso ontology. We built a database of 3,800 C. elegans papers, bibliographic information from WormBase, abstracts of C. elegans meetings and the Worm Breeder's Gazette, and some additional links and WormBase entities. See Materials and Methods for details on the database preparation. Textpresso Ontology Abstracts, titles, and full texts in the Textpresso system are processed for the purpose of marking them up semantically by the ontology we constructed. An ontology is a catalog of types of objects and (abstract) concepts devised for the purpose of discussing a domain of interest. An ontology helps to clarify a domain's semantics for everyday use, as is nicely demonstrated by Gene Ontology (GO; The Gene Ontology Consortium 2000). Although GO terms are not intended as a representation of natural language prose, they are a rich source of biologically meaningful terms and synonyms. They are the foundations for three corresponding categories in Textpresso, which are added to its 30 other categories. GO terms comprise approximately 80% of the lexicon. The first group of categories in the Textpresso ontology consists of biological entities: It contains the categories gene, transgene, allele, cell and cell group, cellular component, nucleic acid, organism, entity feature, life stage, phenotype, strain, sex, drugs and small molecules, molecular function, mutant, and clone. We have incorporated the GO molecular function category and proteins in the Textpresso molecular function category. A more detailed list with definitions can be found on the Textpresso Web site, and the most important ones are provided in Table 1. Many of these categories have subcategories. For example, the molecular function category has the subcategories “source = (Go|Textpresso)” and “protein = (yes|no).” As we have imported all terms from GO, the first subcategory makes it possible to search specifically for GO terms. Terms added by us have the attribute “Textpresso.” Similarly, not all molecular function terms are classified as protein. The word “co-transporter,” for example, conveys more of a function and would be used more in this context in the literature, even though its physical realization may in fact be a protein. A list of all subcategories can be found in Table 2. The second group of categories comprises terms that characterize a biological entity or establish a relation between two of them. It includes physical association (in the sense of binding) and consort (abstract association), effect, purpose, pathway, regulation, comparison, spatial and time relation, localization in time and space, involvement, characterization (terms that express the characterization of something), method, biological process, action, and descriptor (words that describe the state or condition of an entity). These categories, while well defined, have somewhat delicate boundaries, and the common-sense aspects of our ontology apply more to this group. It is likely that its categories are going to be changed as we continue to develop the system. In some instances terms are attributed to one category, even though they might as well fit into another. As an example, the term “coexpress” is put in the “consort” category to emphasize the concurrent aspect of the process, while it could as well be classified as a biological process. However, we believe that in most cases the first sense of the word is used in the literature. The last group (auxiliary) contains categories that can be used for more involved semantic analysis of sentences. These categories are auxiliary (forms of the verbs “be” and “have”), bracket, determiner, conjunction (and, or, because, since, although, etc.), conjecture (could, might, should, suggests), negation, pronoun, preposition, and punctuation. Some of them overlap with the syntactic categories that the part-of-speech tagger (used in the preprocessing steps; see Materials and Methods) assigns to terms, but are repeated here as they also contain some semantic component. The category “conjecture” is introduced to distinguish statements that convey hypotheses, speculations, or theoretical considerations from sentences that are expressed with confidence, thus representing more of a fact. The words of this category indicate the certainty of a statement. The Textpresso ontology is organized into a shallow hierarchy with 33 parent categories. The parent categories may have one or more subcategories, which are specializations of the parent category. For example, all of the terms in the parent category “biological process” will belong to one of its subcategories, “transcription,” “translation,” “expression,” “replication,” “other,” or “no biosynthesis.” This is user friendly and certainly serves the current implementation of the user interface well, which is oriented more towards information retrieval. The ontology is populated with 14,500 Practical Extraction and Report Language (PERL) regular expressions, each of which covers terms with a length from one to eight words. These expressions are contained in a lexicon. Table 3 shows examples of regular expressions for each category and examples of text strings matching them. Each regular expression can match multiple variable patterns. The multiple forms of regular verbs, for example, can be conveniently expressed as “[Ii]nteract(s|ed|ing)?” which stands for the eight cases “interact,” “interacts,” “interacted,” “interacting,” “Interact,” “Interacts,” “Interacted,” and “Interacting.” All regularly named C. elegans genes are matched with the expression “[A–Za–z][a–z][a–z]–\d+” matching three letters ([A–Za–z][a–z][a–z]), a dash (–), and a sequence of digits (\d+). As this example illustrates, the expressions can be made case sensitive. This is important as biological nomenclature becomes more elaborate, and the ability to distinguish subtle differences is pivotal for separating terms into the correct categories. Many of the regular expressions are generated automatically via scripts, taking a list of plain words as input and transforming them as shown in this example, to account for regular forms of verbs and nouns. The text-to-XML converter (see Materials and Methods) marks up the whole corpus of abstracts, full texts, and titles and produces XML documents. Figure 1 illustrates this process with an example. The computer identifies terms by matching them against regular expressions (such as the one shown above) and encloses them with XML tags. The tag serves as a containment of terms not semantically marked up. These tags will be used for a repeated reevaluation of the lexicon, as these terms can be easily pulled out and analyzed. A list of the most frequently missed terms is then produced and included in the lexicon for the next markup. Applications of Textpresso The marked-up text is stored in a database and can be queried. We built a user interface for general queries and another one for a specific type of query for WormBase curators (gene-gene interactions; see below). Textpresso is used in several related ways. Individual biologists use it to find specific information. Database curators, whose job is to extract information from papers or abstracts and to add this to a database, use it repeatedly to find all information of a particular type, in addition to using it for individual queries. The current Textpresso user interface (http://www.textpresso.org/) includes a query interface, a side menu with links to informative pages about the ontology, a document type definition, a user guide, and example searches, as well as the two retrieval and customization interfaces. The Web site offers two different types of retrieval, simple and advanced. Options for the retrieval queries are offered: searching a combination of categories, subcategories, and keywords in a Boolean fashion, specifying the frequency of occurrences of particular items, and choosing where in the article to search (title, abstract, body). The user can also determine whether a query is to be met in the whole publication or in a sentence. These options make the search engine powerful; for example, if a query is met in the whole article, the search has the function of text categorization, while meeting it in a sentence aims at extracting facts, which can be viewed in the context of a paragraph. The specification of cooccurrence determines the character of a search. If a combination of keywords and categories is found in a sentence, the likelihood that a sentence contains a fact involving the chosen categories and keywords is quite high. If the user chooses cooccurrence within a document, he is more interested in finding a relevant document. The scope of a search can be confined to full text, abstract, title, author, year, or any combination thereof, for document searches as well as sentence searches. A typical result page shows a list of documents with all bibliographical information and the abstract as displayed in Figure 2. A simplified version of the Textpresso interface is incorporated within WormBase (http://www.wormbase.org). The result list retrieved by a query can be customized in such a way that the user can choose how to display the information. This list is sorted according to the number of occurrences of matches in the document, so the most relevant document will be on the top of the list. A series of buttons for the whole list as well as for each document is available, allowing the user to view matching sentences or prepare search results in various formats. The individual result entries have up to six links: One can view matches for each paper only, go to the Web site of the journal to read the online text of the article (this only works if the user is subscribed to the journal), view a list of related articles that is provided by PubMed, export the bibliographical information into Endnote (two different links), or, if the user is accessing Textpresso internally (currently at Caltech), one can download the PDF of the paper. The power of Textpresso's search engine unfolds when category searches are used. By searching for a category, the researcher is targeting all keywords that populate that category. For example, the researcher might be interested in facts about genetic regulation of cells. Assuming that many facts are expressed in one sentence, he would search for the categories “gene,” “regulation,” and “cell or cell group” in a sentence. He can then view the matches (and surrounding sentences) of the search return and decide which facts are relevant. If one is not interested in all genetic regulation instances mentioned in the literature, it might be more useful to combine keywords with categories. For example, the question “What entities interact with ‘daf-16' (a C. elegans gerontogene)?” can be answered by typing in the keyword “daf-16” and choosing the category “association.” Advanced Retrieval and Subcategories An extension (the advanced retrieval interface) allows the use of the subcategories of the ontology and the specification of Boolean operators, thereby concatenating categories and keywords with “or” or “not” to permit alternatives or exclude certain items. One special subdivision of terms is the distinction between named and unnamed entities: Categories can include both general terms and specific names of entities. For example, the word “gene” would be an unnamed term of the gene category, while “lin-11” is a named entity. The general terms will likely be used for fact extraction across several neighboring sentences, but they might also be useful for retrieval purposes, even though the rate of false positives might be much higher in the latter case. Lastly, the user can determine how a keyword or category term has to be matched numerically. The options “greater than,” “less than,” and “equal to” are available together with a drop-down menu for the number of occurrences. With these additional tools, document categorization can be made more effective. A detailed profile of which categories and keywords should occur a minimum, maximum, or exact number of times for triggering a match can be established. Similarly, searches on the sentence level acquire a semantic quality, i.e., they at least partially encompass a meaning. In many cases, the answers to questions, phrased in the form of a sophisticated query, can immediately be read off the result screen. If, for example, one were to ask in which cells lin-11 is expressed, one would search sentences for a combination of the category “biological process” (subcategory “biosynthesis: expression”), the category “cell or cell group” (subcategory “type: name”) and the exact keyword “lin-11.” The subcategory “expression” filters out all words that relate to expression, the subcategory “name” limits the search to specific cells which have a name, such as “anchor cell,” “HO neurons,” “IL sensillum,” etc. Other subcategory options would be “group” (for example, “head,” “vulva,” “tail”) and “lineage” (“AB lineage,” “EMS lineage,” etc.). To better understand the following results, note that the term “cell(s)” has the type “name,” to gain the correct meaning of phrases such as “AB lineage cells.” The first two words of this phrase are marked as lineage, but the last word makes the whole phrase named cells. The system returns sentences of different quality. Some of them answer the question posed immediately (returned sentences are taken from Gupta and Sternberg 2002; that paper produced the most hits). The underlined words mark the matched items: “An analysis of the expression pattern of lin-11 in vulva and uterine lineage cells earlier suggested that cellular defects arise due to a failure in the differentiation process”; “Our analysis of the expression of lin-11 in VPC granddaughters (Pn.pxx stage) has revealed the following pattern in P5.p and P7.p lineage cells (from anterior to posterior; L, low; H, high), LLHH and HHLL , respectively.” Other sentences meet the truth more by accident, as the terms are matched within a sentence, but the statement does not really express the fact sought. The cells where lin-11 is expressed might be inferred by the knowledgeable reader, and not stated explicitly: “Our results demonstrate that the tissue-specific expression of lin-11 is controlled by two distinct regulatory elements that function as independent modules and together specify a wild-type egg -laying system”; “Using a temporally controlled overexpression system, we show that lin-11 is initially required in vulval cells for establishing the correct invagination pattern.” Finally, some sentences just do not give any clue about the posed question: “ lin-11 cDNA- expressing vectors under the control of lin-11- AB (pYK452F7-3) and lin-11-C (pYK452F7-2) elements were designed as follows.” Here, “AB” is marked up as a named cell, but this is not the semantically correct tag in this context. This false positive might have been prevented if specific sections of a paper could be searched, as this statement comes from the method section. Evaluation of the Textpresso System An automatic method for retrieving or extracting information from text is only useful if it is as accurate and reliable as human curation. We devised two tests based on two common tasks performed by human experts who extract biological data from journal articles. The first task was the automatic categorization of papers according to the types of biological data they contain. Our study used a large test set of papers scanned by a curator to examine the effectiveness of automatically searching for information in the full text of a journal article compared to its abstract. The second task focused on retrieving sentences containing a specific type of biological data from text. Sentences from eight journal articles were manually inspected on a sentence-by-sentence basis and compared to the return from a Textpresso query on the same articles. From this study we present a detailed error analysis outlining the strengths and weaknesses of the current Textpresso system as an automatic method for information retrieval. We evaluated the performance of Textpresso using the information extraction performance metrics of precision, which is a measure of the amount of true returned data compared to the amount of false returned data, and recall, which is a measure of the true data returned compared to the total amount of true data in the corpus. These values are formulated as recall = number of true returns / total number of true data items and precision = number of true returns / total number of returns. Classification of Journal Articles: Full Text Versus Abstract We examined the effectiveness of automatically identifying journal articles that contain particular types of data. A test set of 965 journal articles pertaining to C. elegans biology was assessed by a human expert and categorized into groups according to six different types of data (antibody data, ablation data, expression data, mapping data, RNAi data, and transgenes). Note that there can be more than one data type per article. We first measured the value of searching for keywords in the full text of an article as opposed to searching its abstracts (Table 4). The overall information recall when searching abstracts is low (∼44.6%) compared to the information recall when searching full text (∼94.7%). Furthermore, keywords for some specific types of data (e.g., antibody data, mapping data, transgene data) are very unlikely to appear in abstracts (∼10% recall) but can be found in full text (∼70% recall). However, precision of the keyword search is reduced by almost 40% when searching full text compared to abstracts (30.4% and 52.3%, respectively). Single keyword searches of full text return a large number of irrelevant documents for most searches. This higher false positive rate might reflect the writing style found in full text, where facts can be expressed within complex sentence structures (as compared to abstracts, where authors are forced to compress information), combined with the inability of a keyword search to capture context. Small-Scale Information Retrieval Study We tested the accuracy of a search combining word categories and keywords to retrieve sentences containing genetic interaction data. For this experiment we broadly defined genetic interaction as the effect of one or more genes on the function of another gene or genes (and thus it includes genetic interaction, regulation, and interaction of gene products). To directly assess how Textpresso performs, a human expert manually evaluated the text sentence by sentence (Figure 3). We formulated a Textpresso query that searched for the presence of at least two genes mentioned by name and at least one term belonging to the “regulation” or “association” word categories (see Materials and Methods). A total of 178 sentences were matched for this query in the eight journal articles, and the results are shown in Table 5. A human expert assessed the returned sentences and determined that 63 sentences contained gene-gene interaction data according to our criterion. The same set of journal articles had been independently manually evaluated for their description of genetic interactions, and 73 true sentences were identified. In both cases, information from the article title, abstract, contents of tables, and reference section was excluded. Sentences that described genetic interaction using the gene product name rather than the gene were also excluded from this study. To measure recall, we first determined the total number of sentences that contained genetic interaction data. For this analysis we took the union of true sentences manually identified in the journal articles and the true sentences returned by Textpresso. The total number of true sentences identified by the two methods was 102. The recall of sentences containing genetic interaction was ∼62% using Textpresso compared to ∼71% for those sentences manually identified in journal articles. One-third of the sentences returned by Textpresso were true positives (35%). Although the numbers of true sentences retrieved by the automatic and manual methods were similar (63 and 73, respectively), only 34 of these sentences overlapped. To investigate this discrepancy, we manually extracted the genetic interactions described in both sets of sentences and determined the number of distinct genetic interactions found by each method (Table 6). The sentences manually identified from the journal articles yielded 23 more distinct genetic interactions than those which were extracted from true sentences retrieved by Textpresso. However, 43 interactions derived from the Textpresso output overlapped with the manually identified set, and Textpresso located sentences describing seven genetic interactions that the human expert missed. The average redundancy (how many times the same gene-gene interaction occurred) of a distinct genetic interaction extracted from both the manual and automatic methods was 3-fold. We analyzed the gene-gene interaction sentences missed by Textpresso. In many cases (65%) the word or phrase used to describe the genetic interaction belonged to neither the “association” nor the “regulation” word category and so the sentence was not returned. In some cases, the term or phrase that determined “genetic interaction” belonged to some other Textpresso word category (e.g., some terms that implied genetic interaction and were not matched by the query were “epistatic,” which belongs to the “consort” word category, and “alters,” which belongs to the “effect” word category). This type of analysis is useful for revising and updating the ontology. In other cases, due to the intricacies of natural language prose, it was difficult to isolate an interaction term in the sentence (e.g., “Thus ref-2 alone is insufficient to keep P(3–6).p unfused when lin-39 is absent.”). Approximately 8% of true sentences were missed because the genetic interaction information was discussed over a number of sentences. This is a limitation of the current Textpresso system, as search queries are matched per sentence (or per entire article). Our analysis of the false positive sentences returned by Textpresso revealed that approximately 10% discussed gene-gene interactions that did not occur (e.g., “Neither pdk-1(gf) nor akt-1(gf) suppressed the Hyp phenotype of age-1(mg44).”). While we do have a “negation” category in our Textpresso ontology, we chose not to exclude negation terms from the posed query, to avoid missing true positives (in case the negation does not apply to the interaction term in a sentence, but to some other portion of it). Twenty-one percent of the false positive sentences were determined by inspection to suggest genetic interaction, but were too weakly phrased to extract the information in confidence without the context of the sentence. However, the majority of false positives (70%) were due to the lack of context of the search terms in the sentence, where they matched the query terms (underlined) but in a context that did not mention genetic interaction: “ lin-35 and lin-53 , two genes that antagonize a C. elegans pathway, encode proteins similar to Rb and its binding protein RbAp48.” This example strongly supports the idea that an information extraction method that considers semantic context of a search query would dramatically increase the precision of the return. Large-Scale Information Retrieval to Expedite Information Extraction We performed extraction of genetic interaction information from a corpus of 3,307 journal articles. A Textpresso query searched for the presence of at least two uniquely named genes and at least one term belonging to the “regulation” or “association” word categories (see Materials and Methods for more details). A total of 17,851 sentences were returned by this query. Due to the lack of context of some sentences, true sentences were determined by a more stringent definition of genetic interaction, i.e., where one or more named genes were described as modifying the phenotype of another named gene or genes by suppression, enhancement, epistasis, or some other genetic method. To determine the frequency of true sentences, a random sample of 200 of the sentences returned by Textpresso was evaluated by a human expert according to this more stringent criterion (Table 7, column C). This sample was compared to 200 sentences chosen from the whole corpus at random (Table 7, column A) and 200 sentences randomly chosen from the whole corpus that contained two or more named genes (Table 7, column B). A typical sentence that was determined to be true for genetic interaction data is “Interestingly, at lower temperatures, the akt-2(+) transgene can supply sufficient Akt/PKB activity to weakly suppress the dauer arrest caused by age-1(mg44).” Some of the sentences strongly suggested genetic interaction but did not quite meet the genetic interaction criterion. These were grouped as “possible genetic interaction,” for example, if a phenotype was not mentioned: “For example, lin-15(lf) animals display a 54% penetrance of P11 to P12 fate transformation, while all egl-5(lf);lin-15(lf) double mutants show a P12 to P11 fate transformation.” Sometimes it is unclear exactly which genes are participating in the genetic interaction: “Evidently the effect of the sir-2.1 transgene alone is too subtle to trigger dauer formation without the sensitizing daf-1 or daf-4 mutations.” Another group was highlighted as discussing interaction, but fell outside the criterion set for genetic interaction. These were classified “non-genetic interaction.” Some examples of this are sentences that specify gene regulation: “These studies have shown that smg-3(Upf2) and smg-4(Upf3) are required for SMG-2 to become phosphorylated.” Finally, sentences that describe physical interaction were also put into the category “possible genetic interaction”: “For example, GLD-1 represses translation of tra-2, one of the sex-determination genes, by binding to the 3′-UTR or the tra-2 mRNA (Jan et al. 1999).” This analysis shows that there is a 1 in 200 chance of a sentence discussing genetic interaction (as defined above) randomly occurring in the full text of the journal articles analyzed. The odds increase to 7 in 100 if one looks at sentences containing at least two named genes. The returned matches from the Textpresso search are enriched 39-fold for genetic interaction compared to random chance, and there is a significant 3-fold enrichment when compared to sentences containing at least two named genes. There is a 1 in 5 chance that a returned Textpresso match is true. To date, 2,015 of the 17,851 returned sentences have been evaluated. Of these, 370 discuss genetic interaction, yielding 160 distinct gene-gene interactions mined from the literature. There are 213 sentences that mention nongenetic interactions, and 419 sentences are classified as possible genetic interactions. Large-Scale Simple Fact Extraction We have extracted gene-allele reference associations from the corpus of papers to populate the WormBase database by searching for the pattern . Of the 10,286 gene-allele associations extracted, 9,230 were already known by WormBase, while 1,056 associations were new and could be added to the database. In addition, 1,464 references could be added to the 2,504 allele reference associations in WormBase. Ninety-eight percent of the data extracted went into the database without any manual correction, and the last 2% were compromised because of typographical errors in the original paper or the inherent character of the data (i.e., gene name synonyms and changes). Discussion Accomplishments We have developed a system to retrieve information from the full text of biological papers and applied it to the C. elegans literature. As of March 2004, the database contains full texts of 60% of all papers listed by the Caenorhabditis Genetics Center (CGC; http://www.cbs.umn.edu/CGC/CGChomepage.htm) and almost all abstracts that are information rich for C. elegans research. The introduction of semantic categories and subsequent marking up of the corpus of texts introduce powerful new ways of querying the literature, leading towards the formulation of meaningful questions that can be answered by the computer. We have demonstrated such queries with one example and have successfully tried many others. A more thorough evaluation of the system revealed that the availability of full text is crucial for building a retrieval system that covers many biological data types with a satisfying recall rate, and thus is truly useful for curators and researchers. For biologists, an automated system with high recall and even moderate precision (like the current Textpresso) confers a great advantage over skimming text by eye. Textpresso is already a useful system, and thus serves not only as proof of principle for ontology-based, full-text information retrieval, but also as motivation for further development of this and related systems to achieve higher precision and hence even greater time savings. It is apparent that the number of articles available in the C. elegans literature (currently about 6,000) can be curated with the assistance of Textpresso, as it is much more efficient than when done by human readers alone. The larger the corpus of papers, the more useful Textpresso will become. We have shown this by calculating the frequencies of genetic interaction data in sentences in three different cases: random sentences, sentences that contain at least two genes, and sentences returned from a Textpresso advanced query. The efficiency was shown to increase dramatically (39-fold in the best case). We have outlined the first steps of how Textpresso helps the curation effort by extracting gene-gene interactions. Overall, we have shown that Textpresso has several uses for researchers and curators: It helps to identify relevant papers and facts and focuses information retrieval efforts. Indeed, Textpresso is used daily by C. elegans researchers and WormBase curators: The server sends 530 files to requests daily via the Web, a quarter of which are to WormBase curators. Areas for Improvement Textpresso is limited in two ways: the lack of complete coverage of the C. elegans literature and the fact that the ontology and its corresponding lexicon are still in their infancy. The preparation of full texts has to be better and more efficient. The conversion of PDF to plain texts was problematic because of the different layouts of each journal. Even with the software we developed, a layout template for each journal needs to be written to specify where different components of text can be found. Prior to the use of this software, we had to forgo the use of figure and table captions. Acquisition of processable text is a general problem for biologists. A new release of XPDF (a PDF viewer for X; http://www.foolabs.com/xpdf/) eases this problem considerably (see Materials and Methods). One of our studies on the effectiveness of the extraction of a specific type of biological fact, in this case gene-gene interaction, showed that the machine still cannot replace the human expert, although it increases efficiency greatly. We anticipate that the computer does better with a larger number of articles because of redundancy. While roughly 9% of distinct gene-gene interactions from a corpus of eight journal articles were missed by the human but revealed by Textpresso, 29% of the interactions were missed by Textpresso, primarily due to flaws in the ontology. Advancing the Textpresso ontology will help to increase the specificity of the retrieval system. A deeper, meaningful structure is likely to make extraction easier and more stable. Possible improvements are to include other biological ontologies and language systems, such as UMLS (http://www.nlm.nih.gov/research/umls/) and SNOMED (http://www.snomed.org/, and to establish a more sophisticated tree structure. Our core lexicon recognizes 5.5 tags per sentence (out of an average of 23.7 tags per sentence) that are of scientific interest. This density results in a term coverage of 23.2%, while the maximum that could theoretically be added is 36.5%, assuming that all terms currently not marked up belong to relevant categories. An average of 9.5 tags per sentence are apparently of no interest for information retrieval; however, this is due to the nature of human language (and will be nonetheless useful for information extraction purposes). Reevaluation of the corpus of text for terms and their meanings that have been missed is necessary. This process will result in an expansion of our ontology, thus continually expanding the resulting lexicon, or revising the structure of the ontology. Ontology and lexicon revision is most efficiently done by a human, and a feasible automated approach seems out of reach. However, we have illustrated semiautomatic methods to help make this task easier in the future: The containment of words that are not covered in our lexicon with tags serves several purposes. First, we are able to extract all words (or n-grams, which are represented as a consecutive sequence of words embedded in tags), assemble a histogram of the most frequent terms, and add important ones to our lexicon. Second, having identified frequently occurring semantic patterns in the corpus, we are able to infer likely candidates of words for specific categories. For example, one popular pattern that indicates a gene-allele association is . If one now searches for patterns such as and extracts the word enveloped by the tags, then a frequency-sorted list of words that are likely to be alleles can be assembled, presented to a curator for approval, and deposited into the lexicon. The alternative, , would give a list of possible gene names. Many other patterns, identified by statistical means and similarity measures, could be obtained and used in such a fashion. These two methods will help us to systematically and significantly reduce the number of terms not marked up in the corpus, making it more complete. The procedure can be repeated with every build of the Textpresso database and has the advantage that the list of words added to the lexicon is tailored to the literature for which it is used. In addition, shortcomings in the general structure of the ontology can be detected and corrected, if those issues have not been caught in the research and development of the information extraction aspects of the system. If the strategy outlined above is applied continually, we will be able to close this gap and reach saturation, even with the addition of new papers and abstracts. About 89% of current users take advantage primarily of the full text and multiple keywords. Some (11%) proceed to keyword plus category. Only 0.3% of users use the advanced retrieval search. It is clear that the implementation of a user test interface improvement/education cycle will greatly help the development of Textpresso and subsequently help users take full advantage of this system. More generally, biologists will become increasingly familiar with ontology-based search engines. Prospects Future development of Textpresso can be undertaken at many different levels. A synonym search could be enabled for keyword searches: After having compiled lists of them, an option could be given to automatically include synonyms for a given term (e.g., genes, cells, cellular component) in a search. Similarly, GO annotations could be used to search for and display sentences involving genes associated with gene ontology terms, after the latter ones have been queried first. As already mentioned, search targeting could be made more flexible: Papers could be subdivided into more sections (such as introduction, methods, results, conclusion, etc.), and a query could then be applied only to the specified sections. In addition, the limitation of searching criteria to just one sentence can be relaxed to a set number of neighboring sentences. Finally, one could improve on links to other databases of relevance besides WormBase and PubMed and increase the wealth of links to the latter ones. An important issue is the portability of the system to other model organism databases. This undertaking is part of the Generic Model Organism Database (GMOD) project (http://www.gmod.org, and a downloadable package with software will be made available on their Web site. For a different model organism, parts of the lexicon, and maybe also parts of the ontology, need to be modified. Language and jargon in each community differ, and terms need to be systematically collected to accommodate their specific usage in the respective communities. However, this is not too laborious, as we have been able to generate a yeast version in a few weeks (E. E. Kenny, Q. Dong, R. S. Nash, and J. M. Cherry, unpublished data). We believe that Textpresso can be extended to achieve information extraction. The wealth of information buried in semantic tag sequences of 1 million sentences asks to be massively exploited by pattern-matching, statistical, and machine learning algorithms. Having the whole corpus semantically marked up provides bioinformaticians with the opportunity to develop fact extraction algorithms that might be quite similar to sequence alignment and gene-finding methods, or, more generally, algorithms that have similarity measures at their core, because sentences can now be represented as sequences of semantic tags. Furthermore, semantic sequences of related sentences show similar properties as related genomic sequences, such as recurring motifs, insertions, and deletions. The relatively rigid structure of the English language (subject-verb-object) and the comparatively low degree of inflections and transformations certainly help. In addition, some scientific information is stored in a structured manner. We have already started to run simple pattern-matching scripts to populate gene-allele associations from the literature for WormBase, as many of them are written in the form “gene name(allele name),” such as “lin-3(n1058).” Materials and Methods Sources. Textpresso builds its C. elegans database from four sources. A collection of articles in PDF format is compiled according to the canonical C. elegans bibliography maintained at the CGC (http://www.cbs.umn.edu/CGC/CGChomepage.htm ). As of March 2004 we had around 3,800 (60%) CGC papers in our database. Software developed by us (see below) converts the PDFs to plain text. We import additional bibliographical information from WormBase: titles of documents and author and citation information. WormBase data comprise additional C. elegans-related documents such as C. elegans meeting abstracts and Worm Breeder's Gazette articles. We also curate certain types of data ourselves. Some C. elegans-related papers are not found in the CGC bibliography or WormBase. We compile lists of URLs of journal Web sites and their articles, and links to related articles (provided by PubMed). Citations are prepared in Endnote format for download. Finally, as Textpresso returns scientific text to the user, we construct links to report pages of WormBase that display detailed information about biological entities, such as genes, cells, phenotypes, clones, and proteins. All data and links produced by us are referred as “Textpresso” data in Figure 4. Ontology. The objective of an ontology is to make the concepts of a domain and the relationships and constraints between these concepts computable. For an ontology to be utilized in a search engine for biological literature, it has to include the language of everyday use and common sense. We have therefore assigned the most commonly used meaning to a word even though it has several meanings in different contexts. We have consequently adopted a strategy of devising an ontology drawing from our own knowledge. Our ontology includes all terms of the three major ontologies of GO, namely “cellular component,” “biological process,” and “molecular function.” The current ontology is unstructured for the sake of straightforward usability, our first priority. A variety of approaches were utilized to construct and populate the 33 categories of the Textpresso ontology. We first designed individual categories for well-defined biological units or concepts such as strain, phenotype, clone, or gene. The terms in some of these categories (such as clone, allele, and gene) were represented by a PERL regular expression designed to match any text that looked like that particular biological unit. This was possible where a conserved and unique nomenclature for that biological concept had been established in the literature. Any exceptions to the established nomenclature recorded in WormBase were also added to these categories. For other biological concepts (e.g., “method,” “phenotype,” “cellular component,” and “drugs and small molecules”), we extracted information from publicly accessible biological databases, such as WormBase, GO, and PubMed/NCBI to construct lists of terms. We supplemented these lists through primary literature and textbook surveys. Next, we conceived categories of terms that would describe the relationship between the biological categories. To structure these “relationship” categories, we listed words of the text of 400 C. elegans journal articles for analysis. From this list we flagged natural prose words that we felt had at least some defined meaning within the context of biological literature (for example, “expressed,” “lineage,” “bound,” “required for”). From this list we constructed 14 new categories designed to encapsulate the natural language used by biologists to describe biological events and the relationship between them (action, characterization, comparison, consort, descriptor, effect, involvement, localization in time and space, pathway, purpose, physical association, regulation, spatial relation, and time relation). We made a second pass through the subset of flagged words from the list and assigned them to one of these categories according to what the sense of the word was in the biological literature for the majority of the time. Finally, a number of categories were designed to account for syntax and grammatical construction of text, such as “preposition,” “conjunction,” and “bracket.” Names. We have manually curated a lexicon of names because it has proved difficult in the past to automatically recognize names of biologically relevant entities (Fukuda et al. 1998; Proux et al. 1998; Rindflesch et al. 2000; Blaschke and Valencia 2002; Hanisch et al. 2003). We therefore chose to curate and maintain a lexicon with names of interest by hand. In this C. elegans-specific implementation of Textpresso, the effort was helped by the fact that the C. elegans community is somewhat disciplined in choosing names and WormBase includes names of interest. Of course, there is the danger that entities not listed in WormBase (and therefore in our lexicon) will be missed in our system, and those cases are of special interest to curators (of WormBase) and researchers, such as newly defined genes or newly isolated alleles. Dictionaries tend to be incomplete and turn stale rapidly, because of the issues of synonyms, lack of naming conventions, and the rapid pace of scientific discovery. Thus, we do not rely only on WormBase, but maintain an independent, Textpresso-specific part of the lexicon. Technical aspects of the system. Figure 4 shows the details of database preparation. The regular hexagons indicate the sources from which Textpresso is built. The PDF collection was converted to plain text by a software package written by Robert Li at Caltech. The development of such a software tool had become necessary, as current PDF-to-text converters do not comply with the typesetting of each journal, i.e., footnotes, headers, figure captions, and two-column texts in general are dispersed and mixed up senselessly in the converted text. The application works with templates that specify the structure and fonts used in a particular journal and uses this information to convert the articles correctly. A high-fidelity conversion is crucial for any information retrieval and extraction application. The software will be made available at the GMOD Web site (http://www.gmod.org). While this manuscript was being written, a new version (2.0.2) of XPDF (http://www.foolabs.com/xpdf/) was released. This version, unlike its predecessors, does a superb job in converting PDF into a congruent stream of plain text. Additional bibliographic data of references for which PDFs are not available are imported from WormBase (symbolized as “WormBase data” in Figure 4). These are mainly abstracts from various meetings. The data collected from our primary sources are treated in two different ways. Author, year, and citation information are deposited “as is” into the database, while abstracts, titles, and full texts are further processed. First, the texts are tokenized. Our tokenizer script reads the ASCII text derived from the conversion from PDF and splits the text into individual sentences based on the end-of-sentence period, where words hyphenated at the end of a line are concatenated and instances of periods within sentences (which are used mainly in technical terms and entity names) are ignored. The script also adds an extra space preceding any instance of punctuation within a sentence, which is a requirement for the Brill tagger (Brill 1992), a publicly available part-of-speech tagger, to attach 36 different grammatical tags to each tokenized word. The tagger has been trained specifically to handle the C. elegans literature, and additional tagging rules are applied. For example, gene names are forced to be tagged as nouns. The grammatical tags are not further used in the current Textpresso system. After this preprocessing step, the corpus of titles, abstracts, and full texts is marked up using the lexicon of the ontology (PERL expressions), as explained in Results and exemplified in Figure 1. The tags contain the name of the category as well as all attributes that apply to a matched term. Terms that are not matched by any of the 14,500 PERL expressions are given the tag , one token at a time. The corpus of searchable full texts, abstracts, and titles has 1,035,000 sentences. A total of 351,000 keywords have been indexed, covering 19,180,000 words in the texts. The semantic mark-up yields a total of 24,542,000 tags. Table 3 shows the distribution of tags. The number of meaningful tags (the ones that are not just ) is only 15,577,368, or 15.04 tags per sentence. An average of 5.5 tags per sentence are of scientific interest, i.e., are either biological entities or words that describe a relationship or characterize an entity. When displaying sentences and paragraphs, Textpresso provides links to report pages of several biological entities, such as proteins, transgenes, alleles, cells, phenotypes, strains, clones, and loci. There are a total of 165,000 different entities in WormBase to which Textpresso links, including links to journal articles and PubMed. All these links are produced statically and again deposited on disk for fast retrieval, and these data are referred to as “Textpresso data” in Figure 4. In this way the actual link is not made on the fly from generic URLs, and the response time for queries remains short. We generated an exhaustive keyword and category index for the whole corpus. This index makes the search extremely fast, using rapid file access algorithms. All keywords and tags in the corpus are indexed. Also, all terms in the corpus that have a report page in WormBase are indexed. For 2,700 full-text articles and 16,300 abstracts, the index takes up 1.7 Gb. The interfaces for submitting queries and customizing display options are written as CGI scripts. They are supported by simple HTML pages that contain documentation. The Web site runs with a RedHat Linux operating system and an Apache http server. No special changes to the standard configuration are required. The Web interface accesses the custom-made Textpresso database; no commercial-grade database systems have been used. It takes 2–3 d to build the complete 6.9-Gb database. Methodology of evaluation. For the preliminary study, a query was formulated using three category rows of the Textpresso “advanced retrieval” interface to identify sentences containing gene-gene interaction data from a test set of eight full-text journal articles (see Table 5): the PMID:11994313 (Norman and Moerman 2002), PMID:12091304 (Alper and Kenyon 2002), PMID:12051826 (Maduzia et al. 2002), PMID:12110170 (Francis et al. 2002), PMID:12110172 (Bei et al. 2002), PMID:12065745 (Scott et al. 2002), PMID:12006612 (Piekny and Mains 2002), and PMID:12062054 (Boxem and van den Heuvel 2002). In the top row of the advanced retrieval tool the “association” ontology was selected in the “category or keyword” column. No other changes in the first row were made, which implies that no subcategory or specification was selected, and the occurrences of association terms in one sentence were “greater than 0.” In the second row, the Boolean operator “or” and the category “regulation” were selected, with no further specification, again asking the machine to return sentences with at least one regulation term. Finally, in the third row, the category “gene” was chosen, with a specification of “named” and an occurrence of “greater than 1.” The Boolean operator to connect this row with the former ones is “and.” All other values remained as default, resulting in no further query specification. As the “advanced retrieval” search engine processes queries sequentially from the top row to the bottom row, this query asks to return sentences with at least one association or regulation term in conjunction with at least two genes mentioned by name. For the semiautomatic information extraction from text, the same query was utilized as above. In addition, sentences that did not mention at least two uniquely named genes were eliminated.

0 comments Cited 201 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

FlyBase 101 – the basics of navigating FlyBase

Peter McQuilton, Susan St. Pierre, Jim Thurmond (2011)

FlyBase (http://flybase.org) is the leading database and web portal for genetic and genomic information on the fruit fly Drosophila melanogaster and related fly species. Whether you use the fruit fly as an experimental system or want to apply Drosophila biological knowledge to another field of study, FlyBase can help you successfully navigate the wealth of available Drosophila data. Here, we review the FlyBase web site with novice and less-experienced users of FlyBase in mind and point out recent developments stemming from the availability of genome-wide data from the modENCODE project. The first section of this paper explains the organization of the web site and describes the report pages available on FlyBase, focusing on the most popular, the Gene Report. The next section introduces some of the search tools available on FlyBase, in particular, our heavily used and recently redesigned search tool QuickSearch, found on the FlyBase homepage. The final section concerns genomic data, including recent modENCODE (http://www.modencode.org) data, available through our Genome Browser, GBrowse.

0 comments Cited 165 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Database (Oxford)

Journal ID (iso-abbrev): Database (Oxford)

Journal ID (publisher-id): database

Journal ID (hwp): databa

Title: Database: The Journal of Biological Databases and Curation

Publisher: Oxford University Press

ISSN (Electronic): 1758-0463

Publication date Collection: 2012

Publication date (Electronic): 15 November 2012

Publication date PMC-release: 15 November 2012

Volume: 2012

Electronic Location Identifier: bas039

Affiliations

Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK

Author notes

*Corresponding author: Email: p.mcquilton@ 123456gen.cam.ac.uk . Tel: 0044 (0) 1223 333963. Fax: 0044 (0) 1223 766732

†The Current FlyBase Consortium comprises: William Gelbart, Nick Brown, Thomas Kaufman, Kathy Matthews, Maggie Werner-Washburne, Richard Cripps, Kris Broll, Lynn Crosby, Adam Dirkmaat, Gil dos Santos, David Emmert, L. Sian Gramates, Kathleen Falls, Beverley Matthews, Susan Russo, Andy Schroeder, Susan St. Pierre, Pinglei Zhou, Mark Zytkovicz, Boris Adryan, Marta Costa, Helen Field, Steven Marygold, Peter McQuilton, Gillian Millburn, Laura Ponting, David Osumi-Sutherland, Ray Stefancsik, Susan Tweedie, Helen Attrill, Josh Goodman, Gary Grumbling, Victor Strelets, Jim Thurmond, J. D. Wong and Harriett Platero.

Article

Publisher ID: bas039

DOI: 10.1093/database/bas039

PMC ID: 3500518

PubMed ID: 23160412

SO-VID: 73d2c2ea-cdbd-4382-ac3e-713ca2d46113

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial reuse, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com.

History

Date received : 18 June 2012

Date revision received : 16 August 2012

Date accepted : 2 October 2012

Page count

Pages: 8

Comments

Comment on this article

scite_

Cited by 5

See all cited by

Most referenced authors 244

See all reference authors

Opportunities for text mining in the FlyBase genetic literature curation workflow

Read this article at

Abstract

Related collections

Genetoberfest

Most cited references 13

The Sequence Ontology: a tool for the unification of genome annotations

Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature

FlyBase 101 – the basics of navigating FlyBase

Author and article information

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 102

Cited by 5

Most referenced authors 244