141
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature

      research-article
      1 , 1 , 1 ,
      PLoS Biology
      Public Library of Science

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at http://www.textpresso.org or via WormBase at http://www.wormbase.org.

          Abstract

          With the increasing availability of full-text scientific papers online, new tools, such as Textpresso, will help to extract information and knowledge from research literature

          Related collections

          Most cited references29

          • Record: found
          • Abstract: found
          • Article: not found

          A literature network of human genes for high-throughput analysis of gene expression.

          We have carried out automated extraction of explicit and implicit biomedical knowledge from publicly available gene and text databases to create a gene-to-gene co-citation network for 13,712 named human genes by automated analysis of titles and abstracts in over 10 million MEDLINE records. The associations between genes have been annotated by linking genes to terms from the medical subject heading (MeSH) index and terms from the gene ontology (GO) database. The extracted database and accompanying web tools for gene-expression analysis have collectively been named 'PubGene'. We validated the extracted networks by three large-scale experiments showing that co-occurrence reflects biologically meaningful relationships, thus providing an approach to extract and structure known biology. We validated the applicability of the tools by analyzing two publicly available microarray data sets.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            aph-1 and pen-2 are required for Notch pathway signaling, gamma-secretase cleavage of betaAPP, and presenilin protein accumulation.

            Presenilins are components of the gamma-secretase protein complex that mediates intramembranous cleavage of betaAPP and Notch proteins. A C. elegans genetic screen revealed two genes, aph-1 and pen-2, encoding multipass transmembrane proteins, that interact strongly with sel-12/presenilin and aph-2/nicastrin. Human aph-1 and pen-2 partially rescue the C. elegans mutant phenotypes, demonstrating conserved functions. The human genes must be provided together to rescue the mutant phenotypes, and the inclusion of presenilin-1 improves rescue, suggesting that they interact closely with each other and with presenilin. RNAi-mediated inactivation of aph-1, pen-2, or nicastrin in cultured Drosophila cells reduces gamma-secretase cleavage of betaAPP and Notch substrates and reduces the levels of processed presenilin. aph-1 and pen-2, like nicastrin, are required for the activity and accumulation of gamma-secretase.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              WormBase: network access to the genome and biology of Caenorhabditis elegans.

              WormBase (http://www.wormbase.org) is a web-based resource for the Caenorhabditis elegans genome and its biology. It builds upon the existing ACeDB database of the C.elegans genome by providing data curation services, a significantly expanded range of subject areas and a user-friendly front end.
                Bookmark

                Author and article information

                Journal
                PLoS Biol
                pbio
                PLoS Biology
                Public Library of Science (San Francisco, USA )
                1544-9173
                1545-7885
                November 2004
                21 September 2004
                : 2
                : 11
                : e309
                Affiliations
                [1] 1Division of Biology and Howard Hughes Medical Institute, California Institute of Technology Pasadena, CaliforniaUnited States of America
                Article
                10.1371/journal.pbio.0020309
                517822
                15383839
                990175ca-0474-4642-8dae-dcf33ca00de0
                Copyright: © 2004 Müller et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited
                History
                : 17 November 2003
                : 19 July 2004
                Categories
                Research Article
                Bioinformatics/Computational Biology
                Caenorhabditis

                Life sciences
                Life sciences

                Comments

                Comment on this article