70
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Getting More Out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/outcome models in the UK's largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors' own group) who work in text processing for biomedicine and other areas. GATE is available online <1> under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis.

          Related collections

          Most cited references15

          • Record: found
          • Abstract: found
          • Article: not found

          Lung cancer susceptibility locus at 5p15.33.

          We carried out a genome-wide association study of lung cancer (3,259 cases and 4,159 controls), followed by replication in 2,899 cases and 5,573 controls. Two uncorrelated disease markers at 5p15.33, rs402710 and rs2736100 were detected by the genome-wide data (P = 2 x 10(-7) and P = 4 x 10(-6)) and replicated by the independent study series (P = 7 x 10(-5) and P = 0.016). The susceptibility region contains two genes, TERT and CLPTM1L, suggesting that one or both may have a role in lung cancer etiology.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text.

            ABNER (A Biomedical Named Entity Recognizer) is an open source software tool for molecular biology text mining. At its core is a machine learning system using conditional random fields with a variety of orthographic and contextual features. The latest version is 1.5, which has an intuitive graphical interface and includes two modules for tagging entities (e.g. protein and cell line) trained on standard corpora, for which performance is roughly state of the art. It also includes a Java application programming interface allowing users to incorporate ABNER into their own systems and train models on new corpora.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              MutationFinder: a high-performance system for extracting point mutation mentions from text.

              Discussion of point mutations is ubiquitous in biomedical literature, and manually compiling databases or literature on mutations in specific genes or proteins is tedious. We present an open-source, rule-based system, MutationFinder, for extracting point mutation mentions from text. On blind test data, it achieves nearly perfect precision and a markedly improved recall over a baseline. MutationFinder, along with a high-quality gold standard data set, and a scoring script for mutation extraction systems have been made publicly available. Implementations, source code and unit tests are available in Python, Perl and Java. MutationFinder can be used as a stand-alone script, or imported by other applications. http://bionlp.sourceforge.net.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS Comput Biol
                PLoS Comput. Biol
                plos
                ploscomp
                PLoS Computational Biology
                Public Library of Science (San Francisco, USA )
                1553-734X
                1553-7358
                February 2013
                February 2013
                7 February 2013
                : 9
                : 2
                : e1002854
                Affiliations
                [1 ]Department of Computer Science, University of Sheffield, Sheffield, United Kingdom
                UCSD, United States of America
                Author notes

                The authors have declared that no competing interests exist.

                Conceived and designed the experiments: HC VT AR KB. Performed the experiments: HC VT AR KB. Analyzed the data: HC VT AR KB. Contributed reagents/materials/analysis tools: HC VT AR KB. Wrote the paper: HC VT AR KB.

                Article
                PCOMPBIOL-D-12-00425
                10.1371/journal.pcbi.1002854
                3567135
                23408875
                a0dd0fef-a16c-47c0-bd27-791dbb763858
                Copyright @ 2013

                This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 16 March 2012
                : 10 November 2012
                Page count
                Pages: 16
                Funding
                GATE has been funded by UK Research Councils (EPSRC, BBSRC, and AHRC), the European Commission's Framework Research programmes, the UK National Health Service (NHS), volunteer contributors and commercial contracts. The specific results presented in this paper were funded by the European Commission (contracts: LarKC, Khresmoi), the NHS (contracts SLaM/IE 1–3), and the Information Retrieval Facility (a non-profit foundation based in Vienna, Austria; contracts SAM 1–3). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology
                Computational Biology
                Text Mining
                Genetics
                Genome-Wide Association Studies
                Computer Science
                Natural Language Processing
                Medicine
                Mental Health
                Psychiatry

                Quantitative & Systems biology
                Quantitative & Systems biology

                Comments

                Comment on this article