78
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      The gene normalization task in BioCreative III

      research-article
      1 , , 2 , 2 , 3 , 3 , 4 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 13 , 14 , 15 , 16 , 17 , 18 , 18 , 19 , 20 , 21 , 21 , 22 , 22 , 1 ,
      BMC Bioinformatics
      BioMed Central
      The Third BioCreative, Critical Assessment of Information Extraction in Biology Challenge
      13-15 September 2010

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP- k).

          Results

          We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP- k scores were 0.3297 ( k=5), 0.3538 ( k=10), and 0.3535 ( k=20), respectively. Higher TAP- k scores of 0.4916 ( k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP- k scores of 0.3707 ( k=5), 0.4311 ( k=10), and 0.4477 ( k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively.

          Conclusions

          By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.

          Related collections

          Most cited references20

          • Record: found
          • Abstract: found
          • Article: not found

          Gene Ontology: tool for the unification of biology

          Genomic sequencing has made it clear that a large fraction of the genes specifying the core biological functions are shared by all eukaryotes. Knowledge of the biological role of such shared proteins in one organism can often be transferred to other organisms. The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing. To this end, three independent ontologies accessible on the World-Wide Web (http://www.geneontology.org) are being constructed: biological process, molecular function and cellular component.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text.

            ABNER (A Biomedical Named Entity Recognizer) is an open source software tool for molecular biology text mining. At its core is a machine learning system using conditional random fields with a variety of orthographic and contextual features. The latest version is 1.5, which has an intuitive graphical interface and includes two modules for tagging entities (e.g. protein and cell line) trained on standard corpora, for which performance is roughly state of the art. It also includes a Java application programming interface allowing users to incorporate ABNER into their own systems and train models on new corpora.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Overview of BioCreAtIvE task 1B: normalized gene lists

              Background Our goal in BioCreAtIve has been to assess the state of the art in text mining, with emphasis on applications that reflect real biological applications, e.g., the curation process for model organism databases. This paper summarizes the BioCreAtIvE task 1B, the "Normalized Gene List" task, which was inspired by the gene list supplied for each curated paper in a model organism database. The task was to produce the correct list of unique gene identifiers for the genes and gene products mentioned in sets of abstracts from three model organisms (Yeast, Fly, and Mouse). Results Eight groups fielded systems for three data sets (Yeast, Fly, and Mouse). For Yeast, the top scoring system (out of 15) achieved 0.92 F-measure (harmonic mean of precision and recall); for Mouse and Fly, the task was more difficult, due to larger numbers of genes, more ambiguity in the gene naming conventions (particularly for Fly), and complex gene names (for Mouse). For Fly, the top F-measure was 0.82 out of 11 systems and for Mouse, it was 0.79 out of 16 systems. Conclusion This assessment demonstrates that multiple groups were able to perform a real biological task across a range of organisms. The performance was dependent on the organism, and specifically on the naming conventions associated with each organism. These results hold out promise that the technology can provide partial automation of the curation process in the near future.
                Bookmark

                Author and article information

                Conference
                BMC Bioinformatics
                BMC Bioinformatics
                BioMed Central
                1471-2105
                2011
                3 October 2011
                : 12
                : Suppl 8
                : S2
                Affiliations
                [1 ]National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, Maryland 20894, USA
                [2 ]Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C
                [3 ]Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
                [4 ]Institute of Information Science, Academia Sinica, Taipei 115, Taiwan
                [5 ]Information Science Institute, University of Southern California, Marina del Rey, California, USA
                [6 ]Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan, R.O.C
                [7 ]Department of Computer Science, National Tsing-Hua University, Hsinchu, Taiwan, R.O.C
                [8 ]Institute of Information Science, Academic Sinica, Taipei, Taiwan, R.O.C
                [9 ]Interfaculty Initiative in Information Studies, University of Tokyo, Japan
                [10 ]Graduate School of Information Science and Technology, University of Tokyo, Japan
                [11 ]Faculty of Life Sciences, University of Manchester, Manchester, M13 9PT, UK
                [12 ]Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1117 Budapest, Hungary
                [13 ]Medical Informatics, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin, USA
                [14 ]BiTem Group, Division of Medical Information Sciences, University of Geneva, Switzerland
                [15 ]BiTeM Group, Information Science Department, University of Applied Science, Geneva, Switzerland
                [16 ]NITAS/TMS, Text Mining Services, Novartis AG, Switzerland
                [17 ]Institute of Computational Linguistics, University of Zurich, Zurich, Switzerland
                [18 ]Department of Computer Science, The University of Iowa, Iowa City, Iowa 52242, USA
                [19 ]Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, MN 55905 USA
                [20 ]Lab of Text Intelligence in Biomedicine, Georgetown University Medical Center, 4000 Reservoir Rd., NW, Washington, DC 20057 USA
                [21 ]DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
                [22 ]Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
                Article
                1471-2105-12-S8-S2
                10.1186/1471-2105-12-S8-S2
                3269937
                22151901
                2f2c58bb-d6ac-4488-8e3b-67bafa391ec1
                Copyright ©2011 Lu et al; licensee BioMed Central Ltd.

                This is an open access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                The Third BioCreative, Critical Assessment of Information Extraction in Biology Challenge
                Bethesda, MD, USA
                13-15 September 2010
                History
                Categories
                Research

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article