A Survey of Bioinformatics Database and Software Usage through Mining the Literature

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Computer-based resources are central to much, if not most, biological and medical research. However, while there is an ever expanding choice of bioinformatics resources to use, described within the biomedical literature, little work to date has provided an evaluation of the full range of availability or levels of usage of database and software resources. Here we use text mining to process the PubMed Central full-text corpus, identifying mentions of databases or software within the scientific literature. We provide an audit of the resources contained within the biomedical literature, and a comparison of their relative usage, both over time and between the sub-disciplines of bioinformatics, biology and medicine. We find that trends in resource usage differs between these domains. The bioinformatics literature emphasises novel resource development, while database and software usage within biology and medicine is more stable and conservative. Many resources are only mentioned in the bioinformatics literature, with a relatively small number making it out into general biology, and fewer still into the medical literature. In addition, many resources are seeing a steady decline in their usage (e.g., BLAST, SWISS-PROT), though some are instead seeing rapid growth (e.g., the GO, R). We find a striking imbalance in resource usage with the top 5% of resource names (133 names) accounting for 47% of total usage, and over 70% of resources extracted being only mentioned once each. While these results highlight the dynamic and creative nature of bioinformatics research they raise questions about software reuse, choice and the sharing of bioinformatics practice. Is it acceptable that so many resources are apparently never reused? Finally, our work is a step towards automated extraction of scientific method from text. We make the dataset generated by our study available under the CC0 license here: http://dx.doi.org/10.6084/m9.figshare.1281371.

Related collections

Most cited references 19

Record: found
Abstract: found
Article: found

Is Open Access

Overview of BioCreAtIvE: critical assessment of information extraction for biology

Lynette Hirschman, Alexander Yeh, Christian Blaschke … (2005)

Background The goal of the first BioCreAtIvE challenge (Critical Assessment of Information Extraction in Biology) was to provide a set of common evaluation tasks to assess the state of the art for text mining applied to biological problems. The results were presented in a workshop held in Granada, Spain March 28–31, 2004. The articles collected in this BMC Bioinformatics supplement entitled "A critical assessment of text mining methods in molecular biology" describe the BioCreAtIvE tasks, systems, results and their independent evaluation. Results BioCreAtIvE focused on two tasks. The first dealt with extraction of gene or protein names from text, and their mapping into standardized gene identifiers for three model organism databases (fly, mouse, yeast). The second task addressed issues of functional annotation, requiring systems to identify specific text passages that supported Gene Ontology annotations for specific proteins, given full text articles. Conclusion The first BioCreAtIvE assessment achieved a high level of international participation (27 groups from 10 countries). The assessment provided state-of-the-art performance results for a basic task (gene name finding and normalization), where the best systems achieved a balanced 80% precision / recall or better, which potentially makes them suitable for real applications in biology. The results for the advanced task (functional annotation from free text) were significantly lower, demonstrating the current limitations of text-mining approaches where knowledge extrapolation and interpretation are required. In addition, an important contribution of BioCreAtIvE has been the creation and release of training and test data sets for both tasks. There are 22 articles in this special issue, including six that provide analyses of results or data quality for the data sets, including a novel inter-annotator consistency assessment for the test set used in task 2.

0 comments Cited 123 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Computational science: ...Error.

Zeeya Merali (2010)

0 comments Cited 67 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Time to Organize the Bioinformatics Resourceome

Nicola Cannata, Emanuela Merelli, Russ Altman (2005)

We will be witnessing the birth of the artificial, or in-silico, scientist. —J. D. Wren [1] The field of bioinformatics has blossomed in the last ten years, and as a result, there is a large and increasing number of researchers generating computational tools for solving problems relevant to biology. Because the number of artifacts has increased greatly, it is impossible for many bioinformatics researchers to track tools, databases, and methods in the field—or even perhaps within their own specialty area. More critically, however, biologist users and scientists approaching the field do not have a comprehensive index of bioinformatics algorithms, databases, and literature annotated with information about their context and appropriate use. We suggest that the full set of bioinformatics resources—the “resourceome”—should be explicitly characterized and organized. A hierarchical and machine-understandable organization of the field, along with rich cross-links (an ontology!) would be a useful start. It is likely that a distributed development approach would be required so that those with focused expertise can classify resources in their area, while providing the metadata that would allow easier access to useful existing resources. The growth of bioinformatics can be quantified in many ways. The Intelligent Systems for Molecular Biology Meeting began in 1993, and numerous other meetings have been established. The International Society for Computational Biology (ISCB) was formed in 1995, and recent membership numbers have reached 2,000. The field has gone from having one or two journals to having more than a dozen—if one considers “-omics” (i.e., subjects relating to high-throughput functional genomics, where computation plays a central role) and the emerging field of systems biology. Because bioinformatics has a strong element of engineering, the creation and maintenance of tools provide value only insofar as they are used. These tools may be databases that hold biological data, or they may be algorithms that act on this data to draw inferences. Access to these artifacts is currently uneven. Of course, the published literature is the archival resting place for the initial description of these innovations, but it only contains a snapshot of most tools early in their lifetime. The literature does not use any standard classification system to describe tools, so the sensitivity of searches for specific functions is not generally high. Indeed, the bibliome itself is idiosyncratically organized, and finding the right article is often like searching for a needle in a haystack [2]. Finally, the published literature does not contain reliable references to the location and to the availability of most bioinformatics resources [3,4]. One could also argue that Google (http://www.google.com) provides adequate access to tools based on keyword searching [5]. However, the lack of standard terms makes sensitive and specific searches difficult. In addition, most search hits confound papers, Web sites, tools, departments, and people in a manner that makes extracting useful information very difficult. Recognizing this limitation, there have been some grassroots attempts to organize the bioinformatics resourceome. Among the most famous are the “archaeological” Pedro's List—a list of computer tools for molecular biologists (http://www.public.iastate.edu/~pedro/research_tools.html)—and the Expasy Life Sciences Directory, formerly known as the Amos's WWW links page (http://www.expasy.org/links.html). The Bioinformatics Links Directory (http://www.bioinformatics.ubc.ca/resources/links_directory/) today contains more than 700 curated links to bioinformatics resources, organized into eleven main categories, including all the databases and Web servers yearly listed in the dedicated Nucleic Acids Research special issues [6]. The National Center for Biotechnology Institute has tried to make access to its suite of tools transparent, with moderate success. Many Web sites can be found listing “useful sites,” especially concerning special interest or limited topics (e.g., microarrays, text mining, and gene regulation). But all of these efforts are limited by the difficulty in maintaining currency and by the lack of a uniformly recognized classification scheme. Yet our colleagues in bioinformatics and biology are constantly asking about the availability of tools or databases with certain characteristics. The lack of a useful index, thus, routinely costs time and opportunities. In addition, there is no “peer-review” system for bioinformatics tools so that the most useful ones can be highlighted by happy users. A secure and reliable system for rating (similar to that used by Amazon.com, for example) would also be an important prerequisite. An “ontology” is a specification of a conceptual space, often used by computer programs. The field of ontology engineering has matured in the last 20 years, making fundamental contributions in computer science and establishing applications in biology. The success of the Gene Ontology Project (it is used by multiple model organism databases, and is used to annotate high-throughput data routinely [8]) is one example of an ontology that was developed for the narrow purpose of supporting comparative genomics, but which has found a multitude of other uses. A primitive bioinformatics-specific ontology is available in Google Directory (http://directory.google.com/Top/Science/Biology/Bioinformatics), assembled in the collaborative Open Directory effort (http://www.dmoz.org), but it, too, mixes all different classes of objects (personal Web sites, organization Web sites, databases, and tools) in a way that is not transparent. It seems clear that a well-organized and intuitive ontology of bioinformatics resources would provide a very valuable framework on which a fully distributed system of registration and annotation of biology-related computational resources could be constructed. The Transparent Access to Multiple Bioinformatics Information Sources (TAMBIS) [9] work was a bold attempt to describe bioinformatics concepts, including resources, using formal description languages. Unfortunately, it has not been widely used, perhaps because it was ahead of its time or because the underlying knowledge representation techniques are somewhat sophisticated and complex. In the foreseeable future the web of links between documents, databases, and programs can provide a new level of interaction among scientific communities. —J. Hendler [10] Ontologies are important, but their use is often hindered by the lack of “killer apps” for using them. It is often unclear how to exchange information about ontologies, and how to link them to other resources on the Web. Emerging technologies that contribute important infrastructures to the resourceome are represented by the semantic Web and Web services. It is now possible to have standardized descriptors of Web resources, using an ontology, in order to “publish” the availability of tools or simply to announce their existence. Thus, the vision for using an ontology to support the resourceome becomes clear: each individual who has created or who is maintaining a resource uses a standard ontology to describe the basic features of that particular resource using the semantic Web, and these are automatically included in a distributed index of resources. Thus, the index is created by querying the semantic net for descriptions of all available tools, which can then be registered and updated on a regular basis. The development of a browser for this index could be the final step (or “killer app”) in building a self-sustaining, distributed index of bioinformatics resources. Adoption of agent technology may be helpful in overcoming the inherent complexity of this challenge [11]. We believe that the need for a bioinformatics resourceome project and the technical requirements for it are both present. We therefore urge the community to come together to start the process of creating a simple distributed system for describing resources, announcing their availability, and presenting this information to biologists and bioinformaticians in an easy-to-navigate manner. The World Wide Web Consortium already launched its first workshop on Semantic Web for Life Sciences, bringing together more than 100 participants from academia, industry, and international organizations. Another important event is the recent creation of the National Center for Biomedical Ontology (http://www.bioontology.org). The initial steps toward a bioinformatics resourceome are clear. First, an overall ontology with the high-level concepts (algorithms, databases, organizations, papers, people, etc.) must be created, with a set of standard attributes and a standard set of relations between these concepts (e.g., people publish papers, papers describe algorithms or databases, organizations house people, etc.). The initial ontology should be compact and built for distributed collaborative extension. Second, a mechanism for people to extend this ontology with subconcepts in order to describe their own resources should be designed. The precise location of a tool within a taxonomy is not critical—the author will place it somewhere based on the location of similar/competing resources or based on a best-informed guess. Others may create links to the resource from other appropriate locations in the taxonomy in order to ensure that competing interpretations of the appropriate conceptual location for the resource are accommodated. Third, the formats for the ontologies and the resource descriptions should be published so enterprising software engineers can create interfaces for surfing, searching, and viewing the resources. The resulting distributed system of resource descriptions would be extensible, robust, and useful to the entire biomedical research community. Who can take leadership in this effort? We believe that a coalition of publishers with an open-access ethic, funding agencies, and scientists who want to contribute to an improved computational infrastructure for biomedicine would be most effective. Companies with an interest in cost-effective research and development may also want to be involved. Most likely, a small group of devoted scientists with both biological domain knowledge and understanding of semantic Web technologies must take the lead. A critical mass of resources must be indexed so that the value of the effort can be assessed. Most likely, the initial indexing will not include all possible resources, but rather algorithms and databases. The community can decide later if Web sites, publications, people, and institutions should also be indexed. The system should also include from the start a capability for routinely evaluating sites for availability (no 404s!). There is increasing discussion of the requirements and technologies for the resourceome at bioinformatics conferences, including Intelligent Systems for Molecular Biology (http://ismb2006.cbi.cnptia.embrapa.br), Pacific Symposium on Biocomputing (http://psb.stanford.edu), and others (see http://www.iscb.org).

0 comments Cited 28 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Shoba Ranganathan: Role: Editor

Journal

Journal ID (nlm-ta): PLoS One

Journal ID (iso-abbrev): PLoS ONE

Journal ID (publisher-id): plos

Journal ID (pmc): plosone

Title: PLoS ONE

Publisher: Public Library of Science (San Francisco, CA USA )

ISSN (Electronic): 1932-6203

Publication date Collection: 2016

Publication date (Electronic): 22 June 2016

Volume: 11

Issue: 6

Electronic Location Identifier: e0157989

Affiliations

[1 ]School of Computer Science, The University of Manchester, Manchester, United Kingdom

[2 ]Manchester Institute of Biotechnology, The University of Manchester, Manchester, United Kingdom

[3 ]Computational and Evolutionary Biology, Faculty of Life Sciences, The University of Manchester, Manchester, United Kingdom

Macquarie University, AUSTRALIA

Author notes

Competing Interests: The authors have declared that no competing interests exist.

Conceived and designed the experiments: GD GN DLR RS. Performed the experiments: GD. Analyzed the data: GD GN MF AB DLR RS. Wrote the paper: GD GN MF DLR RS. Extended bioNerDS: GD MF. Provided statistical guidance: AB.

* E-mail: robert.stevens@ 123456manchester.ac.uk

Author information

Geraint Duck http://orcid.org/0000-0002-1002-4458

Article

Publisher ID: PONE-D-15-44429

DOI: 10.1371/journal.pone.0157989

PMC ID: 4917176

PubMed ID: 27331905

SO-VID: 374b7e19-4e7e-43f7-a578-8ab8bd4db5aa

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 13 October 2015

Date accepted : 8 June 2016

Page count

Figures: 11, Tables: 10, Pages: 25

Funding

Funded by: funder-id http://dx.doi.org/10.13039/501100000268, Biotechnology and Biological Sciences Research Council;

Award Recipient :

ORCID: http://orcid.org/0000-0002-1002-4458

Geraint Duck

Funded by: funder-id http://dx.doi.org/10.13039/501100000266, Engineering and Physical Sciences Research Council;

Award Recipient : Michele Filannino

GD is funded by a studentship from the Biotechnology and Biological Sciences Research Council (BBSRC) to GN, DLR and RS. MF would like to acknowledge the support of the UK Engineering and Physical Science Research Council (EPSRC), in the form of doctoral training grant. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Custom metadata

Data Availability The software binary and source-code can be found on Sourceforge: http://bionerds.sourceforge.net/ The full dataset generated and used for this study can is available from Figshare: http://dx.doi.org/10.6084/m9.figshare.1281371.

A Survey of Bioinformatics Database and Software Usage through Mining the Literature

Read this article at

Abstract

Related collections

PLOS Climate

Most cited references 19

Overview of BioCreAtIvE: critical assessment of information extraction for biology

Computational science: ...Error.

Time to Organize the Bioinformatics Resourceome

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 19

Cited by 28

Most referenced authors 129