Crowdsourcing and curation: perspectives from biology and natural language processing

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Crowdsourcing is increasingly utilized for performing tasks in both natural language processing and biocuration. Although there have been many applications of crowdsourcing in these fields, there have been fewer high-level discussions of the methodology and its applicability to biocuration. This paper explores crowdsourcing for biocuration through several case studies that highlight different ways of leveraging ‘the crowd’; these raise issues about the kind(s) of expertise needed, the motivations of participants, and questions related to feasibility, cost and quality. The paper is an outgrowth of a panel session held at BioCreative V (Seville, September 9–11, 2015). The session consisted of four short talks, followed by a discussion. In their talks, the panelists explored the role of expertise and the potential to improve crowd performance by training; the challenge of decomposing tasks to make them amenable to crowdsourcing; and the capture of biological data and metadata through community editing.

Database URL: http://www.mitre.org/publications/technical-papers/crowdsourcing-and-curation-perspectives

Related collections

Most cited references 16

Record: found
Abstract: found
Article: not found

The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification

T.B.K. Reddy, Alex D Thomas, Dimitri Stamatis … (2014)

The Genomes OnLine Database (GOLD; http://www.genomesonline.org) is a comprehensive online resource to catalog and monitor genetic studies worldwide. GOLD provides up-to-date status on complete and ongoing sequencing projects along with a broad array of curated metadata. Here we report version 5 (v.5) of the database. The newly designed database schema and web user interface supports several new features including the implementation of a four level (meta)genome project classification system and a simplified intuitive web interface to access reports and launch search tools. The database currently hosts information for about 19 200 studies, 56 000 Biosamples, 56 000 sequencing projects and 39 400 analysis projects. More than just a catalog of worldwide genome projects, GOLD is a manually curated, quality-controlled metadata warehouse. The problems encountered in integrating disparate and varying quality data into GOLD are briefly highlighted. GOLD fully supports and follows the Genomic Standards Consortium (GSC) Minimum Information standards.

0 comments Cited 152 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Strengths and limitations of microarray-based phenotype prediction: lessons learned from the IMPROVER Diagnostic Signature Challenge.

Adi L Tarca, Mario Lauria, Michael Unger … (2013)

After more than a decade since microarrays were used to predict phenotype of biological samples, real-life applications for disease screening and identification of patients who would best benefit from treatment are still emerging. The interest of the scientific community in identifying best approaches to develop such prediction models was reaffirmed in a competition style international collaboration called IMPROVER Diagnostic Signature Challenge whose results we describe herein. Fifty-four teams used public data to develop prediction models in four disease areas including multiple sclerosis, lung cancer, psoriasis and chronic obstructive pulmonary disease, and made predictions on blinded new data that we generated. Teams were scored using three metrics that captured various aspects of the quality of predictions, and best performers were awarded. This article presents the challenge results and introduces to the community the approaches of the best overall three performers, as well as an R package that implements the approach of the best overall team. The analyses of model performance data submitted in the challenge as well as additional simulations that we have performed revealed that (i) the quality of predictions depends more on the disease endpoint than on the particular approaches used in the challenge; (ii) the most important modeling factor (e.g. data preprocessing, feature selection and classifier type) is problem dependent; and (iii) for optimal results datasets and methods have to be carefully matched. Biomedical factors such as the disease severity and confidence in diagnostic were found to be associated with the misclassification rates across the different teams. The lung cancer dataset is available from Gene Expression Omnibus (accession, GSE43580). The maPredictDSC R package implementing the approach of the best overall team is available at www.bioconductor.org or http://bioinformaticsprb.med.wayne.edu/.

0 comments Cited 62 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

A CTD–Pfizer collaboration: manual curation of 88 000 scientific articles text mined for drug–disease and drug–phenotype interactions

Allan Peter Davis, Thomas Wiegers, Phoebe M. Roberts … (2013)

Improving the prediction of chemical toxicity is a goal common to both environmental health research and pharmaceutical drug development. To improve safety detection assays, it is critical to have a reference set of molecules with well-defined toxicity annotations for training and validation purposes. Here, we describe a collaboration between safety researchers at Pfizer and the research team at the Comparative Toxicogenomics Database (CTD) to text mine and manually review a collection of 88 629 articles relating over 1 200 pharmaceutical drugs to their potential involvement in cardiovascular, neurological, renal and hepatic toxicity. In 1 year, CTD biocurators curated 2 54 173 toxicogenomic interactions (1 52 173 chemical–disease, 58 572 chemical–gene, 5 345 gene–disease and 38 083 phenotype interactions). All chemical–gene–disease interactions are fully integrated with public CTD, and phenotype interactions can be downloaded. We describe Pfizer’s text-mining process to collate the articles, and CTD’s curation strategy, performance metrics, enhanced data content and new module to curate phenotype information. As well, we show how data integration can connect phenotypes to diseases. This curation can be leveraged for information about toxic endpoints important to drug safety and help develop testable hypotheses for drug–disease events. The availability of these detailed, contextualized, high-quality annotations curated from seven decades’ worth of the scientific literature should help facilitate new mechanistic screening assays for pharmaceutical compound survival. This unique partnership demonstrates the importance of resource sharing and collaboration between public and private entities and underscores the complementary needs of the environmental health science and pharmaceutical communities. Database URL: http://ctdbase.org/

0 comments Cited 35 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Database (Oxford)

Journal ID (iso-abbrev): Database (Oxford)

Journal ID (publisher-id): databa

Journal ID (hwp): databa

Title: Database: The Journal of Biological Databases and Curation

Publisher: Oxford University Press

ISSN (Electronic): 1758-0463

Publication date Collection: 2016

Publication date (Electronic): 08 August 2016

Publication date PMC-release: 08 August 2016

Volume: 2016

Electronic Location Identifier: baw115

Affiliations

¹The MITRE Corporation, Bedford, MA, USA

²University of Paris-Sorbonne/STIH Team, Paris, France

³Philip Morris International R&D, Philip Morris Products S.A., Neuchâtel, Switzerland

⁴Joint Genome Institute, Walnut, Creek, CA, USA

⁵National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA

⁶University of Colorado, School of Medicine, Denver, CO, USA

Author notes

*Corresponding author: Email: lynette@ 123456mitre.org

Citation details: Hirschman,L., Fort,K., Boué,S. et al. Crowdsourcing and curation: perspectives from biology and natural language processing. Database (2016) Vol. 2016: article ID baw115; doi:10.1093/database/baw1115

Article

Publisher ID: baw115

DOI: 10.1093/database/baw115

PMC ID: 4976298

PubMed ID: 27504010

SO-VID: fa116138-b585-4c7a-853f-4b72ce47d70d

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 19 January 2016

Date revision received : 28 June 2016

Date accepted : 11 July 2016

Page count

Pages: 11

Crowdsourcing and curation: perspectives from biology and natural language processing

Read this article at

Abstract

Related collections

Genetoberfest

Most cited references 16

The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification

Strengths and limitations of microarray-based phenotype prediction: lessons learned from the IMPROVER Diagnostic Signature Challenge.

A CTD–Pfizer collaboration: manual curation of 88 000 scientific articles text mined for drug–disease and drug–phenotype interactions

Author and article information

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 125

Cited by 1

Most referenced authors 264