ukbREST: efficient and streamlined data access for reproducible research in large biobanks

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Summary

Large biobanks, such as UK Biobank with half a million participants, are changing the scale and availability of genotypic and phenotypic data for researchers to ask fundamental questions about the biology of health and disease. The breadth of the UK Biobank data is enabling discoveries at an unprecedented pace. However, this size and complexity pose new challenges to investigators who need to keep the accruing data up to date, comply with potential consent changes, and efficiently and reproducibly extract subsets of the data to answer specific scientific questions. Here we propose a tool called ukbREST designed for the UK Biobank study (easily extensible to other biobanks), which allows authorized users to efficiently retrieve phenotypic and genetic data. It exposes a REST API that makes data highly accessible inside a private and secure network, allowing the data specification in a human readable text format easily shareable with other researchers. These characteristics make ukbREST an important tool to make biobank’s valuable data more readily accessible to the research community and facilitate reproducibility of the analysis, a key aspect of science.

Availability and implementation

It is implemented in Python using the Flask-RESTful framework for the API, and it is under the MIT license. It works with PostgreSQL and a Docker image is available for easy deployment. The source code and documentation is available in Github: https://github.com/hakyimlab/ukbrest.

Related collections

Most cited references 1

Record: found
Abstract: found
Article: found

Is Open Access

Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets

Allison P. Heath, Matthew Greenway, Raymond Powell … (2014)

Background As large genomics and phenotypic datasets are becoming more common, it is increasingly difficult for most researchers to access, manage, and analyze them. One possible approach is to provide the research community with several petabyte-scale cloud-based computing platforms containing these data, along with tools and resources to analyze it. Methods Bionimbus is an open source cloud-computing platform that is based primarily upon OpenStack, which manages on-demand virtual machines that provide the required computational resources, and GlusterFS, which is a high-performance clustered file system. Bionimbus also includes Tukey, which is a portal, and associated middleware that provides a single entry point and a single sign on for the various Bionimbus resources; and Yates, which automates the installation, configuration, and maintenance of the software infrastructure required. Results Bionimbus is used by a variety of projects to process genomics and phenotypic data. For example, it is used by an acute myeloid leukemia resequencing project at the University of Chicago. The project requires several computational pipelines, including pipelines for quality control, alignment, variant calling, and annotation. For each sample, the alignment step requires eight CPUs for about 12 h. BAM file sizes ranged from 5 GB to 10 GB for each sample. Conclusions Most members of the research community have difficulty downloading large genomics datasets and obtaining sufficient storage and computer resources to manage and analyze the data. Cloud computing platforms, such as Bionimbus, with data commons that contain large genomics datasets, are one choice for broadening access to research data in genomics.

0 comments Cited 22 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Russell Schwartz: Role: Associate Editor

Journal

Journal ID (nlm-ta): Bioinformatics

Journal ID (iso-abbrev): Bioinformatics

Journal ID (publisher-id): bioinformatics

Title: Bioinformatics

Publisher: Oxford University Press

ISSN (Print): 1367-4803

ISSN (Electronic): 1367-4811

Publication date (Print): 01 June 2019

Publication date (Electronic): 05 November 2018

Publication date PMC-release: 05 November 2018

Volume: 35

Issue: 11

Pages: 1971-1973

Affiliations

[1 ]Department of Medicine, Section of Genetic Medicine, The University of Chicago, Chicago, IL, USA

[2 ]Center for Translational Data Science, The University of Chicago, Chicago, IL, USA

Author notes

To whom correspondence should be addressed. haky@ 123456uchicago.edu

Author information

Milton Pividori http://orcid.org/0000-0002-3035-4403

Hae Kyung Im http://orcid.org/0000-0003-0333-5685

Article

Publisher ID: bty925

DOI: 10.1093/bioinformatics/bty925

PMC ID: 6546122

PubMed ID: 30395166

SO-VID: ce636072-1bbe-45dd-add3-358766ad9fb4

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 3 August 2018

Date revision received : 26 October 2018

Date accepted : 3 November 2018

Page count

Pages: 3

Funding

Funded by: National Institutes of Health Cloud Credits Model Pilot

Award ID: R01 MH107666

Funded by: DRTC 10.13039/100007800

Award ID: P30 DK020595

Comments

Comment on this article

scite_

Cited by 2

See all cited by

Most referenced authors 213

See all reference authors