Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The emergence of massive datasets in a clinical setting presents both challenges and opportunities in data storage and analysis. This so called “big data” challenges traditional analytic tools and will increasingly require novel solutions adapted from other fields. Advances in information and communication technology present the most viable solutions to big data analysis in terms of efficiency and scalability. It is vital those big data solutions are multithreaded and that data access approaches be precisely tailored to large volumes of semi-structured/unstructured data.

The MapReduce programming framework uses two tasks common in functional programming: Map and Reduce. MapReduce is a new parallel processing framework and Hadoop is its open-source implementation on a single computing node or on clusters. Compared with existing parallel processing paradigms (e.g. grid computing and graphical processing unit (GPU)), MapReduce and Hadoop have two advantages: 1) fault-tolerant storage resulting in reliable data processing by replicating the computing tasks, and cloning the data chunks on different computing nodes across the computing cluster; 2) high-throughput data processing via a batch processing framework and the Hadoop distributed file system (HDFS). Data are stored in the HDFS and made available to the slave nodes for computation.

In this paper, we review the existing applications of the MapReduce programming framework and its implementation platform Hadoop in clinical big data and related medical health informatics fields. The usage of MapReduce and Hadoop on a distributed system represents a significant advance in clinical big data processing and utilization, and opens up new opportunities in the emerging era of big data analytics. The objective of this paper is to summarize the state-of-the-art efforts in clinical big data analytics and highlight what might be needed to enhance the outcomes of clinical big data analytics tools. This paper is concluded by summarizing the potential usage of the MapReduce programming framework and Hadoop platform to process huge volumes of clinical data in medical health informatics related fields.

Related collections

Most cited references 40

Record: found
Abstract: not found
Article: not found

ENSEMBLE EMPIRICAL MODE DECOMPOSITION: A NOISE-ASSISTED DATA ANALYSIS METHOD

Zhaohua Wu, Norden Huang (2009)

0 comments Cited 1190 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

MapReduce

Jeffrey S. Dean, Sanjay Ghemawat (2008)

0 comments Cited 949 times     Rated -3 of 5. – based on 1 reviews

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Gene selection and classification of microarray data using random forest

Javier Díaz-Uriarte, Sara Alvarez de Andrés (2006)

Background Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection. Results We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy. Conclusion Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.

0 comments Cited 517 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Emad A Mohammed

Behrouz H Far

Christopher Naugler

Journal

Journal ID (nlm-ta): BioData Min

Journal ID (iso-abbrev): BioData Min

Title: BioData Mining

Publisher: BioMed Central

ISSN (Electronic): 1756-0381

Publication date Collection: 2014

Publication date (Electronic): 29 October 2014

Volume: 7

Page: 22

Affiliations

[1 ]Department of Electrical and Computer Engineering, Schulich School of Engineering, University of Calgary, Calgary, AB, Canada

[2 ]Department of Pathology and Laboratory Medicine, University of Calgary and Calgary Laboratory Services, Calgary, AB, Canada

Article

Publisher ID: 1756-0381-7-22

DOI: 10.1186/1756-0381-7-22

PMC ID: 4224309

SO-VID: 145aa44e-ff8a-41ea-8737-bccfaa52da6a

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends

Read this article at

Abstract

Related collections

Genetoberfest

Most cited references 40

ENSEMBLE EMPIRICAL MODE DECOMPOSITION: A NOISE-ASSISTED DATA ANALYSIS METHOD

MapReduce

Gene selection and classification of microarray data using random forest

Author and article information

Contributors

Journal

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 99

Cited by 21

Most referenced authors 815