An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Bioinformatics researchers are now confronted with analysis of ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years. Recent developments in open source software, that is, the Hadoop project and associated software, provide a foundation for scaling to petabyte scale data warehouses on Linux clusters, providing fault-tolerant parallelized analysis on such data using a programming style named MapReduce.

Description

An overview is given of the current usage within the bioinformatics community of Hadoop, a top-level Apache Software Foundation project, and of associated open source software projects. The concepts behind Hadoop and the associated HBase project are defined, and current bioinformatics software that employ Hadoop is described. The focus is on next-generation sequencing, as the leading application area to date.

Conclusions

Hadoop and the MapReduce programming paradigm already have a substantial base in the bioinformatics community, especially in the field of next-generation sequencing analysis, and such use is increasing. This is due to the cost-effectiveness of Hadoop-based analysis on commodity Linux clusters, and in the cloud via data upload to cloud vendors who have implemented Hadoop/HBase; and due to the effectiveness and ease-of-use of the MapReduce method in parallelization of many data analysis algorithms.

Related collections

Most cited references 13

Record: found
Abstract: found
Article: not found

Searching for SNPs with cloud computing

Ben Langmead, Michael Schatz, Jimmy Lin … (2009)

Rationale Improvements in DNA sequencing have made sequencing an increasingly valuable tool for the study of human variation and disease. Technologies from Illumina (San Diego, CA, USA), Applied Biosystems (Foster City, CA, USA) and 454 Life Sciences (Branford, CT, USA) have been used to detect genomic variations among humans [1-5], to profile methylation patterns [6], to map DNA-protein interactions [7], and to identify differentially expressed genes and novel splice junctions [8,9]. Meanwhile, technical improvements have greatly decreased the cost and increased the size of sequencing datasets. For example, at the beginning of 2009 a single Illumina instrument was capable of generating 15 to 20 billion bases of sequencing data per run. Illumina has projected [10] that its instrument will generate 90 to 95 billion bases per run by the end of 2009, quintupling its throughput in one year. Another study shows the per-subject cost for whole-human resequencing declining rapidly over the past year [11], which will fuel further adoption. Growth in throughput and adoption are vastly outpacing improvements in computer speed, demanding a level of computational power achievable only via large-scale parallelization. Two recent projects have leveraged parallelism for whole-genome assembly with short reads. Simpson et al. [12] use ABySS to assemble the genome of a human from 42-fold coverage of short reads [2] using a cluster of 168 cores (21 computers), in about 3 days of wall clock time. Jackson and colleagues [13] assembled a Drosophila melanogaster genome from simulated short reads on a 512-node BlueGene/L supercomputer in less than 4 hours of total elapsed time. Though these efforts demonstrate the promise of parallelization, they are not widely applicable because they require access to a specific type of hardware resource. No two clusters are exactly alike, so scripts and software designed to run well on one cluster may run poorly or fail entirely on another cluster. Software written for large supercomputers like BlueGene/L is less reusable still, since only select researchers have access to such machines. Lack of reusability also makes it difficult for peers to recreate scientific results obtained using such systems. An increasingly popular alternative for large-scale computations is cloud computing. Instead of owning and maintaining dedicated hardware, cloud computing offers a 'utility computing' model, that is, the ability to rent and perform computation on standard, commodity computer hardware over the Internet. These rented computers run in a virtualized environment where the user is free to customize the operating system and software installed. Cloud computing also offers a parallel computing framework called MapReduce [14], which was designed by Google to efficiently scale computation to many hundreds or thousands of commodity computers. Hadoop [15] is an open source implementation of MapReduce that is widely used to process very large datasets, including at companies such as Google, Yahoo, Microsoft, IBM, and Amazon. Hadoop programs can run on any cluster where the portable, Java-based Hadoop framework is installed. This may be a local or institutional cluster to which the user has free access, or it may be a cluster rented over the Internet through a utility computing service. In addition to high scalability, the use of both standard software (Hadoop) and standard hardware (utility computing) affords reusability and reproducibility. The CloudBurst project [16] explored the benefits of using Hadoop as a platform for alignment of short reads. CloudBurst is capable of reporting all alignments for millions of human short reads in minutes, but does not scale well to human resequencing applications involving billions of reads. Whereas CloudBurst aligns about 1 million short reads per minute on a 24-core cluster, a typical human resequencing project generates billions of reads, requiring more than 100 days of cluster time or a much larger cluster. Also, whereas CloudBurst is designed to efficiently discover all valid alignments per read, resequencing applications often ignore or discount evidence from repetitively aligned reads as they tend to confound genotyping. Our goal for this work was to explore whether cloud computing could be profitably applied to the largest problems in comparative genomics. We focus on human resequencing, and single nucleotide polymorphism (SNP) detection specifically, in order to allow comparisons to previous studies. We present Crossbow, a Hadoop-based software tool that combines the speed of the short read aligner Bowtie [17] with the accuracy of the SNP caller SOAPsnp [18] to perform alignment and SNP detection for multiple whole-human datasets per day. In our experiments, Crossbow aligns and calls SNPs from 38-fold coverage of a Han Chinese male genome [5] in as little as 3 hours (4 hours 30 minutes including transfer time) using a 320-core cluster. SOAPsnp was previously shown to make SNP calls that agree closely with genotyping results obtained with an Illumina 1 M BeadChip assay of the Han Chinese genome [18] when used in conjunction with the short read aligner SOAP [19]. We show that SNPs reported by Crossbow exhibit a level of BeadChip agreement comparable to that achieved in the original SOAPsnp study, but in far less time. Crossbow is open source software available from the Bowtie website [20]. Crossbow can be run on any cluster with appropriate versions of Hadoop, Bowtie, and SOAPsnp installed. Crossbow is distributed with scripts allowing it to run either on a local cluster or on a cluster rented through Amazon's Elastic Compute Cloud (EC2) [21] utility computing service. Version 0.1.3 of the Crossbow software is also provided as Additional data file 1. Results Crossbow harnesses cloud computing to efficiently and accurately align billions of reads and call SNPs in hours, including for high-coverage whole-human datasets. Within Crossbow, alignment and SNP calling are performed by Bowtie and SOAPsnp, respectively, in a seamless, automatic pipeline. Crossbow can be run on any computer cluster with the prerequisite software installed. The Crossbow package includes scripts that allow the user to run an entire Crossbow session remotely on an Amazon EC2 cluster of any size. Resequencing simulated data To measure Crossbow's accuracy where true SNPs are known, we conducted two experiments using simulated paired-end read data from human chromosomes 22 and X. Results are shown in Tables 1 and 2. For both experiments, 40-fold coverage of 35-bp paired-end reads were simulated from the human reference sequence (National Center for Biotechnology Information (NCBI) 36.3). Quality values and insert lengths were simulated based on empirically observed qualities and inserts in the Wang et al. dataset [5]. Table 1 Experimental parameters for Crossbow experiments using simulated reads from human chromosomes 22 and X Reference chromosome Chromosome 22 Chromosome X Reference base pairs 49.7 million 155 million Chromosome copy number Diploid Haploid HapMap SNPs introduced 36,096 71,976 Heterozygous 24,761 0 Homozygous 11,335 71,976 Novel SNPs introduced 10,490 30,243 Heterozygous 6,967 0 Homozygous 3,523 30,243 Simulated coverage 40-fold 40-fold Read type 35-bp paired 35-bp paired Table 2 SNP calling measurements for Crossbow experiments using simulated reads from human chromosomes 22 and X Chromosome 22 Chromosome X True number of sites Crossbow sensitivity Crossbow precision True number of sites Crossbow sensitivity Crossbow precision All SNP sites 46,586 99.0% 99.1% 102,219 99.0% 99.6% Only HapMap SNP sites 36,096 99.8% 99.9% 71,976 99.9% 99.9% Only novel SNP sites 10,490 96.3% 96.3% 30,243 96.8% 98.8% Only homozygous 14,858 98.7% 99.9% NA NA NA Only heterozygous 31,728 99.2% 98.8% NA NA NA Only novel het 6,967 96.6% 94.6% NA NA NA All other 39,619 99.4% 99.9% NA NA NA Sensitivity is the proportion of true SNPs that were correctly identified. Precision is the proportion of called SNPs that were genuine. NA denotes "not applicable" because of the ploidy of the chromosome. SOAPsnp can exploit user-supplied information about known SNP loci and allele frequencies to refine its prior probabilities and improve accuracy. Therefore, the read simulator was designed to simulate both known HapMap [22] SNPs and novel SNPs. This mimics resequencing experiments where many SNPs are known but some are novel. Known SNPs were selected at random from actual HapMap alleles for human chromosomes 22 and X. Positions and allele frequencies for known SNPs were calculated according to the same HapMap SNP data used to simulate SNPs. For these simulated data, Crossbow agrees substantially with the true calls, with greater than 99% precision and sensitivity overall for chromosome 22. Performance for HapMap SNPs is noticeably better than for novel SNPs, owing to SOAPsnp's ability to adjust SNP-calling priors according to known allele frequencies. Performance is similar for homozygous and heterozygous SNPs overall, but novel heterozygous SNPs yielded the worst performance of any other subset studied, with 96.6% sensitivity and 94.6% specificity on chromosome 22. This is as expected, since novel SNPs do not benefit from prior knowledge, and heterozygous SNPs are more difficult than homozygous SNPs to distinguish from the background of sequencing errors. Whole-human resequencing To demonstrate performance on real-world data, we used Crossbow to align and call SNPs from the set of 2.7 billion reads and paired-end reads sequenced from a Han Chinese male by Wang et al [5]. Previous work demonstrated that SNPs called from this dataset by a combination of SOAP and SOAPsnp are highly concordant with genotypes called by an Illumina 1 M BeadChip genotyping assay of the same individual [18]. Since Crossbow uses SOAPsnp as its SNP caller, we expected Crossbow to yield very similar, but not identical, output. Differences may occur because: Crossbow uses Bowtie whereas the previous study used SOAP to align the reads; the Crossbow version of SOAPsnp has been modified somewhat to operate within a MapReduce context; in this study, alignments are binned into non-overlapping 2-Mbp partitions rather than into chromosomes prior to being given to SOAPsnp; and the SOAPsnp study used additional filters to remove some additional low confidence SNPs. Despite these differences, Crossbow achieves comparable agreement with the BeadChip assay and at a greatly accelerated rate. We downloaded 2.66 billion reads from a mirror of the YanHuang site [23]. These reads cover the assembled human genome sequence to 38-fold coverage. They consist of 2.02 billion unpaired reads with sizes ranging from 25 to 44 bp, and 658 million paired-end reads. The most common unpaired read lengths are 35 and 40 bp, comprising 73.0% and 17.4% of unpaired reads, respectively. The most common paired-end read length is 35 bp, comprising 88.8% of all paired-end reads. The distribution of paired-end separation distances is bimodal with peaks in the 120 to 150 bp and 420 to 460 bp ranges. Table 3 shows a comparison of SNPs called by either of the sequencing-based assays - Crossbow labeled 'CB' and SOAP+SOAPsnp labeled 'SS' - against SNPs obtained with the Illumina 1 M BeadChip assay from the SOAPsnp study [18]. The 'sites covered' column reports the proportion of BeadChip sites covered by a sufficient number of sequencing reads. Sufficient coverage is roughly four reads for diploid chromosomes and two reads for haploid chromosomes (see Materials and methods for more details about how sufficient coverage is determined). The 'Agreed' column shows the proportion of covered BeadChip sites where the BeadChip call equaled the SOAPsnp or Crossbow call. The 'Missed allele' column shows the proportion of covered sites where SOAPsnp or Crossbow called a position as homozygous for one of two heterozygous alleles called by BeadChip at that position. The 'Other disagreement' column shows the proportion of covered sites where the BeadChip call differed from the SOAPsnp/Crossbow in any other way. Definitions of the 'Missed allele' and 'Other disagreement' columns correspond to the definitions of 'false negatives' and 'false positives', respectively, in the SOAPsnp study. Table 3 Coverage and agreement measurements comparing Crossbow (CB) and SOAP/SOAPsnp (SS) to the genotyping results obtained by an Illumina 1 M genotyping assay in the SOAPsnp study (SS) (CB) Illumina 1 M genotype Sites Sites covered (SS) Sites covered (CB) Agreed (SS) Agreed (CB) Missed allele Other disagreement Missed allele Other disagreement Chromosome X HOM reference 27,196 98.65% 99.83% 99.99% 99.99% NA 0.004% NA 0.011% HOM mutant 10,737 98.49% 99.19% 99.89% 99.85% NA 0.113% NA 0.150% Total 37,933 98.61% 99.65% 99.97% 99.95% NA 0.035% NA 0.050% Autosomal HOM reference 540,878 99.11% 99.88% 99.96% 99.92% NA 0.044% NA 0.078% HOM mutant 208,436 98.79% 99.28% 99.81% 99.70% NA 0.194% NA 0.296% HET 250,667 94.81% 99.64% 99.61% 99.75% 0.374% 0.017% 0.236% 0.014% Total 999,981 97.97% 99.70% 99.84% 99.83% 0.091% 0.069% 0.059% 0.108% 'Sites covered' is the proportion of BeadChip sites covered by a sufficient number of sequencing reads (roughly four reads for diploid and two reads for haploid chromosomes). 'Agreed' is the proportion of covered BeadChip sites where the BeadChip call equaled the SOAPsnp/Crossbow call. 'Missed allele' is the proportion of covered sites where SOAPsnp/Crossbow called a position as homozygous for one of two heterozygous alleles called by BeadChip. 'Other disagreement' is the proportion of covered sites where the BeadChip call differed from the SOAPsnp/Crossbow in any other way. NA denotes "not applicable" due to ploidy. Both Crossbow and SOAP+SOAPsnp exhibit a very high level of agreement with the BeadChip genotype calls. The small differences in number of covered sites ( 10 Gb/s JANET and Internet2 network backbones, as do many academic institutions. However, even at these institutions, the bandwidth available for a given server or workstation can be considerably less (commonly 100 Mb/s or less). Delays due to slow uplinks can be mitigated by transferring large datasets in stages as reads are generated by the sequencer, rather than all at once. To measure how the whole-genome Crossbow computation scales, separate experiments were performed using 10, 20 and 40 EC2 Extra Large High CPU nodes. Table 4 presents the wall clock running time and approximate cost for each experiment. The experiment was performed once for each cluster size. The results show that Crossbow is capable of calling SNPs from 38-fold coverage of the human genome in under 3 hours of wall clock time and for about $85 ($96 in Europe). Table 4 Timing and cost for Crossbow experiments using reads from the Wang et al. study [5] EC2 Nodes 1 master, 10 workers 1 master, 20 workers 1 master, 40 workers Worker CPU cores 80 160 320 Wall clock time 6 h:30 m 4 h:33 m 2 h:53 m Approximate cluster setup time 18 m 18 m 21 m Approximate crossbow time 6 h:12 m 4 h:15 m 2 h:32 m Approximate cost (US/Europe) $52.36/$60.06 $71.40/$81.90 $83.64/$95.94 Costs are approximate and based on the pricing as of this writing, that is, $0.68 per extra-large high-CPU EC2 node per hour in the US and $0.78 in Europe. Times can vary subject to, for example, congestion and Internet traffic conditions. Figure 1 illustrates scalability of the computation as a function of the number of processor cores allocated. Units on the vertical axis are the reciprocal of the wall clock time. Whereas wall clock time measures elapsed time, its reciprocal measures throughput - that is, experiments per hour. The straight diagonal line extending from the 80-core point represents hypothetical linear speedup, that is, extrapolated throughput under the assumption that doubling the number of processors also doubles throughput. In practice, parallel algorithms usually exhibit worse-than-linear speedup because portions of the computation are not fully parallel. In the case of Crossbow, deviation from linear speedup is primarily due to load imbalance among CPUs in the map and reduce phases, which can cause a handful of work-intensive 'straggler' tasks to delay progress. The reduce phase can also experience imbalance due to, for example, variation in coverage. Figure 1 Number of worker CPU cores allocated from EC2 versus throughput measured in experiments per hour: that is, the reciprocal of the wall clock time required to conduct a whole-human experiment on the Wang et al. dataset [5]. The line labeled 'linear speedup' traces hypothetical linear speedup relative to the throughput for 80 CPU cores. Materials and methods Alignment and SNP calling in Hadoop Hadoop is an implementation of the MapReduce parallel programming model. Under Hadoop, programs are expressed as a series of map and reduce phases operating on tuples of data. Though not all programs are easily expressed this way, Hadoop programs stand to benefit from services provided by Hadoop. For instance, Hadoop programs need not deal with particulars of how work and data are distributed across the cluster; these details are handled by Hadoop, which automatically partitions, sorts and routes data among computers and processes. Hadoop also provides fault tolerance by partitioning files into chunks and storing them redundantly on the HDFS. When a subtask fails due to hardware or software errors, Hadoop restarts the task automatically, using a cached copy of its input data. A mapper is a short program that runs during the map phase. A mapper receives a tuple of input data, performs a computation, and outputs zero or more tuples of data. A tuple consists of a key and a value. For example, within Crossbow a read is represented as a tuple where the key is the read's name and the value equals the read's sequence and quality strings. The mapper is generally constrained to be stateless - that is, the content of an output tuple may depend only on the content of the corresponding input tuple, and not on previously observed tuples. This enables MapReduce to safely execute many instances of the mapper in parallel. Similar to a mapper, a reducer is a short program that runs during the reduce phase, but with the added condition that a single instance of the reducer will receive all tuples from the map phase with the same key. In this way, the mappers typically compute partial results, and the reducer finalizes the computation using all the tuples with the same key, and outputs zero or more output tuples. The reducer is also constrained to be stateless - that is, the content of an output tuple may depend only the content of the tuples in the incoming batch, not on any other previously observed input tuples. Between the map and reduce phases, Hadoop automatically executes a sort/shuffle phase that bins and sorts tuples according to primary and secondary keys before passing batches on to reducers. Because mappers and reducers are stateless, and because Hadoop itself handles the sort/shuffle phase, Hadoop has significant freedom in how it distributes parallel chunks of work across the cluster. The chief insight behind Crossbow is that alignment and SNP calling can be framed as a series of map, sort/shuffle and reduce phases. The map phase is short read alignment where input tuples represent reads and output tuples represent alignments. The sort/shuffle phase bins alignments according to the genomic region ('partition') aligned to. The sort/shuffle phase also sorts alignments along the forward strand of the reference in preparation for consensus calling. The reduce phase calls SNPs for a given partition, where input tuples represent the sorted list of alignments occurring in the partition and output tuples represent SNP calls. A typical Hadoop program consists of Java classes implementing the mapper and reducer running in parallel on many compute nodes. However, Hadoop also supports a 'streaming' mode of operation whereby the map and reduce functions are delegated to command-line scripts or compiled programs written in any language. In streaming mode, Hadoop executes the streaming programs in parallel on different compute nodes, and passes tuples into and out of the program as tab-delimited lines of text written to the 'standard in' and 'standard out' file handles. This allows Crossbow to reuse existing software for aligning reads and calling SNPs while automatically gaining the scaling benefits of Hadoop. For alignment, Crossbow uses Bowtie [17], which employs a Burrows-Wheeler index [25] based on the full-text minute-space (FM) index [26] to enable fast and memory-efficient alignment of short reads to mammalian genomes. To report SNPs, Crossbow uses SOAPsnp [18], which combines multiple techniques to provide high-accuracy haploid or diploid consensus calls from short read alignment data. At the core of SOAPsnp is a Bayesian SNP model with configurable prior probabilities. SOAPsnp's priors take into account differences in prevalence between, for example, heterozygous versus homozygous SNPs and SNPs representing transitions versus those representing transversions. SOAPsnp can also use previously discovered SNP loci and allele frequencies to refine priors. Finally, SOAPsnp recalibrates the quality values provided by the sequencer according to a four-dimensional training matrix representing observed error rates among uniquely aligned reads. In a previous study, human genotype calls obtained using the SOAP aligner and SOAPsnp exhibited greater than 99% agreement with genotype calls obtained using an Illumina 1 M BeadChip assay of the same Han Chinese individual [18]. Crossbow's efficiency requires that the three MapReduce phases, map, sort/shuffle and reduce, each be efficient. The map and reduce phases are handled by Bowtie and SOAPsnp, respectively, which have been shown to perform efficiently in the context of human resequencing. But another advantage of Hadoop is that its implementation of the sort/shuffle phase is extremely efficient, even for human resequencing where mappers typically output billions of alignments and hundreds of gigabytes of data to be sorted. Hadoop's file system (HDFS) and intelligent work scheduling make it especially well suited for huge sort tasks, as evidenced by the fact that a 1,460-node Hadoop cluster currently holds the speed record for sorting 1 TB of data on commodity hardware (62 seconds) [27]. Modifications to existing software Several new features were added to Bowtie to allow it to operate within Hadoop. A new input format (option --12) was added, allowing Bowtie to recognize the one-read-per-line format produced by the Crossbow preprocessor. New command-line options --mm and --shmem instruct Bowtie to use memory-mapped files or shared memory, respectively, for loading and storing the reference index. These features allow many Bowtie processes, each acting as an independent mapper, to run in parallel on a multi-core computer while sharing a single in-memory image of the reference index. This maximizes alignment throughput when cluster computers contain many CPUs but limited memory. Finally, a Crossbow-specific output format was implemented that encodes an alignment as a tuple where the tuple's key identifies a reference partition and the value describes the alignment. Bowtie detects instances where a reported alignment spans a boundary between two reference partitions, in which case Bowtie outputs a pair of alignment tuples with identical values but different keys, each identifying one of the spanned partitions. These features are enabled via the --partition option, which also sets the reference partition size. The version of SOAPsnp used in Crossbow was modified to accept alignment records output by modified Bowtie. Speed improvements were also made to SOAPsnp, including an improvement for the case where the input alignments cover only a small interval of a chromosome, as is the case when Crossbow invokes SOAPsnp on a single partition. None of the modifications made to SOAPsnp fundamentally affect how consensus bases or SNPs are called. Workflow The input to Crossbow is a set of preprocessed read files, where each read is encoded as a tab-delimited tuple. For paired-end reads, both ends are stored on a single line. Conversion takes place as part of a bulk-copy procedure, implemented as a Hadoop program driven by automatic scripts included with Crossbow. Once preprocessed reads are situated on a filesystem accessible to the Hadoop cluster, the Crossbow MapReduce job is invoked (Figure 2). Crossbow's map phase is short read alignment by Bowtie. For fast alignment, Bowtie employs a compact index of the reference sequence, requiring about 3 Gb of memory for the human genome. The index is distributed to all computers in the cluster either via Hadoop's file caching facility or by instructing each node to independently obtain the index from a shared filesystem. The map phase outputs a stream of alignment tuples where each tuple has a primary key containing chromosome and partition identifiers, and a secondary key containing the chromosome offset. The tuple's value contains the aligned sequence and quality values. The soft/shuffle phase, which is handled by Hadoop, uses Hadoop's KeyFieldBasedPartitioner to bin alignments according to the primary key and sort according to the secondary key. This allows separate reference partitions to be processed in parallel by separate reducers. It also ensures that each reducer receives alignments for a given partition in sorted order, a necessary first step for calling SNPs with SOAPsnp. Figure 2 Crossbow workflow. Previously copied and pre-processed read files are downloaded to the cluster, decompressed and aligned using many parallel instances of Bowtie. Hadoop then bins and sorts the alignments according to primary and secondary keys. Sorted alignments falling into each reference partition are then submitted to parallel instances of SOAPsnp. The final output is a stream of SNP calls made by SOAPsnp. The reduce phase performs SNP calling using SOAPsnp. A wrapper script performs a separate invocation of the SOAPsnp program per partition. The wrapper also ensures that SOAPsnp is invoked with appropriate options given the ploidy of the reference partition. Files containing known SNP locations and allele frequencies derived from dbSNP [28] are distributed to worker nodes via the same mechanism used to distribute the Bowtie index. The output of the reduce phase is a stream of SNP tuples, which are stored on the cluster's distributed filesystem. The final stage of the Crossbow workflow archives the SNP calls and transfers them from the cluster's distributed filesystem to the local filesystem. Cloud support Crossbow comes with scripts that automate the Crossbow pipeline on a local cluster or on the EC2 [21] utility computing service. The EC2 driver script can be run from any Internet-connected computer; however, all the genomic computation is executed remotely. The script runs Crossbow by: allocating an EC2 cluster using the Amazon Web Services tools; uploading the Crossbow program code to the master node; launching Crossbow from the master; downloading the results from the cluster to the local computer; and optionally terminating the cluster, as illustrated in Figure 3. The driver script detects common problems that can occur in the cluster allocation process, including when EC2 cannot provide the requested number of instances due to high demand. The overall process is identical to running on a local dedicated cluster, except cluster nodes are allocated as requested. Figure 3 Four basic steps to running the Crossbow computation. Two scenarios are shown: one where Amazon's EC2 and S3 services are used, and one where a local cluster is used. In step 1 (red) short reads are copied to the permanent store. In step 2 (green) the cluster is allocated (may not be necessary for a local cluster) and the scripts driving the computation are uploaded to the master node. In step 3 (blue) the computation is run. The computation download reads from the permanent store, operates on them, and stores the results in the Hadoop distributed filesystem. In step 4 (orange), the results are copied to the client machine and the job completes. SAN (Storage Area Network) and NAS (Network-Attached Storage) are two common ways of sharing filesystems across a local network. Genotyping experiment We generated 40-fold coverage of chromosomes 22 and X (NCBI 36.3_ using 35-bp paired-end reads. Quality values were assigned by randomly selecting observed quality strings from a pair of FASTQ files in the Wang et al. [5] dataset (080110_EAS51_FC20B21AAXX_L7_YHPE_PE1). The mean and median quality values among those in this subset are 21.4 and 27, respectively, on the Solexa scale. Sequencing errors were simulated at each position at the rate dictated by the quality value at that position. For instance, a position with Solexa quality 30 was changed to a different base with a probability of 1 in 1,000. The three alternative bases were considered equally likely. Insert lengths were assigned by randomly selecting from a set of observed insert lengths. Observed insert lengths were obtained by aligning a pair of paired-end FASTQ files (the same pair used to simulate the quality values) using Bowtie with options '-X 10000 -v 2 --strata --best -m 1'. The observed mean mate-pair distance and standard deviation for this subset were 422 bp and 68.8 bp, respectively. Bowtie version 0.10.2 was run with the '-v 2 --best --strata -m 1' to obtain unique alignments with up to two mismatches. We define an alignment as unique if all other alignments for that read have strictly more mismatches. SOAPsnp was run with the rank-sum and binomial tests enabled (-u and -n options, respectively) and with known-SNP refinement enabled (-2 and -s options). Positions and allele frequencies for known SNPs were calculated according to the same HapMap SNP data used to simulate SNPs. SOAPsnp's prior probabilities for novel homozygous and heterozygous SNPs were set to the rates used by the simulator (-r 0.0001 -e 0.0002 for chromosome 22 and -r 0.0002 for chromosome X). An instance where Crossbow reports a SNP on a diploid portion of the genome was discarded (that is, considered to be homozygous for the reference allele) if it was covered by fewer than four uniquely aligned reads. For a haploid portion, a SNP was discarded if covered by fewer than two uniquely aligned reads. For either diploid or haploid portions, a SNP was discarded if the call quality as reported by SOAPsnp was less than 20. Whole-human resequencing experiment Bowtie version 0.10.2 and a modified version of SOAPsnp 1.02 were used. Both were compiled for 64-bit Linux. Bowtie was run with the '-v 2 --best --strata -m 1' options, mimicking the alignment and reporting modes used in the SOAPsnp study. A modified version of SOAPsnp 1.02 was run with the rank-sum and binomial tests enabled (-u and -n options, respectively) and with known-SNP refinement enabled (-2 and -s options). Positions for known SNPs were calculated according to data in dbSNP [28] versions 128 and 130, and allele frequencies were calculated according to data from the HapMap project [22]. Only positions occurring in dbSNP version 128 were provided to SOAPsnp. This was to avoid biasing the result by including SNPs submitted by Wang et al. [5] to dbSNP version 130. SOAPsnp's prior probabilities for novel homozygous and heterozygous SNPs were left at their default values of 0.0005 and 0.001, respectively. Since the subject was male, SOAPsnp was configured to treat autosomal chromosomes as diploid and sex chromosomes as haploid. To account for base-calling errors and inaccurate quality values reported by the Illumina software pipeline [29,30], SOAPsnp recalibrates quality values according to a four-dimensional matrix recording observed error rates. Rates are calculated across a large space of parameters, the dimensions of which include sequencing cycle, reported quality value, reference allele and subject allele. In the previous study, separate recalibration matrices were trained for each human chromosome; that is, a given chromosome's matrix was trained using all reads aligning uniquely to that chromosome. In this study, each chromosome is divided into non-overlapping stretches of 2 million bases and a separate matrix is trained and used for each partition. Thus, each recalibration matrix receives less training data than if matrices were trained per-chromosome. Though the results indicate that this does not affect accuracy significantly, future work for Crossbow includes merging recalibration matrices for partitions within a chromosome prior to genotyping. An instance where Crossbow reports a SNP on a diploid portion of the genome is discarded (that is, considered to be homozygous for the reference allele) if it is covered by fewer than four unique alignments. For a haploid portion, a SNP is discarded if covered by fewer than two unique alignments. For either diploid or haploid portions, a SNP is discarded if the call quality as reported by SOAPsnp is less than 20. Note that the SOAPsnp study applies additional filters to discard SNPs at positions that, for example, are not covered by any paired-end reads or appear to have a high copy number. Adding such filters to Crossbow is future work. Discussion In this paper we have demonstrated that cloud computing realized by MapReduce and Hadoop can be leveraged to efficiently parallelize existing serial implementations of sequence alignment and genotyping algorithms. This combination allows large datasets of DNA sequences to be analyzed rapidly without sacrificing accuracy or requiring extensive software engineering efforts to parallelize the computation. We describe the implementation of an efficient whole-genome genotyping tool, Crossbow, that combines two previously published software tools: the sequence aligner Bowtie and the SNP caller SOAPsnp. Crossbow achieves at least 98.9% accuracy on simulated datasets of individual chromosomes, and better than 99.8% concordance with the Illumina 1 M BeadChip assay of a sequenced individual. These accuracies are comparable to those achieved in the prior SOAPsnp study once filtering stringencies are taken into account. When run on conventional computers, a deep-coverage human resequencing project requires weeks of time to analyze on a single computer by contrast, Crossbow aligns and calls SNPs from the same dataset in less than 3 hours on a 320-core cluster. By taking advantage of commodity processors available via cloud computing services, Crossbow condenses over 1,000 hours of computation into a few hours without requiring the user to own or operate a computer cluster. In addition, running on standard software (Hadoop) and hardware (EC2 instances) makes it easier for other researchers to reproduce our results or execute their own analysis with Crossbow. Crossbow scales well to large clusters by leveraging Hadoop and the established, fast Bowtie and SOAPsnp algorithms with limited modifications. The ultrafast Bowtie alignment algorithm, utilizing a quality-directed best-first-search of the FM index, is especially important to the overall performance of Crossbow relative to CloudBurst. Crossbow's alignment stage vastly outperforms the fixed-seed seed-and-extend search algorithm of CloudBurst on clusters of the same size. We expect that the Crossbow infrastructure will serve as a foundation for bringing massive scalability to other high-volume sequencing experiments, such as RNA-seq and ChIP-seq. In our experiments, we demonstrated that Crossbow works equally well either on a local cluster or a remote cluster, but in the future we expect that utility computing services will make cloud computing applications widely available to any researcher. Abbreviations EC2: Elastic Compute Cloud; FM: full-text minute-space; HDFS: Hadoop Distributed Filesystem; NCBI: National Center for Biotechnology Information; S3: Simple Storage Service; SNP: single nucleotide polymorphism. Authors' contributions BL and MCS developed the algorithms, collected results, and wrote the software. JL contributed to discussions on algorithms. BL, MCS, JL, MP, and SLS wrote the manuscript. Additional data files The following additional data are included with the online version of this article: version 0.1.3 of the Crossbow software (Additional data file 1). Supplementary Material Additional data file 1 Version 0.1.3 of the Crossbow software. Click here for file

0 comments Cited 144 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

CloudBurst: highly sensitive read mapping with MapReduce

Michael Schatz (2009)

Motivation: Next-generation DNA sequencing machines are generating an enormous amount of sequence data, placing unprecedented demands on traditional single-processor read-mapping algorithms. CloudBurst is a new parallel read-mapping algorithm optimized for mapping next-generation sequence data to the human genome and other reference genomes, for use in a variety of biological analyses including SNP discovery, genotyping and personal genomics. It is modeled after the short read-mapping program RMAP, and reports either all alignments or the unambiguous best alignment for each read with any number of mismatches or differences. This level of sensitivity could be prohibitively time consuming, but CloudBurst uses the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes. Results: CloudBurst's running time scales linearly with the number of reads mapped, and with near linear speedup as the number of processors increases. In a 24-processor core configuration, CloudBurst is up to 30 times faster than RMAP executing on a single core, while computing an identical set of alignments. Using a larger remote compute cloud with 96 cores, CloudBurst improved performance by >100-fold, reducing the running time from hours to mere minutes for typical jobs involving mapping of millions of short reads to the human genome. Availability: CloudBurst is available open-source as a model for parallelizing algorithms with MapReduce at http://cloudburst-bio.sourceforge.net/. Contact: mschatz@umiacs.umd.edu

0 comments Cited 139 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

The case for cloud computing in genome informatics

Lincoln D. Stein (2010)

The impending collapse of the genome informatics ecosystem Since the 1980s, we have had the great fortune to work in a comfortable and effective ecosystem for the production and consumption of genomic information (Figure 1). Sequencing labs submit their data to big archival databases such as GenBank at the National Center for Biotechnology Information (NCBI) [1], the European Bioinformatics Institute EMBL database [2], DNA Data Bank of Japan (DDBJ) [3], the Short Read Archive (SRA) [4], the Gene Expression Omnibus (GEO) [5] and the microarray database ArrayExpress [6]. These databases maintain, organize and distribute the sequencing data. Most users access the information either through websites created by the archival databases, or through value-added integrators of genomic data, such as Ensembl [7], the University of California at Santa Cruz (UCSC) Genome Browser [8], Galaxy [9], or one of the many model organism databases [10-13]. Bioinformaticians and other power users download genomic data from these primary and secondary sources to their high performance clusters of computers ('compute clusters'), work with them and discard them when no longer needed (Figure 1). Figure 1 The old genome informatics ecosystem. Under the traditional flow of genome information, sequencing laboratories transmit raw and interpreted sequencing information across the internet to one of several sequencing archives. This information is accessed either directly by casual users or indirectly via a website run by one of the value-added genome integrators. Power users typically download large datasets from the archives onto their local compute clusters for computationally intensive number crunching. Under this model, the sequencing archives, value-added integrators and power users all maintain their own compute and storage clusters and keep local copies of the sequencing datasets. The whole basis for this ecosystem is Moore's Law [14], a long-term trend first described in 1965 by Intel co-founder Gordon Moore. Moore's Law states that the number of transistors that can be placed on an integrated circuit board is increasing exponentially, with a doubling time of roughly 18 months. The trend has held up remarkably well for 35 years across multiple changes in semiconductor technology and manufacturing techniques. Similar laws for disk storage and network capacity have also been observed. Hard disk capacity doubles roughly annually (Kryder's Law [15]), and the cost of sending a bit of information over optical networks halves every 9 months (Butter's Law [16]). Genome sequencing technology has also improved dramatically, and the number of bases that can be sequenced per unit cost has also been growing at an exponential rate. However, until just a few years ago, the doubling time for DNA sequencing was just a bit slower than the growth of compute and storage capacity. This was great for the genome informatics ecosystem. The archival databases and the value-added genome distributors did not need to worry about running out of disk storage space because the long-term trends allowed them to upgrade their capacity faster than the world's sequencing labs could update theirs. Computational biologists did not worry about not having access to sufficiently powerful networks or compute clusters because they were always slightly ahead of the curve. However, the advent of 'next generation' sequencing technologies in the mid-2000s changed these long-term trends and now threatens the conventional genome informatics ecosystem. To illustrate this, I recently plotted long-term trends in hard disk prices and DNA sequencing prices by using the Internet Archive's 'Wayback Machine' [17], which keeps archives of websites as they appeared in the past, to view vendors' catalogs, websites and press releases as they appeared over the past 20 years (Figure 2). Notice that this is a logarithmic plot, so exponential curves appear as straight lines. I made no attempt to factor in inflation or to calculate the cost of DNA sequencing with labor and overheads included, but the trends are clear. From 1990 to 2010, the cost of storing a byte of data has halved every 14 months, consistent with Kryder's Law. From 1990 to 2004, the cost of sequencing a base decreased more slowly than this, halving every 19 months - good news if you are running the bioinformatics core for a genome sequencing center. Figure 2 Historical trends in storage prices versus DNA sequencing costs. The blue squares describe the historic cost of disk prices in megabytes per US dollar. The long-term trend (blue line, which is a straight line here because the plot is logarithmic) shows exponential growth in storage per dollar with a doubling time of roughly 1.5 years. The cost of DNA sequencing, expressed in base pairs per dollar, is shown by the red triangles. It follows an exponential curve (yellow line) with a doubling time slightly slower than disk storage until 2004, when next generation sequencing (NGS) causes an inflection in the curve to a doubling time of less than 6 months (red line). These curves are not corrected for inflation or for the 'fully loaded' cost of sequencing and disk storage, which would include personnel costs, depreciation and overhead. However, from 2005 the slope of the DNA sequencing curve increases abruptly. This corresponds to the advent of the 454 Sequencer [18], quickly followed by the Solexa/Illumina [19] and ABI SOLiD [20] technologies. Since then, the cost of sequencing a base has been dropping by half every 5 months. The cost of genome sequencing is now decreasing several times faster than the cost of storage, promising that at some time in the not too distant future it will cost less to sequence a base of DNA than to store it on a hard disk. Of course there is no guarantee that this accelerated trend will continue indefinitely, but recent and announced offerings from Illumina [21], Pacific Biosystems [22], Helicos [23] and Ion Torrent [24], among others, promise to continue the trend until the middle of the decade. This change in the long-term trend overthrows the assumptions that support the current ecosystem. The various members of the genome informatics ecosystem are now facing a potential tsunami of genome data that will swamp our storage systems and crush our compute clusters. Just consider this one statistic: the first big genome project based on next generation sequencing technologies, the 1000 Genomes Project [25], which is cataloguing human genetic variation, deposited twice as much raw sequencing data into GenBank's SRA division during the project's first 6 months of operation as had been deposited into all of GenBank for the entire 30 years preceding (Paul Flicek, personal communication). But the 1000 Genomes Project is just the first ripple of the tsunami. Projects like ENCODE [26] and modENCODE [27], which use next generation sequencing for high-resolution mapping of epigenetic marks, chromatin-binding proteins and other functional elements, are currently generating raw sequence at tremendous rates. Cancer genome projects such as The Cancer Genome Atlas [28] and the International Cancer Genome Sequencing Consortium [29] are an order of magnitude larger than the 1000 Genomes Project, and the various Human Microbiome Projects [30,31] are potentially even larger still. Run for the hills? First, we must face up to reality. The ability of laboratories around the world to produce sequence faster and more cheaply than information technology groups can upgrade their storage systems is a fundamental challenge that admits no easy solution. At some future point it will become simply unfeasible to store all raw sequencing reads in a central archive or even in local storage. Genome biologists will have to start acting like the high energy physicists, who filter the huge datasets coming out of their collectors for a tiny number of informative events and then discard the rest. Even though raw read sets may not be preserved in their entirety, it will remain imperative for the assembled genomes of animals, plants and ecological communities to be maintained in publicly accessible form. But these are also rapidly growing in size and complexity because of the drop in sequencing costs and the growth of derivative technologies such as chromatin immunoprecipitation with sequencing (ChIP-seq [32]), DNA methylation sequencing [33] and chromatin interaction mapping [34]. These large datasets pose significant challenges for both the primary and secondary genome sequence repositories who must maintain the data, as well as the 'power users' who are accustomed to downloading the data to local computers for analysis. Reconsider the traditional genome informatics ecosystem of Figure 1. It is inefficient and wasteful in several ways. For the value-added genome integrators to do their magic with the data, they must download it from the archival databases across the internet and store copies in their local storage systems. The power users must do the same thing: either downloading the data directly from the archive, or downloading it from one of the integrators. This entails moving the same datasets across the network repeatedly and mirroring them in multiple local storage systems. When datasets are updated, each of the mirrors must detect that fact and refresh their copies. As datasets get larger, this process of mirroring and refreshing becomes increasingly cumbersome, error prone and expensive. A less obvious inefficiency comes from the need of the archives, integrators and power users to maintain local compute clusters to meet their analysis needs. NCBI, UCSC and the other genome data providers maintain large server farms that process genome data and serve it out via the web. The load on the server farm fluctuates hourly, daily and seasonally. At any time, a good portion of their clusters is sitting idle, waiting in reserve for periods of peak activity when a big new genome dataset comes in, or a major scientific meeting is getting close. However, even though much of the cluster is idle, it still consumes electricity and requires the care of a systems administration staff. Bioinformaticians and other computational biologists face similar problems. They can choose between building a cluster that is adequate to meet their everyday needs, or build one with the capacity to handle peak usage. In the former case, the researcher risks being unable to run an unusually involved analysis in reasonable running time and possibly being scooped by a competitor. In the latter case, they waste money purchasing and maintaining a system that they are not using to capacity much of the time. These inefficiencies have been tolerable in a world in which most genome-scale datasets have fit on a DVD (uncompressed, the human genome is about 3 gigabytes). When datasets are measured in terabytes these inefficiencies add up. Cloud computing to the rescue Which brings us, at last, to 'cloud computing.' This is a general term for computation-as-a-service. There are various different types of cloud computing, but the one that is closest to the way that computational biologists currently work depends on the concept of a 'virtual machine'. In the traditional economic model of computation, customers purchase server, storage and networking hardware, configure it the way they need, and run software on it. In computation-as-a-service, customers essentially rent the hardware and storage for as long or as short a time as they need to achieve their goals. Customers pay only for the time the rented systems are running and only for the storage they actually use. This model would be lunatic if the rented machines were physical ones. However, in cloud computing, the rentals are virtual: without ever touching a power cable, customers can power up a fully functional 10-computer server farm with a terabyte of shared storage, upgrade the cluster in minutes to 100 servers when needed for some heavy duty calculations, and then return to the baseline 10-server system when the extra virtual machines are no longer needed. The way it works is that a service provider puts up the capital expenditure of creating an extremely large compute and storage farm (tens of thousands of nodes and petabytes of storage) with all the frills needed to maintain an operation of this size, including a dedicated system administration staff, storage redundancy, data centers distributed to strategically placed parts of the world, and broadband network connectivity. The service provider then implements the infrastructure to give users the ability to create, upload and launch virtual machines on this compute farm. Because of economies of scale, the service provider can obtain highly discounted rates on hardware, electricity and network connectivity, and can pass these savings on to the end users to make virtual machine rental economically competitive with purchasing the real thing. A virtual machine is a piece of software running on the host computer (the real hardware) that emulates the properties of a computer: the emulator provides a virtual central processing unit (CPU), network card, hard disk, keyboard and so forth. You can run the operating system of your choice on the virtual machine, log into it remotely via the internet, configure it to run web servers, databases, load management software, parallel computation libraries, and any other software you favor. You may be familiar with virtual machines from working with consumer products such as VMware [35] or open source projects such as KVM [36]. A single physical machine can host multiple virtual machines, and software running on the physical server farm can distribute requests for new virtual machines across the server farm in a way that intelligently distributes load. The experience of working with virtual machines is relatively painless. Choose the physical aspects of the virtual machine you wish to make, including CPU type, memory size and hard disk capacity, specify the operating system you wish to run, and power up one or more machines. Within a couple of minutes, your virtual machines are up and running. Log into them over the network and get to work. When a virtual machine is not running, you can store an image of its bootable hard disk. You can then use this image as a template on which to start up multiple virtual machines, which is how you can launch a virtual compute cluster in a matter of minutes. For the field of genome informatics, a key feature of cloud computing is the ability of service providers and their customers to store large datasets in the cloud. These datasets typically take the form of virtual disk images that can be attached to virtual machines as local hard disks and/or shared as networked volumes. For example, the entire GenBank archive could be (and in fact is, see below) stored in the cloud as a disk image that can be loaded and unloaded as needed. Figure 3 shows what the genome informatics ecosystem might look like in a cloud computing environment. Here, instead of there being separate copies of genome datasets stored at diverse locations and groups copying the data to their local machines in order to work with them, most datasets are stored in the cloud as virtual disks and databases. Web services that run on top of these datasets, including both the primary archives and the value-added integrators, run as virtual machines within the cloud. Casual users, who are accustomed to accessing the data via the web pages at NCBI, DDBJ, Ensembl or UCSC, continue to work with the data in their accustomed way; the fact that these servers are now located inside the cloud is invisible to them. Figure 3 The 'new' genome informatics ecosystem based on cloud computing. In this model, the community's storage and compute resources are co-located in a 'cloud' maintained by a large service provider. The sequence archives and value-added integrators maintain servers and storage systems within the cloud, and use more or less capacity as needed for daily and seasonal fluctuations in usage. Casual users continue to access the data via the websites of the archives and integrators, but power users now have the option of creating virtual on-demand compute clusters within the cloud, which have direct access to the sequencing datasets. Power users can continue to download the data, but they now have an attractive alternative. Instead of moving the data to the compute cluster, they move the compute cluster to the data. Using the facilities provided by the service provider, they configure a virtual machine image that contains the software they wish to run, launch as many copies as they need, mount the disks and databases containing the public datasets they need, and do the analysis. When the job is complete, their virtual cluster sends them the results and then vanishes until it is needed again. Cloud computing also creates a new niche in the ecosystem for genome software developers to package their work in the form of virtual machines. For example, many genome annotation groups have developed pipelines for identifying and classifying genes and other functional elements. Although many of these pipelines are open source, packaging and distributing them for use by other groups has been challenging given their many software dependencies and site-specific configuration options. In a cloud computing environment these pipelines can be packaged into virtual machine images and stored in a way that lets anyone copy them, run them and customize them for their own needs, thus avoiding the software installation and configuration complexities. But will it work? Cloud computing is real. The earliest service provider to realize a practical cloud computing environment was Amazon, with its Elastic Cloud Computing (EC2) service [37] introduced in 2005. It supports a variety of Linux and Windows virtual machines, a virtual storage system, and mechanisms for managing internet protocol (IP) addresses. Amazon also provides a virtual private network service that allows organizations with their own compute resources to extend their local area network into Amazon's cloud to create what is sometimes called a 'hybrid' cloud. Other service providers, notably Rackspace Cloud [38] and Flexiant [39], offer cloud services with similar overall functionality but many distinguishing differences of detail. As of today, you can establish an account with Amazon Web Services or one of the other commercial vendors, launch a virtual machine instance from a wide variety of generic and bioinformatics-oriented images and attach any one of several large public genome-oriented datasets. For virtual machine images, you can choose images prepopulated with Galaxy [40], a powerful web-based system for performing many common genome analysis tasks, Bioconductor [41], a programming environment that is integrated with the R statistics package [42], GBrowse [43], a genome browser, BioPerl [44], a comprehensive set of bioinformatics modules written in the Perl programming language, JCVI Cloud BioLinux [45], a collection of bioinformatics tools including the Celera Assembler, and a variety of others. Several images that run specialized instances of the UCSC Genome Browser are under development [46]. In addition to these useful images, Amazon provides several large genomic datasets in its cloud. These include a complete copy of GenBank (200 gigabytes), the 30× coverage sequencing reads of a trio of individuals from the 1000 Genomes Project (700 gigabytes) and the genome databases from Ensembl, which includes the annotated genomes of human and 50 other species (150 gigabytes of annotations plus 100 gigabytes of sequence). These datasets were contributed to Amazon's repository of public datasets by a variety of institutions and can be attached to virtual machine images for a nominal fee. There are also a growing number of academic compute cloud projects based on open source cloud management software, such as Eucalyptus [47]. One such project is the Open Cloud Consortium [48], with participants from a group of American universities and industrial partners; another is the Cloud Computing University Initiative, an effort initiated by IBM and Google in partnership with a series of academic institutions [49], and supplemented by grants from the US National Science Foundation [50], for use by themselves and the community. Academic clouds may in fact be a better long-term solution for genome informatics than using a commercial system, because genome computing has requirements for high data read and write speeds that are quite different from typical business applications. Academic clouds will likely be able to tune their performance characteristics to the needs of scientific computing. The economics of cloud computing Is this change in the ecosystem really going to happen? There are some significant downsides to moving genomics into the cloud. An important one is the cost of migrating existing systems into an environment that is unlike what exists today. Both the genome databases and the value-added integrators will need to make significant changes in their standard operating procedures and their funding models as capital expenditures are shifted into recurrent costs; genomics power users will also need to adjust to the new paradigm. Another issue that needs to be dealt with is how to handle potentially identifiable genetic data, such as that produced by whole genome association studies or disease sequencing projects. These data are currently stored in restricted-access databases. In order to move such datasets into a public cloud operated by Amazon or another service provider, they will have to be encrypted before entering the cloud and a layer of software developed that allows authorized users access to them. Such a system would be covered by a variety of privacy regulations and would take time to get right at both the technological and the legal level. Then there is the money question. Does cloud computing make economic sense for genomics? It is difficult to make blanket conclusions about the relative costs of renting versus buying computational services, but a good discussion of the issues can be found in a technical report on Cloud Computing published about a year ago by the UC Berkeley Reliable Adaptive Distributed Systems Laboratory [51]. The conclusion of this report is that when all the costs of running a data center are factored in, including hardware depreciation, electricity, cooling, network connectivity, service contracts and administrator salaries, the cost of renting a data center from Amazon is marginally more expensive than buying one. However, when the flexibility of the cloud to support a virtual data center that shrinks and grows as needed is factored in, the economics start to look downright good. For genomics, the biggest obstacle to moving to the cloud may well be network bandwidth. A typical research institution will have network bandwidth of about a gigabit/second (roughly 125 megabytes/second). On a good day this will support sustained transfer rates of 5 to 10 megabytes/second across the internet. Transferring a 100 gigabyte next-generation sequencing data file across such a link will take about a week in the best case. A 10 gigabit/second connection (1.25 gigabytes/second), which is typical for major universities and some of the larger research institutions, reduces the transfer time to under a day, but only at the cost of hogging much of the institution's bandwidth. Clearly cloud services will not be used for production sequencing any time soon. If cloud computing is to work for genomics, the service providers will have to offer some flexibility in how large datasets get into the system. For instance, they could accept external disks shipped by mail the way that the Protein Database [52] once accepted atomic structure submissions on tape and floppy disk. In fact, a now-defunct Google initiative called Google Research Datasets once planned to collect large scientific datasets by shipping around 3-terabyte disk arrays [53]. The reversal of the advantage that Moore's Law has had over sequencing costs will have long-term consequences for the field of genome informatics. In my opinion the most likely outcome is to turn the current genome analysis paradigm on its head and force the software to come to the data rather than the other way around. Cloud computing is an attractive technology at this critical juncture.

0 comments Cited 136 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Conference

Journal ID (nlm-ta): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central

ISSN (Electronic): 1471-2105

Publication date Collection: 2010

Publication date (Electronic): 21 December 2010

Volume: 11

Issue: Suppl 12

Page: S1

Affiliations

[1 ]Computational Biology and Bioinformatics Group, Pacific Northwest National Laboratory, Richland, Washington, 99352, USA

Article

Publisher ID: 1471-2105-11-S12-S1

DOI: 10.1186/1471-2105-11-S12-S1

PMC ID: 3040523

PubMed ID: 21210976

SO-VID: 3693a51f-3a75-4ff1-93aa-1be43580b9d5

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Conference name: The 11th Annual Bioinformatics Open Source Conference (BOSC) 2010

Conference location: Boston, MA, USA

Conference date: 9–10 July 2010

History

Comments

Comment on this article

scite_

Cited by 73

See all cited by

Most referenced authors 201

See all reference authors

- Version 1

An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics

Read this article at

Abstract

Background

Description

Conclusions

Related collections

Genetoberfest

Most cited references 13

Searching for SNPs with cloud computing

CloudBurst: highly sensitive read mapping with MapReduce

The case for cloud computing in genome informatics

Author and article information

Conference

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 94

Cited by 73

Most referenced authors 201