31
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Performance optimization algorithm for MapReduce based on obtaining frequent keys

      research-article

      Read this article at

      ScienceOpenPublisher
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          MapReduce is getting much attention in academia and industry for use in cloud computing to quickly deal with huge amounts of data. However, when MapReduce deals with text-centric applications, the algorithm generates is large amount of duplicate data in the intermediate results that increases the run time. A frequency buffering (FB) algorithm was used to add a Hash table before the ring memory to store frequent keys in a Hash table. However, since the algorithm is implemented by sampling, the algorithm may not accurately estimate the overhead and the frequent keys. Therefore, this study added a performance optimization algorithm to MapReduce to obtain the frequent keys by adding a counting Bloom filter (CBF) and a Hash table to dynamically filter the frequent keys before storing them in the ring memory. This algorithm more accurately identifies the frequent keys and greatly reduces the data sorting overhead and the disk I/O. Tests show that this performance optimization algorithm for MapReduce for obtaining the frequent keys significantly improves the execution speed by 17.04% compared to the original MapReduce and 9.31% higher than the frequency buffering algorithm.

          Abstract

          摘要 在云计算技术领域中, MapReduce能够帮助人们快速处理海量数据, 因此在学术界以及工业界越来越受到重视。但是MapReduce在处理以文本为中心的应用时, 中间结果中数据重复较多。针对该情况, 已有的高频率缓冲 (frequency buffering, FB) 算法提出在环形内存缓冲之前添加哈希表, 并将高频率键存储在哈希表中。该算法通过采样来实现, 有额外开销并且统计出的高频率键并不一定准确。该文提出一种基于动态获取高频率键的MapReduce性能优化算法, 通过在环形内存缓冲之前增加计数Bloom过滤器 (counting Bloom filter, CBF) 和哈希表, 将高频率键动态地存储在哈希表中。该算法获得的高频率键更准确, 同时大大减少了数据排序和磁盘I/O的开销。实际测试结果表明:该算法明显提高了作业的执行速度, 比原始MapReduce提高17.04%, 比FB算法提高9.31%。

          Author and article information

          Journal
          J Tsinghua Univ (Sci & Technol)
          Journal of Tsinghua University (Science and Technology)
          Tsinghua University Press
          1000-0054
          15 December 2018
          13 December 2018
          : 58
          : 12
          : 1059-1065
          Affiliations
          [1] 1Department of Computer Science and Technology, University of Science and Technology Beijing, Beijing 100083, China
          [2] 2State Grid Zhangjiakou Power Supply Company & Telecommunication Branch, Zhangjiakou 075000, China
          [3] 3Department of Computer and Information Sciences, Temple University, Philadelphia 19122, USA
          Author notes
          *Corresponding author: ZHANG Kai, E-mail: zhchhz@ 123456163.com
          Article
          j.cnki.qhdxxb.2018.21.022
          10.16511/j.cnki.qhdxxb.2018.21.022
          c10f26c7-bcd0-4c17-95ae-b98795c6d80f
          Copyright © Journal of Tsinghua University

          This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 Unported License (CC BY-NC 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See https://creativecommons.org/licenses/by-nc/4.0/.

          History
          : 19 June 2018

          Software engineering,Data structures & Algorithms,Applied computer science,Computer science,Artificial intelligence,Hardware architecture
          performance optimization,frequent key,MapReduce,counting Bloom filter

          Comments

          Comment on this article