9
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Data Engineering for HPC with Python

      Preprint

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications. There are many structures such as tables, graphs, and trees to represent data in these data engineering phases. Among them, tables are a versatile and commonly used format to load and process data. In this paper, we present a distributed Python API based on table abstraction for representing and processing data. Unlike existing state-of-the-art data engineering tools written purely in Python, our solution adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings. In the core system, we use MPI for distributed memory computations with a data-parallel approach for processing large datasets in HPC clusters.

          Related collections

          Author and article information

          Journal
          13 October 2020
          Article
          2010.06312
          f575139b-75c1-4a26-ac27-7685c256c40f

          http://arxiv.org/licenses/nonexclusive-distrib/1.0/

          History
          Custom metadata
          9 pages, 11 images, Accepted in 9th Workshop on Python for High-Performance and Scientific Computing (In conjunction with Supercomputing 20)
          cs.DC cs.CY cs.PF cs.SE

          Software engineering,Applied computer science,Performance, Systems & Control,Networking & Internet architecture

          Comments

          Comment on this article