To search, Click below search items.

 

All Published Papers Search Service

Title

A Hybrid Spark MPI OpenACC System

Author

Waleed Al Shehri, Maher Khemakhem, Abdullah Basuhail and Fathy E. Eassa

Citation

Vol. 19  No. 5  pp. 81-86

Abstract

Apache Spark is a common big data platform that is built based on a Resilient Distributed Dataset (RDD). This data structure abstraction is able to handle large datasets by partitioning and computing the data in parallel across many nodes. In addition, Apache Spark also features fault tolerance and interoperability with the Hadoop ecosystem. However, Apache Spark is written in high-level programming languages which do not support high parallelism like other native parallel programming models such as Message Passing Interface (MPI) and OpenACC. Furthermore, the use of the Java Virtual Machine (JVM) in the Spark implementation negatively affects performance. On the other hand, the tremendous volume of big data may not be suitable for distributed tools such as MPI and OpenACC to support a high level of parallelism. The distributed architecture of big data platforms is different from the architecture of High Performance Computing (HPC) clusters. Big data applications running on HPC clusters cannot exploit the capabilities afforded by HPC. In this paper, a hybrid approach is proposed that takes the best of both worlds by handling big data with Spark combined with the fast processing of MPI. In addition, the availability of graphics processing units (GPUs) available in modern systems can further speed up the computation time of an application. Therefore, the hybrid Spark+MPI approach may be extended by using OpenACC to include the GPU processor as well. To test the approach, the PageRank algorithm was implemented using all three methods: Spark, Spark+MPI and Spark+MPI+OpenACC.

Keywords

High-Performance Computing Big Data Spark MPI OpenACC Hybrid Programming model Power Consumption

URL

http://paper.ijcsns.org/07_book/201905/20190511.pdf