A Time-Energy Performance Analysis of MapReduce on Heterogeneous Systems with GPUs

Abstract

Motivated by the explosion of Big Data analytics, performance improvements in low-power (wimpy) systems and the increasing energy efficiency of GPUs, this paper presents a time-energy performance analysis of MapReduce on heterogeneous systems with GPUs. We evaluate the time and energy performance of three MapReduce applications with diverse resource demands on a Hadoop-CUDA framework. As executing these applications on heterogeneous systems with GPUs is challenging, we introduce a novel lazy processing technique which requires no modifications to the underlying Hadoop framework. To analyze the impact of heterogeneity, we compare the heterogeneous CPU+GPU with the homogeneous CPU-only execution across three systems with diverse characteristics, (i) a traditional high-performance (brawny) Intel i7 system hosting a discrete 640-core Nvidia GPU of the latest Maxwell generation, (ii) a wimpy platform consisting of a quad-core ARM Cortex-A9 hosting the same discrete Maxwell GPU, and (iii) a wimpy platform integrating four ARM Cortex-A15 cores and 192 Nvidia Kepler GPU cores on the same chip. These systems encompass both intra-node heterogeneity with discrete GPUs and intra-chip heterogeneity with integrated GPUs. Our measurement-based performance analysis highlights the following results. For compute-intensive workloads, the brawny heterogeneous system achieves speedups of up to 2.3 and reduces the energy usage by almost half compared to the brawny homogeneous system. As expected, for applications where data transfers dominate the execution time, heterogeneity exhibits worse time-energy performance compared to homogeneous systems. For such applications, the heterogeneous wimpy A9 system with discrete GPU uses around 14 times the energy of homogeneous A9 system due to both system resource imbalances and high power overhead of the discrete GPU. However, comparing among heterogeneous systems, the wimpy A15 with integrated GPU uses the lowest energy across all workloads. This allows us to establish an execution time equivalence ratio between a single brawny node and multiple wimpy nodes. Based on this equivalence ratio, the wimpy nodes exhibit energy savings of two-thirds while maintaining the same execution time. This result advocates the potential usage of heterogeneous wimpy systems with integrated GPUs for Big Data analytics.

Overview

This web-page provides the source code, logs and details about the experimental setup for the article “A Time-Energy Performance Analysis of MapReduceon Heterogeneous Systems with GPUs” which will be presented in IFIP WG 7.3 Performance 2015 and published in Performance Evaluation (PEVA) journal.

Source code archive ( download or on github) contains the code for the three Hadoop MapReduce applications. For all other benchmarks and workloads, we specify their version and provide download links.

For more information send an e-mail to dumitrel [at] comp [dot] nus [dot] edu [dot] sg.

Hadoop-CUDA with lazy processing

We first explain MapReduce flow. Next, we show how lazy processing is integrated in MapReduce (Hadoop).

MapReduce flow

As shown in the figure, the input files are split by Hadoop into chunks that are processed by Map task assigned to worker nodes. Each chunk consists of a number of input record or <key,value> pairs. Hadoop applies the user-defined map() function to each of these pairs. The output <key,value> pairs are collected, shuffled and merged into lists based on their key. Each Reduce task processes a list of values with the same key and outputs the resulted <key,value>. Each Map and Reduce task is processed by a single CPU core. At the end of each Map or Reduce task, Hadoop call user-defined close() function.

The typical structure of user-defined map() function is:

map(Contex ctx) {
   // get the <key_in, val_in>
   key_in = ctx.getInputKey();
   val_in = ctx.getInputValue();
   // process the <key_in, val_in> and get <key_out, val_out>
   // ...
   // output
   ctx.emit(key_out, val_out);
}   

Our lazy processing technique sends m <key,value> pairs to m CUDA threads to be processed in parallel. The map() function is modified as follows:

// set the counter
static int i = 0;
 
map(Contex ctx) {
   // get the <key_in, val_in>
   key_in = ctx.getInputKey();
   val_in = ctx.getInputValue();
   // buffer the pair
   buffer[i++] = <key_in, val_in>
   if (i == m) {
      // send buffered records to GPU
      // process the records on GPU by calling the CUDA kernel
      // get the output pairs and emit the results 
      // ...
      // reset the counter
      i = 0;
   }
}   

Since at the end of each Map task there may be some buffered records, we process them in the close() function:

close(Context ctx) {
   if (i > 0) {
      // send buffered records to GPU
      // process the records on GPU by calling the CUDA kernel
      // get the output pairs and emit the results 
   }
}

Experimental setup

Systems

Specs Brawny (i7) Wimpy (A9 - Kayla) Wimpy (A15 - Jetson TK1)
CPU Type Intel Core i7-2600 NVIDIA Tegra 3 (ARM Cortex-A9) NVIDIA Tegra K1 (ARM Cortex-A15)
ISA x86-64 ARMv7 ARMv7
Cores 4 (8 threads) 4 4
Frequency 1.60 - 3.40 GHz 0.05 - 1.40 GHz 0.05 - 2.32 GHz
Cache 32kB L1, 256kB L2, 8MB L3 32kB L1, 1MB L2 32kB L1, 2MB L2
Memory 16 GB DDR3 2 GB LPDDR2 2 GB LPDDR3
GPU Type NVIDIA GTX 750 Ti (Maxwell architecture) NVIDIA GK20A
Cores 640 192
Memory 2 GB GDDR5 2 GB (shared)
Storage Device 512GB SSD (Crucial M4)
Port SATA 3.0 SATA 2.0 SATA 3.0
Networking Gigabit Ethernet
Software OS Ubuntu 13.04 Ubuntu 12.04 Ubuntu 14.04
OS Kernel Linux 3.11.0 Linux 3.1.10-carma Linux 3.10.40
Compiler gcc 4.8.1 gcc 4.6.3 gcc 4.8.2
Java jdk1.8.0
NVIDIA driver 340.32 340.24 L4T R21.2
CUDA toolkit 6.5

Systems setup

Kayla wimpy system with power monitor

Benchmarks

Tool Version on brawny (i7) Version on wimpy A9 (Kayla) Version on wimpy A15 (Jetson TK1) Comment Download
CoreMark 1.01 CoreMark benchmark Download CoreMark
pmbw 0.6.2 parallel memory bandwidth Download pmbw
SHOC commit 90e5734864ca61ccb541b0e894bec0242ac2c0b2 MaxFlops and BusSpeedDownload Download SHOC
dd 8.20 (coreutils) 8.13 (coreutils) 8.21 (coreutils) storage throughput

CoreMark

Compilation flags for i7:

 gcc -O3 -m64 -march=native -ffast-math -funroll-loops 

Compilation flags for A9 (Kayla):

 gcc -O3 -marm -mcpu=cortex-a9 -mtune=cortex-a9 -mfpu=neon 
 -ftree-vectorize -mfloat-abi=hard -ffast-math -faggressive-loop-optimizations

Compilation flags for A15 (Jetson TK1):

 gcc -O3 -marm -mcpu=cortex-a15 -mtune=cortex-a15 -mfpu=neon 
 -ftree-vectorize -mfloat-abi=hard -ffast-math -faggressive-loop-optimizations

Run command:

 ./coremark.exe  0x0 0x0 0x66 200000 7 1 2000

SHOC

For measuring FLOPS and CPU-GPU bandwidth, we use MaxFlops and BusSpeedDownload, which are under:

 src/cuda/level0

For building SHOC for NVIDIA GTX 750, we run the following commands:

 export PATH=$PATH:$CUDA_DIR/bin
 export CUDA_CPPFLAGS="-gencode=arch=compute_50,code=sm_50"
 ./configure --with-cuda --without-mpi
 make

For building SHOC for Jetson TK1 board, just change the flags:

 export CUDA_CPPFLAGS="-gencode=arch=compute_32,code=sm_32"

Results

Subsystem Benchmark Measure Brawny (i7) Wimpy (A9 - Kayla) Wimpy (A15 - Jetson TK1)
CPU CoreMark Performance per core [iterations/s] 17237.4 3952.2 8155.0
System peak power [W] 51.7 10.1 5.9
System idle power [W] 25.8 7.7 3.2
Memory pmbw Bandwidth [GB/s] 17.6 0.9 4.3
GPU SHOC Performance [GFLOPS] 1514.3 1511.9 209.0
System peak Power [W] 105.2 44.0 6.1
System idle power [W] 40.6 19.2 3.2
Storage dd Write throughput [MB/s] 198 90 161
Read throughput [MB/s] 284 85 275
Buffered read throughput [GB/s] 10 0.44 1.6

Workloads

MapReduce workloads are implemented in C++/CUDA and run on Hadoop 1.2.1 using pipes mechanism. Workloads source code can be found in the provided archive under hadoop-1.2.1/src/examples/pipes. Common headers and Makefile are under hadoop-1.2.1/src/examples/pipes/common.

Workload Input source Input size Size on disk [GB] Details Download
BlackScholes (BS) PARSEC [1] 3.0 BlackScholes input generator S 0.8 12,000,000 options download
M 8.0 120,000,000 options download
L 24.2 360,000,000 options 3x M dataset
Kmeans (KM) Mars [2] Kmeans input generator S 0.3 n=3,474,500, m=34, k=5 download
M 7.7 n=83,388,000, m=34, k=5 4x download
L 19.3 n=208,470,000, m=34, k=5 10x M dataset
Grep (GR) wikipedia articles (truncated) S 0.6 7,800,963 lines download
M 11.1 166,656,938 lines download
L 22.3 368,789,935 lines download

Reference outputs for all MapReduce workloads can be downloaded from here.

Bash script launch-hadoop-pipes.sh can be used to run MapReduce workloads on Hadoop. This script can be found in the source code archive under scripts folder. It expects the following HDFS folder structure:

bin/
   bs-cpu
   bs-gpu
   ...
bs-in-s/
bs-in-m/
bs-in-l/
...

where bin is the folder where all the executables are placed using <bench>-[cpu|gpu] naming; input files are under <bench>-in-<size> folders.

References

[1] C. Bienia, S. Kumar, J. P. Singh, K. Li, The PARSEC Benchmark Suite: Characterization and Architectural Implications, Proc. of 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 72–81, 2008.

[2] B. He, W. Fang, Q. Luo, N. K. Govindaraju, T. Wang, Mars: A MapReduce Framework on Graphics Processors, Proc. of 17th International Conference on Parallel Architectures and Compilation Techniques, pages 260–269, 2008.