Big Data on Small Nodes

Abstract

The continuous increase in volume, variety and velocity of Big Data exposes datacenter resource scaling to an energy utilization problem. Traditionally, datacenters employ x86- 64 (big) server nodes with power usage of tens to hundreds of Watts. But lately, low-power (small) systems originally developed for mobile devices have seen significant improvements in performance. These improvements could lead to the adoption of such small systems in servers, as announced by major industry players. In this context, we systematically conduct a performance study of Big Data execution on small nodes in comparison with traditional big nodes, and present insights that would be useful for future development. We run Hadoop MapReduce, MySQL and in-memory Shark workloads on clusters of ARM big.LITTLE boards and Intel Xeon server systems. We evaluate execution time, energy usage and total cost of running the workloads on self-hosted ARM and Xeon nodes. Our study shows that there is no one size fits all rule for judging the efficiency of executing Big Data workloads on small and big nodes. But small memory size, low memory and I/O bandwidths, and software immaturity concur in canceling the lower-power advantage of ARM servers. We show that I/O-intensive MapReduce workloads are more energy-efficient to run on Xeon nodes. In contrast, database query processing is always more energy-efficient on ARM servers, at the cost of slightly lower throughput. With minor software modifications, CPU-intensive MapReduce workloads are almost four times cheaper to execute on ARM servers.

Overview

This web-page provides the source code, logs and details about the experimental setup in “Big Data on Small Nodes” project.

Source code archive ( download) contains the code for Dhrystone benchmark, Java micro-benchmark, storage benchmark and MapReduce Pi estimator in C++. For all the other benchmarks and workloads, we use their default version from selected frameworks. We provide download links for these frameworks.

Logs archive ( download) contains all the experiments log files for static system characterization, Hadoop MapReduce, query processing and TCO analysis.

For more information send an e-mail to dumitrel [at] comp [.] nus [.] edu [.] sg

System characterization

Systems characteristics

System Xeon Odroid XU Odroid XU
CPU Xeon E5-2603 Cortex-A7 Cortex-A15
ISA x86_64 ARMv7l ARMv7l
Cores 4 4 4
Frequency [GHz] 1.20 - 1.80 0.25-0.60 0.60-1.60
L1 Data Cache 128 kB 32 kB 32 kB
L2 Cache 1 MB 2 MB 2 MB
L3 Cache 10 MB - -
Memory 8 GB 2 GB LPDDR 2 GB LPDDR
Storage 1 TB HDD 64 GB eMMC 64 GB eMMC
Network 1 Gbit Ethernet 1 Gbit Ethernet USB3.0 1 Gbit Ethernet USB3.0
OS Kernel Linux 3.8.0-35-generic Linux 3.4.67 Linux 3.4.67
OS Ubuntu 13.04 Ubuntu 13.10 Ubuntu 13.10
C/C++ compiler gcc 4.7.3 gcc 4.7.3 (arm-linux-gnueabihf) gcc 4.7.3 (arm-linux-gnueabihf)
Java jdk1.7.0_54 jdk1.7.0_54 jdk1.7.0_54
dstat 0.7.2 0.7.2 0.7.2

Download Ubuntu 13.10 image for Odroid XU

Cluster Setup

Hadoop cluster under test

The experiments are executed on clusters of up to six Xeon and Odroid XU systems. Cluster nodes are connected through a Gigabit Ethernet switch. Power and energy are measured using a Yokogawa 210 power meter connected to a controller system through serial interface. All the logs are collected by the controller system.

 ARM big.LITTLE cluster  ARM big.LITTLE cluster with power meter

Benchmarks

Version Version
Tool Xeon Odroid XU Comment Command
Dhrystone 2.1 2.1 Dhrystone benchmark dhrystone
CoreMark 1.01 1.01 CoreMark benchmark coremark.exe
pmbw 0.6.2 0.6.2 parallel memory bandwidth pmbw
dd 8.20 (coreutils) 8.20 (coreutils) storage throughput dd - -version
ioping 0.7 0.7 storage latency ioping -v
iperf 2.0.5 2.0.5 TCP/UDP bandwidth iperf -v
ping iputils-sss20101006 iputils-s20121221 network latency ping -V

Dhrystone

Download Dhrystone

We had to adapt Dhrystone source code to compile and run on modern systems. The modified code, Makefile and run scritp are provided in the source code archive.

CoreMark

Download CoreMark.

Compilation flags for ARM:

 gcc -O3 -marm -mcpu=cortex-a15 -mtune=cortex-a15 -mfpu=neon 
 -ftree-vectorize -mfloat-abi=hard -ffast-math -faggressive-loop-optimizations

Compilation flags for Xeon:

 gcc -O3 -m64 -march=native -ffast-math -funroll-loops   

Run command:

 ./coremark.exe  0x0 0x0 0x66 200000 7 1 2000

Java microbenchmark

Since Hadoop runs on a Java virtual machine, we benchmark Java execution by running a small program that stresses CPU's pipeline. This program consists of a loop inside which it performs integer and floating-point operations. We consider each operation as one instruction and we compute the MIPS as the number of iterations multiplied by the number of operations per iteration dived by one million multiplied by the time in seconds. The time is measured only during loop execution.

Java micro-benchmark, a Makefile and run script are provided in the source code archive.

Memory Bandwidth

To measure memory bandwidth, we use pmbw tool Download pmbw .

Run command:

 ./pmbw -P <cores> 

Storage

dd is a powerful tool which performs read/write operations and reports the throughput. We run dd by writing/reading zeroes with a block size of 1MB and a total size that fits in the main memory in order to report buffered read throughput. We use the following commands:

 # dd if=/dev/zero of=tmp bs=1M count=1024 conv=fdatasync,notrunc
 # echo 1 > /proc/sys/vm/drop_caches
 # dd if=tmp of=/dev/null bs=1M count=1024
 # dd if=tmp of=/dev/null bs=1M count=1024

ioping tool reports the latency of storage system for read/write operations. We run ioping for ten times and report the average:

 # ioping -D . -c 10

(-D for using direct and not cached I/O)

 # ioping -W . -c 10

Network

iperf is a client/server network benchmarking tool. First, start iperf sever on one system:

 # iperf -s

then start the client on a different system:

 # iperf -c <server_ip_addr>

For UDP, start server using -u flag:

 # iperf -u -s   

then start client and provide an upper bound for the bandwidth:

 # iperf -u -c <server_ip_addr> -b 1Gbits/sec

If this upper bound is not provided, iperf will report a much lower value for UDP bandwidth.

For network latency, we ping another system in the same cluster. We run ping ten times and report the average:

 # ping <ip_addr> -c 10

Hadoop

Hadoop 1.2.1 ( download)

Java JDK 1.7.0_45 ( download)

Workloads

Name Input source Input size Benchmark origin
TestDFSIO synthetic (TestDFSIO) 12 GB Hadoop examples
Pi synthetic (Pi) 16 Gsamples Hadoop examples
Pi C++ synthetic (Pi) 16 Gsamples Ported from Java
Terasort synthetic (Teragen) 12 GB Hadoop examples
Wordcount wikipedia articles 12 GB Hadoop examples
Grep wikipedia articles 12 GB Hadoop examples
Kmeans PUMA dataset 4 GB PUMA benchmark

Datasets

Input data is either synthetic, or taken from well know available datasets.

Wikipedia Latest Articles Dump

Wikipedia Articles 12GB

Kmeans PUMA 30GB

Kmeans 4GB

Code

Pi estimator in C++ is available in the provided source code archive.

Kmeans was adapted from PUMA benchmark suite.

All the other benchmarks are from Hadoop 1.2.1 examples.

Query processing

MySQL

We run TPC-C and TPC-H benchmarks on top of MySQL version 5.6.16 ( download).

For the TPC-C benchmark, we use the tool provided by percona ( download).

For the TPC-H benchmark, we use the official tpch dbgen 2.16.1 provided by TPC-H council ( download).

Benchmark Size
TPC-C 12 GB (130 warehouses)
TPC-H 2 GB

Shark

We run scan and join queries on Shark 0.9.1 ( download).

Workloads

Workload Input type Input Size
Scan Ranking 21 GB
Join Ranking/UserVisit 43 MB / 1.3 GB
Join Ranking/UserVisit 86 MB / 2.5 GB

Scan query:

 SELECT pageURL, pageRank 
 FROM rankings WHERE pageRank > X

Join query:

 SELECT sourceIP, totalRevenue, avgPageRank FROM
   (SELECT sourceIP,
     AVG(pageRank) as avgPageRank, 
     SUM(adRevenue) as totalRevenue
   FROM Rankings AS R, UserVisits AS UV 
   WHERE R.pageURL = UV.destURL
     AND UV.visitDate BETWEEN Date(`1980-01-01') 
     AND Date(`X')
   GROUP BY UV.sourceIP)
 ORDER BY totalRevenue DESC LIMIT 1 

We use the datesets provided by AMPLab Big Data Benchmark and modify to match our setting. The modified dataset can be downloaded from here.

TCO

All the logs and the spreadsheet with TCO model formulae can be found in the logs archive.

For MaxTCO we use Google TCO model described in the book “The Datacenter as a Computer An Introduction to the Design of Warehouse-Scale Machines” (second edition) and available as a spreadsheet at http://goo.gl/eb6Ui.