The continuous increase in volume, variety and velocity of Big Data exposes datacenter resource scaling to an energy utilization problem. Traditionally, datacenters employ x86- 64 (big) server nodes with power usage of tens to hundreds of Watts. But lately, low-power (small) systems originally developed for mobile devices have seen significant improvements in performance. These improvements could lead to the adoption of such small systems in servers, as announced by major industry players. In this context, we systematically conduct a performance study of Big Data execution on small nodes in comparison with traditional big nodes, and present insights that would be useful for future development. We run Hadoop MapReduce, MySQL and in-memory Shark workloads on clusters of ARM big.LITTLE boards and Intel Xeon server systems. We evaluate execution time, energy usage and total cost of running the workloads on self-hosted ARM and Xeon nodes. Our study shows that there is no one size fits all rule for judging the efficiency of executing Big Data workloads on small and big nodes. But small memory size, low memory and I/O bandwidths, and software immaturity concur in canceling the lower-power advantage of ARM servers. We show that I/O-intensive MapReduce workloads are more energy-efficient to run on Xeon nodes. In contrast, database query processing is always more energy-efficient on ARM servers, at the cost of slightly lower throughput. With minor software modifications, CPU-intensive MapReduce workloads are almost four times cheaper to execute on ARM servers.
This web-page provides the source code, logs and details about the experimental setup in “Big Data on Small Nodes” project.
Source code archive ( download) contains the code for Dhrystone benchmark, Java micro-benchmark, storage benchmark and MapReduce Pi estimator in C++. For all the other benchmarks and workloads, we use their default version from selected frameworks. We provide download links for these frameworks.
Logs archive ( download) contains all the experiments log files for static system characterization, Hadoop MapReduce, query processing and TCO analysis.
For more information send an e-mail to dumitrel [at] comp [.] nus [.] edu [.] sg
|System||Xeon||Odroid XU||Odroid XU|
|Frequency [GHz]||1.20 - 1.80||0.25-0.60||0.60-1.60|
|L1 Data Cache||128 kB||32 kB||32 kB|
|L2 Cache||1 MB||2 MB||2 MB|
|L3 Cache||10 MB||-||-|
|Memory||8 GB||2 GB LPDDR||2 GB LPDDR|
|Storage||1 TB HDD||64 GB eMMC||64 GB eMMC|
|Network||1 Gbit Ethernet||1 Gbit Ethernet USB3.0||1 Gbit Ethernet USB3.0|
|OS Kernel||Linux 3.8.0-35-generic||Linux 3.4.67||Linux 3.4.67|
|OS||Ubuntu 13.04||Ubuntu 13.10||Ubuntu 13.10|
|C/C++ compiler||gcc 4.7.3||gcc 4.7.3 (arm-linux-gnueabihf)||gcc 4.7.3 (arm-linux-gnueabihf)|
The experiments are executed on clusters of up to six Xeon and Odroid XU systems. Cluster nodes are connected through a Gigabit Ethernet switch. Power and energy are measured using a Yokogawa 210 power meter connected to a controller system through serial interface. All the logs are collected by the controller system.
|pmbw||0.6.2||0.6.2||parallel memory bandwidth||pmbw|
|dd||8.20 (coreutils)||8.20 (coreutils)||storage throughput||dd - -version|
|ioping||0.7||0.7||storage latency||ioping -v|
|iperf||2.0.5||2.0.5||TCP/UDP bandwidth||iperf -v|
|ping||iputils-sss20101006||iputils-s20121221||network latency||ping -V|
We had to adapt Dhrystone source code to compile and run on modern systems. The modified code, Makefile and run scritp are provided in the source code archive.
Compilation flags for ARM:
gcc -O3 -marm -mcpu=cortex-a15 -mtune=cortex-a15 -mfpu=neon -ftree-vectorize -mfloat-abi=hard -ffast-math -faggressive-loop-optimizations
Compilation flags for Xeon:
gcc -O3 -m64 -march=native -ffast-math -funroll-loops
./coremark.exe 0x0 0x0 0x66 200000 7 1 2000
Since Hadoop runs on a Java virtual machine, we benchmark Java execution by running a small program that stresses CPU's pipeline. This program consists of a loop inside which it performs integer and floating-point operations. We consider each operation as one instruction and we compute the MIPS as the number of iterations multiplied by the number of operations per iteration dived by one million multiplied by the time in seconds. The time is measured only during loop execution.
Java micro-benchmark, a Makefile and run script are provided in the source code archive.
To measure memory bandwidth, we use pmbw tool Download pmbw .
./pmbw -P <cores>
dd is a powerful tool which performs read/write operations and reports the throughput. We run dd by writing/reading zeroes with a block size of 1MB and a total size that fits in the main memory in order to report buffered read throughput. We use the following commands:
# dd if=/dev/zero of=tmp bs=1M count=1024 conv=fdatasync,notrunc # echo 1 > /proc/sys/vm/drop_caches # dd if=tmp of=/dev/null bs=1M count=1024 # dd if=tmp of=/dev/null bs=1M count=1024
ioping tool reports the latency of storage system for read/write operations. We run ioping for ten times and report the average:
# ioping -D . -c 10
(-D for using direct and not cached I/O)
# ioping -W . -c 10
iperf is a client/server network benchmarking tool. First, start iperf sever on one system:
# iperf -s
then start the client on a different system:
# iperf -c <server_ip_addr>
For UDP, start server using -u flag:
# iperf -u -s
then start client and provide an upper bound for the bandwidth:
# iperf -u -c <server_ip_addr> -b 1Gbits/sec
If this upper bound is not provided, iperf will report a much lower value for UDP bandwidth.
For network latency, we ping another system in the same cluster. We run ping ten times and report the average:
# ping <ip_addr> -c 10
|Name||Input source||Input size||Benchmark origin|
|TestDFSIO||synthetic (TestDFSIO)||12 GB||Hadoop examples|
|Pi||synthetic (Pi)||16 Gsamples||Hadoop examples|
|Pi C++||synthetic (Pi)||16 Gsamples||Ported from Java|
|Terasort||synthetic (Teragen)||12 GB||Hadoop examples|
|Wordcount||wikipedia articles||12 GB||Hadoop examples|
|Grep||wikipedia articles||12 GB||Hadoop examples|
|Kmeans||PUMA dataset||4 GB||PUMA benchmark|
Input data is either synthetic, or taken from well know available datasets.
Pi estimator in C++ is available in the provided source code archive.
Kmeans was adapted from PUMA benchmark suite.
All the other benchmarks are from Hadoop 1.2.1 examples.
For the TPC-C benchmark, we use the tool provided by percona ( download).
For the TPC-H benchmark, we use the official tpch dbgen 2.16.1 provided by TPC-H council ( download).
|TPC-C||12 GB (130 warehouses)|
We run scan and join queries on Shark 0.9.1 ( download).
|Workload||Input type||Input Size|
|Join||Ranking/UserVisit||43 MB / 1.3 GB|
|Join||Ranking/UserVisit||86 MB / 2.5 GB|
SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
SELECT sourceIP, totalRevenue, avgPageRank FROM (SELECT sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(`1980-01-01') AND Date(`X') GROUP BY UV.sourceIP) ORDER BY totalRevenue DESC LIMIT 1
All the logs and the spreadsheet with TCO model formulae can be found in the logs archive.
For MaxTCO we use Google TCO model described in the book “The Datacenter as a Computer An Introduction to the Design of Warehouse-Scale Machines” (second edition) and available as a spreadsheet at http://goo.gl/eb6Ui.