The Energy Efficiency of Modern Multicore Systems

Abstract

Motivated by the proliferation of both homogeneous and heterogeneous shared-memory multicore systems with large core counts, and by the energy usage issue faced by computing today, we extend Amdahl’s and Gustafson’s laws for speedup to estimate the energy savings of a multicore system compared to the system using a single core. To define energy savings, we introduce two key parameters, (i) the active power fraction (APF) of a core representing the ratio between core’s average active power and the power of the idle system, and (ii) the inter-core speedup (ICS) depicting the difference in speed among different types of cores in heterogeneous multicores. We show that energy savings are achievable, but they rapidly plateau on large core counts and are affected by system’s APF such that a lower APF value leads to higher energy savings. However, on low core counts, energy savings are affected by both the APF and the sequential fraction such that for workloads with large sequential fraction, energy savings are small, regardless of system’s APF. We validate our analytical models for energy savings with seven applications covering a wide range of sequential fractions, on two homogeneous server systems with 48 cores representing both traditional brawny x86/64 and emerging wimpy ARM server nodes, and on one heterogeneous system representing the emerging ARM big.LITTLE architecture.

Overview

This web-page provides details about the experimental setup used in the article "The Energy Efficiency of Modern Multicore Systems".

For more information send an e-mail to dumitrel [at] comp [.] nus [.] edu [.] sg

System characterization

Table 1. Applications

Application Benchmark suite Input size OMP scheduling
EP (Embarrassingly Parallel) NPB [1] Class C (Random-number pairs: 232 ) [2] default
BT (Block Tri-diagonal Solver) NPB [1] Class C (Grid size: 162 x 162 x 162, Iterations: 200) [2] static
SP (Scalar Penta-Diagonal solver) NPB [1] Class C (Grid size: 162 x 162 x 162, Iterations: 400) [2] static
LV (LavaMD) Rodinia [3] Boxes1d: 24 default
KM (Kmeans) Rodinia [3] n=1,000,000 m=34 k=5 static
PF (Pathfinder) Rodinia [3] Width (rows): 900000, number of steps (columns): 500 default
BS (BlackScholes) Parsec [4] 4,000,000 options default

Figure 1. Odroid XU3 cluster connected to a Yokogawa power meter

Power Proportionality Factor of Modern Multicore Systems

In this section, we assess the influence of different measured APF values representing modern multicore systems on energy savings. For this analysis, we select seven systems covering both brawny and wimpy nodes from server, desktop and embedded markets. As representative of server market, we select (i) a 48-core AMD Opteron (AMD) server system with 64GB of RAM using Non-Uniform Memory Access (NUMA) connection, (ii) a system based on Intel Xeon E5-2630 v4 with 10 cores clocked at 2.20 GHz, and (iii) a 48-core ARM-based server system (ARM) produced by Gi- gabyte and powered by a 64-bit Cavium ThunderX CPU. Xeon and AMD servers are widely used in cloud computing and in Top500 supercomputers. As representative of desktop and laptop market, we select a system based on Intel i7 CPU with four physical cores clocked at 3.40 GHz. Representing the emerging wimpy nodes capable of running advanced data analytics and machine learning, we analyze a Jetson TX1 node from Nvidia. This system has four ARM Cortex-A57 cores running at a maximum of 1.73 GHz. We select the Raspberry Pi 3 model B (Pi3) which is widely used by hobbyists for a variety of IoT projects. This tiny system has a quad-core ARM Cortex-A53 CPU running at 1.2 GHz. Lastly, as representative of the emerging wimpy heterogeneous systems, we use Odroid XU3 (XU3) development board powered by Samsung Exynos 5422 chip of 32-bit ARM big.LITTLE architecture with four ARM Cortex-A7 little cores and four ARM Cortex-A15 big cores.

To measure the APF, we follow the same steps as described in the paper. The results are summarized in Table 2. Surprisingly, i7’s PPF is large since active core power is almost half of idle system power. At another extreme is the 48-core ARM server. We observe that server systems tend to have lower APF since their idle power is high, due to additional I/O sub-systems and cooling. On the other hand, consumer devices have higher APF suggesting that the CPU is the main active part of these systems. Server systems exhibit higher energy savings due to their lower PPF, as highlighted by the proposed energy savings models. Assuming that all systems have 48 cores while maintaining the same APF, Figure 2 shows that Xeon, AMD and ARM servers achieve higher energy savings compared to the i7 desktop system and XU3 embedded system, when both Amdahl’s and Gustafson’s laws are used to determine the speedup for a workload with 10% sequential fraction.

Table 2. APF of Modern Multicore Systems

System
AMD Xeon i7 ARM Jetson Pi3 XU3 big XU3 little
Psys [W] 386 63 26 126 2.65 1.92 6 6
Pcore [W] 5.2 2.8 12.5 1.2 0.6 0.3 3.45 0.45
APF 0.0135 0.0437 0.4808 0.0095 0.2264 0.1536 0.575 0.075

(a) Using Amdahl's law for speedup

(b) Using Gustafson's law for speedup

Figure 2. Estimated Energy Savings on Modern Multicore Systems

References