# Modeling the Energy Efficiency of Heterogeneous Clusters

Lavanya Ramapantulu, Bogdan Marius Tudor, Dumitrel Loghin, Trang Vu, Yong Meng Teo Department of Computer Science National University of Singapore

> 10<sup>th</sup> September 2014 43<sup>rd</sup> International Conference on Parallel Processing, Minneapolis, MN, USA

## Outline

- Motivation
- Objective
- Methodology
- Analysis
- Conclusions

#### **Energy Use of Datacenters**

- Energy consumption of large-scale data centers and its costs are significant
  - 2006 6,000 data centers in US consumed 61x10<sup>9</sup> KWh of energy, 1.5% of all electricity consumption, at a cost of \$4.5 billion
  - 2006-2011 from 7 GW to 12 GW, 10 new power plants
- 1998-2007: performance of supercomputers (+7,000%) has increased 3.5 times faster than their operating efficiency\* (+2,000%)

\*operating efficiency of a system = *performance per Watt of power* 

#### **Datacenter Power Usage**



Source: L. Barroso, J Clidaras and U. Hölzle, *The Datacenter as a Computer: An Introduction to Warehouse-Scale Machines*, Morgan Claypool, 2013

#### Wimpy vs Brawny Servers [Gupta et al. 2013]



#### performance [MFLOPS]

#### **Intra-node Heterogeneity**

- KnightShift: [Wong et al. 2012]
  - Low utilizations, lower energy proportionality
  - Knight responds to low-utilization requests
  - Enables two energy-efficient operating regions

- Thin servers with smart pipes: [Lim et al. 2013]
  - Accelerator for memcached
  - 6X-16X power-performance improvement

## **Inter-node Heterogeneity**

- Dynamic request allocation on heterogeneous clusters
  - Throughput vs. power [Heath et al. 2005]
  - Pikachu: dynamic load balancing among fast and slow nodes for MapReduce [Gandhi et al. 2013]
- Static analysis of single workload on heterogeneous clusters
  - Unexplored from energy-time performance perspective

## Objective

- For a given application with a power budget, to determine energy efficient heterogeneous configurations that meet an execution time deadline.
  - Energy efficient configurations meet a given deadline with the minimum energy

## Contributions

 A measurement-based analytical model to determine energy efficient configurations on a mix of heterogeneous nodes

– Meets a deadline with minimum energy

 Our analysis shows that energy-deadline Pareto frontier consisting of heterogeneous mixes is almost always more energy-efficient than homogeneous clusters

## Outline

- Motivation
- Objective
- Methodology
- Analysis
- Conclusions

#### Approach



## Approach



## **Applications**

#### Broad range of datacenter application domains

| Domain             | Program       | Problem Size                 |
|--------------------|---------------|------------------------------|
| HPC                | EP            | 2,147,483,648 random numbers |
| Web Server         | memcached     | 600,000 GET/SET operations   |
| Streaming video    | x264          | 600 frames 704 × 576         |
| Financial          | Black-scholes | 500,000 stock options        |
| Speech recognition | Julius        | 2,310,559 samples            |
| Web security       | RSA-2048      | 5000 keys verifications      |

#### **Heterogeneous System**

- AMD K10, x86\_64
- six-core, 0.8 to 2.1GHz
- ARM v7-A Cortex-A9
- quad-core, 0.2 to 1.4GHz





#### **Baseline Execution**

- Measurements needed only for a single node, for each type of node
  - non-intrusive hardware performance counters
- Execute the program for a very small problem size
  - measure instructions, computation cycles and stall cycles
  - Ex: measure instructions per GET operation of memcached
- Execute micro-benchmarks to measure active and stall power of processor cores

#### **Execution Time Model**



## **Execution Time Model**

•  $T_{core,work} \approx \frac{computational work cycles}{clock frequency}$ 

•  $T_{core,stall} \approx \frac{stall cycles due to memory contention}{clock frequency}$ 

- Stall cycles increase linearly with
  - increase in core clock frequency
  - increase in the number of cores

# **Energy Model**

- Total Energy =  $E_{ARM} \times n_{ARM} + E_{AMD} \times n_{AMD}$
- $E_{node} = E_{core} + E_{mem} + E_{I/O} + E_{idle}$
- $E_{core} = P_{core,act} \times T_{core,work} + P_{core,stall} \times T_{core,stall}$ 
  - Power × Time
  - uses execution time model
  - measured values for P<sub>core,act</sub> , P<sub>core,stall</sub>

#### **Model Summary**

| Execution Time Model  |                                                                                                                                          |  |  |  |
|-----------------------|------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| т                     | max(T <sub>ARM</sub> ,T <sub>AMD</sub> )                                                                                                 |  |  |  |
| T <sub>ARM</sub>      | max(T <sub>CPU,ARM</sub> ,T <sub>I/O,ARM</sub> )                                                                                         |  |  |  |
| T <sub>CPU,ARM</sub>  | max(T <sub>core,ARM</sub> ,T <sub>mem,ARM</sub> )                                                                                        |  |  |  |
| T <sub>core,ARM</sub> | I <sub>core,ARM</sub> × (WPI <sub>ARM</sub> + SPI <sub>core,ARM</sub> )                                                                  |  |  |  |
|                       | f <sub>ARM</sub>                                                                                                                         |  |  |  |
| T <sub>mem,AR</sub>   | I <sub>core,ARM</sub> × (WPI <sub>ARM</sub> + SPI <sub>mem,ARM</sub> )                                                                   |  |  |  |
| М                     | f <sub>ARM</sub>                                                                                                                         |  |  |  |
| T <sub>i/o,ARM</sub>  | $max(T_{I/O,ARM}, 1/\lambda_{I/O})$                                                                                                      |  |  |  |
| Energy Model          |                                                                                                                                          |  |  |  |
| E                     | E <sub>ARM</sub> +E <sub>AMD</sub>                                                                                                       |  |  |  |
| E <sub>ARM</sub>      | (E <sub>core,ARM</sub> +E <sub>mem,ARM</sub> +E <sub>I/O,ARM</sub> +E <sub>idle,ARM</sub> ) × n <sub>ARM</sub>                           |  |  |  |
| E <sub>core,ARM</sub> | $(P_{\text{core,act,ARM}} \times T_{\text{act,ARM}} + P_{\text{core,stall,ARM}} \times T_{\text{stall,ARM}}) \times c_{\text{act, ARM}}$ |  |  |  |

#### **Model Validation**

| Program      | Configuration |              | Execution<br>time | Energy   |
|--------------|---------------|--------------|-------------------|----------|
| Trogram      | ARM<br>nodes  | AMD<br>nodes | error[%]          | error[%] |
| EP           | 8             | 1            | 3                 | 10       |
|              | 8             | 0            | 3                 | 2        |
| memcached    | 8             | 1            | 10                | 8        |
|              | 8             | 0            | 3                 | 1        |
| x264         | 8             | 1            | 11                | 10       |
|              | 8             | 0            | 13                | 11       |
| blackscholes | 8             | 1            | 4                 | 7        |
|              | 8             | 0            | 4                 | 13       |
| Julius       | 8             | 1            | 13                | 1        |
|              | 8             | 0            | 1                 | 2        |
| RSA-2048     | 8             | 1            | 2                 | 8        |
|              | 8             | 0            | 1                 | 12       |

## Outline

- Motivation
- Objective
- Methodology
- Analysis
- Conclusions

#### **Performance-to-Power Ratio**



#### **Research Questions**

- 1. Is heterogeneity better than homogeneity ?
- 2. Are larger mixes of heterogeneous nodes better ?
- 3. ..

#### **Heterogeneity versus Homogeneity**



#### **Heterogeneity versus Homogeneity**



#### **Heterogeneity versus Homogeneity**



## Are larger mixes better ?



#### **Observations**

- 1. Heterogeneity allows larger energy savings compared to homogeneous systems.
- 2. Larger mixes increase the number of configurations in the sweet region.

3. ..

## Conclusions

- measurement-driven analytical model to determine energy-efficient configurations for a single workload on a heterogeneous mix with different ISA's
- Heterogeneity is almost always more energyefficient than homogeneity
  - But not for programs with large sequential fraction and high parallel overhead

#### **Questions**?

#### Thank you [lavanya,teoym]@comp.nus.edu.sg

L. Ramapantulu, B.M. Tudor, D.Loghin, T. Vu and Y.M. Teo, **Modeling the Energy Efficiency of Heterogeneous Clusters**, Proceedings of 43rd International Conference on Parallel Processing, Minneapolis, USA, Sep 9-12, 2014.

## Thank you

#### **BACKUP SLIDES**

#### **System Overview**

| Node                 | AMD K10     | ARM Cortex-A9 |
|----------------------|-------------|---------------|
| ISA                  | X86_64      | ARM v7-A      |
| Cores/node           | 6           | 4             |
| Core clock frequency | 0.8-2.1 GHz | 0.2-1.4 GHz   |
| L1 data cache        | 64KB/core   | 32KB/core     |
| L2 cache             | 512KB/core  | 1MB/node      |
| L3 cache             | 6MB /node   | NA            |
| Memory               | 8GB DDR3    | 1GB LP-DDR2   |
| I/O bandwidth        | 1Gbps       | 100MBps       |

#### Stalls due to memory contention



# **CPU and I/O Overlap**

- Tudor et al. [SIGMETRICS'13]
- Server workloads (Ex: memcached)



## What is a good mix ?



#### **Other Research Questions**

- Queuing Delay
  - As cluster utilization increases due to faster job arrivals, the energy savings are further amplified, but the minimal response time achievable is reduced

## **Queuing Delay**

