Prediction-Based Task Migration on S-NUCA Many-Cores

Martin Rapp∗, Anuj Pathania†, Tulika Mitra†, Jörg Henkel∗
∗Karlsruhe Institute of Technology, Germany, †National University of Singapore, Singapore
martin.rapp@kit.edu, pathania@comp.nus.edu.sg, tulika@comp.nus.edu.sg, henkel@kit.edu

Abstract—Performance of a task running on a many-core with distributed shared Last-Level Cache (LLC) strongly depends on two factors: the power budget needed to guarantee thermally safe operation and the LLC latency. The task’s thread-to-core mapping determines both the factors. Arrival and departure of tasks on a many-core deployed in an open system can change its state significantly in terms of available cores and power budget. Task migrations can therefore be used as a tool to keep the many-core operating at the peak performance. Furthermore, the relative impacts of power budget and LLC latency on a task’s performance can change with its different execution phases mandating its migration on-the-fly.

We propose the first run-time algorithm PCMig that increases the performance of a many-core with distributed shared LLC by migrating tasks based on their phases and the many-core’s state. PCMig is based on a performance-prediction model that predicts the performance impact of migrations. PCMig results in up to 16% reduction in the average response time compared to state-of-the-art.

Index Terms—Cache Memory, Processor Scheduling, Power Dissipation, Thermal Stability

I. INTRODUCTION

Increasing power densities ended the race to higher and higher processor frequencies. Instead, the number of cores in processors kept on rising to satisfy the never-ending need for more computational power. Many-core processors with dozens to hundreds of cores housed on a single die emerged as promising parallel processing platforms [1]. With such large numbers of cores, the one-thread-per-core model is commonly used [2]. The task-to-core mappings are the key to derive maximum performance out of many-cores.

The task arrivals and departures, as well as the tasks themselves, are not known in advance in open systems [3]. Furthermore, the mappings need to be constantly readjusted at run-time by task migrations to cope with changing state of the many-core and task requirements. Mapping multiple tasks to multiple cores optimally is an NP-hard problem in the general case [4]. Therefore, we are required to develop fast heuristics for run-time task migrations in many-cores.

Power Budgets: Increasing power densities cause elevated on-chip temperatures, which above a critical temperature can cause permanent damage to the chip and thereby necessitate run-time thermal management. One way to ensure thermal safety on many-cores is through the use of power budgets that constrain the power consumption of cores. We use in this work a mapping-determined non-uniform power budget called Thermal Safe Power (TSP) [5], which maximizes the sum of all per-core power budgets under a thermal constraint.

Many-Core Architecture with S-NUCA: Figure 1 presents an overview of a many-core architecture with a distributed shared LLC. The many-core consists of tiles, each containing a core with its associated caches. The L1 caches are private to the core on the tile. The shared L2 cache (LLC) is physically distributed among all tiles with each tile holding one LLC bank. All the tiles are connected by a Network-on-Chip (NoC) with a router in every tile. Memory controllers on the periphery provide DRAM access. Static Non-Uniform Cache Access (S-NUCA) is a scalable policy for managing physically distributed, yet logically shared LLCs. It determines a static mapping of memory address to LLC bank at design-time [6].

The latency of an LLC access under S-NUCA depends upon the hop count on the NoC between the tile where the thread is running and the tile where the LLC bank containing the memory address is located. The hop count is measured by the Manhattan Distance (MD). The average LLC latency experienced by a thread executing at a core depends upon its Average Manhattan Distance (AMD) to all tiles of a the many-core. The closer a core is to the center of the many-core, the lower is its AMD and thereby the lower is the average LLC latency for threads running on it. Therefore, the S-NUCA policy imposes an inherent heterogeneity on the otherwise perfectly homogeneous many-core. In order to minimize their average LLC latency, tasks should be mapped to cores as close as possible to the center of the many-core [7].

However, mapping all the tasks to the center of the many-core creates a thermal hotspot, which reduces their power budgets and therefore degrades their performance. The tasks need to be mapped as far away as possible from each other to maximize their power budgets. However, this causes cores near the corners of many-core to be used excessively, which are

Due to space constraints, not all of the originally submitted content is included in this IP paper.
Authors of [10] propose to perform defragmentation of the many-core. Thereby, already running tasks are migrated in order to let the idle cores form a contiguous shape. This allows new arriving tasks to be mapped in a contiguous shape. 

The relative impacts of power budget and LLC latency in this trade-off depend upon the thread characteristics. Figure 2 shows the performance (measured in Instructions per Second (IPS)) for PARSEC [9] blackscholes master and slave threads depending on the power budget and the AMD of the core on which they are executing. The master threads prepares data, which is processed by the slave threads. Due to more frequent LLC accesses, the master thread is more sensitive to the AMD. With a high power budget, the performance of the master thread drops significantly with increasing AMD whereas slave threads’ performance remains nearly constant for all AMD values (Lines a in Figure 2). Contrarily, the slave threads, which mainly perform computations, are more sensitive to the power budget. For a low AMD, the performance of the master thread saturates early with increasing power budget whereas the performance of the slave threads saturates at a higher power budget (Lines b in Figure 2). Thread characteristics must be considered to determine the performance-maximizing trade-off between power budget and LLC latency.

Arriving or departing tasks in an open system can cause sudden changes in the workload that require adjustments to the current mappings in order to maximize the many-core’s performance. Furthermore, the characteristics of a certain task may change over time as the task proceeds through various phases in its execution. This may change the trade-off between power budget and LLC latency at run-time for the task and can only be satisfactorily addressed by changing its mapping. Therefore, performance of a many-core in an open system can only be maximized by dynamically adapting the mappings of its executing tasks using task migrations.

Authors of [10] propose to perform defragmentation of the many-core. Thereby, already running tasks are migrated in order to let the idle cores form a contiguous shape. Defrag also the cores with a high AMD and therefore a high average LLC latency. Therefore, a trade-off needs to be made between minimizing the LLC latency and maximizing the power budget for a task in order to maximize its performance [8]. In our previous work [8], we proposed a task mapping algorithm PCMap which exploits this trade-off. However, this algorithm does not employ task migration and hence is not able to react to changing workload and thread characteristics.

The relative impacts of power budget and LLC latency in this trade-off depend upon the thread characteristics. Figure 2 shows the performance (measured in Instructions per Second (IPS)) for PARSEC [9] blackscholes master and slave threads depending on the power budget and the AMD of the core on which they are executing. The master threads prepares data, which is processed by the slave threads. Due to more frequent LLC accesses, the master thread is more sensitive to the AMD. With a high power budget, the performance of the master thread drops significantly with increasing AMD whereas slave threads’ performance remains nearly constant for all AMD values (Lines a in Figure 2). Contrarily, the slave threads, which mainly perform computations, are more sensitive to the power budget. For a low AMD, the performance of the master thread saturates early with increasing power budget whereas the performance of the slave threads saturates at a higher power budget (Lines b in Figure 2). Thread characteristics must be considered to determine the performance-maximizing trade-off between power budget and LLC latency.

Arriving or departing tasks in an open system can cause sudden changes in the workload that require adjustments to the current mappings in order to maximize the many-core’s performance. Furthermore, the characteristics of a certain task may change over time as the task proceeds through various phases in its execution. This may change the trade-off between power budget and LLC latency at run-time for the task and can only be satisfactorily addressed by changing its mapping. Therefore, performance of a many-core in an open system can only be maximized by dynamically adapting the mappings of its executing tasks using task migrations.

Authors of [10] propose to perform defragmentation of the many-core. Thereby, already running tasks are migrated in order to let the idle cores form a contiguous shape. This allows new arriving tasks to be mapped in a contiguous shape. Defrag...
performance loss due to LLC accesses. The IPS and characteristics are represented by the thread’s current IPS 

\[ \text{IPS} \]

Cores of a many-core with S-NUCA differ in their AMDs, as thread characteristics to distinguish different thread behaviors.

This model needs features for both the target core and the thread that predicts the IPS of a thread if migrated to another core. 

Fig. 4: X264 shows different phases during its execution, which results in changing impacts of power budget and LLC latency on its performance. Migrating x264 in each phase to the core best suited for that phase increases the per-phase performance by up to 14 %.

(b) Mappings of the tasks

(c) IPS of the phases under different mappings

Hence, we add the AMD and power budget \( P_B \) of this core to the set of our features. Details on the internals of this model have been omitted due to space constraints.

IV. ONLINE TASK MIGRATION ALGORITHM

Our run-time task migration algorithm PCMig is based on the IPS prediction model. Task migration periodically traverses the following steps:

First, migration candidates are created. There are \( O(n^T) \) possible migrations in a many-core with \( n \) cores and set of threads \( T \). The number of migrations is too large to be explored in its entirety. We therefore limit our algorithm to only migrate one thread at a time or swapping two threads. This reduces the number of possible migrations to \( O(n \cdot |T|) \).

All these migration candidates need to be rated in order to select the most promising one. Let \( M_0 \) be the mapping before migration. A migration candidate \( m \) is defined by the changed mapping \( M_m \). The IPS prediction model is used to rate each migration candidate. The AMDs and power budgets before and after migration can be determined from the mappings \( M_0 \) and \( M_m \). The IPS and \( \varepsilon_{LLC} \) before migration are obtained from performance counters. Migration of a thread does not only affect this thread, but also all other threads because their power budgets may change as well. Therefore, the model is used to predict the IPS of all threads before and after migration. In order to be able to select one migration, we calculate the average relative IPS improvement \( \Delta_{IPS}(m) \) among all threads for all migration candidates \( m \).

\[ \Delta_{IPS}(m) = \frac{\text{IPS}(t, M_m)}{\text{IPS}(t, M_0)} - 1 \]

Note that no actual migration is performed to calculate this metric, but the IPS prediction model \( \text{IPS}(t, M) \) described in Section III is used instead. Finally, the migration with the highest IPS improvement \( \Delta_{IPS}(m) \) is selected. However, this migration is executed only if \( \Delta_{IPS}(m) > \delta \). The threshold \( \delta \) prevents migrations with only very little expected speedup, where the performance penalty of the migration itself is likely to outweigh the expected benefit. Also, it prevents oscillations due to small prediction inaccuracies. We set \( \delta = 3\% \).

System Integration: Thread characteristics are not available during the initial mapping of a task. Therefore, our IPS model cannot be used. Instead, we use PCMMap [8], which is a task-agnostic mapping algorithm to decide the initial mapping.

Migrating a thread causes a performance penalty due to cold caches. It takes some time until the caches are warm and the thread reaches its peak performance, which strongly depends on the size of the L1 caches and the LLC latency. On our infrastructure, it takes less than 0.2 ms for the IPS to saturate after a migration. The migration interval is set to 10 ms, which is small enough to react fast to changes, but large enough to maintain a reasonable overhead.

V. EVALUATION

Experimental Setup: We perform the evaluation using HotSniper [12], which offers multi-threaded multi-program simulation of many-cores with full modeling of shared resource contentions. HotSniper combines the Sniper many-core
workloads and thread characteristics. Performance of a thread strongly depends on the power budget and the AMDs of its core. However, both factors affect the performance of different threads differently. We proposed a run-time task migration algorithm PCMig that uses a performance-prediction model to rate potential migration candidates prior to actual migration. Our algorithm results in up to 16% improvement in the response time compared to the state-of-the-art.

ACKNOWLEDGMENTS

This work was supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Centre “Invasive Computing” (SFB/TR 89), by the National Research Foundation, Prime Ministers Office, Singapore under its Industry-IHL Partnership Grant and by Huawei International Pte. Ltd. NRF2015-IIP003.

REFERENCES