Chronos for multi-cores: a WCET analysis tool for multi-cores


Background


This page describes multi-core Chronos --- a worst case execution time (WCET) analyzer tool for multi-core platforms. Originally, the tool has been built upon Chronos --- a WCET analyzer tool for single core platforms. The single core version of Chronos is still actively maintained and distributed here.

Starting from the last decade, the following developers have contributed to Chronos (single core and multi-core) by implementing some of its crucial components:

  1. Xiangfeng Li (past graduate student, NUS)
  2. Liang Yun (past graduate student, NUS)
  3. Ju Lei (past graduate student, NUS)
  4. Bach Khoa Huynh (past graduate student, NUS)
  5. Vivy Suhendra (past graduate student, NUS)
  6. Sudipta Chattopadhyay (past graduate student, NUS)
  7. Abhijeet Banerjee (past graduate student in NUS and primary contact of Chronos)
  8. Lee Kee Chong (current graduate student and research assistant in NUS)
  9. Timon Kelter (current graduate student, TU Dortmund)
  10. Abhik Roychoudhury (Professor at NUS and additional contact)
  11. Tulika Mitra (Professor at NUS and additional contact) 

Key Extension


Chronos for multi-cores extends its single core counterpart through two crucial analyses: (1) shared Level 2 instruction cache and (2) shared TDMA round robin bus. Additionally, multi-core version of Chronos applies innovative analyses techniques, which capture the key interactions of shared cache and shared bus with different other micro-architectural features (e.g. pipeline and branch prediction). Therefore, multi-core version of Chronos currently supports the following micro-architectural features:

  1. Pipeline (inorder, out-of-order, superscalar)
  2. Branch prediction and speculative execution
  3. L1 instruction cache
  4. L2 (shared) instruction cache
  5. Shared time-division multiple accessed (TDMA) bus

Multi-core version of Chronos does not drop any of the functionalities in pipeline and branch prediction, provided by the single core version of Chronos. Consequently, we can also handle the timing anomaly problem in multi-cores due to the interaction of shared cache and shared bus with pipeline and branch prediction.

Current / Future work:

The current distribution for multi-core Chronos does not include the timing effect of data caches. Integration of data cache analysis into the multi-core extension is under development and will be available for next release.

Publication related to the multi-core Chronos tool:

The technical aspect of our tool has been described in the following research report:

A Unified WCET Analysis Framework for Multi-core Platforms
Sudipta Chattopadhyay, Chong Lee Kee, Abhik Roychoudhury, Timon Kelter, Peter Marwedel and Heiko Falk
Technical Report, October 2011 [A shorter version is going to appear in the 18th IEEE Real-time and Embedded Technology and Applications Symposium (RTAS) 2012]

Related publications:

1. Chronos: A Timing Analyzer for Embedded Software
Xianfeng Li, Yun Liang, Tulika Mitra and Abhik Roychoudhury
Science of Computer Programming, Volume 69, December 2007

2. Modeling Out-of-Order Processors for Software Timing Analysis
Xianfeng Li, Abhik Roychoudhury and Tulika Mitra
IEEE Real-Time Systems Symposium (RTSS) 2004

3. Timing Analysis of Concurrent Programs Running on Shared Cache Multi-cores
Yan Li, Vivy Suhendra, Yun Liang, Tulika Mitra and Abhik Roychoudhury
IEEE Real-time System Symposium (RTSS) 2009.

4. Modeling shared Cache and Bus in Multi-core Platforms for Timing Analysis
Sudipta Chattopadhyay , Abhik Roychoudhury and Tulika Mitra

13th International Workshop on Softwares and Compilers for Embedded Systems (SCOPES) 2010

5. Bus aware Multicore WCET analysis through TDMA Offset Bounds
Timon Kelter, Heiko Falk, Peter Marwedel, Sudipta Chattopadhyay and Abhik Roychoudhury

23rd Euromicro Conference on Real-time Systems (ECRTS) 2011


DOWNLOAD: Open-source version of multi-core Chronos


Contact Sudipta Chattopadhyay (sudiptaonline@gmail.com) to get the first prototype version of Chronos multi-core WCET analysis tool. We also provide the precompiled binary of a simulator (a multi-core extension of simplescalar simulator) to validate the obtained WCET result.

There is a version available in bitbucket - from here https://bitbucket.org/sudiptac/multi-core-chronos

 


Prototype and benchmarks


The prototype of multi-core Chronos contains the following directories:

chronos-multi-core

This directory contains the WCET analyzer targeting simplescalar PISA. The 32 bit precompiled analyzer binary is chronos-multi-core/est/est. The analyzer has been built in an Intel Pentium 4 machine, running Ubuntu 8.10 operating systems.

CMP_SIM

This directory contains the multi-core simulator supporting shared cache and shared TDMA bus. Target architecture is simplescalar PISA (i.e. same as the WCET analyzer). The 32 bit precompiled analyzer binary is CMP_SIM/simplesim-3.0/sim-outorder.  The simulator has also been built in an Intel Pentium 4 machine, running Ubuntu 8.10 operating systems.

benchmarks

Contains precompiled binaries of sample benchmarks targeting simplescalar PISA. Each subdirectory say, <bm> in the benchmarks directory contains the following relevant files

<bm>.c : C source file of the benchmark

<bm> : Simplescalar PISA binary of <bm>

<bm>.cfg : Internal representation of the <bm> control flow graph

<bm>.loopbound : Loop bound specifier. Each line contains the loop bound of a specific loop in the source file in following format:

                                        <procedure id> <loop id> <relative loop bound>

where <procedure id> represents the id of the procedure in source code order. Similarly, <loop id> represents   the relative id of the loop (inside procedure) in source code order.

<bm>.dis : Disassembled file from <bm> binary.

<bm>.cons : user specified ILP constraints for helping WCET analysis process.

<bm>.lp : ILP problem formulated to solve the WCET of <bm>. This <bm>.lp file can be solved by lp_solve.

<bm>.ilp : Same as <bm>.lp but with CPLEX support. If you use CPLEX, the input file should be <bm>.ilp instead of <bm>.lp

lp_solve

This is a freely available LP solver which we used for our experiments. LP solver is used to get the WCET value, which is formulated as an ILP problem. "lp_solve/lp_solve" is the binary which invokes the solver. Although, we strongly recommend using commercial solver CPLEX, as we found it much faster than lp_solve.


Download and test


Download the package "chronos-mc-v1-exe.zip" (contact sudiptaonline@gmail.com to download this package) and extract it in a directory say $IDIR:

cmd> unzip chronos-mc-v1-exe.zip

After extracting, it will produce four directories as follows:

1) chronos-multi-core

2) CMP_SIM

3) benchmarks

4) lp_solve

Go inside the "chronos-multi-core" directory and execute the following commands to test the WCET analyzer:

cmd> ./est/est -config processor_config/exp_cache/processor_l2.opt task_config/exp_cache/tgraph_mrtc_est_2

If the above command executes successfully and gives a valid WCET value, then you have installed the analyzer successfully.

Execute the following command from CMP_SIM/simplesim-3.0 directory to test the simulator:

cmd> ./sim-outorder -sconfig sim-test/exp_cache/processor.opt sim-test/exp_cache/cnt.arg sim-test/exp_cache/jfdcint.arg

If executing the above command produces valid simulation results at end, then you have successfully installed the multi core simulator.


Running the WCET analyzer


From the directory "chronos-multi-core", you can run the run the chronos multicore analyzer using the following command:

cmd> ./est/est -config <processor_config> <task_config>

where <processor_config> is the file providing micro-architectural configuration and <task_config> is the file providing task mapping to different cores. Please follow the description below which describes the format of these files:

processor_config:

processor_config describes the micro-architectural configurations. The micro-architectural configuration follow the same format as simplescalar except a few modifications introduced to handle multi-core specific features. A typical processor configuration file looks as follows (description shown beside after # sign):

                                                                                -cache:il1 il1:16:32:2:l             # 2-way associative, 1 KB L1 cache

                                                                                -cache:dl1 none                       # perfect L1 data cache

                                                                                -cache:il2 il2:32:32:4:l             # 4-way associative, 4 KB L2 cache

                                                                                -cache:il2lat 6                          # L2 cache hit latency = 6 cycles

                                                                                -mem:lat 30 2                          # memory latency = 30 cycles

                                                                                -bpred 2lev                              # 2 level branch predictor

                                                                                -fetch:ifqsize 4                         # 4-entry instruction fetch queue

                                                                                -issue:inorder false                  # out-of-order processor

                                                                                -decode:width 2                       # 2-way superscalar

                                                                                -issue:width 2                          # 2-way superscalar

                                                                                -commit:width 2                      # 2-way superscalar

                                                                                -ruu:size 8                                # 8-entry reorder buffer

                                                                                -il2:share 2                              # 2 cores share an L2 cache

                                                                                -core:ncores 2                         # total number of cores = 2

                                                                                -bus:bustype 0                        # TDMA round robin shared bus

                                                                                -bus:slotlength 50                  # bus slot length assigned to each core = 50 cycles


Note that except the last four paramters, all other parameters are identical to simplescalar. A more detailed description of the parameters can be found by running the sim-outorder without any input. Last four parameters are introduced for multi-core WCET analysis and are detailed as follows:

-core:ncores

Total number of cores in the processor, default value is 1

-il2:share

Total number of cores sharing an L2 cache. Default value is 1. Therefore providing only "-core:ncores 2" does not mean that the two cores share an L2 cache. We need to additionally provide "-il2:share 2" (as shown in the example above). Some examples of using the above two arguments:

-core:ncores 2 , -il2:share 1 : 2 cores with private L2 caches

-core:ncores 2 , -il2:share 2 : 2 cores with shared L2 cache

-core:ncores 4 , -il2:share 2 : 4 cores with a group of two cores sharing an L2 cache

-bus:bustype

Default value is -1 which means a perfect shared bus and introduces zero bus delay for any possible bus transaction. If the value is 0 (as in the example), it resembles a round robin TDMA bus where a fixed length bus slot is assigned to each available core.

-bus:slotlength

Only relevant if -bus:bustype is set to 0. Represents the bus slot length assigned to each core in round robin bus schedule.

Example processor configurations:

Numerous processor configuration examples are provided in directories chronos-multi-core/processor_config/exp_*. Those can be tried for running example programs.

task_config:

task_config represents the task names and their mapping to different cores. This file is viewed by the analyzer front end as a task graph. The abstract syntax of any <task_config> is as follows:

<Total number of tasks>

<Path to task 1>

<priority of task 1> <assigned core to task 1> <MSC id to task 1> <list of successor task ids of task 1 according to the partial order>

<Path to task 2>

<priority of task 2> <assigned core to task 2> <MSC id to task 2> <list of successor task ids of task 2 according to the partial order>

.........(for all the tasks)

Therefore, each 2 lines in the task_config file represent the task name and its parameters (priority, assigned cores etc). Although, the front end supports reading the file in a task graph format, we currently do not have any support for scheduling task graph. We currently derive the WCETs of interfering tasks in different cores. Therefore, for the current release of the tool, only the following parameters are important:

  1. <Total number of tasks> : Total number of tasks running on different cores

  2. <path to task i> : Relative or full path name pointing to the simplescalar PISA binary of the task.

  3. <assigned core to task i> : The assigned core number to the task. 

Following is an example of <task_config> file:

2

../benchmarks/cnt/cnt

0 0 0 0

../benchmarks/jfdcint/jfdcint

0 1 1 0


The above task_config file states that we are running "../benchmarks/cnt/cnt" on core 0 and "../benchmarks/jfdcint/jfdcint" on core 1 concurrently.

Example task configurations:

Numerous task configuration examples are provided in directories chronos-multi-core/task_config/exp_*. Those can be tried for running example programs. Nevertheless, it should verified whether the path names of the benchmarks/tasks are provided correctly in the task_config file.

(Note: MSC id is the message sequence chart id. It is mainly retained for some compatibility reason and can be removed later. MSC id can be put same for each task (say 0). List of successor task ids are the list of integers representing different task identities.)


Running the Simulator:


From the directory "CMP_SIM/simplesim-3.0", you can run the run the multi-core simulator using the following command: 

cmd> ./sim-outorder -sconfig <processor_config> <arg_file 1> <arg_file 2> .....

Above command runs the simulation of task provided by <arg_file 1> in core 0, task provided by <arg_file 1> in core 1 and so on with some shared configurations given through <processor_config> (explained below in more detail).  Here <arg_file 1>.... represents the tasks running on a dedicated core and the configurations of the corresponding core. 

A typical arg file looks like this:

../../benchmarks/cnt/cnt > sim-test/exp_cache/cnt.out

-issue:inorder true

-issue:wrongpath false

-bpred 2lev

-ruu:size 8

-fetch:ifqsize 4

-cache:il1 il1:16:32:2:l

-cache:il2 il2:32:32:4:l

-cache:il2lat 6

-mem:lat 30 2

-decode:width 1

-issue:width 1

-commit:width 1

The first line of the above arg file provides the benchmark name running on the core (i.e. ../../benchmarks/cnt/cnt). The benchmark name must point to the simplescalar PISA binary of the corresponding benchmark (check that path). The detailed output of the simulation will be redirected to the file "sim-test/exp_cache/cnt.out". The simulation also provide the output in standard output. Note that to compare result with WCET analyzer, comparison must be made with "effective" cycles as the effective cycles ignore the time spent in libraries (so as the WCET analyzer).


Rest of the lines in arg file resembles the micro-architectural configuration of the core which has exactly similar format with single core simplescalar release. <processor_config> file provides (given with -sconfig option) the shared configurations among cores. Following is an example:

                                        -core:ncores 2           # total number of cores

-bus:bustype 0          # TDMA round robin bus

-bus:slotlength 50    # bus slot length assigned to each core

-il2:share 2               # number of cores sharing an instruction L2 cache

-dl2:share 2              # number of cores sharing an data L2 cache


Note that the above parameters have the same interpretation as in the multi core WCET analyzer.

Example simulations:

Numerous examples have been provided in directory CMP_SIM/simplesim-3.0/sim-test/exp_*. Check out the "run_one" script in directories CMP_SIM/simplesim-3.0/sim-test/exp_* to get some sample commands for simulations.


Contact


Please contact Sudipta Chattopadhyay (sudiptaonline@gmail.com) for any feedback, comments or difficulties in running the tools.


Experimental result


Default processor configuration

Component Default Setting Perfect Setting
Cores 2 NA
pipeline 1 way, inorder

4-entry instruction fetch queue

8 entry reorder buffer

NA
L1 cache 2-way associative, 1 KB

miss penalty = 6 cycles

All accesses are

L1 hit

L2 cache 4-way associative, 4 KB

miss penalty = 30 cycles

NA
shared bus slot length = 50 cycles Zero bus delay
Branch predictor 2 level predictor, L1 size = 1,

L2 size = 4, history size = 2

Branch prediction

is always correct

 

Task configuration

Core 1 Core 2
matmult jfdctint
cnt jfdctint
fir jfdctint
fdct jfdctint
expint jfdctint
nsichneu jfdctint
edn statemate
ludcmp statemate
ns statemate
bsort100 statemate
adpcm statemate
st statemate

 

Experimental Data

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

To reproduce the experimental results please follow this README.