



# **MDPeek:** Breaking Balanced Branches in SGX with Memory Disambiguation Unit Side Channels

Chang Liu, Shuaihu Feng, Yuan Li, Dongsheng Wang, Wenjian He, Yongqiang Lyu and Trevor E. Carlson









- [1] Sangho Lee, Ming-Wei Shih, Prasun Gera, Taesoo Kim, Hyesoon Kim, and Marcus Peinado. Inferring Fine-grained Control Flow Inside SGX Enclaves with Branch Shadowing. In USENIX Security Symposium (USENIX Security), pages 557–574, 2017.
- [2] Jiyong Yu, Trent Jaeger, and Christopher W. Fletcher. All Your PC Are Belong to Us: Exploiting Non-control-Transfer Instruction BTB Updates for Dynamic PC Extraction. In Proceedings of the Annual International Symposium on Computer Architecture (ISCA), pages 1– 14, 2023.
- [3] Hosein Yavarzadeh, Archit Agarwal, Max Christman, Christina Garman, Daniel Genkin, Andrew Kwong, Daniel Moghimi, Deian Stefan, Kazem Taram, and Dean Tullsen. Pathfinder: High-Resolution Control-Flow Attacks Exploiting the Conditional Branch Predictor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 770–784, 2024.
- [4] Daniel Moghimi, Jo Van Bulck, Nadia Heninger, Frank Piessens, and Berk Sunar. CopyCat: Controlled Instruction-Level Attacks on Enclaves. In USENIX Security Symposium (USENIX Security), pages 469–486, 2020.
- [5] Jo Van Bulck, Frank Piessens, and Raoul Strackx. Nemesis: Studying Microarchitectural Timing Leaks in Rudimentary CPU Interrupt Logic. In Proceedings of the Conference on Computer and Communications Security (CCS), pages 178–195, 2018.
- [6] Alejandro Cabrera Aldaya, Billy Bob Brumley, Sohaib ul Hassan, Cesar Pereida García, and Nicola Tuveri. Port Contention for Fun and Profit. In Symposium on Security and Privacy (SP), pages 870–887, 2019.
- [7] Ahmad Moghimi, Gorka Irazoqui, and Thomas Eisenbarth. CacheZoom: How SGX Amplifies the Power of Cache Attacks. In International Conference on Cryptographic Hardware and Embedded Systems (CHES), pages 69–90, 2017.
- [8] Ivan Puddu, Moritz Schneider, Miro Haller, and Srdjan Čapkun. Frontal Attack: Leaking Control-Flow in SGX via the CPU Frontend. In USENIX Security Symposium (USENIX Security), pages 663–680, 2021.
- [9] Yun Chen, Lingfeng Pei, and Trevor E. Carlson. AfterImage: Leaking Control Flow Data and Tracking Load Operations via the Hardware Prefetcher. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 16–32, 2023.

#### **Recent Control Flow Attacks Effects on Branch Predictor** Branch Shadowing NightVision O PathFinder Instruction Count or Type Copycat Nemesis PortSmash **Effects on other HW Units** CacheZoom **Frontal Attack** AfterImage



Page Fault Side Channel



Symposium on Security and Privacy (SP), pages 640–656, 2015.



Symposium on Security and Privacy (SP), pages 640–656, 2015.

2



Single-step Execution Timing Side Channel

Jo Van Bulck, Frank Piessens, and Raoul Strackx. Nemesis: Studying Microarchitectural Timing Leaks in Rudimentary CPU Interrupt Logic. In Proceedings of the Conference on Computer and Communications Security (CCS), pages 178–195, 2018.



Jo Van Bulck, Frank Piessens, and Raoul Strackx. Nemesis: Studying Microarchitectural Timing Leaks in Rudimentary CPU Interrupt Logic. In Proceedings of the Conference on Computer and Communications Security (CCS), pages 178–195, 2018.

2



Daniel Moghimi, Jo Van Bulck, Nadia Heninger, Frank Piessens, and Berk Sunar. CopyCat: Controlled Instruction-Level Attacks on Enclaves. In USENIX Security Symposium (USENIX Security), pages 469– 486, 2020.



Enclaves. In USENIX Security Symposium (USENIX Security), pages 469–486, 2020.



Ivan Puddu, Moritz Schneider, Miro Haller, and Srdjan Čapkun. Frontal Attack: Leaking Control-Flow in SGX via the CPU Frontend. In USENIX Security Symposium (USENIX Security), pages 663–680, 2021.



Ivan Puddu, Moritz Schneider, Miro Haller, and Srdjan Čapkun. Frontal Attack: Leaking Control-Flow in SGX via the CPU Frontend. In USENIX Security Symposium (USENIX Security), pages 663–680, 2021.



Hosein Yavarzadeh, Archit Agarwal, Max Christman, Christina Garman, Daniel Genkin, Andrew Kwong, Daniel Moghimi, Deian Stefan, Kazem Taram, and Dean Tullsen. Pathfinder: High-Resolution Control-Flow Attacks Exploiting the Conditional Branch Predictor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 770–784, 2024.



Hosein Yavarzadeh, Archit Agarwal, Max Christman, Christina Garman, Daniel Genkin, Andrew Kwong, Daniel Moghimi, Deian Stefan, Kazem Taram, and Dean Tullsen. Pathfinder: High-Resolution Control-Flow Attacks Exploiting the Conditional Branch Predictor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 770–784, 2024.



- Systematic MDU Characterization
  - Update condition
  - Interaction with cache, TLB and ROB
  - Multiple stores and loads

#### NOT isolated between the normal and secure world





- Systematic MDU Characterization
  - Update condition
  - Interaction with cache, TLB and ROB
  - Multiple stores and loads
  - NOT isolated between the normal and secure world
- Vulnerable Loads Identification in Real-world Applications
  - Modeling address generation delay
  - Measuring Delay Capacity





- Systematic MDU Characterization
  - Opdate condition
  - Interaction with cache, TLB and ROB
  - Multiple stores and loads
  - NOT isolated between the normal and secure world
- Vulnerable Loads Identification in Real-world Applications
  - Modeling address generation delay
  - Measuring Delay Capacity

#### MDPeek: End-to-end Attacks with MDU Side Channel

- Attacking Libjpeg
- Attacking RSA Generation (MbedTLS and WolfSSL)



- Systematic MDU Characterization
  - Update condition
  - Interaction with cache, TLB and ROB
  - Multiple stores and loads
  - NOT isolated between the normal and secure world
- Vulnerable Loads Identification in Real-world Applications
  - Modeling address generation delay
  - Measuring Delay Capacity

#### MDPeek: End-to-end Attacks with MDU Side Channel

- Attacking Libjpeg
- Attacking RSA Generation (MbedTLS and WolfSSL)
- Defenses against MDPeek
  - Naive defenses
  - Store-to-load Coupling



### **Background: Memory Disambiguation Unit**

#### Unresolved Data Dependence

|       | <pre>movq (%rdx), %rdi</pre> |
|-------|------------------------------|
| 0x0e: | <b>movq \$0,</b> (%rdi)      |
| 0x15: | <pre>movl (%rsi), %eax</pre> |
|       | addl %eax, %ebx              |
|       | addl %eax, %edx              |
|       | addl %ebx, %ecx              |

(1) Fetch address from [%rdx]
(2) Store 0 to this address
(3) Load a value from [%rsi]
(4) Use the loaded value
(5) Use the loaded value
(6) Use the loaded value after calculation

Before %rdi is ready

#### **Delayed Store**

#### Ready Load

Out of order ? In Order ?



# **Background: Memory Disambiguation Unit**



Selected by the least significant 8 bits of the Load PC

# **Background: Memory Disambiguation Unit**



Unresolved Data Dependence

Intel: Memory Disambiguation Unit

### Main Idea: MDU Side Channel



Intel: Memory Disambiguation Unit





Vulnerable Load Identification

MDPeek: End-to-end Attacks

De

Defenses against MDPeek

#### Method

- Microbenchmark
- Transient Execution
- Performance Monitor Counter



#### Workflow of Characterization



#### Method

- Microbenchmark
- Transient Execution
- Performance Monitor Counter

#### An Example of Test Case

Effects of ROB on MDU

| lestcase                        |
|---------------------------------|
| <pre>mdu_update_dispatch:</pre> |
| <b>mov \$1,</b> %rcx            |
| mov %rdi, %rax                  |
| <b>clflush</b> (%rdx)           |
| mfence                          |
| lfence                          |
| <pre>movq (%rdx), %rdx</pre>    |
| .rep NUM_NOP                    |
| nop                             |
| . endr                          |
| <b>mov \$0,</b> %rdx            |
| div %rcx                        |
| mov %rax, %rdi                  |
| <pre>movq %rdi, (%rdi)</pre>    |
| .rep 60                         |
| nop                             |
| .endr                           |
| <b>movq</b> (%rsi), %rsi        |
| lfence                          |
| ret                             |
|                                 |

#### Testcase

Delay the instruction in the ROB head

Adjust the number of nop to control the layout of ROB

Store with delayed address generation

Load to be tested

#### Method

- Microbenchmark
- Transient Execution
- Performance Monitor Counter
- An Example of Test Case
  - Effects of ROB on MDU
  - Adjust NUM\_NOP



#### Method

- Microbenchmark
- Transient Execution
- Performance Monitor Counter

#### An Example of Test Case

- Effects of ROB on MDU
- Adjust NUM\_NOP

Insight MDU can update only when both the delayed store and load are in the ROB. mdu update dispatch: **mov \$1,** %rcx mov %rdi, %rax clflush (%rdx) mfence lfence movq (%rdx), %rdx .rep NUM NOP nop .endr **mov \$0**, %rdx div %rcx mov %rax, %rdi movq %rdi, (%rdi) .rep 60 nop .endr movq (%rsi), %rsi lfence ret



**ROB Size 224** 

- Results
  - Shown as follows



#### Insights on Update Condition

Store is allocated in the ROB earlier than load



- Store is allocated in the ROB earlier than load
- Unresolved dependence is necessary



- Store is allocated in the ROB earlier than load
- Unresolved dependence is necessary
- Address of the store is generated slower than load



- Store is allocated in the ROB earlier than load
- Unresolved dependence is necessary
- Address of the store is generated slower than load
- Physical address of the load is ready



- Store is allocated in the ROB earlier than load
- Unresolved dependence is necessary
- Address of the store is generated slower than load
- Physical address of the load is ready
- Both the addresses of the store and load are valid, or the page offset of store and load  $\geq$  4 bytes







Vulnerable Load Identification

MDPeek: End-to-end Attacks Defe

Defenses against MDPeek

## **Modeling Vulnerable Codes**

#### Instruction Model

store [rd], load [rs], op@rd, op^rd, op^rs

## Distance Model

- Distance: number of instructions
- Delay capacity

## Update Condition

Def distance + LS distance < Delay capacity</p>



## **Modeling Vulnerable Codes**

#### Instruction Model

store [rd], load [rs], op@rd, op^rd, op^rs

## Distance Model

- Distance: number of instructions
- Delay capacity

## Update Condition

Def distance + LS distance < Delay capacity</p>

## Precomputed Delay Capacity

- Input: uops.info
- Distance computing: using nop instructions
- Cache state of the load
- Instruction chains

#### **Delay Capacity Experiments on loads** 1.0 0.8 Update Rate load (cache hit) 0.6 load (cache miss) 0.4 2 loads (cache hit) 0.2 0.0 20 40 100 120 140 160 180 200 220 60 80 0 **Def Distance**

#### **Delay Capacity Experiments on Some Arithmetic Instructions**



# Andreas Abel and Jan Reineke. uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 673–686, 2019.

## **Modeling Vulnerable Codes**

### Instruction Model

store [rd], load [rs], op@rd, op^rd, op^rs

### Distance Model

- Distance: number of instructions
- Delay capacity

## Update Condition

Def distance + LS distance < Delay capacity</p>

## Precomputed Delay Capacity

- Input: uops.info
- Distance computing: using nop instructions
- Cache state of the load
- Instruction chains
- Implementation
  - LLVM v11.0.0



## An Example

WolfSSL (v5.7.2)

```
mp_invmod_slow
                 Secret Dependent Branch
  // if u >= v
  if (mp_cmp(&u, &v) != MP_LT) {
2
      // u = u - v
3
      mp_sub(&u, &v, &u);
4
5 }
6 else {
      // v = v - u
7
      mp_sub(&v, &u, &v);
8
9 }
```

## **An Example**

WolfSSL (v5.7.2)



## An Example

WolfSSL (v5.7.2) 







Vulnerable Load Identification

MDPeek: End-to-end Attacks Defenses against MDPeek

## **Attack Framework and Primitives**

#### Attack Framework

- Synchronization: page fault (with SGX-Step, page level)
- Control flow Leakage: MDU counter update (byte level)
- Reset: aliased store-load pairs
- Prime: non-aliased store-load pairs



## **Attack Framework and Primitives**

#### Attack Framework

- Synchronization: page fault (with SGX-Step, page level)
- Control flow Leakage: MDU counter update (byte level)
- Reset: aliased store-load pairs
- Prime: non-aliased store-load pairs





## **Attack Framework and Primitives**

#### **Attack Framework**

mov

lea

mov

mov

lea

ret

- Synchronization: page fault (with SGX-Step, page level)
- Control flow Leakage: MDU counter update (byte level) 0
- Reset: aliased store-load pairs
- Prime: non-aliased store-load pairs



## **Attacking Libjpeg**

#### Observation

- IDCT function iterates 8 times for each 8×8 pixel block
- Different pixel layout results in different control flow
- After scaling the image 12×, only 16 layouts are possible



#### jidcint.c: jpeg\_idct\_islow /\* Pass 1: process columns from input, store into work array. \*/ /\* Note results are scaled up by sqrt(8) compared to a true IDCT; \*/ /\* furthermore, we scale the results by 2\*\*PASS1 BITS. \*/ for (ctr = DCTSIZE; ctr > 0; ctr--) { /\* Pass 2: process rows from work array, store into output array. \*/ /\* Note that we must descale the results by a factor of 8 == $2^{*3}$ , \*/ /\* and also undo the PASS1 BITS scaling. \*/ for (ctr = 0; ctr < DCTSIZE; ctr++) {</pre> 8 pixels idct slow Branch-1 8 pixels Scan page 12 Branch-2 page 13 Input Image

## **Attacking Libjpeg**

#### Observation

- IDCT function iterates 8 times for each 8×8 pixel block
- Different pixel layout results in different control flow
- After scaling the image 12×, only 16 layouts are possible

### Method

- Synchronize with page faults
- Measure the branch taken through MDU counters
- Recover leaked pixels with pre-computed patterns





## **Attacking RSA Key Generation**

#### Observation

- Inverse modular (invmod) is used during RSA key generation
- Secrets (p, q or lcm(p-1,q-1)) serve as parameters of invmod
- Secret-dependent branches exist in invmod function



int wc\_MakeRsaKey(RsaKey\* key, int size, long e, WC\_RNG\* rng)
{ ...
 if (err == MP\_OKAY) /\* key->d = 1/e mod lcm(p-1, q-1) \*/
 err = mp\_invmod(&key->e, tmp3, &key->d);
}

#### MbedTLS v3.6.1

int mbedtls\_rsa\_deduce\_crt(const mbedtls\_mpi \*P, const mbedtls\_mpi \*Q, const mbedtls\_mpi \*D, mbedtls\_mpi \*DP, mbedtls\_mpi \*DQ, mbedtls\_mpi \*QP) { ... if (QP != NULL) { /\* QP = Q^{-1} mod P \*/ MBEDTLS\_MPI\_CHK(mbedtls\_mpi\_inv\_mod(QP, Q, P)); } }

#### invmod function

## **Attacking RSA Key Generation**

#### Observation

- Inverse modular (invmod) is used during RSA key generation
- Secrets (p, q or lcm(p-1,q-1)) serve as parameters of invmod
- Secret-dependent branches exist in invmod function

#### Evaluation

- 1000 attacks on 2048-bit key
- MbedTLS: 830 ms for a single trace, with success rate exceeding 97%
- WolfSSL: 880 ms for a single trace, with success rate exceeding 95%







Vulnerable Load Identification

MDPeek: End-to-end Attacks



Defenses against

MDPeek

## **Defenses: Serialization and Alignment**

### Serialization

- Insight: MDU is enabled only when both a delayed store and load are allocated in the **ROB**
- Method: Insert an **1fence** instruction between a potential delayed store and following loads
- Performance Overhead: ~140%



## **Defenses: Serialization and Alignment**

### Serialization

- Insight: MDU is enabled only when both a delayed store and load are allocated in the **ROB**
- Method: Insert an 1fence instruction between a potential delayed store and following loads
- Performance Overhead: ~140%

## Alignment

- Insight: MDU is selected by the lowest 8 bits of the load PC
- Method: Align the load PC to 256 bytes by inserting nop instructions
- Performance Overhead: ~160%



## **Defenses: Store-to-load Coupling**

#### Insight

- Unresolved data dependence is necessary to update the MDU
- Making the dependence explicit to the CPU
- Adding deterministic dependence between the store and load addresses



## **Defenses: Store-to-load Coupling**

#### Insight

- Unresolved data dependence is necessary to update the MDU
- Making the dependence explicit to the CPU
- Adding deterministic dependence between the store and load addresses

## Evaluation

- Serialization: ~140%
- Alignment: ~160%
- Store-to-load Coupling: ~20%



## Conclusion

#### Systematic MDU Characterization

- Update condition
- Interaction with cache, TLB and ROB
- Multiple stores and loads
- NOT isolated between the normal and secure world

## Vulnerable Loads Identification in Real-world Applications

- Modeling address generation delay
- Measuring Delay Capacity

### MDPeek: End-to-end Attacks with MDU Side Channel

- Attacking Libjpeg
- Attacking RSA Generation (MbedTLS and WolfSSL)
- Defenses against MDPeek
  - Naive defenses
  - Store-to-load Coupling











# **MDPeek: Breaking Balanced Branches in SGX with Memory Disambiguation Unit Side Channels**

Chang Liu, Shuaihu Feng, Yuan Li, Dongsheng Wang, Wenjian He, Yongqiang Lyu and Trevor E. Carlson

