GoFetch

GoFetch: Breaking Constant-Time Cryptographic Implementations Using Data Memory-Dependent Prefetchers -- USENIX Security'24

Generate Convert Improve

Install / Use

/learn @FPSG-UIUC/GoFetch

About this skill

Quality Score

0/100

README

GoFetch

This repository is the open-source code for our USENIX Security 2024 paper: GoFetch: Breaking Constant-Time Cryptographic Implementations Using Data Memory-Dependent Prefetchers. Please check our website for more information.

Introduction

GoFetch is a microarchitectural side-channel attack that can extract secret keys from constant-time cryptographic implementations via data memory-dependent prefetchers (DMPs).

We show that DMPs are present in many Apple CPUs and pose a real threat to multiple cryptographic implementations, allowing us to extract keys from OpenSSL Diffie-Hellman, Go RSA, as well as CRYSTALS Kyber and Dilithium.

Source Code Overview

This repository comprises the following artifacts:

.
|-- linuxsetup: kernel module to set up performance counter access in Asahi Linux.
|-- re: reverse engineering experiments.
|-- poc: proof-of-concept attacks.

Reverse Engineering DMPs

Set up Environment

The high-resolution timing source used for reverse engineering experiments comes from performance counters. To configure and access the performance counter, in macOS, we load the dynamic library named kperf. In Asahi Linux, extra actions should be taken before go to the init.sh.

M2 and M3 have different L2 cache configurations. For M2 and M3 machines, go to src/lib/basics/arch.h and uncomment the M2 macro.

// uncomment below for M2 and M3
// #define M2 0

Run init.sh script to set up the timer and use it to profile the latency of accessing data from L1, L2, DRAM (also the overhead of the measurement). You need to re-do the timer set up every time you reboot the machine.

Set up Asahi Linux Performance Counters

To enable userspace access to performance counters, we need to load a kernel module and set a system register (SYS_APL_PMCR0_EL1 [30]).

Build and load kernel module

cd linuxsetup
cd kmod
make
sudo insmod kmod.ko

Enable userspace access

# Install Rust required
cd pmuctl
cargo b -r
sudo ./target/release/pmuctl

DIT Bit Test (Quick Test)

Run quick_check.out to check whether setting the data independent timing bit (DIT) turns off the DMP. Argument set_dit is used to set (set_dit=1) or unset (set_dit=0) the DIT bit.

sudo ./src/quick_check.out <set_dit>

Examine Access Patterns

Reproduce the Augury access pattern: Augury access pattern is the access pattern described in the prior work. Basically, when the program loads and dereferences each entry of an array of pointers (aop), *aop[0] ... *aop[N-1], the DMP will set out to prefetch *aop[N] ... *aop[M-1]. Run ./aopstream.sh 264 256 <trial> to create a 264 entries aop, perform access+dereference to the first 256 entries, and see the DMP dereferences pointers in the next 8 entries.

Avoiding architectural pointer dereferencing: We take off the dereference operations in the program and find that the traversal access pattern aop[0] ... aop[N-1] can trigger the stride (or stream) prefetcher to bring aop[N] ... aop[M-1] and the DMP dereferences pointers brought in by the stride prefetcher. Run ./noderef.sh 264 8 256 256 <trial> to create a 264 entries aop, fill the first 256 entries with dummy value and the next 8 entries with pointers. Traversing first 256 entries results in DMP derefercing the next 8 pointers.

In-bounds DMP dereferencing: This time we put the pointers to entries the program is going to traverse. Run ./noderef.sh 8 8 0 8 <trial> to create an 8 entries aop filled with pointers. Traversing these 8 entries results in DMP dereferencing corresponding 8 pointers.

One load: Run ./noderef.sh 8 8 0 1 <trial>. The same aop configuration as above, but the program only touches the first entry. We find that all 8 pointers are still dereferenced by the DMP, when they are in the same L1 cache (the aop is L1 line aligned).

DMP Activation Criteria

History filter: Run ./history.sh <trial> to examine how many DMP dereferences is required to let the DMP dereference the same pointer again.

L1 and L2 cache fills: Run ./dmpline.sh <trial>. Create an aop fitting a L2 cache line (128 bytes) with 16 pointers. First 8 entries locate in the lower L1 cache line (64 bytes), while the next 8 entries corresponds to the upper L1 cache line (64 bytes). If we only touch one of L1 lines, then only the touched line will be cached in L1. Meanwhile, only pointers in the touched L1 line are dereferenced by the DMP. This shows that the DMP monitors L1 cache fills.

Do-not-scan hint: Run ./l2l1fetch.sh <trial>. We load a pointer from the L2 cache and differentiate where this pointer comes from. We find that if the pointer comes from L1 cache (L1->L2) then it will not be dereferenced, if it comes from DRAM (DRAM->L2), it will be dereferenced. The reason is that the pointer comes from L1 is likely to be inspected before, and the DMP tries to avoid redundant inspection.

Restrictions on Dereferenced Pointers

4GByte prefetch region: Run ./addrsweep.sh 0x10000000 16 0x380000000 <trial>. Create a big buffer across the 4GByte boundary. Here, we start the buffer from 0x380000000 and select 16 candidate addresses with a fixed stride 0x10000000. Every time we pick two addresses from the 16 candidates, one is selected as the pointer value, the other one is the location we put the selected pointer value. We find that these two should be in the same 4GByte region to activate the DMP.

Top byte ignore: Run ./tbi.sh <trial>. Select a pointer as the test pointer, flip one of upper bits of it and then place it in the aop. We trigger the DMP on the flipped pointer to see if the DMP can still dereference the original pointer. We find that the DMP ignore the bit flip of the upper 8 bits.

Auxiliary next-line prefetch: Run nextline.sh <trial>. We trigger the DMP on a specific pointer and find that not only the cache line pointed by the pointer is prefetched by the DMP, but the next line is also prefetched.

Constant-Time Cryptography PoCs

From the understanding of how the DMP works, we develop a new type of chosen-input attack framework, where the attacker engineers a secret-dependent pointer value in victim's memory (by giving the victim a chosen input to process) and exploit the DMP to deduce the secret from it. To this end, the attacker has to inspect the victim program:

Spot the mixture of secret data and chosen input. Since the DMP only leaks pointer values, the attacker should have enough control over the mixture, such that the mixture could be a valid pointer value depending on the secret data.
To confirm the DMP scanning over the mixture, one eviction set is required to evict the mixture (later on the victim reloads it from the memory) and the attacker needs the other eviction set to monitor whether the DMP dereferences the secret-dependent pointer (Prime+Probe channel). To get both of them,
- The page offset of mixture address should be stable across runs and the space of eviction set candidates can shrink.
- Cold, valid pages locating in the same 4GByte region as the mixture should be stable across runs. And the attacker will search the DMP prefetch target (pointed by the secret-dependent pointer) within these pages.
- Having pre-knowledge in hand (page offset of mixture's address, address range to pick as secret-dependent pointer), the attacker has to force the mixture as the pointer (secret-independent) and try different combination of eviction sets until detect the DMP signal.

To show the idea, we build end-to-end key extraction PoCs for constant-time cryptographic implementations on Apple M1 machines running macOS.

Constant-Time Conditional Swap

Constant-time conditional swap performs a swap between two arrays, a and b, depending on the secret s (s=1 swap, s=0 no swap). The attacker places the pointer in one of arrays (a), and target the other one (b) to see if the content of it triggers the DMP dereferencing the pointer, in other words, whether the swap happens.

Determine the address of dyld cache: the location of the dyld shared library is randomized for each boot time, so we need to run dyld_search every time reboot the machine.

cd crypto_attacker
cargo b -r --example dyld_search
./target/release/examples/dyld_search

Run experiments: Run the attacker and victim separately. The victim has one input argument, <secret>, it could be set as 1 or 0. The attacker has three arguments. In our example, we configure the attacker measures 32 samples for each chosen input, set the threshold of the Prime+Probe channel as 680 ticks and group 8 standard eviction sets as the eviction set for the mixture.

# attacker
cd crypto_attacker
cargo b -r --example ctswap_attacker
./target/release/examples/ctswap_attacker 32 680 8
# victim
cd crypto_victim
cargo b -r --example ctswap_victim
./target/release/examples/ctswap_victim <secret>

Analyze leakage result: The attacker records the time consumption of each stage in ctswap_bench.txt and Prime+Probe latency samples depending on victim's secret in ctswap.txt. Four real cryptographic examples below also record above benchmarking results.

Go's RSA Encryption

Go's RSA implementation adopts Chinese Remainder Theorem to accelerate the decryption. One necessary step in the decryption procedure is a modular operation between chosen cipher c and secret prime p (or q), c mod p. By placing a pointer value in c, the mixture c mod p contains a secret-dependent pointer value.

Run experiments: Run the attacker and victim separately. Apart from the same three arguments as in constant-time conditional swap, the attacker has fou

Related Skills

node-connect

348.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

348.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

348.5k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。