How Hard Is Weak-Memory Testing?

SOHAM CHAKRABORTY, TU Delft, Netherlands
SHANKARA NARAYANAN KRISHNA, IIT Bombay, India
UMANG MATHUR, National University of Singapore, Singapore and IIT Bombay, India
ANDREAS PAVLOGIANNIS, Aarhus University, Denmark

Weak-memory models are standard formal specifications of concurrency across hardware, programming languages, and distributed systems. A fundamental computational problem is consistency testing: is the observed execution of a concurrent program in alignment with the specification of the underlying system? The problem has been studied extensively across Sequential Consistency (SC) and weak memory, and proven to be NP-complete when some aspect of the input (e.g., number of threads/memory locations) is unbounded. This unboundedness has left a natural question open: are there efficient parameterized algorithms for testing?

The main contribution of this paper is a deep hardness result for consistency testing under many popular weak-memory models: the problem remains NP-complete even in its bounded setting, where candidate executions contain a bounded number of threads, memory locations, and values. This hardness spreads across several Release-Acquire variants of C11, a popular variant of its Relaxed fragment, popular Causal Consistency models, and the POWER architecture. To our knowledge, this is the first result that fully exposes the hardness of weak-memory testing and proves that the problem admits no parameterization under standard input parameters. It also yields a computational separation of these models from SC, x86-TSO, PSO, and Relaxed, for which bounded consistency testing is either known (for SC), or shown here (for the rest), to be in polynomial time.

CCS Concepts: • Software and its engineering → Software verification and validation; • Theory of computation → Theory and algorithms for application domains; Program analysis.

Additional Key Words and Phrases: concurrency, consistency checking, weak memory models, complexity

ACM Reference Format:

1 INTRODUCTION

Memory-consistency models play a crucial role in the design, use, and verification of concurrent systems spanning hardware, programming languages, and distributed computing. These models formally define the set of behaviors that the system can exhibit as a whole, accounting for the
intricate communication patterns between its entities due to buffers, caching, message delays, etc. The simplest and most widespread, general model is Sequential Consistency (SC) [Lamport 1978], which defines program behavior by thread interleaving. Although its simplicity is a major advantage, SC fails to capture the additional, complex behaviors that are abundant in modern concurrency.

In contrast, weak-memory models are richer and more faithful specifications of concurrent/distributed communication and are developed specifically for the system under consideration. For example, the x86, POWER, and Arm architectures follow their own memory models [Alglave et al. 2021, 2014; Owens 2010], programming languages provide certain primitives for writing weak-memory concurrent programs [Batty et al. 2013], and distributed systems implement various models of causal consistency [Bouajjani et al. 2017; Burckhardt 2014; Hutto and Ahamad 1990]. Naturally, verification techniques are specific to the memory model at hand, so as to account for (and verify) all the possible behaviors that the system can exhibit according to the model.

One of the core computational problems associated with a memory model is that of consistency testing: is a high-level, observed behavior of a program in alignment with the semantics of the underlying model [Gibbons and Korach 1997]? The observed behavior is specified in terms of an abstract execution that defines the sequence of instructions each process/thread executed, the shared memory locations it read from/wrote to, along with the respective values read/written. Answering this question requires determining the low-level, unobserved behavior of the architecture that gave rise to the observed behavior of the program; for example, the order in which writes were made visible to (one or more of) the threads, and the dataflow between writes and reads.

Consistency testing is a natural task in both the development and the implementation of memory models. In particular, memory models are contracts between the designers of a system and its users [Adve and Hill 1990]. When designing hardware architectures, memory subsystems, compiler optimizations, and distributed-communication protocols, consistency-testing serves to validate that the contract has been respected [Chen et al. 2009; Gibbons and Korach 1997; Manovit and Hangal 2006; Qadeer 2003; Windsor et al. 2022]. From the opposite direction, litmus testing is a standard approach to understanding the semantics of hardware architectures [Alglave et al. 2011, 2014] so as to design faithful models around them. Here, given a candidate memory model and the observed execution of a litmus test, consistency checking verifies whether the execution is a counterexample to the model. Finally, consistency testing is also used as a separability criterion between different memory models [Kokologiannakis et al. 2023; Wickerson et al. 2017].

Consistency checks are also a widespread task in program verification and testing. In stateless model checking, the goal of the model checker is to enumerate-and-check the absence of errors in all program executions (typically up to some bound). To reduce the load of the verification task, an abstraction mechanism partitions the space of all behaviors into equivalence classes, each represented by an abstract execution. Instead of enumerating concrete executions, the model checker enumerates abstract executions, which yields an exponential reduction of the search space. Each candidate abstract execution undergoes a consistency check to ensure that the model checker does not diverge to unrealizable parts of the search space [Abdulla et al. 2023, 2019, 2018; Agarwal et al. 2021; Bui et al. 2021; Chalupa et al. 2017; Chatterjee et al. 2019; Kokologiannakis et al. 2022, 2019]. In runtime testing, predictive techniques aim to infer the presence of unobserved, erroneous executions from observed executions that are bug-free. Such techniques operate by constructing a candidate execution that manifests the bug (using the observed execution as a guide) and then applying a consistency check (explicitly or implicitly) to verify that the execution indeed represents...
valid program behavior (hence the bug report is a true positive) [Huang et al. 2014; Kalhauge and Palsberg 2018; Kini et al. 2017; Luo and Demsky 2021; Mathur et al. 2020, 2021; Pavlogiannis 2019].

Owing to the widespread applicability, the computational complexity of consistency testing has been studied thoroughly for a wide variety of memory models and a wide variety of settings. The seminal work of Gibbons and Korach [1997] showed that the problem is NP-complete for SC, even when either the number of threads or the number of memory locations is bounded (but not both). Later, Cantin et al. [2005] proved that the problem remains NP-complete even with a single memory location (but with unboundedly many threads), which also implies NP-completeness for all memory models that adhere to the “SC-per-location” property, such as TSO, PSO, RA, SRA and Relaxed. Gonthmakher et al. [2003] showed a similar NP-completeness for a Java memory model. Furbach et al. [2015] proposed a unified treatment of the consistency problem on many weak-memory models which led to similar NP-completeness results, while the NP-completeness of consistency testing under various causal-consistency models was proven in [Bouajjani et al. 2017].

A popular approach to tackle the intractability in consistency testing is via parameterization: an intractable problem becomes tractable when some of its input parameters, such as the number of threads, is bounded [Abdulla et al. 2018; Gibbons and Korach 1994; Mathur et al. 2020]. On the other hand, all existing results that establish the NP-hardness of consistency testing in all memory models rely on some input parameters being unbounded. For SC, this unboundedness is, in fact, a prerequisite for intractability: the problem becomes polynomial-time when both the number of threads and memory locations are bounded, a result that has led to efficient parameterized model checking [Agarwal et al. 2021]. For weak-memory, however, analogous results have thus far remained elusive. In particular, are there any efficient parameterized algorithms for consistency in weak memories? How hard, actually, is weak-memory testing?

Here we resolve this question by establishing a deep hardness result for many popular weak-memory models: consistency testing is NP-complete even in its bounded setting, where executions contain a constant number of threads, memory locations and values, and the size of the input is solely determined by the (unbounded) number of events. To our knowledge, this is the first result that fully exposes the hardness of weak-memory testing and proves that the problem admits no parameterization under standard input parameters. In turn, this implies that practical approaches to testing have to resort to heuristics, while model checkers might be more performant when exploring finer abstractions (such as reads-from [Abdulla et al. 2019; Chalupa et al. 2017; Tunç et al. 2023], or those based on executions graphs [Kokologiannakis et al. 2017; Lahav and Margalit 2019]).

Our contributions. We study the bounded consistency testing problem for many popular weak-memory models found in software, hardware, and distributed systems. The input is always an abstract execution \( \overline{X} = (E, po) \) consisting of a set of events \( E \) and a program order \( po \) defining the order of execution of these events in each thread. The task is to determine whether \( \overline{X} \) is consistent in a given memory model. The boundedness of the problem refers to the number of threads, memory locations, and values accessed by \( \overline{X} \) being bounded (i.e., constant). It is easy to see that the problem is in NP in all the models we consider, and we will not be establishing this fact formally. We write \( M_1 \preceq M_2 \) to denote that memory model \( M_2 \) is weaker than memory model \( M_1 \), i.e., any execution that is consistent in \( M_1 \) is also consistent in \( M_2 \).

We begin with release-acquire semantics, as popularized by C11. We consider the Release-Acquire model (RA), as well as its Strong (SRA) and Weak variants (WRA) [Lahav and Boker 2022]. In addition, we consider Relaxed-Acyclic, the standard Relaxed semantics of C11 equipped with the common assumption of causal acyclicity (aka \((po \cup rf)\) acyclicity). This assumption is often used as
Fig. 1. The complexity landscape of bounded weak-memory testing. An arrow \( M_1 \rightarrow M_2 \) means that \( M_1 \) is stronger than \( M_2 \). Thick arrows represent range-hardness for all models between the endpoints. The complexity of bounded consistency checking for all models except for SC are established in this paper.

an additional axiom [Margalit and Lahav 2021; Norris and Demsky 2013] as it has been argued that \((po \cup rf)\)-cycles do not arise in practice [Lee et al. 2023]. We prove the following theorem.

**Theorem 1.1.** Consistency testing for bounded inputs is \( NP \)-complete for Relaxed-Ayclic as well as for any memory model \( M \) such that \( SRA \leq M \leq WRA \), even in their atomic read-modify-write (RMW)-free fragment.

Note that Theorem 1.1 establishes hardness for Relaxed-Ayclic, WRA, RA, SRA as well as the whole range of models between SRA and WRA. This result improves existing results on consistency checking, for which hardness relied on an unbounded domain of threads and/or memory locations [Cantin et al. 2005; Furbach et al. 2015; Gibbons and Korach 1994].

Next, we turn our attention to popular causal-consistency models [Fidge 1988; Lamport 1978]. There have been several efforts to formalize various aspects of causal consistency, out of which have emerged three well-accepted models, namely Causal Consistency CC [Bouajjani et al. 2017; Hutto and Ahamad 1990], Causal Convergence CCv [Bouajjani et al. 2017; Burckhardt 2014; Perrin et al. 2016], and Causal Memory CM [Ahamad et al. 1995; Bouajjani et al. 2017; Perrin et al. 2016]. It was recently shown that CC coincides with WRA while CCv coincides with SRA [Lahav and Boker 2022]. Thus Theorem 1.1 extends to CC and CCv. We prove that the problem is also hard for CM, thereby establishing hardness for the ranges defined by the three main models.

**Theorem 1.2.** Consistency testing for bounded inputs is \( NP \)-complete for any memory model \( M \) such that (i) \( CCv \leq M \leq CC \) or (ii) \( CM \leq M \leq CC \).

Next, we turn our attention to the POWER architecture. Lahav et al. [2016] show that SRA captures precisely the guarantees of POWER for programs that are compiled from the release-acquire fragment of C/C++. Thus Theorem 1.1 extends to the following corollary.

**Corollary 1.3.** Consistency testing for bounded inputs is \( NP \)-complete for POWER.

Continuing with hardware models, we study Total Store Order (TSO) as employed in x86 architectures (aka x86-TSO) and its extension to Partial Store Order (PSO). It turns out that, in the bounded setting, consistency checks become tractable in these models.

**Theorem 1.4.** Consistency testing for bounded threads and memory locations is in polynomial time for TSO and PSO.
One natural, final question concerns the vanilla Relaxed model, i.e., if we remove the acyclicity condition from Relaxed-Acyclic. In this case, the problem becomes polynomial-time, which is a corollary of the corresponding result for SC [Agarwal et al. 2021].

**Corollary 1.5.** Consistency testing for bounded threads is in polynomial time for Relaxed.

Although Corollary 1.5 is technically straightforward, it is conceptually interesting under the following realization. For all of our previous results (as well as for SC), the hardness of consistency coincides with whether the corresponding model exhibits multi-copy atomicity. In contrast, Relaxed is non-multi-copy atomic, yet consistency testing is in polynomial time.

Following the results of this paper, Fig. 1 pictorially presents the full landscape of the tractability and the hardness in testing weak memories.

**High-level intuition.** Our proofs exploit complex combinatorial properties that arise in weak memory. Although it is hard to pinpoint one key insight that fully explains our hardness results, our proofs rely on the fact that most of the models we consider (i) are causally consistent, and (ii) allow \((po’ ∪ rf ∪ fr)\)-cycles, where \(po’\) is the standard program order restricted to instructions of the same type (read-read and write-write orderings) on different locations. In contrast, the polynomial-time models SC and TSO forbid (ii), while PSO allows (ii) but also fails (i).

**Outline.** The rest of the paper is organized as follows.

- In Section 2, we define our problem setting and the memory models we consider based on C/C++ atomics. We also develop relevant notation that will be helpful in later sections.
- In Section 3, we prove Theorem 1.1 for Relaxed-Acyclic. For readability, we prove a weaker version of Theorem 1.1 in which the inputs use boundedly many threads and locations but manipulate unboundedly many values. Later in Section 6, we explain how to perform simple modifications to our reduction to make it work even for bounded values.
- In Section 4, we prove Theorem 1.1 for all models \(SRA \preceq M \preceq WRA\). Similarly to the previous case, our reduction uses unboundedly many values, while the modifications described in Section 6 also apply to this model, to arrive at the final result.
- In Section 5, we establish Theorem 1.2, Corollary 1.3, Theorem 1.4 and Corollary 1.5.
- Finally, in Section 6, we present the modifications in the reductions of Section 3 and Section 4 that fully establish Theorem 1.1.

Due to space restrictions, the full paper appears as a technical report in [Chakraborty et al. 2023].

## 2 PRELIMINARIES

This section defines the axiomatic semantics of the SRA, RA, WRA, and Relaxed memory models. As these are standard concepts, our exposition follows recent work on the topic (e.g., [Lahav and Boker 2022; Margalit and Lahav 2021; Tunç et al. 2023]). In axiomatic semantics, program executions consist of sets of events and relations between them. Given an integer \(i\), we let \([i] = \{1, 2, \ldots, i\}\).

**Events.** An event is a tuple \((id, tid, lab)\) where \(id, tid, lab\) denote a unique identifier, thread identifier, and the label respectively. The label is of the form \(lab = \langle op, loc, Val, ord \rangle\) where \(op, loc, Val, ord\) respectively denote a read \((r)\) or write \((w)\) memory operation, accessed memory location, read or written value, and memory order respectively. For the SRA, RA, and WRA models, reads and writes are of acquire and release orders respectively. For the Relaxed model, the read and write accesses have relaxed order. These memory orders are used to define the semantics of models like C11, but we will not be using them explicitly here. As we treat each model separately, all access orders are
determined by the models and are never mixed. Hence, we will simply write \( r(t, x, v)/w(t, x, v) \) to denote a read/write event of thread \( t \), accessing location \( x \) and reading/writing value \( v \). We occasionally omit \( x \) and/or \( v \), when it is irrelevant or clear from the context, while we let \( \text{tid}(e) \) denote the thread of event \( e \). We do not introduce fences or atomic read-modify-write (RMW) events, as all our hardness results hold even with only read/write events, while our positive results can be easily extended to handle fences and RMWs. Finally, we denote the set of read and write accesses by \( R \) and \( W \) respectively.

**Notation on relations.** Let \( B \) be a binary relation over a set of events \( E \). The reflexive, transitive, reflexive-transitive closures, and inverse relations of \( B \) are denoted as \( B^\ast, B^+, B^\dagger, \) and \( B^{-1} \), respectively. We compose two relations \( B_1 \) and \( B_2 \) as \( B_1; B_2 \). [A] denotes the identity relation on a set \( A \). We write \( \text{irr}(B) \) and \( \text{acy}(B) \) to denote that \( B \) is irreflexive and acyclic, respectively. We occasionally write that there exists a \( B \)-edge \( e \xrightarrow{B} e' \) to denote that \( (e, e') \in B \). We naturally extend this notation to paths, so that a \( B \)-path \( P : e \xrightarrow{B} e' \) is a sequence of \( B \)-edges \( e = e_1 \xrightarrow{B} e_2 \xrightarrow{B} \cdots \xrightarrow{B} e_1 = e' \). Finally, we write \( B_x \) to restrict \( B \) on events accessing location \( x \).

**Executions and relations.** An execution is a tuple \( X = (E, po, rf, mo) \) where \( E \) is a set of events and \( po, rf, mo \) are binary relations over \( E \). In particular, the program order \( (po \subseteq (E \times E)) \) is a strict total order on the events of each thread. The reads-from relation \( (rf \subseteq (W \times R)) \) relates a write and read event pair \( (w, r) \), denoting that \( r \) obtains its value from \( w \). Every read reads from exactly one write on the same memory location and having the same value (thus \( rf^{-1} \) is a function). The modification order \( (mo \subseteq (\bigcup_x (W_x \times W_x))) \) is a strict total order over same-location writes in an execution. Finally, the happens-before relation is defined as \( hb \triangleq (po \cup rf)^+ \). Fig. 2 shows examples of executions presented as execution graphs. In each execution graph the nodes represent events and the edges represent relations. We omit some relation-edges that are clear from the context.

**Consistency Axioms.** Consistency axioms capture different aspects or properties of an execution, such as coherence and causality cycles, under a memory model. These properties are interpreted differently in different memory models.

**Coherence.** In an execution, coherence enforces an ordering between same-location events. For events using the release-acquire memory orders, write-coherence requires that each \( mo_x \) order agrees with \( hb \). A stronger variant is strong-write-coherence, which requires that \( mo \) agrees with \( hb \), transitively. Read coherence enforces that a read \( r \) can read from a write \( w \) when there is no intermediate write \( w' \) on the same-location that happens-before \( r \). Depending upon how “intermediate” writes are treated, two variations of read coherence are popular — in standard read-coherence, \( w \) and \( w' \) are ordered by \( mo_x \) whereas in weak-read-coherence they are ordered by \( hb_x \). Finally, we also have variants of write and read coherence when all accesses are relaxed. Here \( hb \) is replaced with \( po \), as \( rf \)-edges do not contribute to \( hb \) between different memory locations.

\[\text{It is also common to define a from reads relation } \text{fr} \triangleq \text{rf}^{-1}; \text{mo}. \text{ However, we will not be using fr explicitly in this paper.}\]
Causality Cycles. A causality cycle arises in the presence of relaxed accesses and consists of po and rf orderings. A causality cycle may result in ‘out-of-thin-air’ behavior in an execution. To avoid such ‘out-of-thin-air’ behavior, many consistency models and verification tools explicitly disallow such cycles [Luo and Demsky 2021; Margalit and Lahav 2021; Norris and Demsky 2013].

- acy(po ∪ rf) (porf-acyclicity)

Fig. 2 shows examples of executions forbidden by different axioms. The write-coherence axiom forbids the execution in Fig. 2a as it violates the irreflexivity of \((\text{mo}_x; \text{hb})\). The \((\text{hb} \cup \text{mo})\) cycle in Fig. 2b is forbidden by strong-write-coherence. The execution in Fig. 2c violates irreflexivity of \((\text{rf}^{-1}; \text{mo}_x; \text{hb})\) and thus fails read-coherence. In Fig. 2d, we have \((w(x), w(x)) \in \text{hb}_x; [W], (w(x), r(x)) \in \text{po} \subseteq \text{hb}, \) and \((r(x), w(x)) \in \text{rf}^{-1}\), violating weak-read-coherence. The execution in Fig. 2e violates porf-acyclicity. Finally, the executions in Fig. 2f and Fig. 2g violate relaxed-write-coherence and relaxed-read-coherence, respectively.

Memory Models. We can now describe the main memory models we consider in this work, by listing the axioms that each execution needs to satisfy in the respective model Table 2.

Release-Acquire and variants. The release-acquire (RA) memory model is weaker than sequential consistency and is arguably the most well-understood fragment of C11. Here, the reads-from relation rf induces synchronization between thread threads, which is captured in the semantics by the happens-before relation hb. Following [Lahav and Boker 2022], we consider three variants of release-acquire models: Release-Acquire (RA), and its Strong (SRA) and Weak (WRA) variants.

SRA enforces strong-write-coherence on write accesses whereas RA enforces write-coherence. On the other hand, WRA does not place any ordering between same-location writes by \(\text{mo}_x\). Instead, the only orderings considered between same-location writes are through the \([W]; \text{hb}_x; [W]\) relation.

Relaxed. All accesses in the Relaxed model satisfy the corresponding coherence axioms relaxed-write-coherence and relaxed-read-coherence, which guarantee SC-per-location. The Relaxed-Acyclic model strengthens Relaxed by also requiring the acyclicity of \((\text{po} \cup \text{rf})\).

---

Table 2. The main weak-memory models based on C11 that we consider in this work.

<table>
<thead>
<tr>
<th>WRA</th>
<th>Release-Acquire</th>
<th>Relaxed</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>porchlexicity</td>
<td>relaxed-write-coherence</td>
</tr>
<tr>
<td>RA</td>
<td>write-coherence</td>
<td>relaxed-read-coherence</td>
</tr>
<tr>
<td>SRA</td>
<td>strong-write-coherence</td>
<td>porf-acyclicity</td>
</tr>
</tbody>
</table>

Fig. 2. Executions forbidden by (a) write-coherence, (b) strong-write-coherence, (c) read-coherence, (d) weak-read-coherence, (e) porf-acyclicity, (f) relaxed-write-coherence, (g) relaxed-read-coherence.
Based on the set of allowed behaviors, these models can be partially ordered as \( SRA \preceq RA \preceq \{WRA, \{Relaxed-Acyclic \preceq Relaxed\}\} \), where models towards the right allow more behaviors.

**The consistency-testing problem.** An execution \( X \) is consistent in a memory model \( M \), written \( X \models M \), if it satisfies the axioms of \( M \). For example, the execution in Fig. 2b satisfies all axioms except strong-write-coherence, and hence it is consistent in RA, WRA, and Relaxed(-Acyclic).

When testing the behavior of a program within a memory model, one does not have access to concrete executions, but rather to abstract executions. The latter contains only information observed by the program, i.e., the events it executed and the values it read/wrote. Formally, an abstract execution \( \overline{X} = \langle E, po \rangle \) is a coarser object than concrete executions, missing the mo and rf relations, and a concrete execution \( X = \langle E', po', rf', mo' \rangle \) is said to be an extension of \( \overline{X} \) if \( E' = E \) and \( po' = po \). We call \( \overline{X} \) consistent in \( M \), written similarly as \( \overline{X} \models M \), if there exists an rf and an mo such that the extension \( X = \langle E, po, rf, mo \rangle \) is an execution with \( X \models M \). The problem of consistency testing in a memory model \( M \) is to determine whether \( \overline{X} \) is consistent in \( M \), i.e., whether there is a way to resolve rf and mo in \( M \) that would give rise to the observed behavior \( \overline{X} \) on the program level.

**Conflicting triplets.** In the coming sections, we use the notion of conflicting triplets. Given an abstract execution \( \overline{X} = \langle E, po \rangle \), we say that two events \( e_1, e_2 \in E \) conflict if they access the same location and at least one of them is a write. Given additionally a reads-from relation rf, a conflicting triplet (or triplet, for short) is a tuple \((w, r, w')\) of pairwise conflicting events such that \((w, r) \in rf\).

### 3 HARDNESS FOR RELAXED-ACYCLIC

We start with the Relaxed-Acyclic memory model and show that consistency testing is NP-complete under bounded threads and memory locations. This differs slightly from Theorem 1.1, which states that the problem remains hard even with bounded values. Since our proof is rather technical, we choose to present this intermediate result here. We will make the final step towards Theorem 1.1 in Section 6, which consists of a simple modification of the technique presented here. In Section 3.1, we present the hardness reduction and argue about its correctness in Sections 3.2 and 3.3.

#### 3.1 Reduction

Our reduction is from Monotone 1-in-3 SAT which is known to be NP-complete [Garey and Johnson 1990]. The input is a monotone formula \( \varphi \) in conjunctive normal form, where each clause contains three literals, all of which are positive. The task is to determine if there exists a 1-in-3 truth assignment for \( \varphi \), i.e., one that sets exactly one literal to true in each clause.

We remark that our reduction is combinatorially elaborate. We found that complex interactions between threads are necessary to expose the nuances that make the consistency problem for the Relaxed-Acyclic memory model (or for that matter, other memory models we consider) hard. Nevertheless, we assist the text with illustrations that help visualize and generalize the interaction patterns that are exploited in our reduction. To further enhance readability, we distinguish different memory locations with different colors (in both figures and main text).

Let \( \varphi = \{C_j\}_{j \in [m]} \) be a monotone Boolean formula over \( n \) variables \( \{s_j\}_{j \in [n]} \) and \( m \) clauses of the form \( C_j = (s_j, s_k, s_l) \). We construct an abstract execution \( \overline{X} = \langle E, po \rangle \) such that \( \overline{X} \models \) Relaxed-Acyclic iff \( \varphi \) is satisfiable using a 1-in-3 assignment\(^\dagger\).

\(^\dagger\)We often use the phrase '\( \varphi \) is satisfiable' to mean '\( \varphi \) is satisfiable by a 1-in-3 assignment'.
The main gadget in our construction is the copy gadget $\text{Copy}_j$. The main gadget in our construction is the copy gadget $\text{Copy}_j$, defined for each $i \in [m]$ and $j \in [n]$, and shown in Fig. 4. This gadget contains (i) the three focal events $w(t_1, x_1, v'_j), w(t_2, x_1, v'_j)$ and $r(t_3, x_1, v'_j)$ that determine the truth value of $s_j$, (ii) three “mirror” events $w(t_1, x_2, v'_j), w(t_5, x_2, v'_j)$ and $r(t_6, x_2, v'_j)$, and (iii) other events on memory locations $y_1, y_2, y_3, y_4$.

The gadget couples the writers of $r(t_3, x_1, v'_j)$ and $r(t_6, x_2, v'_j)$: if $r(t_3, x_1, v'_j)$ reads from thread $t_1$ then $r(t_6, x_2, v'_j)$ reads from thread $t_4$ (see Fig. 4b), while if $r(t_3, x_1, v'_j)$ reads from thread $t_2$ then $r(t_6, x_2, v'_j)$ reads from thread $t_5$ (see Fig. 4c).

The copy-down gadget $\overline{\text{Copy}_j}$. We use a copy-down gadget $\overline{\text{Copy}_j}$, defined for $i \in [m - 1]$ and $j \in [n]$, with structure identical to $\text{Copy}_j$, and shown in Fig. 5. This gadget contains (i) the three focal events $w(t_1, x_1, v^{j+1}_j), w(t_2, x_1, v^{j+1}_j)$ and $r(t_3, x_1, v^{j+1}_j)$, (ii) the three mirror events $w(t_4, x_2, v'_j), w(t_5, x_2, v'_j)$ and $r(t_6, x_2, v'_j)$, and (iii) other events on memory locations $z_1, z_2, z_3, z_4$. 

Fig. 3. The schematic reduction from a monotone formula $\varphi$ to an abstract execution $\overline{X}$. 

**High-level description.** Our reduction constructs an abstract execution with $O(n \cdot m)$ events accessing $d = 14$ memory locations in $\kappa = 23$ threads, of which the three threads $t_1, t_2$ and $t_3$ form the core of the construction. Events appear in each of these threads in $m$ phases (one phase per clause $C_i$), starting from phase 1 and going to larger phases as we go downwards in the threads. Each phase, in turn, consists of $n$ steps (one step per variable $s_j$), again starting from step 1 and going to larger steps as we go downwards. In phase $i$ and step $j$ we have a read event $r(t_3, x_1, v'_j)$ that can read from either of two writes $w(t_1, x_1, v'_j)$ or $w(t_2, x_1, v'_j)$. The former case corresponds to the assignment $s_j = \bot$, while the latter case corresponds to the assignment $s_j = \top$. See Fig. 3 for an illustration. Our construction guarantees that the choices for the writer of $r(t_3, x_1, v'_j)$ are consistent across all phases $i$: either each $r(t_3, x_1, v'_j)$ reads from $w(t_1, x_1, v'_j)$, which corresponds to setting $s_j = \bot$ in $\varphi$, or each $r(t_3, x_1, v'_j)$ reads from $w(t_2, x_1, v'_j)$, which corresponds to setting $s_j = \top$ in $\varphi$. Moreover, for each clause $C_i = (s_j, s_k, s_t)$, our reduction guarantees that exactly one of $r(t_3, x_1, v'_j)$, $r(t_3, x_1, v'_k)$, and $r(t_3, x_1, v'_t)$ reads from thread $t_2$, which implies that the corresponding assignment on $s_j, s_k$ and $s_t$ satisfies the 1-in-3 property. To achieve all these constraints, we introduce four gadgets, which consist of events on additional threads and memory locations, that guarantee the desired properties. In the following, we first describe each gadget separately, and then explain how to interleave them in order to obtain the abstract execution $\overline{X}$.
The at-most-one-true gadgets AMOne[c]j,k – Consider a clause \( C_i \), and for each \( c \in [3] \), let \( (s_j, s_k) \) be the \( c \)-th pair of literals that appear in \( C_i \) (according to some arbitrary but fixed total ordering on pairs of propositional variables). The at-most-one-true gadget \( \text{AMOne}[c]_{j,k} \) is shown in Fig. 6a and contains (i) the six focal events \( w(t_1, x_1, v^i_j) \), \( w(t_2, x_1, v^i_j) \) and \( r(t_3, x_1, v^i_j) \); and \( w(t_1, x_1, v^i_k) \), \( w(t_2, x_1, v^i_k) \) and \( r(t_3, x_1, v^i_k) \), and (ii) other events on memory location \( a_c \). The gadget guarantees that at most

Fig. 4. The copy gadget Copy^i_j (a) captures the Boolean assignment to variable \( s_j \) in phase \( i \). There are two ways to realize this gadget, by choosing which of the two writes \( w(x_1, v^i_j) \) the read \( r(x_1, v^i_j) \) observes.

Choosing the write of \( t_1 \) (b) corresponds to setting \( s_j = \bot \) and also forces \( r(x_2, v^i_j) \) to read from \( t_3 \). Choosing the write of \( t_2 \) (c) corresponds to setting \( s_j = \top \) and also forces \( r(x_2, v^i_j) \) to read from \( t_3 \). This \( rf \) coupling is formalized in Lemma 3.1. The edge numbers specify the order in which \( rf \)-edges are inferred.
Copy and contains (i) the nine focal events \( \overline{\text{Copy}}^i \) (w) and (ii) other events on memory location \( b \). The gadget guarantees that at least one of \( r(t_3, x_1) \), \( r(t_3, x_1, v'_j) \), and \( r(t_3, x_1, v'_k) \) reads from thread \( t_2 \), which corresponds to assigning \( T \) to at least one of \( s_j, s_k, s_l \) (Figs. 6b to 6d). Note that this gadget, by itself, allows for two or even all three of the literals to be assigned to \( T \), however, these cases are not shown as they are prohibited by the at-most-one-true gadgets above.

The at-least-one-true gadget AOne\(_{j,k,t}^i \). Consider a clause \( C_i = (s_j, s_k, s_l) \). The at-least-one-true gadget AOne\(_{j,k,t}^i \) is shown in Fig. 7a and contains (i) the nine focal events \( w(t_1, x_1, v'_j) \), \( w(t_2, x_1, v'_j) \) and \( r(t_3, x_1, v'_j) \), \( w(t_1, x_1, v'_k) \), \( w(t_2, x_1, v'_k) \) and \( r(t_3, x_1, v'_k) \); and \( w(t_1, x_1, v'_l) \), \( w(t_2, x_1, v'_l) \) and \( r(t_3, x_1, v'_l) \); and (ii) other events on memory location \( b \). The gadget guarantees that at least one of \( r(t_3, x_1) \), \( r(t_3, x_1, v'_j) \), and \( r(t_3, x_1, v'_k) \) reads from thread \( t_2 \), which corresponds to assigning \( T \) to at least one of \( s_j, s_k, s_l \) (shown in Figs. 7b to 7d). Note that this gadget, by itself, allows for two or even all three of the literals to be assigned to \( T \), however, these cases are not shown as they are prohibited by the at-most-one-true gadgets above.

**Putting the gadgets together.** We serially connect all gadgets in their common threads by po. In particular, \( \overline{\text{Copy}}^i_{j_1} \) appears before \( \overline{\text{Copy}}^i_{j_2} \) if \( i_1 < i_2 \) or \( i_1 = i_2 \) and \( j_1 < j_2 \); \( \overline{\text{Copy}}^i_{j_1} \) appears before \( \overline{\text{Copy}}^i_{j_2} \) if \( i_1 < i_2 \) or \( i_1 = i_2 \) and \( j_1 < j_2 \); each AOne\([c]^i_{j_1,k_1} \) appears before AOne\([c]^i_{j_2,k_2} \) if \( i_1 < i_2 \), and finally AOne\(_{j_1,k_1,t_1}^i \) appears before AOne\(_{j_2,k_2,t_2}^i \) if \( i_1 < i_2 \). As various gadgets have common threads and events, besides connecting them, we also need to specify the interleaving between them. However, this interleaving can be arbitrary and we will not fix it here. Finally, we have indeed used \( \kappa = 23 \) threads and \( d = 14 \) memory locations.
3.2 Soundness

We first establish the soundness of the reduction, i.e., if $\overline{X}$ is consistent (using an extension $X$) within Relaxed-Acyclic, then $\varphi$ has a satisfying assignment. In this direction, we will establish some intermediate lemmas. Recall that we obtain the satisfying assignment for $\varphi$ by assigning $s_j = \top$ if $(w(t_2, x_1, v^j_1), r(t_3, x_1, v^j_1)) \in rf$ for all $i \in [m]$, and $s_j = \bot$ if $(w(t_1, x_1, v^j_1), r(t_3, x_1, v^j_1)) \in rf$ for all $i \in [m]$. The first lemma is based on the copy gadgets $Copy^j_1$ and $Copy^j_2$, and states that each phase of $X$ makes consistent choices for the writer of $r(t_3, x_1, v^j_1)$ (i.e., whether it reads from $t_1$ or $t_2$), which makes the above assignment well-defined. It follows from the high-level description of these two gadgets and the accompanying Fig. 4 and Fig. 5.

**Lemma 3.1.** Let $X = (E, po, rf, mo)$ be a concrete extension of $\overline{X}$ with $X \models$ Relaxed-Acyclic. For all $i_1, i_2 \in [m], j \in [n]$, we have that $(w(t_1, x_1, v^j_{i_1}), r(t_3, x_1, v^j_{i_1})) \in rf$ iff $(w(t_1, x_1, v^j_{i_2}), r(t_3, x_1, v^j_{i_2})) \in rf$.

**Proof.** We argue by induction that for every $i \in [m - 1]$, we have $(w(t_1, x_1, v^j_i), r(t_3, x_1, v^j_i)) \in rf$ iff $(w(t_1, x_1, v^j_{i+1}), r(t_3, x_1, v^j_{i+1})) \in rf$. First, note that if $(w(t_1, x_1, v^j_i), r(t_3, x_1, v^j_i)) \in rf$, then the copy gadget $Copy^j_i$ forces $(w(t_2, x_2, v^j_i), r(t_3, x_2, v^j_i)) \in rf$. Indeed, we have the following inferred sequence of $rf$ edges (see 1-5 in Fig. 4b, where 1 represents $(w(t_1, x_1, v^j_i), r(t_3, x_1, v^j_i)) \in rf$).

1. We have $(r(t_1, y_1, v^j_i), w(t_3, y_1, v^j_i)) \in (po \cup rf)^+$ and thus $(w(t_3, y_1, v^j_i), r(t_1, y_1, v^j_i)) \notin rf$ due to porf-acyclicity. Thus $r(t_1, y_1, v^j_i)$ is forced to read from the only other available write of the same value, i.e., $(w(f_1, y_1, v^j_i), r(t_1, y_1, v^j_i)) \in rf$, depicted as 2 in Fig. 4b.
Thus \( \text{mo} \) is forced to read from the only other available write, i.e.,

\[
(w(x_1, y_5)) \in \{ \text{po} \cup \text{rf} \}^+.
\]

We now have \((w(f_1, y_1, v_j), r(t_1, y_1, u_j)) \in \{ \text{po} \cup \text{rf} \}^+\), and since \((w(f_2, y_1, u_j), r(t_1, y_1, u_j)) \in \text{rf}\), we have \((w(f_1, y_1, v_j), w(f_2, y_1, u_j)) \in \{ \text{mo} \cup \text{rf} \}^+\) due to relaxed-read-coherence. Observe that now \((w(f_1, y_1, v_j), w(f_2, y_1, u_j)) \in \{ \text{po} \cup \text{mo} \}^+\), thus due to relaxed-write-coherence, we have \((w(f_1, y_1, v_j), w(f_2, y_1, u_j)) \in \{ \text{mo} \cup \text{rf} \}^+\). Since \((w(f_2, y_1, u_j), r(f_3, y_1, v_j)) \in \{ \text{po} \cup \text{rf} \}^+\), we have \((w(f_1, y_1, v_j), r(f_3, y_1, v_j)) \notin \text{rf}\) due to relaxed-read-coherence. Thus \(r(f_3, y_1, v_j)\) is forced to read from the only other available write, i.e., \((w(t_5, y_1, v_j), r(f_3, y_1, v_j)) \in \text{rf}\), depicted by 3.

(3) We now have \((r(t_5, y_4, v_j), w(f_3, y_4, v_j)) \in \{ \text{po} \cup \text{rf} \}^+\) and thus \((w(f_3, y_4, v_j), r(t_5, y_4, v_j)) \notin \text{rf}\), due to porf-acyclicity. Thus \(r(t_5, y_4, v_j)\) is forced to read from the only other available write, i.e., \((w(t_6, y_4, v_j), r(t_5, y_4, v_j)) \in \text{rf}\), depicted by 4.

(4) We now have \((r(t_6, x_2, v_j), w(t_5, x_2, v_j)) \in \{ \text{po} \cup \text{rf} \}^+\) and thus \((w(t_6, x_2, v_j), r(t_6, x_2, v_j)) \notin \text{rf}\) due to porf-acyclicity. Thus \(r(t_6, x_2, v_j)\) is forced to read from the only other available write, i.e., \((w(t_4, x_2, v_j), r(t_6, x_2, v_j)) \in \text{rf}\), depicted by 5.

Fig. 7. The at-least-one-true gadget ALOne\(_{j,k,t}^i\) (a) and the three ways to resolve it depending on the boolean assignment to \(s_j, s_k\) and \(s_t\) (d, c, b). These \text{rf} constraints are formalized in Lemma 3.3. The edge numbers specify the order in which \text{rf}-edges are inferred. The crossed-out events are ignored in our analysis for simplicity (see Lemma 3.5).
On the other hand, if \((w(t_2, x_1, v^i_j), r(t_3, x_1, v^i_j)) \in rf\), then the copy gadget \(Copy^i_j\) forces \((w(t_5, x_2, v^i_j), r(t_6, x_2, v^i_j)) \in rf\), by a similar analysis (see Fig. 4c, depicted by 1 - 5).

Finally, a similar analysis on the copy-down gadget \(Copy^i_j\) (see Fig. 5, and the forced \(rf\) edges 1 - 5) concludes that \((w(t_1, x_1, v^{i+1}_j), r(t_3, x_1, v^{i+1}_j)) \in rf\) iff \((w(t_4, x_2, v^i_j), r(t_6, x_2, v^i_j)) \in rf\), and hence we have \((w(t_1, x_1, v^i_j), r(t_3, x_1, v^i_j)) \in rf\) iff \((w(t_1, x_1, v^{i+1}_j), r(t_3, x_1, v^{i+1}_j)) \in rf\), as desired. □

The next lemma is based on the at-most-one-true gadgets \(AMOne[c]^i_{j,k}\), and it is used to show that for every clause \(C_i\) and for each of the three pairs of literals \((s_j, s_k)\) in \(C_i\), at most one of them is assigned to true. Again, it follows by a direct analysis of the accompanying figure, Fig. 6.

**Lemma 3.2.** Let \(X = (E, po, rf, mo)\) be a concrete extension of \(\overline{X}\) with \(X \models\) Relaxed-Acyclic. For every \(i \in \lfloor m \rfloor\) and \(j, k \in \lfloor n \rfloor\) such that \(s_j\) and \(s_k\) appear in clause \(C_i\), we have \((w(t_2, x_1, v^i_j), r(t_3, x_1, v^i_j)) \notin rf\) or \((w(t_2, x_1, v^i_k), r(t_3, x_1, v^i_k)) \notin rf\).

**Proof.** The statement follows by analyzing the at-most-one-true gadget \(AMOne[c]^i_{j,k}\), where \(c \in \lfloor 3 \rfloor\) is such that \((s_j, s_k)\) is the \(c\)-th pair of variables in \(C_i\) (see Fig. 6).

First, if \((w(t_2, x_1, v^i_j), r(t_3, x_1, v^i_j)) \in rf\) holds (marked 1 in Fig. 6b) then \((w(t_2, x_1, v^i_k), r(t_3, x_1, v^i_k)) \notin rf\). Indeed, we have the following sequence of \(rf\) edges (see 1 - 4, Fig. 6b).

1. We have \((r(t_2, a_c, v^i_j), w(t_3, a_c, v^i_j)) \in (po \cup rf)^+\) and thus \((w(t_3, a_c, v^i_j), r(t_2, a_c, v^i_j)) \notin rf\) due to porf-acyclicity. Thus \(r(t_2, a_c, v^i_j)\) is forced to read from the only other available write, i.e., \((w(h_c, a_c, v^i_j), r(t_2, a_c, v^i_j)) \in rf\), depicted by 2.

2. We now have \((w(h_c, a_c, v^i_j), w(t_2, a_c, v^i_j)) \in (po_{a_c} \cup rf_{a_c})^+\), and due to relaxed-write-coherence, we also have \((w(h_c, a_c, v^i_k), w(h_c, a_c, v^i_j)) \in mo_{a_c}\). In turn, this implies \((w(h_c, a_c, v^i_j), r(t_2, a_c, v^i_k)) \notin rf\) due to relaxed-read-coherence. Thus \(r(t_2, a_c, v^i_k)\) is forced to read from the only other available write, i.e., \((w(t_3, a_c, v^i_k), r(t_2, a_c, v^i_k)) \in rf\), depicted by 3.

3. We now have \((r(t_3, x_1, v^i_j), w(t_2, x_1, v^i_k)) \in (po \cup rf)^+\) and thus \((w(t_2, x_1, v^i_k), r(t_3, x_1, v^i_k)) \notin rf\) due to porf-acyclicity. Thus \(r(t_3, x_1, v^i_k)\) is forced to read from the only other available write, i.e., \((w(t_1, x_1, v^i_k), r(t_3, x_1, v^i_k)) \in rf\), depicted by 4.

Second, if \((w(t_2, x_1, v^i_j), r(t_3, x_1, v^i_j)) \in rf\) marked 1 in Fig. 6c), then \((w(t_2, x_1, v^i_j), r(t_3, x_1, v^i_j)) \notin rf\). Indeed, we have the following forced sequence of \(rf\) edges (see Fig. 6c).

1. We have \((r(t_2, a_c, v^i_k), w(t_3, a_c, v^i_k)) \in (po \cup rf)^+\) and thus \((w(t_3, a_c, v^i_k), r(t_2, a_c, v^i_k)) \notin rf\) due to porf-acyclicity. Thus \(r(t_2, a_c, v^i_k)\) is forced to read from the only other available write, i.e., \((w(h_c, a_c, v^i_k), r(t_2, a_c, v^i_k)) \in rf\), marked 2.

2. Due to relaxed-write-coherence, we have \((w(h_c, a_c, v^i_k), w(h_c, a_c, v^i_j)) \in mo_{a_c}\). We thus have \((w(h_c, a_c, v^i_j), r(t_2, a_c, v^i_j)) \notin rf\), as this would imply that \((w(h_c, a_c, v^i_j), r(t_2, a_c, v^i_k)) \in (po_{a_c} \cup rf_{a_c})\), which would violate relaxed-read-coherence. Thus \(r(t_2, a_c, v^i_j)\) is forced to read from the only other available write, i.e., \((w(t_3, a_c, v^i_j), r(t_2, a_c, v^i_j)) \in rf\), marked 3.

3. We now have \((r(t_3, x_1, v^i_j), w(t_2, x_1, v^i_j)) \in (po \cup rf)^+\) and thus \((w(t_2, x_1, v^i_j), r(t_3, x_1, v^i_j)) \notin rf\) due to porf-acyclicity. Thus \(r(t_3, x_1, v^i_j)\) is forced to read from the only other available write, i.e., \((w(t_1, x_1, v^i_j), r(t_3, x_1, v^i_j)) \in rf\), depicted by 4. □
The third lemma is based on the at-least-one-true gadget $\text{ALOne}_{j,k,t}^j$ and it is used to show that for every clause $C_i = (s_j, s_k, s_t)$, at least one of its literals is assigned to true. Again, it follows by a direct analysis of the accompanying figure, Fig. 7.

**Lemma 3.3.** Let $X = (E, \text{po}, \text{rf}, \text{mo})$ be a concrete extension of $\overline{X}$ with $X \models \text{Relaxed-Acyclic}$. For every $i \in [m]$ and $j, k, t \in [n]$ such that $s_j, s_k$ and $s_t$ appear in clause $C_i$, we have $(w(t_2, x_1, v'_j), r(t_3, x_1, v'_j)) \in \text{rf}$ or $(w(t_2, x_1, v'_j), r(t_3, x_1, v'_j)) \in \text{rf}$ or $(w(t_2, x_1, v'_j), r(t_3, x_1, v'_j)) \in \text{rf}$.

**Proof.** First, if $(w(t_2, x_1, v'_j), r(t_3, x_1, v'_j)) \notin \text{rf}$ and $(w(t_2, x_1, v'_j), r(t_3, x_1, v'_j)) \notin \text{rf}$ (hence both read from $t_1$, marked 1 in Fig. 7b) then $(w(t_2, x_1, v'_j), r(t_3, x_1, v'_j)) \in \text{rf}$. Indeed, we have the following forced sequence of rf edges (see Fig. 7b, 1 - 4).

1. We have $(r(t_1, b, v'_j), w(t_3, b, v'_j)) \in (\text{po} \cup \text{rf})^+$ and thus $(w(t_3, b, v'_j), r(t_1, b, v'_j)) \notin \text{rf}$ due to porf-acyclicity. Thus $r(t_1, b, v'_j)$ is forced to read from the only other available write, i.e., $(w(p, b, v'_j), r(t_1, b, v'_j)) \in \text{rf}$. Similarly, we have $(r(t_1, b, v'_j), w(t_3, b, v'_j)) \in (\text{po} \cup \text{rf})^+$ and thus $(w(t_3, b, v'_j), r(t_1, b, v'_j)) \notin \text{rf}$ due to porf-acyclicity. Thus $r(t_1, b, v'_j)$ is forced to read from the only other available write, i.e., $(w(q, b, v'_j), r(t_1, b, v'_j)) \in \text{rf}$.

2. We now have $(w(p, b, v'_j), r(t_1, b, v'_j)) \in (\text{po} \cup \text{rf})^+$, and due to relaxed-write-coherence, we also have $(w(p, b, v'_j), w(p, b, v'_j)) \in \text{mo}_b$. In turn, this implies $(w(p, b, v'_j), r(t_1, b, v'_j)) \notin \text{rf}$ due to relaxed-read-coherence. Similarly, we now have $(w(q, b, v'_j), r(t_1, b, v'_j)) \in (\text{po} \cup \text{rf})^+$, and due to relaxed-write-coherence, we also have $(w(q, b, v'_j), w(q, b, v'_j)) \in \text{mo}_b$. In turn, this implies $(w(q, b, v'_j), r(t_1, b, v'_j)) \notin \text{rf}$ due to relaxed-read-coherence. Thus $r(t_1, b, v'_j)$ is forced to read from the only other available write, i.e., $(w(t_3, b, v'_j), r(t_1, b, v'_j)) \in \text{rf}$, depicted 3.

3. We now have $(r(t_1, b, v'_j), w(t_3, x_1, v'_j)) \in (\text{po} \cup \text{rf})^+$ and thus $(w(t_1, x_1, v'_j), r(t_3, x_1, v'_j)) \notin \text{rf}$ due to porf-acyclicity. Thus $r(t_1, x_1, v'_j)$ is forced to read from the only other available write, i.e., $(w(t_2, x_1, v'_j), r(t_3, x_1, v'_j)) \in \text{rf}$, depicted by 4.

A similar analysis establishes that if $(w(t_2, x_1, v'_j), r(t_3, x_1, v'_j)) \notin \text{rf}$ and $(w(t_2, x_1, v'_j), r(t_3, x_1, v'_j)) \notin \text{rf}$ then $(w(t_2, x_1, v'_j), r(t_3, x_1, v'_j)) \in \text{rf}$ (see Fig. 7c), as well as that if $(w(t_2, x_1, v'_k), r(t_3, x_1, v'_k)) \notin \text{rf}$ and $(w(t_2, x_1, v'_k), r(t_3, x_1, v'_k)) \notin \text{rf}$ then $(w(t_2, x_1, v'_j), r(t_3, x_1, v'_j)) \in \text{rf}$ (see Fig. 7d). \qed

**Corollary 3.4.** If $\overline{X} \models \text{Relaxed-Acyclic}$ then $\varphi$ is satisfiable.

### 3.3 Completeness

We now turn our attention to the completeness property, i.e., if $\varphi$ is satisfiable then $\overline{X}$ is consistent in the Relaxed-Acyclic model. To this end, we construct a reads-from relation $\text{rf}$ and a partial modification order $\text{mo}$ as follows.

1. For each gadget, we insert $\text{rf}$-edges according to Figs. 4 to 7 and the truth assignments on literals $s_j, s_k, s_t$ involved in that gadget. In particular, this implies that, for each $i \in [m]$ and $j \in [n]$, we have $(w(t_1, x_1, v'_j), r(t_3, x_1, v'_j)) \in \text{rf}$ if $s_j = \top$ and $(w(t_2, x_1, v'_j), r(t_3, x_1, v'_j)) \in \text{rf}$ if $s_j = \bot$. Observe that $\text{rf}$ is fully specified, i.e., every read has been assigned a write.

2. For every two conflicting writes $w_1, w_2$ with (i) $\text{tid}(w_1) \neq \text{tid}(w_2)$, and (ii) there exist two reads $r_1, r_2$ with $(r_1, r_2) \in \text{po}$ such that $(w_i, r_i) \in \text{rf}$ for each $i \in [2]$, we have $(w_1, w_2) \in \text{mo}$. 

Proc. ACM Program. Lang., Vol. 8, No. POPL, Article 66. Publication date: January 2024.
We call a triplet \((w, r, w')\) on location \(x\) safe in Relaxed-Acylic if either \((w', r) \notin (po_x \cup rf_x)^+\) or \((w', w) \in (po_x \cup rf_x \cup mo_x)^+\). To prove the consistency of \(X\) it suffices to prove the following:

1. \((po \cup rf)\) is acyclic, and
2. \(mo\) is minimally coherent for Relaxed-Acylic, namely, that (i) \((po_x \cup rf_x \cup mo_x)\) is acyclic for every location \(x\), and (ii) every triplet is safe.

Indeed, minimal-coherence guarantees that, for each location \(x\), there exists a total extension \(mo_x\) of \(mo\) that satisfies relaxed-write-coherence and relaxed-read-coherence [Tunc et al. 2023].

To simplify our analysis, we ignore the writes in threads \(\{h_e\}_{e \in [3]}\), \(p\) and \(q\) that are not read in \(rf\) (crossed-out in Fig. 6 and Fig. 7). This allows us to make our formal statements more uniform, while this omission does not affect the correctness of the analysis. Indeed, let \(InactiveWr = \{w \in W \mid tid(w) \in \{h_1, h_2, h_3, p, q\}\} \) and \(\complement r \in R\). \((w, r) \in rf\). The following lemma is straightforward.

**Lemma 3.5.** The following statements hold.

1. If there is a \((po \cup rf)\)-cycle then there is such a cycle without any write in \(InactiveWr\).
2. If there is a \((po_x \cup rf_x \cup mo_x)\)-cycle for some memory location \(x\) then there is such a cycle without any write in \(InactiveWr\).
3. If there is an unsafe triplet, then there is such a triplet \((w, r, w')\) where \(w' \notin \) \(InactiveWr\).

**Proof.** We prove each item separately.

1. We first observe that for any \(w \in \) \(InactiveWr\), there is no outgoing \(rf\) or even \(mo\) edge. Hence any cycle containing a write \(w \in \) \(InactiveWr\) must contain a sequence of edges \(e_1 \xrightarrow{po} w \xrightarrow{po} e_2\), which can be replaced by the single edge \(e_1 \xrightarrow{po} e_2\).

2. Proof same as above.

3. Consider a memory location \(x\) and an unsafe triplet \((w, r, w')\) on \(x\). This means that there is a \((po_x \cup rf_x)\)-path \(P: w' \xrightarrow{po_x \cup rf_x} r\). Since the threads \(\{h_e\}_{e \in [3]}\), \(p\) and \(q\) contain only same-location writes, \(P\) must be of the form \(P: w' \xrightarrow{po_x} w'' \xrightarrow{rf_x} r'' \xrightarrow{po_x \cup rf_x} r\). Here, \(w''\) and \(r''\) conflict with \(w'\). Note that \(r'' \neq r\), otherwise \(w'' = \) \(w\) since \((w, r) \in rf\), implying that \((w, r, w')\) is safe. Since \((w, r, w')\) is unsafe, we have \((w', w) \notin (po_x \cup rf_x \cup mo_x)^+\) and hence, \((w'', w) \notin (po_x \cup rf_x \cup mo_x)^+\). Thus \((w, r, w'')\) is also unsafe, while clearly \(w'' \notin \) \(InactiveWr\).

The completeness can now be established in three lemmas, one for each of the properties (1), (2i), and (2ii) above (i.e., \((po \cup rf)\)-acyclicity and minimal coherence). Before we proceed, we will use the following notation to make our analysis easier.

**Notation.** Given an event \(e\), we say that \(e\) appears in phase \(i\), written phase\((e) = i\), if it writes/reads a value superscripted by \(i\) (i.e., of the form \(v^i\) or \(\overline{v}^i\)). Similarly, we say that \(e\) appears in step \(j\), written step\((e) = j\), if it writes/reads a value subscripted by \(j\) (i.e., of the form \(v_j\) or \(\overline{v}_j\)). We define a quasi order \(\preceq\) on the event set \(E\) with \(e_1 \preceq e_2\) if either (i) phase\((e_1) <\) phase\((e_2)\) or (ii) phase\((e_1) = \) phase\((e_2)\) and step\((e_1) \leq \) step\((e_2)\). We also write \(e_1 < e_2\) to denote that \(e_1 \preceq e_2\) and \(e_2 \not\preceq e_1\). Now consider an arbitrary path \(P = e_1, \ldots, e_k\). We say that \(P\) crosses a thread \(t\) if \(tid(e_t)\) for some for some \(t \in [k]\). We call \(P\) monotonic if it linearizes \(\preceq\), i.e., \(e_t \preceq e_{t+1}\) for all \(t \in [k - 1]\).

We first establish the safety of each triplet.

**Lemma 3.6.** Every triplet \((w, r, w')\) is safe.
Proof. First, observe that the following hold by construction.

1. There is no memory location that is both written and read by the same thread.
2. For every two conflicting writes \( w_1, w_2 \), if there exist two reads \( r_1, r_2 \) such that \((w_i, r_i) \in rf\) for each \( i \in [2] \) and \((r_1, r_2) \in po\), then \((w_2, w_1) \notin po\).

Now, let \( x \) be the location accessed by a triplet \((w, r, w')\), and assume there exists a \((po_x \cup rf_x)\)-path \( P: w' \xrightarrow{po_{x_{lr}}rf_x} r \). Due to Item (1), \( P \) must leave a thread and enter another at most once. Thus, \( P \) must be of the form \( P: w' \xrightarrow{po_{x_{lr}}} w'' \xrightarrow{rf} r'' \xrightarrow{po_{r_{lr}}} r \). If \((w'', w) \in po_x^2\), we are done. Otherwise, due to Item (2), we can’t have that \((w, w'') \in po_x\). Hence, \( w \) and \( w'' \) are in different threads, and by the construction of \( \overline{m0}_x \), we have \((w'', w) \in \overline{m0}_x\), implying that \((w', w) \in (po_x \cup \overline{m0}_x)^*\). Hence the triplet is safe, as desired. \( \square \)

Next, we establish the second ingredient of completeness, i.e., the acyclicity of \((po \cup rf)\).

Lemma 3.7. \((po \cup rf)\) is acyclic.

Proof. First, note that there is no \((po \cup rf)\)-cycle crossing any of \( g_2 \) and \( g_3 \), as these threads only contain writes, and hence there is no way to enter them by an \( rf\)-edge. Moreover, by construction, any \((po \cup rf)\)-path \( P \) that starts from a thread other than \( g_2 \) and \( g_3 \) is necessarily monotonic.

Indeed, all \( rf\)-edges connect events of the same phase and step, while the only \( po\)-edges \( e_1 \xrightarrow{po} e_2 \) with \( \text{phase}(e_1) < \text{phase}(e_2) \) appear in \( g_2 \) and \( g_3 \). Finally, the only \( po\)-edges \( e_1 \xrightarrow{po} e_2 \) with \( \text{phase}(e_1) = \text{phase}(e_2) \) and \( \text{step}(e_1) > \text{step}(e_2) \) occur in threads \( h_1, h_2, h_3, p \) and \( q \) (see Fig. 6 and Fig. 7). However, in light of Lemma 4.5, these edges can be ignored in our analysis.

Thus, any potential \((po \cup rf)\)-cycle has to traverse events of the same phase and step. A straightforward analysis of each instantiation of each gadget (see Figs. 4b and 4c, Figs. 5a and 5b, Figs. 6b to 6d, Figs. 7b to 7d), and their combinations on common events, establishes that no \((po \cup rf)\)-cycle exists, as desired. \( \square \)

Finally, we establish the acyclicity of \((po_x \cup rf_x \cup \overline{m0}_x)\).

Lemma 3.8. \((po_x \cup rf_x \cup \overline{m0}_x)\) is acyclic for all locations \( x \).

Proof. Observe that no memory location is both read and written by the same thread in any of the gadgets. Hence, any \((po_x \cup rf_x \cup \overline{m0}_x)\)-cycle \( C \) following an \( rf\)-edge \( w \xrightarrow{rf} r \) enters a thread that it cannot exit. This means that \( C \) can only contain events of the same thread, which is forbidden by Lemma 3.7. Thus we only need to reason about the absence of \((po_x \cup \overline{m0}_x)\) cycles.

Observe that for every memory location \( x \) except \( z_1 \) and \( z_2 \), every \((po_x \cup \overline{m0}_x)\)-path is monotonic. Hence any potential \((po_x \cup \overline{m0}_x)\)-cycle must traverse paths of the same phase and step, and thus be contained in one of the gadgets. The absence of such cycles can be easily established by looking at the instantiations of these gadgets (Figs. 4 to 7).

On the other hand, consider the case \( x = z_1 \) (the analysis is similar for \( x = z_2 \)). Let \( C \) be a shortest \((po_{z_1} \cup \overline{m0}_{z_1})\)-cycle. If \( C \) contains only events of the same phase and step, it can be dismissed again by looking at the gadget in Fig. 5. Otherwise, \( C \) is non-monotonic and must thus traverse the non-monotonic edge \( w(g_2, z_1, u_j^{+1}) \xrightarrow{po_{z_1}} w(g_2, z_1, \overline{u}_j^i) \). From there it can continue to threads \( g_1 \).
and Lemma 4.1. The coupling is a shortest cycle. The desired result follows.

Lemma 3.6, Lemma 3.7 and Lemma 3.8 imply the completeness of the reduction.

Corollary 3.9. If \( \varphi \) is satisfiable then \( \bar{X} \models \text{Relaxed-Acyclic} \).

4 HARDNESS FOR WRA, RA AND SRA

In this section we prove Theorem 1.1 for the range of models SRA \( \equiv M \equiv \text{WRA} \), which also implies hardness specifically for SRA, RA, and WRA. Similarly to Section 3, our reduction uses unbounded values. Again, we will make the final step towards Theorem 1.1 in Section 6, which consists of a simple modification of the technique presented here.
reads from thread $r$

4.1 Reduction

Given a monotone formula $\varphi = \{C_i\}_{i \in [m]}$ over $n$ variables $\{s_j\}_{j \in [n]}$, we construct an abstract execution $\bar{\Xi} = (E, po)$ with $k = 23$ threads and accessing $d = 26$ memory locations such that $\varphi$ is satisfiable iff $\bar{\Xi} \models M$, for any memory model $M$ such that SRA $\preceq M \preceq WRA$. $\bar{\Xi}$ follows the general scheme of Fig. 3. Again, in phase $i$ and step $j$ we have a read event $r(t_3, x_1, v_j^1)$ that can read from either of two writes $w(t_1, x_1, v_j^1)$ (corresponding to $s_j = \bot$) or $w(t_2, x_1, v_j^1)$ (corresponding to $s_j = \top$). We use the same types of gadgets as in the previous section to force the desirable interaction patterns between threads. However, the contents of each gadget are different to account for the different memory models. In particular, these gadgets now rely on the weak-read-coherence axiom to couple the readers in threads $t_3$ and $t_4$ and force the 1-in-3 property in each clause $C_i$, as opposed to the porf-acyclicity and relaxed-read-coherence axioms used in Section 3 for Relaxed-Acyclic.

The copy gadget $\text{Copy}_j^i$. The copy gadget $\text{Copy}_j^i$ (Fig. 8) guarantees that $r(t_3, x_1, v_j^1)$ reads from thread $t_4$ if $r(t_6, x_2, v_j^6)$ reads from thread $t_4$. Besides $x_1$ and $x_2$, this gadget also uses locations $y_1$, $y_2$, $y_3$, $y_4$, $y_5$, $y_6$, $y_7$ and $y_8$. Moreover, the gadget also contains two $hb$-edges out of $r(t_6, x_2, v_j^6)$.

Though not shown explicitly, these $hb$-edges can be easily enforced by an $rf$-edge $w \xrightarrow{rf} r$, where $w$ and $r$ access a new memory location. These events are independent to our analysis later, and will thus be ignored, i.e., we will only be considering the $hb$-edges as they appear in the gadget.

The copy-down gadget $\overline{\text{Copy}_j^i}$. The copy-down gadget $\overline{\text{Copy}_j^i}$ (Fig. 9), as before, has a similar structure to the copy gadget $\text{Copy}_j^i$ and guarantees that $r(t_3, x_1, v_j^{i+1})$ reads from thread $t_1$ if $r(t_6, x_2, v_j^6)$ reads from thread $t_4$. Together, the two copy gadgets $\text{Copy}_j^i$ and $\overline{\text{Copy}_j^i}$ ensure that $r(t_3, x_1, v_j^1)$ reads from thread $t_1$ if $r(t_3, x_1, v_j^{i+1})$ reads from thread $t_1$, guaranteeing a valid truth assignment on $s_j$. Besides $x_1$ and $x_2$, this gadget also uses locations $z_1$, $z_2$, $z_3$, $z_4$, $z_5$, $z_6$, $z_7$ and $z_8$.
Moreover, the gadget also contains two \( \text{hb} \)-edges out of \( r(t_0, x_2, v^j) \), which, as argued above, will be ignored in our analysis.

The at-most-one-true gadgets \( \text{AMOne}^{j,k}_c \). For each \( c \in [3] \), the at-most-one-true gadget (Fig. 10) guarantees that for every clause \( C_t \), the \( c \)-th pair of literals \( (s_j, s_k) \) in \( C_t \) is such that at most one of \( s_j \) and \( s_k \) is set to true. For each \( c \in [3] \), the corresponding gadget contains one additional memory location \( a_c \).

The at-least-one-true gadget \( \text{AOne}^{j,k,c,e}_t \). The at-least-one-true gadget (Fig. 11) guarantees that for every clause \( C_t \) with three literals \( s_j, s_k, s_e \), at least one of them is set to true. It contains one additional memory location \( b \).

Putting the gadgets together. We obtain the abstract execution \( X \) by appropriately connecting all gadgets and specifying the interleaving of events in common threads.
Assignment to fixing a total order on memory locations gadgets, thus we have to specify how to interleave their corresponding events. We do so by first serially connect all gadgets in their common threads by Lemma Fig. 11. The at-least-one-true gadget ALOne\textsubscript{i,j,k,t'}. First, we serially connect all gadgets in their common threads by po. In particular, Copy\textsubscript{j1} appears before Copy\textsubscript{j2} if \( i_1 < i_2 \) or \( i_1 = i_2 \) and \( j_1 < j_2 \); Copy\textsubscript{j1} appears before Copy\textsubscript{j2} if \( i_1 < i_2 \) or \( i_1 = i_2 \) and \( j_1 < j_2 \); AMOne[e]\textsubscript{j1,k1} appears before AMOne[e]\textsubscript{j2,k2} if \( i_1 < i_2 \), and finally ALOne\textsubscript{i,j,k1,t1} appears before ALOne\textsubscript{i,j,k2,t2} if \( i_1 < i_2 \). Second, observe that the threads \( t_i \), for \( i \in \{3\} \), appear in multiple gadgets, thus we have to specify how to interleave their corresponding events. We do so by first fixing a total order on memory locations \( \sigma = y_1, y_2, y_3, z_1, z_2, z_3, a_1, a_2, a_3, b \).

1. The read events succeeding every \( r(t_3, x_1, v_j') \) in each gadget (i.e., those on locations \( \{y_t, z_t\}_{t \in \{4\}} \), \( \{a_t\}_{t \in \{3\}} \) and \( b \)) are po-ordered according to \( \sigma \) (note that reads on some locations, such as \( a_t \), \( b \), might not appear at all after \( r(t_3, x_1, v_j') \), e.g., if clause \( C_t \) does not contain variable \( s_j \)). Moreover, all these reads appear before any read on \( x_1 \) that is a po-successor of \( r(t_3, x_1, v_j') \) (in particular, reads on \( x_1 \) with phase \( i \) or reads with phase \( i+1 \) and step \( j \)).
(2) Similarly, the write events preceding every \(w(t_1, x_1, \nu^j_i)\) in each gadget are ordered according to \(\sigma\). Moreover, all these writes appear after any write on \(x_1\) that is a predecessor of \(w(t_1, x_1, \nu^j_i)\). Likewise for \(w(t_2, x_1, \nu^j_i), w(t_4, x_2, \nu^j_i)\) and \(w(t_5, x_3, \nu^j_i)\).

Finally, observe that we have indeed used \(k = 23\) threads and \(d = 26\) memory locations. For the latter, we also count one extra memory location for each hb edge out of thread \(t_{\ell}\), thus \(4\) in total.

### 4.2 Soundness

We start with the soundness of the reduction, in particular, if \(\overline{X} \models \text{WRA} \) (and thus also \(\overline{X} \models \text{M} \)) then \(\phi\) is satisfiable. We achieve this by proving a sequence of intermediate lemmas similar to Section 3, however, each lemma now requires reasoning about the semantics of WRA. Recall that we assign \(s_j = \top\) if \((w(t_2, x_1, \nu^j_i), r(t_3, x_1, \nu^j_i)) \in \text{rf}\) for all \(i \in [m]\), and \(s_j = \bot\) if \((w(t_1, x_1, \nu^j_i), r(t_3, x_1, \nu^j_i)) \in \text{rf}\) for all \(i \in [m]\).

The first lemma is based on the copy gadgets \(\text{Copy}^j\) and \(\overline{\text{Copy}^j}\), and states that each phase of \(X\) makes consistent choices for the writer of \(r(t_3, x_1, \nu^j_i)\) (i.e., whether it reads from \(t_1\) or \(t_2\)), which makes the above assignment well-defined. It follows from the high-level description of these two gadgets and the accompanying Fig. 8 and Fig. 9.

**Lemma 4.1.** Let \(X = (E, \text{po}, \text{rf}, \text{mo})\) be a concrete extension of \(\overline{X}\) that satisfies weak-read-coherence. For all \(i_1, i_2 \in [m], j \in [n]\), we have \((w(t_1, x_1, \nu^j_{i_1}), r(t_3, x_1, \nu^j_{i_1})) \in \text{rf}\) if \((w(t_1, x_1, \nu^j_{i_1}), r(t_3, x_1, \nu^j_{i_2})) \in \text{rf}\). The next lemma is based on the at-most-one-true gadgets \(\text{AMOne}[c]_{i,j,k}^j\), and states that for every clause \(C_i\) and for each of the \(c \in [3]\) pair of variables \((s_j, s_k)\) in \(C_i\), at most one of them is assigned to true. Again, it follows by a direct analysis on the accompanying Fig. 10.

**Lemma 4.2.** Let \(X = (E, \text{po}, \text{rf}, \text{mo})\) be a concrete extension of \(\overline{X}\) that satisfies weak-read-coherence. For every \(i \in [m]\) and \(j, k \in [n]\) such that \(s_j\) and \(s_k\) appear in clause \(C_i\), we have \((w(t_2, x_1, \nu^j_i), r(t_3, x_1, \nu^j_i)) \not\in \text{rf}\) or \((w(t_2, x_1, \nu^j_k), r(t_3, x_1, \nu^j_k)) \not\in \text{rf}\).

The third lemma is based on the at-least-one-true gadget \(\text{ALOne}^j_{i,k,t}\), and it is used to show that for every clause \(C_i = (s_j, s_k, s_t)\), at least one of its variables is assigned to true. Again, it follows by a direct analysis on the accompanying Fig. 11.

**Lemma 4.3.** Let \(X = (E, \text{po}, \text{rf}, \text{mo})\) be a concrete extension of \(\overline{X}\) that satisfies weak-read-coherence. For every \(i \in [m]\) and \(j, k, t \in [n]\) such that \(s_j, s_k\) and \(s_t\) appear in clause \(C_i\), we have \((w(t_2, x_1, \nu^j_i), r(t_3, x_1, \nu^j_i)) \in \text{rf}\) or \((w(t_2, x_1, \nu^j_k), r(t_3, x_1, \nu^j_k)) \in \text{rf}\) or \((w(t_2, x_1, \nu^j_t), r(t_3, x_1, \nu^j_t)) \in \text{rf}\).

Lemma 4.1 states that our truth assignment for \(\phi\) is valid, while Lemma 4.2 and Lemma 4.3 guarantee that in every clause, exactly one literal is set to true. Hence we have the following corollary.

**Corollary 4.4.** If \(\overline{X} \models \text{WRA}\) then \(\phi\) is satisfiable.

### 4.3 Completeness

We now turn our attention to completeness property, i.e., if \(\phi\) is satisfiable then \(\overline{X} \models \text{SRA} \) (and thus also \(\overline{X} \models \text{M} \)). We use the notions of phase, step, and monotonicity as in Section 3. We construct a reads-from relation \(\text{rf}\) and a partial modification order \(\text{mo}\) as follows.

(1) For each gadget, we insert \(\text{rf}\) and \(\text{mo}\) edges according to Figs. 8 to 11 and the truth assignments on variables \(s_j, s_k, s_t\) involved in that gadget. In particular, this implies that, for each \(i \in [m]\)
and $j \in [n]$, we have $((w(t_1, x_1, v'_j), r(t_3, x_1, v'_j)) \in \rho f$ if $s_j = \bot$ and $((w(t_2, x_1, v'_j), r(t_3, x_1, v'_j)) \in \rho f$ if $s_j = \top$. Observe that $\rho f$ is fully specified, i.e., every read has a write to read from. Note that for every pair $(w, r) \in \rho f$ we have $\text{phase}(w) = \text{phase}(r)$ and $\text{step}(w) = \text{step}(r)$, thus $r \preceq w$.

(2) For every two conflicting writes $w_1, w_2$ such that (i) $w_1 < w_2$, and (ii) $(w_1, w_2) \notin \text{hb}$, we have $(w_1, w_2) \in \text{mo}$. Thus, for any two conflicting writes $w_1, w_2$ with $w_1 < w_2$ we have $(w_1, w_2) \in (\text{hb} \cup \text{mo})^\ast$.

We call a triplet $(w, r, w')$ safe if either $(w', r) \notin \text{hb}$ or $(w', w) \in (\text{hb} \cup \text{mo})^\ast$. To prove the consistency of $\overline{X}$, it suffices to argue that $\overline{\text{mo}}$ is minimally coherent, namely, that (i) $(\text{hb} \cup \overline{\text{mo}})^\ast$ is acyclic, and (ii) every triplet $(w, r, w')$ is safe. Indeed, these two conditions guarantee that $\overline{\text{mo}}$ can be linearized to a total $\text{mo}$ such that any extension $X$ of $\overline{X}$ satisfies $X = (E, po, rf, \text{mo}) \models \text{SRA}$ [Tunç et al. 2023], which implies that also $X \models M$.

In order to simplify our analysis, we again ignore the writes in threads $\{h_c\}_{c \in [3]}, p$, and $q$ that are not read in $\rho f$ (crossed-out in Fig. 10 and Fig. 11). This allows us to make our formal statements more uniform, while it does not affect the correctness of the analysis. Indeed, let $\text{InactiveWr} = \{w \in W : \text{tid}(w) \in \{h_1, h_2, h_3, p, q\} \text{ and } \exists r \in R. (w, r) \in \rho f\}$. The following lemma is straightforward.

**Lemma 4.5.** The following statements hold.

1. If there is an $(\text{hb} \cup \overline{\text{mo}})$-cycle then there is such a cycle without any write in $\text{InactiveWr}$.
2. If there is an unsafe triplet, then there is such a triplet $(w, r, w')$ where $w' \notin \text{InactiveWr}$.

We start with condition (i) of minimal coherence, i.e., we need to show that $(\text{hb} \cup \overline{\text{mo}})$ is acyclic. Observe that each individual gadget is free from $(\text{hb} \cup \overline{\text{mo}})$-cycles, regardless of how we resolve the $\rho f$ edges associated with it (see Figs. 8b and 8c, Figs. 9a and 9b, Figs. 10b to 10d, Figs. 11b to 11d). However, we have to also argue that the interleaving of these gadgets is free from $(\text{hb} \cup \overline{\text{mo}})$-cycles.

Our first key lemma states that $\text{hb}$ paths between writes are, without loss of generality, monotonic. This is based on three observations. First, due to Lemma 4.5, we can ignore writes in the threads $h_1, h_2, h_3, p$ and $q$, which contain $po$-edges that would violate this statement. Second, all $\rho f$-edges connect events of the same phase and step. Third, the only $po$-edges that are non-monotonic enter read events (in particular, a read $r(v'_j, z_6, g_1)$ or a read $r(v'_j, z_8, g_4)$ in $\text{Copy}_j$). Since the only possible continuation of an $\text{hb}$-path out of a read event is to take another $po$-edge, we can remove the first non-monotonic edge (as $po$ is transitive) and obtain a new valid $\text{hb}$-path. Repeating this process results in a monotonic $\text{hb}$-path between the writes. Formally, we have the following.

**Lemma 4.6.** For every two writes $w_1, w_2$, if $(w_1, w_2) \in \text{hb}$ then there exists a monotonic $\text{hb}$-path $w_1 \overset{\text{hb}}{\rightarrow} w_2$.

We can now prove the acyclicity condition of minimal coherence. Intuitively, any potential $(\text{hb} \cup \overline{\text{mo}})$-cycle $C$ can be seen as a sequence of write events connected by $\text{hb}$ and $\overline{\text{mo}}$. By construction, every edge $w_1 \overset{\text{mo}}{\rightarrow} w_2$ is monotonic, while, due to Lemma 4.6, every subpath $w_1 \overset{\text{hb}}{\rightarrow} w_2$ of $C$ is, without loss of generality, monotonic. Thus $C$ is monotonic, and since it is a cycle, every event in $C$ has the same phase and step. The absence of such cycles can then be directly established by inspecting the gadgets in Figs. 8 to 11.

**Lemma 4.7.** $(\text{hb} \cup \overline{\text{mo}})$ is acyclic.
Next, we turn our attention to the second condition of minimal coherence, i.e., we argue that every triplet is safe. We first prove a general statement that prohibits \( \text{hb} \)-paths to a read \( r \) from writes \( w' \) that are po-successors to the write \( w \) that \( r \) reads from. This will help us establish the safety of each triplet, and will also prove useful later in Section 5.1 when we address Causal Memory.

**Lemma 4.8.** For every pair \((w, r) \in \text{rf}\) and write \( w' \) with \((w, w') \in \text{po}\), we have \((w', r) \notin \text{hb}\).

To realize Lemma 4.8, we first argue that any path \( P: w' \xrightarrow{\text{hb}} r \) contains events of the same phase and step. Indeed, as no location is ever written and read by the same thread, \( P \) has the general form \( P: w' \xrightarrow{\text{hb}} w'' \xrightarrow{\text{rf}} r'' \xrightarrow{\text{po}} r \) for some write \( w'' \) and read \( r'' \). Due to Lemma 4.6, the subpath \( w' \xrightarrow{\text{hb}} w'' \) is monotonic (wlog), while the last two edges of \( P \) are also monotonic (rf-edges are monotonic, while non-monotonic po-edges go from writes to reads). Hence \( P \) is monotonic. On the other hand, we have \( w \leq w' \) (as \((w, w') \in \text{po}\)), while, by construction, \( r \leq w' \). Hence \( r \leq w' \), and since \( P \) is monotonic, it must contain only events of the same phase and step. In particular, \( P \) must be contained in the gadgets in Figs. 8 to 11. The absence of such paths \( P \) can then be established by a careful inspection of these gadgets.

We can now prove the safety of each triplet \((w, r, w')\). Intuitively, if \( w' < w \), then we have \((w', w) \in \text{mo}\) by construction. On the other hand, if \( w < w' \), Lemma 4.6 and Lemma 4.8 exclude the existence of \( \text{hb} \)-paths \( w' \xrightarrow{\text{hb}} r \). Hence, it again suffices to only consider \( \text{hb} \)-paths \( P: w' \xrightarrow{\text{hb}} r \) that are contained in the same gadget. Again, a careful inspection of Figs. 8 to 11 and the use of Lemma 4.8 show that \((w, r, w')\) is indeed safe.

**Lemma 4.9.** Every triplet \((w, r, w')\) is safe.

Lemma 4.7 and Lemma 4.9 show that \( \overline{\text{mo}} \) is indeed minimally coherent, which implies that \( X = (E, \text{po}, \text{rf}, \text{mo}) \models \text{SRA} \). Thus we have the following corollary.

**Corollary 4.10.** If \( \varphi \) is satisfiable then \( \overline{X} \models \text{SRA} \).

Together, Corollary 4.4 and Corollary 4.10 establish the correctness of the reduction, i.e., \( \varphi \) is satisfiable iff \( \overline{X} \models M \) for any memory model \( \text{SRA} \preceq M \preceq \text{WRA} \).

## 5 IMPLICATIONS AND OTHER MEMORY MODELS

Our proof of Theorem 1.1 is strong enough to yield hardness on other popular memory models across different domains. In this section, we explore its implications.

**Causal Consistency models.** In a distributed setting, consistency commonly captures the concept of causality. Three of the most standard causal consistency models are the basic Causal Consistency (CC), Causal Convergence (CCv), and Causal Memory (CM) [Bouajjani et al. 2017]. It was recently shown that CC coincides with WRA while CCv coincides with SRA [Lahav and Boker 2022]. Thus, Theorem 1.1 implies NP-completeness for all models between CCv and CC. In Section 5.1 we also establish NP-completeness for CM, by extending the proof of Theorem 1.1, thereby completing Theorem 1.2.

**Hardware memory models.** Next, we turn our attention to some popular hardware memory models, namely, for the POWER and x86-64-TSO architectures, as well as PSO. We show that Theorem 1.1 implies NP-completeness for POWER, but consistency checks for TSO/PSO run in polynomial time.
The observed-before relation \(\text{ob}\) is the smallest transitive relation \(\text{ob} \subseteq E \times E\) with the following properties.

1. For every \((e_1, e_2) \in \text{hb}\) such that \((e_i, e) \in \text{hb}\) for each \(i \in [2]\), we have \((e_1, e_2) \in \text{ob}\).  
2. For every conflicting triplet \((w, r, w')\) such that (i) \((w', r) \in \text{ob}\) and (ii) \((r, e) \in \text{po}\), we have \((w', w) \in \text{ob}\).

Intuitively, when \(t\) executed \(r\), it must have observed \(w\) after \(w'\), so that \(r\) indeed obtained its value from \(w\). The \(\text{ob}\) relation specifies that this ordering cannot change later when \(t\) executes \(e\). Notice the fixpoint style of the above definition. As we order \((w_1, w_2) \in \text{ob}\) (and since the relation is transitive and contains \(\text{hb}\)), more and more write-read pairs satisfy property (i) \((w', r) \in \text{ob}\), triggering the addition of new orderings \((w', w) \in \text{ob}\). For any two events \(e_1, e_2\) with \((e_1, e_2) \in \text{po}\), other works refer to this relation as “happened-before” for \(e\). We avoid this term here to not confuse it with \(\text{hb}\).
we have $\text{ob}_{e_1} \subseteq \text{ob}_{e_2}$, i.e., $\text{ob}$ grows monotonically as we go downwards in each thread. The observed-before relation for a thread $t$ is defined as $\text{ob}_t = \text{ob}_{e_{\max}}$, where $e_{\max}$ is the po-maximal event of $t$. The new axiom requires that $\text{ob}_t$ is irreflexive [Bouajjani et al. 2017].

- $\text{ob}_t$ is irreflexive for each thread $t$ (ob-acyclicity)

In turn, CM is equal to WRA with ob-acyclicity as an extra axiom.

- $(\text{porf-acyclicity}) \land (\text{weak-read-coherence}) \land (\text{ob-acyclicity})$ [CM]

Observe that WRA $\preceq$ CM, but CM is incomparable with RA/SRA, i.e., CM allows executions that are inconsistent in RA/SRA and vice versa. See Fig. 12a and Fig. 12b for illustrations.

Next, we prove the completeness of the construction, i.e., if $\varphi$ is satisfiable then $X \models CM$. Consider the reads-from relation $\text{rf}$ and the partial modification order $\overrightarrow{\text{mo}}$ exactly as constructed in the completeness argument of Section 4 (i.e., following the gadgets in Figs. 8 to 11). Let $X = (E, \text{po}, \text{rf}, \overrightarrow{\text{mo}})$ be the execution witnessing the SRA-consistency of $X$ according to Corollary 4.10. Since SRA satisfies the porf-acyclicity and weak-read-coherence axioms, we only need to argue that $X$ also satisfies ob-acyclicity to conclude that $X \models CM$. For this, we have to establish some additional lemmas.

Our first lemma stems from Lemma 4.8 and states an important property of $\text{rf}$: for every pair $(w, r) \in \text{rf}$, $w$ has no $\text{hb}$-path to po-predecessors of $r$. In other words, the first event of the thread of $r$ that $w$ can reach by means of an $\text{hb}$-path is $r$ itself via the $\text{rf}$-edge $w \xrightarrow{\text{rf}} r$.

**Lemma 5.2.** For every $(w, r) \in \text{rf}$ and event $e$ such that $(e, r) \in \text{po}$, we have $(w, e) \notin \text{hb}$.

Next, we define a “one-hop” variant of $\text{ob}$. Given an event $e$, the one-hop observed-before relation for $e$ is the smallest transitive relation $\text{ob}^1_e \subseteq E \times E$ with the following properties.

1. For every $(e_1, e_2) \in \text{hb}$ such that $(e_i, e) \in \text{hb}$ for each $i \in [2]$, we have $(e_1, e_2) \in \text{ob}^1_e$.
2. For every event $e$ and conflicting triplet $(w, r, w')$ such that (i) $(w', r) \in \text{hb}$ and (ii) $(r, e) \in \text{po'}$,
   - we have $(w', w) \in \text{ob}^1_e$.

Contrasting $\text{ob}^1_e$ to $\text{ob}_e$, the only difference is in condition (2i): $\text{ob}_e$ checks whether $(w', r) \in \text{ob}_e$, while $\text{ob}^1_e$ checks the weaker condition $(w', r) \in \text{hb}$. Thus $\text{ob}^1_e$ does not have the fixpoint style of $\text{ob}_e$. Similarly to $\text{ob}_t$, we let $\text{ob}^1_t = \text{ob}^1_{e_{\max}}$, where $e_{\max}$ is the po-maximal event of thread $t$.

Our next lemma states that for our execution $X$, $\text{ob}_e$ coincides with $\text{ob}^1_e$. In other words, $\text{ob}_e$ reaches a fixpoint after only one iteration. This observation stems from Lemma 5.2. Intuitively, since for each triplet $(w, r, w')$, $w$ cannot $\text{hb}$-reach any po-predecessor of $r$, traversing an edge $w' \xrightarrow{\text{ob}_e} w$ cannot lead to any events of the thread of $r$ that weren’t already reachable via the $\text{hb}$-path $w' \xrightarrow{\text{hb}} r$ that made us insert $(w', w) \in \text{ob}_e$ in the first place (see Fig. 12c). Hence, adding such an ordering $(w', w) \in \text{ob}_e$ cannot lead to further firings of condition (2i) of $\text{ob}_e$. Formally, we have the following.

**Lemma 5.3.** For every thread $t$, we have $\text{ob}^1_t = \text{ob}_t$.

Finally, observe that whenever we add $(w', w) \in \text{ob}^1_t$, we have $(w', r) \in \text{hb}$. Due to Lemma 4.9, the triplet $(w, r, w')$ is safe, thus $(w', w) \in (\text{hb} \cup \overrightarrow{\text{mo}})^*$. Hence, the acyclicity of $\text{ob}_t = \text{ob}^1_t$ follows from the acyclicity of $(\text{hb} \cup \overrightarrow{\text{mo}})^*$ (Lemma 4.7). We thus have the following lemma, which, together with Corollary 5.1, completes the proof case (ii) of Theorem 1.2.

**Lemma 5.4.** If $\varphi$ is satisfiable, then $X \models CM$. 

---

Proc. ACM Program. Lang., Vol. 8, No. POPL, Article 66. Publication date: January 2024.
Fig. 13. (a) An execution consistent in TSO, as well as in PSO and all other memory models we have considered, but not SC. (b) An execution consistent in PSO, as well as in Relaxed-Acyclic but not in TSO or even WRA. (c) An execution inconsistent in TSO/PSO, but consistent in CC/WRA.

5.2 Implications for POWER

The memory model of the POWER architecture is defined on load, store, atomic read-modify-write memory accesses, and various types of fences. POWER orders memory accesses based on fences, address, data, and control dependencies, while, again, coherence forces a total order on same-location accesses. In addition, POWER defines two global orderings, namely, happens-before, and propagation. The happens-before relation is based on dependencies, fences, and the rf relation across threads. The propagation relation captures the propagation of read and written values by combining fences, happens-before, rf, and mo. Based on these relations POWER defines its consistency axioms, which we will not present here; instead, we refer the interested readers to [Alglave et al. 2014]. Lahav et al. [2016] showed that SRA captures precisely the guarantees of POWER for programs that are compiled from the release-acquire fragment of C/C++. In turn, this implies that the result established in Section 4 for SRA transfers over to POWER. We thus have the following corollary.

**Corollary 1.3.** Consistency testing for bounded inputs is NP-complete for POWER.

5.3 What About x86-TSO and PSO?

Our results so far prove strong hardness for testing a variety of weak-memory models. In contrast, in this section, we outline that the problem is solvable in polynomial time for x86-TSO and PSO. Conceptually, this is less surprising for TSO, which diverges only a little from SC, but is more so for PSO, which allows for behaviors that are not even causally consistent.

**Total Store Order.** The TSO model deviates from SC by introducing a write-buffer for each thread, which acts as a FIFO queue [Sewell et al. 2010]. When a thread \( t \) executes a write \( w(t, x) \), this does not modify the shared memory immediately and is thus not visible to the other threads. Instead, \( w(t, x) \) is stored in the buffer of \( t \). The buffer non-deterministically flushes some of its writes to the shared memory, at which point they become visible to the other threads. On the other hand, when a thread \( t \) executes a read \( r(t, x) \), it is forced to read from the most recent write \( w(t, x) \) in \( t \)'s buffer. If no such write exists then \( r(t, x) \) reads from the shared memory. See Fig. 13a for an illustration.

For capturing the complexity of consistency-testing of an abstract execution \( \bar{X} = (E, po) \) under TSO, it is helpful to switch to operational semantics. The semantics are defined by means of a labeled transition system \( L_{TSO} \). In high level, a state in \( L_{TSO} \) is a triplet \( (P, B, M) \), where

1. \( P \subseteq E \) is the set of events that have been executed so far.
2. \( B : \mathcal{T} \rightarrow (W)^{+} \) maps every thread \( t \) to a sequence of writes \( w(t, x_1), w(t, x_2), \ldots, w(t, x_i) \), which represents the state of the buffer of thread \( t \).
3. \( M : \mathcal{V} \rightarrow W \) maps every memory location of the shared memory to the most recent write to it.
A counting argument shows that the size of \( \mathcal{L}_{\text{TSO}} \) is bounded by \( \kappa^d \cdot n^{O(k^2)} \), for \( n = |E| \), \( \kappa \) threads, \( d \) locations, and thus becomes polynomial when \( \kappa, d = O(1) \).

**Partial Store Order (PSO).** The PSO model [SPARC International 1994] is similar to TSO, with the difference that every thread has a different buffer for each location. This allows both write-read reorderings (like TSO) and write-write reorderings on different locations. This induces more behaviors than TSO, but is incomparable to some other models like WRA (see Fig. 13b and Fig. 13c). The operational semantics can be defined by means of an LTS \( \mathcal{L}_{\text{PSO}} \) analogously to \( \mathcal{L}_{\text{TSO}} \). A similar analysis shows that the size of \( \mathcal{L}_{\text{PSO}} \) is bounded by \( n^{O(k(k+d))} \). Hence we have the following theorem, which differentiates TSO/PSO from the other weak-memory models we have seen so far.

**Theorem 1.4.** Consistency testing for bounded threads and memory locations is in polynomial time for TSO and PSO.

**5.4 A Final Note on Relaxed**

Finally, we turn our attention to the Relaxed model. The only two axioms of this model are relaxed-write-coherence and relaxed-read-coherence, which concern individual memory locations, and guarantee per-location coherence, i.e., focusing on each location individually; the corresponding execution is SC-consistent. Given an abstract execution \( \widetilde{X} \), to decide whether \( \widetilde{X} \models \text{Relaxed} \), it suffices to check whether \( \widetilde{X}_x \models \text{SC} \) for each location \( x \), where \( \widetilde{X}_x \) occurs from \( \widetilde{X} \) by considering only events accessing \( x \). As each consistency check \( \widetilde{X}_x \models \text{SC} \) takes polynomial time [Agarwal et al. 2021], and we clearly have polynomially many such checks, we arrive at Corollary 1.5.

**Corollary 1.5.** Consistency testing for bounded threads is in polynomial time for Relaxed.

**6 HARDNESS WITH BOUNDED VALUES**

For ease of presentation, our reductions in Section 3 and Section 4 use a bounded number of threads and memory locations but an unbounded value domain. Indeed, given the Boolean formula \( \varphi = \{C_i\}_{i \in [m]} \) on \( m \) clauses and \( n \) variables, the value domain of the abstract execution \( \widetilde{X} \) has size \( \Theta(n \cdot m) \). In this section we outline how to modify those reductions so that \( \widetilde{X} \) also uses a bounded value domain, thereby arriving at Theorem 1.1 and Theorem 1.2.

**Intuition.** Our two reductions are such that every read \( r \) can read from at most three writes, and these appear in the same gadget as \( r \). However, the values of these events are specific to the gadget, and in particular, specific to the phase \( i \) and the step \( j \) of the events (i.e., events are of the form \( r(x_1, t_3, o^j) \)). Our strategy for decreasing the size of the value domain (of both reductions in Section 4 and Section 3) is by using repeating values which are not parameterized by the superscript \( i \) and subscript \( j \) (i.e., the events in the executions constructed now look like \( r(x_1, t_3, o) \) or \( w(x_1, t_1, v) \)). This change does not affect completeness but threatens soundness, as now, some read events may read from write events in other gadgets that were previously forbidden simply because their values were not matching. To avoid this, we slightly modify our abstract executions \( \widetilde{X} \) by inserting a bounded number of auxiliary write and read events between consecutive gadgets, which also access a bounded number of values. The auxiliary write events write dummy values read by the auxiliary read events. The effect of these additional reads-from edges due to auxiliary events is to create \( (\text{po}_x \cup \text{rf}_x)^+ \) paths that once again forbid the original (i.e., non-auxiliary) read events of a gadget to access write events from other gadgets (while obeying the desired consistency axioms).

**Construction.** We now outline the construction. The process is similar for both the abstract executions of Section 3 and Section 4. For this reason, we describe it generically on an abstract
execution $\overline{X}$. Our transformation is carried out in two steps, $\overline{X} \rightarrow \overline{X}_1 \rightarrow \overline{X}_2$, where $\overline{X}_1$ and $\overline{X}_2$ have the same number of threads and locations as $\overline{X}$, and $\overline{X}_2$ additionally has a bounded value domain.

**Step 1.** We obtain $\overline{X}_1$ by inserting various events in $\overline{X}$ while keeping the threads and memory locations the same. We start by fixing a total order $\sigma_1$ on locations, and a total order $\sigma_2$ on threads.

$$\sigma_1 = x_1, x_2, y_1, \ldots, y_6, z_1, \ldots, z_6, a_1, \ldots, a_3, b$$

$$\sigma_2 = t_1, \ldots, t_6, f_1, \ldots, f_6, g_1, \ldots, g_6, h_1, \ldots, h_3, p, q$$

Fix a phase $i$ and step $j$ and let $\zeta = (i + m + j) \mod 2$. For a location $x$ of $\overline{X}$, different from $a_1, a_2, a_3, b$, we introduce auxiliary write and read events on $x$ as follows: (i) if a thread $t$ writes on $x$, we insert a write $w(t, x, v^t_i)$ after all events of phase $i$ and step $j$ in $t$, and (ii) if a thread $t$ reads from $x$, we insert a sequence of read events $r(t, x, v^t_i), r(t, x, v^t_i), \ldots$ before all events of phase $i$ and step $j + 1$ (or phase $i + 1$ and step 1, if $j = n$), where $t^1, t^2, \ldots$ is the subsequence of $\sigma_2$ of threads writing to $x$ values read by thread $t$. We repeat this process for all locations $x \notin \{a_1, a_2, a_3, b\}$ in the order of appearance in the total order $\sigma_1$, placing the auxiliary writes before the auxiliary reads in each thread. Observe that each $r(t, x, v^t_i)$ event is forced to read from the respective $w(t, x, v^t_i)$.

Next, we turn our attention to the locations $a_1, a_2, a_3$ and $b$. The auxiliary events are positioned similarly, except for the detail about the step number $j$, because accesses to these locations span an entire phase (in the at-most-one-true and at-least-one-true gadgets). In particular, we have $\zeta = i \mod 2$, while auxiliary write events are placed in each thread after all events of phase $i$, and read events are placed before events of phase $i + 1$.

Observe that since the number of threads and locations is bounded in $\overline{X}$, the same holds for $\overline{X}_1$, while the additional values accessed by the auxiliary events in $\overline{X}_1$ are also bounded.

**Step 2.** In the second step, we transform $\overline{X}_1$ to $\overline{X}_2$ so that the latter only accesses a bounded number of values. In particular, we make $\overline{X}_2$ identical to $\overline{X}_1$ with the difference that, for every event of $\overline{X}_1$ that also appears in $\overline{X}$ (i.e., non-auxiliary events), we remove from its value the superscript of the phase and the subscript of the step of that event. For example, each write $w(t_1, x_1, v^t_i)$ in $\overline{X}_1$ becomes $w(t_1, x_1, v)$ in $\overline{X}_2$. It is straightforward to verify that $\overline{X}_2$ has a bounded domain of threads, locations, and values. Moreover, $\overline{X}_2$ is consistent in the respective memory model iff $\overline{X}$ is, by repeating the arguments in Section 3 and Section 4, this time also accounting for the auxiliary events.

7 CONCLUSION

We have studied the standard problem of consistency-testing for various popular weak-memory models spanning across software, hardware, and distributed systems. We have shown that even the bounded version of consistency testing is NP-complete in most of these models, i.e., when every natural input parameter is bounded. This is a significant improvement over an abundance of prior hardness results which primarily stemmed from parameters such as the number of threads or memory locations being unbounded. Our results thus highlight the true intricacies of weak-memory testing. In particular, our results imply that the problem provably admits no parameterization with respect to natural input parameters. Interesting future work includes the possibility of extending our hardness to other memory models such as the one in ARM architectures, as well as recovering tractability by imposing further restrictions (such as context/view-switching).
ACKNOWLEDGMENTS

Andreas Pavlogiannis was partially supported by a research grant (VIL42117) from VILLUM FONDEN. S. Krishna was partially supported by the SERB MATRICS grant MTR/2019/000095. Umang Mathur was partially supported by a Singapore Ministry of Education (MoE) Academic Research Fund (AcRF) Tier 1 grant.

REFERENCES


Soham Chakraborty, Shankara Narayanan Krishna, Umang Mathur, and Andreas Pavlogiannis


Received 2023-07-11; accepted 2023-11-07