When you specify a concurrent system, you aren't just describing a state machine. You are describing how that machine is allowed to evolve over time. This creates two distinct correctness categories: forbidden states and required futures.
The first category, Safety, is about ensuring a "bad thing" never happens. The second, Liveness, is about ensuring a "good thing" eventually happens. In TLA+, these aren't just philosophical labels; they change how TLC explores your state space and what its counterexamples actually tell you.
Table of Contents
1. Safety and the Finite Counterexample
A safety property, usually defined as a state invariant like TypeOK or MutualExclusion, is checked against every reachable state in the state space. The logic is simple: []P (Always $P$).
During Breadth-First Search (BFS), TLC maintains a queue of discovered states. For every state $s$, it evaluates $P(s)$. If $P(s)$ is false, TLC has found a finite violation. This is the "Oops" moment: a trace $s_0 \rightarrow s_1 \rightarrow \dots \rightarrow s_n$ where $s_n$ is a forbidden state.
Because safety is prefix-closed, once a safety property is broken, it stays broken. No future state can "undo" the fact that two processes were in the critical section at the same time. If a safety check fails, your bug is likely in the initial condition or the next-state relation—you allowed a transition that is operationally illegal.
Common Safety Invariants:
pc1 = "cs" => pc2 /= "cs"balance >= 0len(Queue) <= MaxSize
2. Liveness and the Infinite Lasso
Liveness properties, such as P ~> Q ($P$ leads to $Q$) or []<>Done (infinitely often $Done$), cannot be refuted by a finite trace. You can never point to a finite log and prove a liveness property was violated, because the "good thing" could always happen in the next (unseen) step.
To refute a liveness property, TLC must find an infinite behavior where the property is never satisfied. Since we are working with finite-state models, an infinite behavior is always a Lasso: a finite path that enters a cycle: $s_0 \rightarrow s_1 \rightarrow \dots \rightarrow (s_k \rightarrow \dots \rightarrow s_m)^\omega$.
When TLC detects a liveness violation, it is showing you a cycle where the system can loop forever without making progress. This is often a Livelock or a Fairness hazard.
3. The Mechanics of Fairness (WF vs. SF)
By default, TLA+ allows stuttering. A system can stay in its current state forever even if actions are enabled. To prove progress, we use Fairness assumptions to constrain the scheduler.
Weak Fairness (WF_vars(A))
Weak fairness on action $A$ means that if $A$ stays continuously enabled, it must eventually occur. In the state graph, TLC looks for cycles where $A$ is enabled in every single state of the loop, yet $A$ is never taken. If it finds such a cycle, and you assumed WF, TLC reports a violation.
WF is usually enough for "internal" progress, like a local timer that can't be blocked by other processes.
Strong Fairness (SF_vars(A))
Strong fairness on action $A$ means that if $A$ is infinitely often enabled, it must eventually occur. Checking this is significantly more expensive; TLC must monitor the "enabledness" of $A$ throughout the entire SCC (Strongly Connected Component) to see if $A$ could have been taken at any point in the cycle.
SF is often necessary when processes contend for a shared resource. If a process is repeatedly blocked and unblocked, WF won't guarantee it ever gets the resource, but SF will.
4. Stuttering and the [Next]_vars Pattern
Canonical TLA+ specs often take the form Init /\ [][Next]_vars /\ Fairness. The [Next]_vars notation explicitly permits the "stuttering" step vars' = vars.
This makes stuttering a first-class citizen. If a liveness property fails, TLC often shows a trace ending in "Back to state X". This tells you that your spec permits an infinite execution that avoids progress. Sometimes the cycle is pure stuttering (the system hangs); sometimes the variables change, but the "required" variable remains static (the system is busy doing useless work).
5. The Performance Tax of Liveness
Checking liveness is orders of magnitude slower than checking safety because TLC cannot exit early upon finding a bad state. It must build the state graph and perform SCC analysis (typically using Tarjan's or a tableau-based approach).
| Task | Logic | Algorithm | Complexity |
|---|---|---|---|
| Safety | []P | BFS / State Fingerprinting | $O(V + E)$ |
| Liveness | P ~> Q | SCC / Tableau Analysis | $O(V + E) \times \text{Fairness}$ |
While a safety check might handle 10 million states in minutes, a liveness check on the same model may take hours or require massive memory to track the edges required for cycle detection.
6. Debugging the "Lasso"
When TLC gives you a liveness counterexample:
- Identify the Cycle: Look for the "Back to state [N]" message at the end of the trace.
- Evaluate Enabledness: For every state in the cycle, check which actions were
ENABLED. - Verify Fairness: If you expected an action to fire, was it
ENABLEDcontinuously (forWF) or infinitely often (forSF)?
A common trap is the Request/Response failure: if Response is only enabled when a buffer isn't full, and Request keeps filling it, Response may be disabled exactly when the scheduler would have picked it. In this case, even SF_vars(Response) won't save you if the buffer stays full—the protocol itself must be changed to ensure Response eventually becomes and stays enabled.
Ultimately, safety prevents catastrophe, but liveness prevents the "zombie" state where your system is legal but useless. If you can't prove liveness, your spec is just a list of things that won't happen, not a description of a system that works.