Why on earth do we need TLA+?

19 minute read Published: 2025-11-16

Every few years the industry collectively rediscovers the same uncomfortable fact: you can have good engineers, solid testing, careful code reviews, and still ship a distributed system that behaves in ways nobody expected.

Table of Contents

You get a rare race condition that only shows up under a very particular pattern of failures. Or a recovery protocol that works for every test case you wrote, but corrupts data when three retries line up with a network partition and a scheduled restart. The kind of thing that, when you finally understand it, makes you think “of course”, yet you know you never would have found it with more unit tests.

TLA+ exists for that specific problem. It is not magical, and it is not for everything. It is for the situations where “we think it is correct” is not good enough, and where the behavior space is too large for intuition and tests to cover.

This is the perspective that finally made TLA+ click for me, and it came from a strange place: learning Rust.

From Rust’s borrow checker to specification languages

When I first spent serious time with Rust, I was fascinated by how much the compiler would refuse to let me do. The borrow checker would complain about aliasing, lifetimes, and ownership in a way that felt pedantic at first and then, slowly, started to feel like guardrails.

Once I internalized the rules, I could write fairly low level code without constantly worrying that I was accidentally holding a dangling pointer or racing on shared state. The language’s type system and ownership model were doing real work for me: they encoded invariants about memory in a way that the compiler could enforce.

That was my entry point to TLA+.

I kept thinking: if we can encode memory safety rules in a type system so that the compiler protects us, is there a way to encode higher level rules for distributed systems and concurrency so that some tool can protect us there as well? Not just “this variable is an Option<T>” but “this cluster never loses acknowledged writes” or “this lock protocol never deadlocks”.

Rust had trained me to see invariants as first class citizens. TLA+ was the first tool that felt like Rust’s borrow checker for system designs, not for individual programs.

What TLA+ actually is

There are many good introductions, but one line I like is that TLA+ is “exhaustively testable pseudocode” for designs, especially concurrent and distributed ones.¹

You write a specification in a math flavored language. It looks like a mixture of set theory, logic, and high level pseudocode. You describe:

The state of your system: variables that represent things like “set of committed transactions”, “log entries on each node”, “current leader”.
The possible actions: rules that describe how the state can change when a client sends a write, when a node crashes, when a timeout triggers.
Properties you want to be true: invariants such as “committed data is never lost” and liveness properties such as “every accepted write eventually gets a response”.

For liveness properties in particular, you usually add fairness assumptions so the tool does not consider schedules where progress is indefinitely postponed.

Then you run a model checker (TLC) that explores all reachable states of that abstract system, within the finite bounds you give its variables (for example, numbers of nodes or sizes of sets), and checks whether your properties always hold.

If you think “this sounds like writing a tiny interpreter for my protocol, then brute forcing every possible interleaving of events”, you are not far off. The win is that the language is designed to make that state space exploration feasible for realistic designs, and the tooling gives you counterexamples as concrete execution traces when something goes wrong.

The important distinction: TLA+ checks your design, not your code. It is closer to checking a carefully written technical spec than to running unit tests on a codebase.

That sounds abstract. It is easier to see the value when you look at what people have caught with it.

Concrete incidents where TLA+ earned its keep

TLA+ is not just a toy for academics. There are several public stories that are worth reading in full; I will only sketch the highlights here.

Amazon Web Services: bugs that testing never found

Engineers at Amazon Web Services started using TLA+ around 2011 for services like S3, DynamoDB, EBS, and internal coordination systems.² They wrote a detailed article in Communications of the ACM about this experience.

A few points from their account:

They had already been doing “all the normal good things”: code reviews, deep design reviews, extensive testing, fault injection. They still found that subtle design bugs kept slipping through in their distributed systems.
Once they started writing TLA+ specifications for the core protocols, the model checker found bugs that required surprisingly long and counterintuitive sequences of events to manifest. One famous example needed a 35 step execution trace to reproduce, which no human was ever going to write as a test.¹
In several documented cases the TLA+ model exposed flaws that would have led to data loss or violation of S3’s durability promises. These were issues in the design, not in the implementation, so more code level testing would not have helped.

There is an interesting twist in their writeup. Once they had confidence in a design, they also used the model as a “what if” tool for performance work. They could remove locks, weaken ordering constraints, or simplify recovery flows, then re run the model checker to see whether the design still preserved the critical invariants.² That made it less scary to attempt aggressive optimizations in distributed algorithms.

Microsoft: hardware and databases

Microsoft has used TLA+ in a few different contexts.

The Wikipedia page on TLA+ summarizes that a serious bug in the memory controller for the Xbox 360 was caught while engineers were writing a specification in TLA+.³ The bug was in the design of the hardware protocol, not in any specific implementation, so again you would not catch it with more unit tests. You needed a way to reason about all possible sequences of reads and writes at the controller boundary.

On the cloud side, Microsoft Azure used TLA+ to design Cosmos DB, which offers several distinct consistency models in one globally distributed database.³ Getting the interactions between those models right, especially under failures and reconfiguration, is precisely the kind of problem where informal reasoning gets very shaky.

The common thread is that TLA+ let them explore behaviors of complex protocols that were simply too subtle to reason about by inspection.

Smaller teams: failures that cost real money

Big cloud providers are not the only ones who have stories like this. Hillel Wayne’s Learn TLA+ site collects several public success cases. One example describes Espark Learning, a small edtech company, using TLA+ to model a distributed app installer.¹ Their model uncovered complex failure scenarios that would have been expensive to debug in production and could have cost them significant revenue.

The same page mentions that TLA+ has been used by companies like CrowdStrike, MongoDB, Confluent, Elastic, and Cockroach Labs to find bugs in their systems, often in a matter of days once they had a working specification.¹

These stories are not about proving rocket control systems correct to a mathematical standard of perfection. They are about fairly ordinary commercial systems where concurrency, failure, and scale make the behavior space too wide for ad hoc testing.

Where tests, reviews, and type systems hit their limits

If you are comfortable with modern engineering practice, you already rely on multiple lines of defense:

Code review to catch logic errors and misunderstandings.
Static analysis and type systems to catch an entire class of mistakes (null dereferences, lifetime errors, blazing obvious concurrency bugs).
Unit tests, property based tests, and integration tests to codify expectations.
Staging environments and canary deployments to approximate real world conditions.

All of this matters. None of this goes away if you adopt TLA+.

The problem is that there are entire classes of bugs that these techniques practically cannot touch, simply because the state space is too big and the timing interactions are too weird.

Some patterns that should feel familiar:

You have a pretty standard web service that takes an order, writes it to a database, and hands off fulfillment to a background job. Under a particular mix of client retries, message queue redeliveries, and database timeouts, the “idempotent” workflow isn’t actually idempotent. One unlucky sequence of timeouts and partial commits double-charges a customer or ships the same box twice. None of your tests ever lined up the failures in exactly that order.

You have a feature flag system with a central config store and per-service caches. In staging, rollouts look clean. In production, a short-lived network blip plus staggered cache expiry means some instances see the new flag state and others don’t. A user bouncing between nodes hits a path where one service enforces the new rule and another still runs the old behavior, violating assumptions you thought were globally true once the flag flipped.

You have a multi-step onboarding workflow that spans three services and a couple of external APIs. Each step is retried “safely” on failure. Once in a while, a timeout in one downstream system plus a retry in another leaves a tenant half-provisioned: billed in one place, missing permissions in another, stuck forever in a “pending” state. Nobody can quite remember all the states and transitions, so the team debugs by trawling logs for hours.

These are cases where the lack is not more tests, it is coverage of behaviors your imagination never produced as test cases. You do not know that a certain weave of events is worth testing until somebody has already suffered for it.

TLA+ attacks exactly that gap. You force yourself to be explicit about the rules of your system, then you let an automated tool explore a huge range of event interleavings and failure scenarios inside a controlled model.

Why model instead of just writing more code tests?

There is a reasonable objection here: why learn a new specification language when you can write randomized tests or use something like Jepsen or integration chaos tools?

Two answers, one pragmatic and one philosophical.

The pragmatic answer is that models can be vastly smaller than real systems while still exposing the same bugs. A 50 line TLA+ spec can represent a distributed algorithm that takes thousands of lines of production code to implement.¹ That smaller state description means a model checker can actually explore most or all of the reachable states, rather than relying on clever sampling.

The philosophical answer is that the act of modeling forces you to answer questions you did not know you were fudging.

When you write a TLA+ spec for a protocol, you must decide what happens if messages are reordered, duplicated, or dropped. You must decide what “durable write” means in terms of abstract state. You must decide whether your invariant is “no acknowledged write is ever lost” or “no write that was committed to quorum N is lost, but the client might not have seen the ack”.

You can sometimes avoid making those choices in code, or you make them implicitly and inconsistently across modules. The spec drags them into the open.

Once they are in the open, a model checker can exhaustively explore what happens under those rules. If there is a 35 step execution trace where an acknowledged write disappears, the tool will show you that specific trace. You can then decide whether to adjust the design, tighten the invariant, or accept the behavior.

The mental model: seeing your system as states and rules

Rust’s borrow checker was what first made this style of thinking feel concrete to me. You write code, the compiler enforces a small set of precise rules about ownership and lifetimes, and if those rules hold you get strong guarantees about memory safety that would be hard to maintain by hand.

TLA+ applies the same basic idea one level up: instead of rules about pointers and references, you write down rules about system state and allowed transitions. If those rules are consistent and your invariants hold under all the modeled behaviors, you get stronger guarantees about your system’s behavior than you would by relying on intuition and a finite set of tests.

Concretely, the mental model I keep is: a system is “what the world looks like right now” plus “the moves that are allowed to happen next” plus “the things that are never supposed to go wrong”. TLA+ asks you to write those three ingredients down explicitly. The state is a bunch of variables that describe the world. The moves are the rules that say how those variables are allowed to change. Your invariants are the conditions that rule out those “never supposed to happen” situations.

When the model checker runs, it does not care about your implementation details. It starts from an initial state, applies every move in every order it can, and watches how the state evolves. At each step it asks: “did any of the forbidden situations appear?” If yes, it keeps a concrete example and hands it back to you as a counterexample trace. If not, within the explored bounds, you have evidence that your rules really do prevent those bad situations.

This is a different way of thinking from writing tests. With tests, you pick a handful of concrete scenarios and check what happens. With a TLA+ model, you describe the shape of all possible scenarios and let a tool grind through them mechanically. Your job shifts from “did I think of enough test cases?” to “did I write down the right rules and properties?”

The nice part is that the rules themselves do not have to be very complicated. A surprisingly large class of ugly bugs come from small gaps in a handful of simple rules: forgetting that messages can be duplicated, assuming a step is atomic when it is not, or quietly allowing a variable to take on a value you never intended. Forcing yourself to express the system as states, moves, and invariants flushes those gaps out into the open, where a model checker can poke at them.

What TLA+ is good at, and what it is not

It is easy to get overly excited and start imagining TLA+ as the answer to everything. That is a trap, and it is one reason many engineers bounce off formal methods.

In practice, TLA+ shines in a few specific regions:

It is very good at protocols and algorithms where concurrency, failures, and ordering rules interact in complex ways. This includes replication protocols, distributed locks, consensus algorithms, leader election, sharded metadata services, and similar patterns.

It is very good at life cycle heavy systems with tricky invariants. For example, a distributed job scheduler that must move tasks between queues without losing or duplicating work, under partial failures and restarts.

It is quite good at intricate state machines where the next allowed action depends on subtle combinations of prior transitions.

It starts to lose its edge when the behavior of interest depends heavily on low level performance characteristics (like CPU cache effects or garbage collector interactions) or on unmodeled external systems. Amazon’s own writeup mentions that formal specification did not help them much with “emergent sustained performance degradation”; slowdowns caused by feedback loops between timeouts, retries, and resource contention were better tackled with different tools.²

It is also not a silver bullet for correctness. The model only checks what you specify. If you forget to express a critical property, or if you oversimplify the environment, TLA+ will cheerfully tell you that your underspecified toy is correct.

In other words, it is powerful, but it still requires taste and judgment.

What about the math and the learning curve?

The first time you open the TLA+ toolbox or read a spec, it can feel like falling into a different discipline. There are quantifiers, temporal operators, and more Greek letters than you might see in a typical web service.

In practice, I find the barrier lower than it first appears. If you are comfortable with “discrete math” basics—sets, quantifiers, simple logic—it mostly feels like normal math written in a slightly quirky syntax. Most engineers have seen this at some point; TLA+ just asks you to use it more deliberately.

Personally I prefer to write specs directly in TLA+; the more mathematical notation keeps me honest. If you like a more program-like notation, PlusCal is worth a look: it is an algorithm language that compiles to TLA+ and uses the same tools.² Hillel Wayne’s book Practical TLA+ has a substantial PlusCal-focused introduction and is an excellent way to learn that style.⁴

For external resources, I usually point people to:

Leslie Lamport’s TLA+ Video Course, a structured set of lectures that walk through the language and tools from first principles.⁵
The tlaplus Google group, which acts as the main Q&A forum and mailing list for the community.⁶
Markus Kuppe’s talks, especially “Weeks of Debugging Can Save You Hours of TLA+” and “Large Scale Model Checking 101”, which show how engineers actually use TLA+ and TLC in practice.⁷

The mental shift is the important bit: you start thinking of your system not as “code with tests”, but as “a set of states and transitions with properties that should hold”.

When is TLA+ worth the pain?

I do not reach for TLA+ on every project. For a straightforward CRUD backend with tame failure modes, I am happier spending time on types, tests, and keeping the architecture boring.

Two examples where it has clearly earned its keep for me:

One was a restaurant booking algorithm: many customers trying to reserve limited tables and timeslots at the same time, with constraints about overlapping bookings and fairness. Getting the concurrency and edge cases right in code alone felt brittle. Writing a small TLA+ model forced me to be precise about what a “valid” allocation meant and how concurrent requests were allowed to succeed or fail under load.

Another was a state machine that governs how a business object’s status is allowed to change. Scattering “if status is X, allow transition to Y” checks across endpoints is a recipe for bugs. Encoding the allowed states and transitions as a single, explicit algorithm—and then modeling that algorithm—made it much easier to see which transitions were legal, which ones were forbidden, and which invariants had to hold no matter how the endpoints were wired.

Those experiences nudged me toward a rule of thumb: TLA+ is most useful when you have a small, critical piece of logic with fiddly rules that matter a lot under concurrency or over time. In those cases, a model that explores all the weird interleavings buys you more confidence than another page of prose or another handful of tests.

Why we still ask “why on earth”

None of this fully answers the gut level skepticism many engineers have.

There is a long history of formal methods being sold as an academic cure for all software ills, then failing to deliver in mass practice. Many people have sat through talks where someone writes a temporal logic formula on a slide and declares victory over bugs.

So when someone says “we should use TLA+”, it is natural to respond with “we already barely have time to write tests; why on earth are we adding math”.

For me, the honest answer is: you only reach for a tool like TLA+ when you can no longer fool yourself that intuition is enough.

If you have ever stared at a race condition in a distributed system for days, only to realize that the root cause was a sequence of events nobody imagined, this is the tool you wish you had earlier. If you have ever argued for hours about whether a protocol “can” lose data and felt the conversation circling, this is the tool that lets you settle the question.

Learning Rust made me appreciate what it feels like when a language enforces invariants that I only had as fuzzy intentions and could not reliably encode by hand. Learning TLA+ extended that feeling to the level of system designs. In both cases, the cost was real, and the path was sometimes frustrating. In both cases, once I saw the kinds of mistakes I stopped worrying about, it was difficult to go back.

We do not need TLA+ for every system, and pretending otherwise would just turn it into another buzzword. Most software lives perfectly fine on a diet of tests, monitoring, and engineers using their judgment.

For me, the point of having TLA+ around is different. It is a way to take the stories I tell myself about how a design “obviously” behaves and make them precise enough that something unforgiving can argue back. If the model checker finds a trace that violates the rules I thought I had written, that is a gift: it shows me the gap between the system in my head and the system I have actually described.

Hillel Wayne, “FAQ”, Learn TLA+, https://learntla.com/intro/faq.html.

Chris Newcombe et al., “How Amazon Web Services Uses Formal Methods”, Communications of the ACM, https://cacm.acm.org/research/how-amazon-web-services-uses-formal-methods/.

“TLA+”, Wikipedia, overview and industry uses, https://en.wikipedia.org/wiki/TLA%2B.

⁴

Hillel Wayne, “Practical TLA+: Planning Driven Development”, Apress, announcement and details at https://www.hillelwayne.com/post/practical-tla/.

⁵

Leslie Lamport, “The TLA+ Video Course”, recorded lectures on learning TLA+, https://lamport.azurewebsites.net/video/videos.html.

⁶

“tlaplus – Google Groups”, main discussion list for TLA+ and PlusCal, https://groups.google.com/g/tlaplus.

⁷

Markus A. Kuppe, talks such as “Weeks of Debugging Can Save You Hours of TLA+” and “Large Scale Model Checking 101”, linked from https://lamport.azurewebsites.net/tla/more-stuff.html.